# Text Classifier fine tuning using IMDb with PyTorch

This notebook demonstrates fine tuning pretrained models from [Hugging Face](https://huggingface.co) using the [IMDb dataset](https://huggingface.co/datasets/imdb) to analyze the sentiment of movie reviews text. The full dataset has 25,0000 training and 25,000 test examples, but this notebook uses a subset of the dataset for quicker training. The notebook uses [Intel® Extension for PyTorch*](https://github.com/intel/intel-extension-for-pytorch), which extends PyTorch with optimizations for an extra performance boost on Intel hardware.

Please install the dependencies from the [requirements.txt](requirements.txt) file before executing this notebook.

The notebook performs the following steps:
1. [Import dependencies and setup parameters](#1.-Import-dependencies-and-setup-parameters)
2. [Prepare the dataset](#2.-Prepare-the-dataset)
3. [Get the model and setup the Trainer](#3.-Get-the-model-and-setup-the-Trainer)
4. [Fine tuning and evaluation](#4.-Fine-tuning-and-evaluation)
5. [Export the model](#5.-Export-the-model)
6. [Reload the model and make predictions](#6.-Reload-the-model-and-make-predictions)

## 1. Import dependencies and setup parameters

In [None]:
import intel_extension_for_pytorch as ipex
import logging
import numpy as np
import os
import pandas as pd
import sys
import torch
import warnings

from datasets import load_dataset, load_metric
from datasets import logging as datasets_logging
from transformers.utils import logging as transformers_logging
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    Trainer,
    TrainingArguments
)

# Set the logging stream to stdout
for handler in transformers_logging._get_library_root_logger().handlers:
    handler.setStream(sys.stdout)

sh = datasets_logging.logging.StreamHandler(sys.stdout)

warnings.filterwarnings('ignore')
os.environ["TRANSFORMERS_NO_ADVISORY_WARNINGS"] = "1"

In [None]:
# Specify the name of the Hugging Face pretrained model to use (https://huggingface.co/models)
# For example: 
#   albert-base-v2
#   bert-base-uncased
#   distilbert-base-uncased
#   distilbert-base-uncased-finetuned-sst-2-english
#   roberta-base
model_name = "distilbert-base-uncased"

# Name of the Hugging Face dataset
dataset_name = "imdb"

# Define an output directory
output_dir = os.environ["OUTPUT_DIR"] if "OUTPUT_DIR" in os.environ else \
    os.path.join(os.environ["HOME"], "{}-{}-output".format(model_name, dataset_name))

# Define a cache directory for dataset and pretrained model files
cache_dir = os.environ["CACHE_DIR"] if "CACHE_DIR" in os.environ else \
    os.path.join(os.environ["HOME"], "pytorch-text-classification-cache")

print("Model name:", model_name)
print("Dataset name:", dataset_name)
print("Output directory:", output_dir)
print("Cache directory:", cache_dir)

## 2. Prepare the dataset

The notebook gets the [IMDb movie review dataset](https://huggingface.co/datasets/imdb) using the Hugging Face datasets API. If the notebook is executed multiple times, the dataset will be used from the cache directory, to speed up the timet that it takes to run.

The IMDb dataset in Hugging Face has 3 splits: `train`, `test`, and `unsupervised`. This notebook will be using data from the `train` split for training and data from the `test` split for evaluation. The data has 2 columns: `text` (string with the movie review) and `label` (integer class label). The code in the next cell is setup to run using the IMDb dataset, so note that if a different dataset is being used, you may need to change the split names and/or the column names.

In [None]:
# For quicker training and debug runs, use a subset of the dataset by specifying the size of the train/eval datasets.
# Set the sizes `None` to use the full dataset. The full IMDb dataset has 25,000 training and 25,000 test examples.
train_dataset_size = 1000
eval_dataset_size = 1000

# Name of the dataset splits (the split names may vary if you are not using the IMDb dataset)
train_split_name = "train"
eval_split_name = "test"

# Name of the columns in the dataset (the column names may vary if you are not using the IMDb dataset)
dataset_sentence1_key = "text"
dataset_sentence2_key = None
dataset_label_key = "label"

datasets_logging.set_verbosity_error()

# Load the dataset from the Hugging Face dataset API
dataset = load_dataset(dataset_name, cache_dir=cache_dir)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=cache_dir)

def tokenize_function(examples):
    # Define the tokenizer args, depending on if the data has 2 sentences or just 1
    args = ((examples[dataset_sentence1_key],) if dataset_sentence2_key is None \
             else (examples[dataset_sentence1_key], examples[dataset_sentence2_key]))
    return tokenizer(*args, padding="max_length", truncation=True)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Remove the raw text from the tokenized dataset
raw_text_columns = [dataset_sentence1_key, dataset_sentence2_key] if dataset_sentence2_key else [dataset_sentence1_key]
tokenized_dataset = tokenized_dataset.remove_columns(raw_text_columns)

# Get the training and eval dataset based on the specified dataset sizes
train_dataset = tokenized_dataset[train_split_name].shuffle().select(range(train_dataset_size)) if train_dataset_size \
    else tokenized_dataset[train_split_name]    
eval_dataset = tokenized_dataset[eval_split_name].shuffle().select(range(eval_dataset_size)) if eval_dataset_size \
    else tokenized_dataset[eval_split_name]

# Save the class label information to use later when predicting
class_labels = dataset[train_split_name].features[dataset_label_key]
print("Label names:", class_labels.names)

The next cell displays a sample of the text and labels so that we can see what our training data looks like.

In [None]:
# Get a sample of the dataset to display
sample_size = 7
sentence1_sample = dataset[train_split_name][dataset_sentence1_key][:sample_size]
sentence2_sample = dataset[train_split_name][dataset_sentence2_key][:sample_size] if dataset_sentence2_key else None
label_sample = dataset[train_split_name][dataset_label_key][:sample_size]
dataset_sample = zip(sentence1_sample, sentence2_sample, label_sample) if dataset_sentence2_key \
    else zip(sentence1_sample, label_sample)

columns = [dataset_sentence1_key, dataset_sentence2_key, dataset_label_key] if dataset_sentence2_key else \
    [dataset_sentence1_key, dataset_label_key]

# Display the sample using a dataframe
sample = pd.DataFrame(dataset_sample, columns=columns)
sample.style.hide_index()

## 3. Get the model and setup the Trainer

This step gets the pretrained model from [Hugging Face](https://huggingface.co/models) and sets up the
[TrainingArguments](https://huggingface.co/docs/transformers/v4.16.2/en/main_classes/trainer#transformers.TrainingArguments) and the
[Trainer](https://huggingface.co/docs/transformers/v4.16.2/en/main_classes/trainer#transformers.Trainer). For simplicity, this example is using default values for most of the training args, but we are specifying our output directory and the number of training epochs. If your output directory already has checkpoints from a previous run,
training will resume from the last checkpoint. The `overwrite_output_dir` training argument can be set to
`True` if you want to instead overwrite previously generated checkpoints.

> Note that it is expected to see a warning at this step about some weights not being used. This is because
> the pretraining head from the original model is being replaced with a classification head.

In [None]:
num_train_epochs = 2

# Load the model using the pretrained weights
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=len(class_labels.names))

# Apply the ipex optimize function to the model
model = ipex.optimize(model)

# Set up the training args
training_args = TrainingArguments(output_dir=output_dir, num_train_epochs=num_train_epochs)

# Compute metrics used for evaluation
metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# Define the Trainer
trainer = Trainer(model=model,
                  args=training_args,
                  train_dataset=train_dataset,
                  eval_dataset=eval_dataset,
                  compute_metrics=compute_metrics)

## 4. Fine tuning and evaluation

This step uses the [Trainer](https://huggingface.co/docs/transformers/v4.16.2/en/main_classes/trainer#transformers.Trainer)
defined in the previous step to train and evaluate the model.

In [None]:
%%time

trainer.train()

In [None]:
# Evaluate and print metrics
metrics = trainer.evaluate()

for key in metrics.keys():
    print("{}: {}".format(key, metrics[key]))

## 5. Export the model

In [None]:
# Save the model to our output directory
trainer.save_model(output_dir)

## 6. Reload the model and make predictions

The output directory is used to reload the model. In the next cell, we evalute the reloaded model to verify that we are getting the same metrics that we saw after fine tuning.

In [None]:
reloaded_model = AutoModelForSequenceClassification.from_pretrained(output_dir)

# Set the model in evaluation mode
reloaded_model.eval()

# Use the reloaded model in the trainer
trainer.model = reloaded_model
reloaded_model_metrics = trainer.evaluate()

for key in metrics.keys():
    print("{}: {}".format(key, reloaded_model_metrics[key]))

Next, we demonstrate how encode raw text input and get predictions from the reloaded model.

In [None]:
# Setup some raw text input
raw_text_input = ["It was okay. I finished it, but wouldn't watch it again.",
                  "So bad",
                  "Definitely not my favorite",
                  "Highly recommended"]

# Encode the raw text using the tokenizer
encoded_input = tokenizer(raw_text_input, padding=True, return_tensors='pt')

# Send the encoded input to the model and get the predicted results
output = model(**encoded_input)
_, predictions = torch.max(output.logits, dim=1)

# Translate the predictions to class label strings
prediction_labels = class_labels.int2str(predictions)

# Create a dataframe to display the results
result_list = [list(x) for x in zip(raw_text_input, prediction_labels)]
result_df = pd.DataFrame(result_list, columns=["Input Text", "Predicted Label"])
result_df.style.hide_index()

## Citations

```
@InProceedings{maas-EtAl:2011:ACL-HLT2011,
  author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
  title     = {Learning Word Vectors for Sentiment Analysis},
  booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
  month     = {June},
  year      = {2011},
  address   = {Portland, Oregon, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {142--150},
  url       = {http://www.aclweb.org/anthology/P11-1015}
}
```