Fine-tuning a pretrained transformer BERT model for customized sentiment analysis using transformer PyTorch Trainer from Hugging Face

# Intro

Hugging Face provides three ways to fine-tune a pretrained text classification model: Tensorflow Keras, PyTorch, and transformer trainer. Transformer trainer is an API for feature-complete training in PyTorch without writing all the loops. This tutorial will use the transformer trainer to fine-tune a text classification model. We will talk about the following:
* How does transfer learning work?
* How to convert a pandas dataframe into a Hugging Face Dataset?
* How to tokenize text, load a pretrained model, set training arguments, and train a transfer learning model?
* How to make predictions and evaluate the model performance of a fine-tuned transfer learning model for text classification?
* How to save the model and re-load the model?

If you are interested in learning how to implement transfer Learning Using Tensorflow, please check out my previous tutorial [Customized Sentiment Analysis: Transfer Learning Using Tensorflow with Hugging Face](https://medium.com/grabngoinfo/customized-sentiment-analysis-transfer-learning-using-tensorflow-with-hugging-face-1b439eedf167).

Let's get started!

# Step 0: Transfer Learning Algorithms

In step 0, we will talk about how transfer learning works.

Transfer learning is a machine learning technique that reuses a pretrained large deep learning model on a new task. It usually includes the following steps:
1. Select a pretrained model that is suitable for the new task. For example, if the new task includes text from different languages, a multi-language pretrained model needs to be selected.
2. Keep all the weights and biases from the pretrained model except for the output layer. This is because the output layer for the pretrained model is for the pretrained tasks and it needs to be replaced with the new task.
3. Feed randomly initialize weights and biases into the new head of the new task. For a sentiment analysis transfer learning (aka fine-tuning) model on a pretrained BERT model, we will remove the head that classifies mask words, and replace it with the two sentiment analysis labels, positive and negative.
4. Retrain the model for the new task with the new data, utilizing the pretrained weights and biases. Because the weights and biases store the knowledge learned from the pretrained model, the fine-tuned transfer learning model can build on that knowledge and does not need to learn from scratch.

# Step 1: Install And Import Python Libraries

In step 1, we will install and import python libraries.

Firstly, let's install `transformers`, `datasets`, and `evaluate`.

In [19]:
# Install libraries
!pip install transformers datasets evaluate

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.6


After installing the python packages, we will import the python libraries.
* `pandas` and `numpy` are imported for data processing.
* `tensorflow` and `transformers` are imported for modeling.
* `Dataset` is imported for the Hugging Face dataset format.
* `evaluate` is imported for model performance evaluation.

In [22]:
# Data processing
import pandas as pd
import numpy as np

# Modeling
import tensorflow as tf
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, EarlyStoppingCallback, TextClassificationPipeline

# Hugging Face Dataset
from datasets import Dataset

# Model performance evaluation
import evaluate

# Step 2: Download And Read Data

The second step is to download and read the dataset.

The UCI Machine Learning Repository has the review data from three websites: imdb.com, amazon.com, and yelp.com. We will use the review data from amazon.com for this tutorial. Please follow these steps to download the data.
1. Go to: https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences
2. Click "Data Folder"
3. Download "sentiment labeled sentences.zip"
4. Unzip "sentiment labeled sentences.zip"
5. Copy the file "amazon_cells_labelled.txt" to your project folder

Those who are using Google Colab for this analysis need to mount Google Drive to read the dataset. You can ignore the code below if you are not using Google Colab.
* `drive.mount` is used to mount to the Google drive so the colab notebook can access the data on the Google drive.
* `os.chdir` is used to change the default directory on Google drive. I set the default directory to the folder where the review dataset is saved.
* `!pwd` is used to print the current working directory.

Please check out [Google Colab Tutorial for Beginners](https://medium.com/towards-artificial-intelligence/google-colab-tutorial-for-beginners-834595494d44) for details about using Google Colab for data science projects.

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Change directory
import os
os.chdir("drive/My Drive/contents/nlp")

# Print out the current directory
!pwd

Mounted at /content/drive
/content/drive/My Drive/contents/nlp


Now let's read the data into a `pandas` dataframe and see what the dataset looks like.

The dataset has two columns. One column contains the reviews and the other column contains the sentiment label for the review.

In [3]:
# Read in data
amz_review = pd.read_csv('amazon_cells_labelled.txt', sep='\t', names=['review', 'label'])

# Take a look at the data
amz_review.head()

Unnamed: 0,review,label
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1


`.info` helps us to get information about the dataset.

From the output, we can see that this data set has 1000 records and no missing data. The `review` column is the `object` type and the `label` column is the `int64` type.

In [23]:
# Get the dataset information
amz_review.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   review  1000 non-null   object
 1   label   1000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 15.8+ KB


The label value of 0 represents negative reviews and the label value of 1 represents positive reviews. The dataset has 500 positive reviews and 500 negative reviews. It is well-balanced, so we can use  accuracy as the metric to evaluate the model performance.

In [5]:
# Check the label distribution
amz_review['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,500
1,500


# Step 3: Train Test Split

In step 3, we will split the dataset and have 80% as the training dataset and 20% as the testing dataset.

Using the `sample` method, we set `frac=0.8`, which randomly samples 80% of the data. `random_state=42` ensures that the sampling result is reproducible.

Dropping the `train_data` from the review dataset gives us the rest 20% of the data, which is our testing dataset.

In [6]:
# Training dataset
train_data = amz_review.sample(frac=0.8, random_state=42)

# Testing dataset
test_data = amz_review.drop(train_data.index)

# Check the number of records in training and testing dataset.
print(f'The training dataset has {len(train_data)} records.')
print(f'The testing dataset has {len(test_data)} records.')

The training dataset has 800 records.
The testing dataset has 200 records.


After the train test split, there are 800 reviews in the training dataset and 200 reviews in the testing dataset.

# Step 4: Convert Pandas Dataframe to Hugging Face Dataset

In step 4, the training and the testing datasets will be converted from pandas dataframe to Hugging Face Dataset format.

Hugging Face Dataset objects are memory-mapped on drive, so they are not limited by RAM memory, which is very helpful for processing large datasets.

We use `Dataset.from_pandas` to convert a pandas dataframe to a Hugging Face Dataset.

In [7]:
# Convert pyhton dataframe to Hugging Face arrow dataset
hg_train_data = Dataset.from_pandas(train_data)
hg_test_data = Dataset.from_pandas(test_data)

The length of the Hugging Face Dataset is the same as the number of records in the pandas dataframe. For example, there are 800 records in the pandas dataframe for the training dataset, and the length of the converted Hugging Face Dataset for the training dataset is 800 too.

`hg_train_data[0]` gives us the first record in the Hugging Face Dataset. It is a dictionary with three keys, `review`, `label`, and `__index_level_0__`.
* `review` is the variable name for the review text. The name is inherited from the column name of the pandas dataframe.
* `label` is the variable name for the sentiment of the review text. The name is inherited from the column name of the pandas dataframe too.
* `__index_level_0__` is an automatically generated field from the pandas dataframe. It stores the index of the corresponding record.

In [8]:
# Length of the Dataset
print(f'The length of hg_train_data is {len(hg_train_data)}.\n')

# Check one review
hg_train_data[0]

The length of hg_train_data is 800.



{'review': 'Thanks again to Amazon for having the things I need for a good price!',
 'label': 1,
 '__index_level_0__': 521}

In this example, we can see that the review is `Thanks again to Amazon for having the things I need for a good price!`, the sentiment for the review is positive/1, and the index of this record is 521 in the pandas dataframe.

Checking the index 521 in the pandas dataframe confirms the same information with Hugging Face Dataset.

In [9]:
# Validate the record in pandas dataframe
amz_review.iloc[[521]]

Unnamed: 0,review,label
521,Thanks again to Amazon for having the things I...,1


# Step 5: Tokenize Text

In step 5, we will tokenize the review text using a tokenizer.

A tokenizer converts text into numbers to use as the input of the NLP (Natural Language Processing) models. Each number represents a token, which can be a word, part of a word, punctuation, or special tokens. How the text is tokenized is determined by the pretrained model. `AutoTokenizer.from_pretrained("bert-base-cased")` is used to download vocabulary from the pretrained `bert-base-cased` model, meaning that the text will be tokenized like a BERT model.

In [10]:
# Tokenizer from a pretrained model
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# Take a look at the tokenizer
tokenizer

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

BertTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

We can see that the tokenizer contains information such as model name, vocabulary size, max length, padding position, truncation position, and special tokens.

There are five special tokens for the BERT model. Other models may have different special tokens.
 * The tokens that are not part of the BERT model training dataset are unknown tokens. The unknown token is [UNK] and the ID for the unknown token is 100.
 * The separator token is [SEP] and the ID for the separator token is 102.
 * The pad token is [PAD] and the ID for the pad token is 0.
 * The sentence level classification token is [CLS] and the ID for the classification token is 101.
 * The mask token is [MASK] and the ID for the mask token is 103.

In [11]:
# Mapping between special tokens and their IDs.
print(f'The unknown token is {tokenizer.unk_token} and the ID for the unkown token is {tokenizer.unk_token_id}.')
print(f'The seperator token is {tokenizer.sep_token} and the ID for the seperator token is {tokenizer.sep_token_id}.')
print(f'The pad token is {tokenizer.pad_token} and the ID for the pad token is {tokenizer.pad_token_id}.')
print(f'The sentence level classification token is {tokenizer.cls_token} and the ID for the classification token is {tokenizer.cls_token_id}.')
print(f'The mask token is {tokenizer.mask_token} and the ID for the mask token is {tokenizer.mask_token_id}.')

The unknown token is [UNK] and the ID for the unkown token is 100.
The seperator token is [SEP] and the ID for the seperator token is 102.
The pad token is [PAD] and the ID for the pad token is 0.
The sentence level classification token is [CLS] and the ID for the classification token is 101.
The mask token is [MASK] and the ID for the mask token is 103.


After downloading the model vocabulary, the method `tokenizer` is used to tokenize the review corpus.
* `max_length` indicates the maximum number of tokens kept for each document.
 * If the document has more tokens than the `max_length`, it will be truncated.
 * If the document has less tokens than the `max_length`, it will be padded with zeros.
 * If `max_length` is unset or set to `None`, the maximum length from the pretrained model will be used. If the pretrained model does not have a maximum length parameter, `max_length` will be deactivated.
* `truncation` controls how the token truncation is implemented. `truncation=True` indicates that the truncation length is the length specified by `max_length`. If `max_length` is not specified, the max_length of the pretrained model is used.
* `padding` means adding zeros to shorter reviews in the dataset. The `padding` argument controls how `padding` is conducted.  
 * `padding=True` is the same as `padding='longest'`. It checks the longest sequence in the batch and pads zeros to that length. There is no padding if only one text document is provided.
 * `padding='max_length'` pads to `max_length` if it is specified, otherwise, it pads to the maximum acceptable input length for the model.
 * `padding=False` is the same as `padding='do_not_pad'`. It is the default, indicating that no padding is applied, so it can output a batch with sequences of different lengths.

In [12]:
# Funtion to tokenize data
def tokenize_dataset(data):
    return tokenizer(data["review"],
                     max_length=32,
                     truncation=True,
                     padding="max_length")

# Tokenize the dataset
dataset_train = hg_train_data.map(tokenize_dataset)
dataset_test = hg_test_data.map(tokenize_dataset)

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

After tokenization, we can see that both the training and the testing Dataset have 6 features, `'review'`, `'label'`, `'__index_level_0__'`, `'input_ids'`, `'token_type_ids'`, and `'attention_mask'`. The number of rows is stored with `num_rows`.

In [13]:
# Take a look at the data
print(dataset_train)
print(dataset_test)

Dataset({
    features: ['review', 'label', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 800
})
Dataset({
    features: ['review', 'label', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 200
})


`dataset_train[0]` gives us the content for the first record in the training dataset in a dictionary format.
* `'review'` has the review text. The first review of the training dataset is `'Thanks again to Amazon for having the things I need for a good price!'`.
* `'label'` is the label of the classification. The first record is a positive review, so the label is 1.
* `'__index_level_0__'` is the index of the record. 521 means that the first record in the training dataset has the index 521 in the original pandas dataframe.
* `'input_ids'` are the IDs for the tokens. There are 32 token IDs because the `max_length` is 32 for the tokenization.
* `'token_type_ids'` is also called segment IDs.
 * BERT was trained on two tasks, Masked Language Modeling and Next Sentence Prediction. `'token_type_ids'` is for the Next Sentence Prediction, where two sentences are used to predict whether the second sentence is the next sentence for the first one.
 * The first sentence has all the tokens represented by zeros, and the second sentence has all the tokens represented by ones.
 * Because our classification task does not have a second sentence, all the values for `'token_type_ids'` are zeros.
* `'attention_mask'` indicates which token ID should get attention from the model, so the padding tokens are all zeros and other tokens are 1s.



In [14]:
# Check the first record
dataset_train[0]

{'review': 'Thanks again to Amazon for having the things I need for a good price!',
 'label': 1,
 '__index_level_0__': 521,
 'input_ids': [101,
  5749,
  1254,
  1106,
  9786,
  1111,
  1515,
  1103,
  1614,
  146,
  1444,
  1111,
  170,
  1363,
  3945,
  106,
  102,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'token_type_ids': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0]}

# Step 6: Load Pretrained Model

In step 6, we will load the pretrained model for sentiment analysis.

* `AutoModelForSequenceClassification` loads the BERT model without the sequence classification head.
* The method `from_pretrained()` loads the weights from the pretrained model into the new model, so the weights in the new model are not randomly initialized. Note that the new weights for the new sequence classification head are going to be randomly initialized.
* `bert-base-cased` is the name of the pretrained model. We can change it to a different model based on the nature of the project.
* `num_labels` indicates the number of classes. Our dataset has two classes, positive and negative, so `num_labels=2`.

In [15]:
# Load model
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Step 7: Set Training Argument

In step 7, we will set the training arguments for the model.

Hugging Face has 96 parameters for `TrainingArguments`, which provides a lot of flexibility in fine-tuning the transfer learning model.
* `output_dir` is the directory to write the model checkpoints and model predictions.
* `logging_dir` is the directory for saving logs.
* `logging_strategy` is the strategy for logging the training information.
 * `'no'` means no logging for the training.
 * `'epoch'` means logging at the end of each epoch.
 * `'steps'` means logging at the end of each `logging_steps`.
* `logging_steps` is the number of steps between two logs. The default is 500.
* `num_train_epochs` is the total number of training epochs. The default value is 3.
* `per_device_train_batch_size` is the batch size per GPU/TPU core/CPU for training. The default value is 8.
* `per_device_eval_batch_size` is the batch size per GPU/TPU core/CPU for evaluation. The default value is 8.
* `learning_rate` is the initial learning rate for AdamW optimizer. The default value is 5e-5.
* `seed` is for reproducibility.
* `save_strategy` is the strategy for saving the checkpoint during training.
 * `'no'` means do not save during training.
 * `'epoch'` means saving at the end of each epoch.
 * `'steps'` means saving at the end of each `save_steps`. `'steps'` is the default value.
* `save_steps` is the number of steps before two checkpoint saves. The default value is 500.
* `evaluation_strategy` is the strategy for evaluation during training. It's helpful for us to monitor the model performance during model fine-tuning.
 * `'no'` means no evaluation during training.
 * `'epoch'` means evaluating at the end of each epoch and the evaluation results will be printed out at the end of each epoch.
 * `'steps'` means evaluating and reporting at the end of each `eval_steps`. `'no'` is the default value.
* `eval_steps` is the number of steps between two evaluations if `evaluation_strategy='steps'`. It defaults to the same value as `logging_steps` if not set.
* `load_best_model_at_end=True` indicates that the best model will be loaded at the end of the training. The default is `False`. When it is set to `True`, the `save_strategy` and `evaluation_strategy` must be the same. When both arguments are `'steps'`, the value of `save_steps` needs to be a round multiple of the value of `eval_steps`.

In [16]:
training_args = TrainingArguments(
    output_dir="./sentiment_transfer_learning_transformer/",
    logging_dir='./sentiment_transfer_learning_transformer/logs',
    logging_strategy='epoch',
    logging_steps=100,
    num_train_epochs=2,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    learning_rate=5e-6,
    seed=42,
    save_strategy='epoch',
    save_steps=100,
    eval_strategy='epoch',
    load_best_model_at_end=True,
    report_to="none"
)


# Step 8: Set Evaluation Metrics

In step 8, we will set the evaluation metric because Hugging Face Trainer does not evaluate the model performance automatically during the training process.

Hugging Face has an `evaluate` library with over 100 evaluation modules. We can see the list of all the modules using `evaluate.list_evaluation_modules()`.

In [24]:
# Number of evaluation modules
print(f'There are {len(evaluate.list_evaluation_modules())} evaluation models in Hugging Face.\n')

# List all evaluation metrics
evaluate.list_evaluation_modules()

There are 216 evaluation models in Hugging Face.



['lvwerra/test',
 'angelina-wang/directional_bias_amplification',
 'cpllab/syntaxgym',
 'lvwerra/bary_score',
 'hack/test_metric',
 'yzha/ctc_eval',
 'codeparrot/apps_metric',
 'mfumanelli/geometric_mean',
 'daiyizheng/valid',
 'erntkn/dice_coefficient',
 'mgfrantz/roc_auc_macro',
 'Vlasta/pr_auc',
 'gorkaartola/metric_for_tp_fp_samples',
 'idsedykh/metric',
 'idsedykh/codebleu2',
 'idsedykh/codebleu',
 'idsedykh/megaglue',
 'Vertaix/vendiscore',
 'GMFTBY/dailydialogevaluate',
 'GMFTBY/dailydialog_evaluate',
 'jzm-mailchimp/joshs_second_test_metric',
 'ola13/precision_at_k',
 'yulong-me/yl_metric',
 'abidlabs/mean_iou',
 'abidlabs/mean_iou2',
 'KevinSpaghetti/accuracyk',
 'NimaBoscarino/weat',
 'ronaldahmed/nwentfaithfulness',
 'Viona/infolm',
 'kyokote/my_metric2',
 'kashif/mape',
 'Ochiroo/rouge_mn',
 'leslyarun/fbeta_score',
 'anz2/iliauniiccocrevaluation',
 'zbeloki/m2',
 'xu1998hz/sescore',
 'dvitel/codebleu',
 'NCSOFT/harim_plus',
 'JP-SystemsX/nDCG',
 'sportlosos/sescore',
 'Dru

Since our dataset is highly balanced, we will use accuracy as the evaluation metric. It can be loaded using `evaluate.load("accuracy")`. After getting predictions from the model, the metric is computed using `metric.compute`.

In [25]:
# Function to compute the metric
def compute_metrics(eval_pred):
    metric = evaluate.load("accuracy")
    logits, labels = eval_pred
    # probabilities = tf.nn.softmax(logits)
    predictions = np.argmax(logits, axis=1)
    return metric.compute(predictions=predictions, references=labels)

# Step 9: Train Model Using Transformer Trainer

In step 9, we will train the model using the transformer `Trainer`.
* model is the model for training, evaluation, or prediction by the `Trainer`.
* `args` takes the arguments for tweaking the `Trainer`. It defaults to the instance of `TrainingArguments`.
* `train_dataset` is the training dataset name. If the dataset is in `Dataset` format, the unused columns will be automatically ignored. In our training dataset, `__index_level_0__` and `review` are not used by the model, so they are ignored.
* `eval_dataset` is the evaluation dataset name. Similar to the `train_dataset`, the unused columns will be automatically ignored for the `Dataset` format.
* `compute_metrics` takes the function for calculating evaluation metrics.
* `callbacks` takes a list of callbacks to customize the training loop. `EarlyStoppingCallback` stops the training by `early_stopping_patience` for the evaluation calls. There is no practical need to use early stopping because there are only two epochs for the model. It is included as an example code reference.

In [26]:
# Train the model
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset_train,
    eval_dataset=dataset_test,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=1)]
)

trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.5649,0.275119,0.91
2,0.2298,0.214304,0.93


Downloading builder script: 0.00B [00:00, ?B/s]

TrainOutput(global_step=400, training_loss=0.39739394187927246, metrics={'train_runtime': 57.9379, 'train_samples_per_second': 27.616, 'train_steps_per_second': 6.904, 'total_flos': 26311105536000.0, 'train_loss': 0.39739394187927246, 'epoch': 2.0})

We can see that the accuracy is above 90 percent in just 2 epochs.

# Step 10: Make Predictions for Text Classification

In step 10, we will talk about how to make predictions using the Hugging Face transformer Trainer model.

Passing the tokenized `Dataset` to the `.predict` method, we get the predictions for the customized transfer learning sentiment model. We can see that the prediction results contain multiple pieces of information.
* `Num examples = 200` indicates that there are 200 reviews in the testing dataset.
* `Batch size = 4` means that 4 reviews are processed each time.
* Under `PredictionOutput`, `predictions` has the logits for each class. logit is the last layer of the neural network before softmax is applied. `label_ids` has the actual labels. Please note that it is not predicted labels although it is under the `PredictionOutput`. We need to calculate the predicted labels based on the logit values.
* Under `metrics` there is information about the testing predictions.
 * `test_loss` is the loss for the testing dataset.  
 * `test_accuracy` is the percentage of correct predictions.
 * `test_runtime` is the runtime for testing.
 * `test_samples_per_second` is the number of samples the model can process in one second.
 * `test_steps_per_second` is the number of steps the model can process in one second.


In [27]:
# Predictions
y_test_predict = trainer.predict(dataset_test)

# Take a look at the predictions
y_test_predict

PredictionOutput(predictions=array([[-2.337913  ,  2.0567882 ],
       [-2.0842035 ,  2.1617513 ],
       [-2.3695347 ,  2.0440009 ],
       [ 1.058769  , -2.0527053 ],
       [ 1.1460142 , -2.0395653 ],
       [-2.2710783 ,  2.2040794 ],
       [ 1.0685357 , -1.8443897 ],
       [ 1.0599883 , -2.0149977 ],
       [-2.2190018 ,  2.0326715 ],
       [ 1.106114  , -2.0032737 ],
       [-2.4095082 ,  2.1275632 ],
       [ 0.39245996, -1.2386558 ],
       [-1.1339428 ,  0.6357692 ],
       [-1.5931114 ,  1.2322338 ],
       [-2.2777    ,  2.2067661 ],
       [ 0.84898484, -1.7747142 ],
       [ 0.98557854, -2.139289  ],
       [-2.473114  ,  2.2452862 ],
       [-2.2006621 ,  2.2210886 ],
       [ 1.0887971 , -2.0489047 ],
       [-2.2802224 ,  1.9937029 ],
       [ 0.33930895, -1.2468055 ],
       [-2.4372137 ,  2.188267  ],
       [ 1.1117644 , -1.8673589 ],
       [-1.6723467 ,  1.1742327 ],
       [-2.338886  ,  2.0677106 ],
       [ 1.146054  , -1.9198551 ],
       [ 1.1817392 , -2.07

The predicted logits for the transfer learning text classification model can be extracted using `.predictions`.

We can see that the prediction has two columns. The first column is the predicted logit for label 0 and the second column is the predicted logit for label 1. logit values do not sum up to 1.

In [28]:
# Predicted logits
y_test_logits = y_test_predict.predictions

# First 5 predicted probabilities
y_test_logits[:5]

array([[-2.337913 ,  2.0567882],
       [-2.0842035,  2.1617513],
       [-2.3695347,  2.0440009],
       [ 1.058769 , -2.0527053],
       [ 1.1460142, -2.0395653]], dtype=float32)

To get the predicted probabilities, we need to apply softmax on the predicted logit values.

After applying softmax, we can see that the predicted probability for each review sums up to 1.

In [29]:
# Predicted probabilities
y_test_probabilities = tf.nn.softmax(y_test_logits)

# First 5 predicted logits
y_test_probabilities[:5]

<tf.Tensor: shape=(5, 2), dtype=float32, numpy=
array([[0.01219209, 0.9878079 ],
       [0.01411983, 0.9858802 ],
       [0.01196733, 0.9880327 ],
       [0.95736355, 0.04263642],
       [0.960288  , 0.03971201]], dtype=float32)>

To get the predicted labels, `argmax` is used to return the index of the maximum probability for each review, which corresponds to the labels of zeros and ones.

In [30]:
# Predicted labels
y_test_pred_labels = np.argmax(y_test_probabilities, axis=1)

# First 5 predicted probabilities
y_test_pred_labels[:5]

array([1, 1, 1, 0, 0])

The actual labels can be extracted using `y_test_predict.label_ids`.

In [31]:
# Actual labels
y_test_actual_labels = y_test_predict.label_ids

# First 5 predicted probabilities
y_test_actual_labels[:5]

array([1, 1, 1, 0, 0])

# Step 11: Model Performance Evaluation

In step 11, we will make the transfer learning text classification model performance evaluation.

`trainer.evaluate` is a quick way to get the loss and the accuracy of the testing dataset.

We can see that the model has a loss of 0.28 and an accuracy of 91.5%.

In [32]:
# Trainer evaluate
trainer.evaluate(dataset_test)

{'eval_loss': 0.21430419385433197,
 'eval_accuracy': 0.93,
 'eval_runtime': 0.9603,
 'eval_samples_per_second': 208.277,
 'eval_steps_per_second': 52.069,
 'epoch': 2.0}

To calculate more model performance metrics, we can use `evaluate.load` to load the metrics of interest.

The results show that the testing dataset has a `f1` value of 0.91 and a `recall` value of 0.89.

In [33]:
# Load f1 metric
metric_f1 = evaluate.load("f1")

# Compute f1 metric
metric_f1.compute(predictions=y_test_pred_labels, references=y_test_actual_labels)

Downloading builder script: 0.00B [00:00, ?B/s]

{'f1': 0.9263157894736842}

In [None]:
# Load recall metric
metric_recall = evaluate.load("recall")

# Compute recall metric
metric_recall.compute(predictions=y_test_pred_labels, references=y_test_actual_labels)

Downloading builder script:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

{'recall': 0.8877551020408163}

# Step 12: Save and Load Model

In step 12, we will talk about how to save the model and reload it for prediction.

`tokenizer.save_pretrained` saves the tokenizer information to the drive and `model.save_model` saves the model to the drive.

In [34]:
# Save tokenizer
tokenizer.save_pretrained('./sentiment_transfer_learning_transformer/')

# Save model
trainer.save_model('./sentiment_transfer_learning_transformer/')

We can load the saved tokenizer later using `AutoTokenizer.from_pretrained()` and load the saved model using `AutoModelForSequenceClassification.from_pretrained()`.

In [35]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("./sentiment_transfer_learning_transformer/")

# Load model
loaded_model = AutoModelForSequenceClassification.from_pretrained('./sentiment_transfer_learning_transformer/')

# References

* [Hugging Face documentation on fine-tuning a pretrained model](https://huggingface.co/docs/transformers/training)
* [Hugging Face notebook on fine-tuning a pretrained model](https://github.com/huggingface/notebooks/blob/main/transformers_doc/en/training.ipynb)
* [Hugging Face documentation on tokenizer](https://huggingface.co/transformers/v3.5.1/main_classes/tokenizer.html)
* [Deeplearning.AI transfer learning video from Andrew Ng](https://www.youtube.com/watch?v=yofjFQddwHE)
* [Hugging Face TensorFlow predictions and metrics video](https://youtu.be/nx10eh4CoOs)
* [Hugging Face documentation on prepare_tf_dataset()](https://github.com/huggingface/notebooks/blob/main/transformers_doc/en/training.ipynb)
* [Hugging Face documentation on transformers.Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer)
* [Hugging Face documentation on Datasets](https://huggingface.co/docs/datasets/v1.7.0/index.html#:~:text=Datasets%20and%20evaluation%20metrics%20for,Natural%20Language%20Processing%20(NLP).)