# Pre-Trained Model Demo

In this python notebook, we will show you how to use the pre-trained model to predict the sentiment of a sentence.
This notebook goes together with our State of the Art paper on [Transformer Models](https://github.com/Lorenc1o/transformer_models_SoE/blob/main/22_23_BDS_Transformers.pdf) for the [eBISS 2023](https://cs.ulb.ac.be/conferences/ebiss2023/index.html).

## 1. Install Dependencies

First, we need to install the dependencies. We will use the [transformers](https://huggingface.co/docs/transformers/index) library from HuggingFace, which implements several pre-trained models that we will fine-tune for our task. From this library we will also use the BERT tokenizer. A tokenizer is a function that splits a sentence into tokens, which are the basic units of a language. For example, the sentence "I love transformers" can be tokenized into the following tokens: ["I", "love", "transformers"]. Tokens are then mapped to a continuous space, called embeddings, which are used as input to the model. 

We will also use the [datasets](https://huggingface.co/docs/datasets/index.html) library from HuggingFace, which provides a convenient way to load the [IMDB](https://huggingface.co/datasets/imdb) dataset.

In [1]:
!pip install torch transformers datasets
!pip install accelerate -U

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


# 2. Load the dataset and prepare the data

We will use the [IMDB](https://huggingface.co/datasets/imdb) dataset, which contains 50,000 movie reviews from the [Internet Movie Database](https://www.imdb.com/). Each review is labeled as either positive or negative.

Our objective is to train a model that can predict the sentiment of a movie review. 

First, we will use a pre-trained model to predict the sentiment of a movie review. Then, we will fine-tune the model on the IMDB dataset and compare the results.

In [2]:
from datasets import load_dataset

dataset = load_dataset('imdb')

  from .autonotebook import tqdm as notebook_tqdm
Found cached dataset imdb (/home/jose/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)
100%|██████████| 3/3 [00:00<00:00, 886.25it/s]


Let's analyze a little bit the dataset. We will load the dataset and print the first 5 examples.

In [18]:
# Select 5 random samples
import random
random.seed(42)
dataset = dataset.shuffle()
for i in range(5):
    print('---')
    print(dataset['train'][i]['text'])
    print('negative' if dataset['train'][i]['label'] == 0 else 'positive')

---
There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier's plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it's the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...
positive
---
This movie is a great. The plot is very true to the book which is a classic written by Mark Twain. The movie starts of with a scene where Hank sings a song with a bunch of kids called "when you stub your toe on the moon" It 

We can observe the nature of the dataset. Each example is a movie or series review, and the label is either positive or negative. The reviews are quite long and contain a lot of information. This task is easy for us, but it is not so easy for a machine. We will see how we can train a model to perform this task.

Now, we need to tokenize the sentences, so that our models can understand them. 

In [3]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

'''
    Function: tokenize_function
    Description: Tokenize the text
    Input: 
        - examples, a dictionary with key 'text' and value as the text to be tokenized
        - Padding is used to ensure that all sequences in a batch have the same length 
            by adding padding tokens. Setting padding to 'max_length' ensures that each 
            sequence is padded to have a length equal to max_length.
        - truncation, When a text input is longer than the model can handle (or longer 
            than the specified maximum length), it needs to be truncated. Setting truncation=True 
            will ensure that inputs longer than the maximum length are truncated.
        - max_length, specifies the maximum length of a sequence. Any input that is longer 
            than this value will be truncated, and any input shorter than this value will 
            be padded.
    Output: tokenized text
'''
def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=128)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

Loading cached processed dataset at /home/jose/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-2f019c60187261be.arrow
Loading cached processed dataset at /home/jose/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-271ea7d24566f69c.arrow
Loading cached processed dataset at /home/jose/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-76898d4e76125c65.arrow


# 3. Select a pre-trained model and load it

Now, we are going to load a pre-trained model. We will use the [BERT](https://huggingface.co/transformers/model_doc/bert.html) model, which is a transformer model that was pre-trained on a large corpus of text.

Of course, there are alternatives, such as the [RoBERTa](https://huggingface.co/transformers/model_doc/roberta.html) model, which is a variant of BERT that was trained on a larger corpus of text or the [DistilBERT](https://huggingface.co/transformers/model_doc/distilbert.html) model, which is a smaller and faster version of BERT.

In [4]:
from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

2023-06-27 12:13:45.777585: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequence

# 4. Evaluate the vanilla model

At this point, we can evaluate the model on the test set. We will use the [Accuracy](https://huggingface.co/metrics/accuracy) metric, which is the percentage of correct predictions.

Then, we will be able to compare the results with the fine-tuned model.

In [8]:
from sklearn.metrics import classification_report

def evaluate_model(model, dataset, tokenizer, label_names):
    model.eval()
    true_labels = []
    predictions = []
    
    for batch in dataset:
        inputs = tokenizer(batch['text'], return_tensors="pt", padding=True, truncation=True, max_length=128)
        outputs = model(**inputs)
        batch_predictions = outputs.logits.argmax(dim=1).tolist()
        true_labels.append(batch['label'])
        predictions.extend(batch_predictions)
    
    print(classification_report(true_labels, predictions, target_names=label_names))

In [6]:
# Evaluate the vanilla model
print("\nPerformance of the Vanilla Model:")
evaluate_model(model, tokenized_datasets['test'], tokenizer, ['negative', 'positive'])


Performance of the Vanilla Model:
              precision    recall  f1-score   support

    negative       0.54      0.04      0.07     12500
    positive       0.50      0.97      0.66     12500

    accuracy                           0.50     25000
   macro avg       0.52      0.50      0.37     25000
weighted avg       0.52      0.50      0.37     25000



As we can see, the recall for the negative class is just 0.04, which means that the model is not able to detect negative reviews. Let's see if we can improve this result by fine-tuning the model.

As for the accuracy, we obtain a poor 0.5, so the model is not able to predict the sentiment of a review and we cannot rely on it for this task.

# 5. Fine-tune the model

Now, we will fine-tune the model on the IMDB dataset. We will use the [Trainer](https://huggingface.co/transformers/main_classes/trainer.html) class from the transformers library, which provides a convenient way to fine-tune a model.

In [5]:
from transformers import TrainingArguments, Trainer

# Training arguments for the vanilla model
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
)

# Train the vanilla model (this will take a while)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
    tokenizer=tokenizer,
)

trainer.train()

  5%|▌         | 500/9375 [17:34<5:13:51,  2.12s/it]

{'loss': 0.5194, 'learning_rate': 5e-05, 'epoch': 0.16}


 11%|█         | 1000/9375 [35:24<4:49:17,  2.07s/it]

{'loss': 0.4712, 'learning_rate': 4.71830985915493e-05, 'epoch': 0.32}


 16%|█▌        | 1500/9375 [52:38<4:30:12,  2.06s/it]

{'loss': 0.4143, 'learning_rate': 4.436619718309859e-05, 'epoch': 0.48}


 21%|██▏       | 2000/9375 [1:09:40<4:15:00,  2.07s/it]

{'loss': 0.3892, 'learning_rate': 4.154929577464789e-05, 'epoch': 0.64}


 27%|██▋       | 2500/9375 [1:26:44<3:52:57,  2.03s/it]

{'loss': 0.366, 'learning_rate': 3.8732394366197184e-05, 'epoch': 0.8}


 32%|███▏      | 3000/9375 [1:43:42<3:34:07,  2.02s/it]

{'loss': 0.3652, 'learning_rate': 3.5915492957746486e-05, 'epoch': 0.96}


 37%|███▋      | 3500/9375 [2:00:52<3:23:14,  2.08s/it]

{'loss': 0.3024, 'learning_rate': 3.3098591549295775e-05, 'epoch': 1.12}


 43%|████▎     | 4000/9375 [2:18:09<3:06:48,  2.09s/it]

{'loss': 0.2816, 'learning_rate': 3.028169014084507e-05, 'epoch': 1.28}


 48%|████▊     | 4500/9375 [2:35:34<2:53:49,  2.14s/it]

{'loss': 0.2937, 'learning_rate': 2.746478873239437e-05, 'epoch': 1.44}


 53%|█████▎    | 5000/9375 [2:52:36<2:30:23,  2.06s/it]

{'loss': 0.2793, 'learning_rate': 2.4647887323943664e-05, 'epoch': 1.6}


 59%|█████▊    | 5500/9375 [3:09:54<2:16:06,  2.11s/it]

{'loss': 0.2936, 'learning_rate': 2.1830985915492956e-05, 'epoch': 1.76}


 64%|██████▍   | 6000/9375 [3:27:01<1:55:10,  2.05s/it]

{'loss': 0.2666, 'learning_rate': 1.9014084507042255e-05, 'epoch': 1.92}


 69%|██████▉   | 6500/9375 [3:44:11<1:39:51,  2.08s/it]

{'loss': 0.2068, 'learning_rate': 1.619718309859155e-05, 'epoch': 2.08}


 75%|███████▍  | 7000/9375 [4:01:30<1:21:01,  2.05s/it]

{'loss': 0.129, 'learning_rate': 1.3380281690140845e-05, 'epoch': 2.24}


 80%|████████  | 7500/9375 [4:18:41<1:04:45,  2.07s/it]

{'loss': 0.1335, 'learning_rate': 1.056338028169014e-05, 'epoch': 2.4}


 85%|████████▌ | 8000/9375 [4:36:03<47:29,  2.07s/it]  

{'loss': 0.1369, 'learning_rate': 7.746478873239436e-06, 'epoch': 2.56}


 91%|█████████ | 8500/9375 [4:53:22<29:53,  2.05s/it]

{'loss': 0.1447, 'learning_rate': 4.929577464788732e-06, 'epoch': 2.72}


 96%|█████████▌| 9000/9375 [5:10:50<12:56,  2.07s/it]

{'loss': 0.1279, 'learning_rate': 2.112676056338028e-06, 'epoch': 2.88}


100%|██████████| 9375/9375 [5:24:40<00:00,  2.08s/it]

{'train_runtime': 19480.7727, 'train_samples_per_second': 3.85, 'train_steps_per_second': 0.481, 'train_loss': 0.27903265055338544, 'epoch': 3.0}





TrainOutput(global_step=9375, training_loss=0.27903265055338544, metrics={'train_runtime': 19480.7727, 'train_samples_per_second': 3.85, 'train_steps_per_second': 0.481, 'train_loss': 0.27903265055338544, 'epoch': 3.0})

We can save the fine-tuned model to disk, so that we can load it later and use it to make predictions.
It is not strictly necessary, but the training takes very long and if we want to use the model later, we can just load it from disk.

In [6]:
model.save_pretrained('./fine-tuned_model')
tokenizer.save_pretrained('./tokenizer')

('./tokenizer/tokenizer_config.json',
 './tokenizer/special_tokens_map.json',
 './tokenizer/vocab.txt',
 './tokenizer/added_tokens.json')

Finally, we can now evaluate the fine-tuned model on the test set and compare the results with the vanilla model.

In [11]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Load the model and tokenizer
loaded_model = AutoModelForSequenceClassification.from_pretrained('./fine-tuned_model')
loaded_tokenizer = AutoTokenizer.from_pretrained('./tokenizer')

# Evaluate the fine-tuned model
print("\nPerformance of the Fine-tuned Model:")
evaluate_model(loaded_model, tokenized_datasets['test'], loaded_tokenizer, ['negative', 'positive'])


Performance of the Fine-tuned Model:
              precision    recall  f1-score   support

    negative       0.88      0.89      0.88     12500
    positive       0.89      0.88      0.88     12500

    accuracy                           0.88     25000
   macro avg       0.88      0.88      0.88     25000
weighted avg       0.88      0.88      0.88     25000



Observe how the precision for the negative class has increased from 0.54 to 0.88, and the recall from 0.04 to 0.89. This means that the fine-tuned model is much better at detecting negative reviews. The f1-score for the negative class has also increased from 0.07 to 0.88.

As for the positive class, the precision has gone from 0.50 to 0.89 and the recall from 0.97 to 0.88. The f1-score has also increased from 0.66 to 0.88. This means that the fine-tuned model is also better at detecting positive reviews, although the improvement is not as big as for the negative class.

Also, the overall accuracy has increased from 0.5 to 0.88, which means that the fine-tuned model is much better at predicting the sentiment of a movie review.

# 6. Conclusion

In this notebook, we have shown how to use a pre-trained model to predict the sentiment of a sentence. We have also shown how to fine-tune the model on a dataset and how to evaluate the fine-tuned model on a test set.

We have observed that the fine-tuned model is much better at predicting the sentiment of a movie review than the vanilla model. As we explained in our paper, this approach is very useful, since generic models that take a long time to train can be fine-tuned on a specific task and achieve state-of-the-art results, reducing the time and resources required to train a model from scratch.