# BERT - IMDB Sentiment Analysis

In [None]:
!pip install transformers
!pip install datasets
!pip install evaluate
!pip install accelerate

In [5]:
import numpy as np

import evaluate
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    BertTokenizer,
    BertTokenizerFast,
    DataCollatorWithPadding,
)

from transformers import (
    pipeline,
    Trainer,
    TrainingArguments,
    AutoModel,
    AutoModelForSequenceClassification,
)

In [4]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Dataset

## Loading

In [3]:
imdb = load_dataset("imdb")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## Analyzing

Imdb dataset [\[2\]](https://huggingface.co/datasets/stanfordnlp/imdb) is partitioned into train and test splits, each with 25k number of samples. A data sample is actually the composition of ***text*** and ***label*** in a dictionary.

- Positive Review: label 1
- Negative Review: label 0

In [4]:
# train/test dataset size
print("Train size: ", len(imdb["train"]))
print("Test size: ", len(imdb["test"]))
print()

train_set = imdb["train"]
test_set = imdb["test"]

# a data sample = <text, label> dict
print("Type of a sample: ", type(train_set[0]))
print("Text: ", train_set[0]["text"])
print("Label: ", train_set[0]["label"])

Train size:  25000
Test size:  25000

Type of a sample:  <class 'dict'>
Text:  I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was consider

## Text Preprocessing - Tokenizer

### 1. Structure of Tokenizer:

In HuggingFace, tokenizers are generally provided in two different forms:

 - Standard tokenizers → python-based
 - Fast tokenizers → rust-based

Being implemented in Rust provides with tokenizers additional acceleration while doing batched-tokenization. Furthermore, fast tokenizers have more advanced alignment methods to map between original string and token space [\[3, ](https://huggingface.co/docs/transformers/main_classes/tokenizer)[4\]](https://discuss.huggingface.co/t/difference-betweeen-distilberttokenizerfast-and-distilberttokenizer/5961/2).

Here, an uncased BERT tokenizer [\[5\]](https://huggingface.co/google-bert/bert-base-uncased) is loaded to split the given text as input into a series of tokens. The function `imbdb.map()` is suitable for this; it requires a pre-processing function to be applied to each of its samples in batched version [\[6\]](https://huggingface.co/docs/datasets/en/about_map_batch). To realize this, we define `tokenize()` function, which sends a call to the instance function `__call__()` of *PreTrainedTokenizerBase* class [\[7\]](https://huggingface.co/docs/transformers/internal/tokenization_utils). It tokenizes the text and truncates the sequences if they are longer than maximum length in BERT.



In [5]:
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")

print("It is BertTokenizer: ", isinstance(tokenizer, BertTokenizer))
print("It is BertTokenizerFast: ", isinstance(tokenizer, BertTokenizerFast))

# Inheritance: BertTokenizerFast -> PreTrainedTokenizerFast -> PreTrainedTokenizerBase
# Instance call to tokenizer is made by tokenize() function.
# That call invoke __call__ method of PreTrainedTokenizerBase class [7].
# In that way, it tokenizes one or several sequence(s) for the model.
# It returns "BatchEncoding" object [7].

def tokenize(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_imdb = imdb.map(tokenize, batched=True)

It is BertTokenizer:  False
It is BertTokenizerFast:  True


Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

### 2. Tokens and Vocabulary:

The function `map()` generates a new dataset in which each text is represented by a series of tokens. By `batched=True` option, multiple elements of the dataset is processed at once.  

These tokens are either sub-words or the most repeated full-words. This approach is called *WordPiece*; some prefixes, suffixes and sub-sections in English are highly repetitive in many different words, so they are chosen as tokens. Their different combinations can derive distinct words; no need to save every word as a token. Each token is represented by an index value, which is encoded by an *Embedding* layer to be fed into the model [\[8\, ](https://www.youtube.com/watch?v=zHvTiHr506c)[9\]](https://huggingface.co/docs/transformers/en/tokenizer_summary).

Tokenizers provide relevant vocubulary by `get_vocab()` function. It returns a dict of token-index pairs. You can reverse it so that you can map from indices to textual tokens, or you can count the number of entries to see the size of vocabulary.

\\

**Tokenizers and Models:**

* ***Unigram:*** XLNet, ALBERT
* ***WordPiece:*** BERT, DistilBERT
* ***Byte-Pair Encoding:*** GPT-2, RoBERTa


In [6]:
# Tokenized dataset has new data field "input_ids".
# This field is a list of indices, each refers to one token in raw string.
tokenized_train_set = tokenized_imdb["train"]
print("Sample 1 - Text: ", tokenized_train_set[0]["text"])
print("Sample 1 - Token indices: ", tokenized_train_set[0]["input_ids"])

# Getting vocabulary and inv vocabulary of BERT
vocabulary = tokenizer.get_vocab()
inv_vocab = {index: token for token, index in vocabulary.items()}

# Mapping indices to tokens for sample 1
tokens = [inv_vocab[index] for index in tokenized_train_set[0]["input_ids"]]
print("Sample 1 - Tokens: ", tokens)

# BERT - Vocabulary Size
print("Vocabulary size: ", len(vocabulary))

Sample 1 - Text:  I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few an

### 3. Batching Tokens

* In general, the sentences or paragraphs to be processed by the language model have different length; so after tokenization each sentence has different number of tokens. This is a problem, because batched inputs need to be fixed-size tensors, where padding enters the picture. It adds a special padding token to guarantee that all sequences have same length as the longest one or maximum length accepted by the model [\[10\]](https://huggingface.co/docs/transformers/en/pad_truncation#).

* Our main purpose is to create a batch of padded samples. To achieve that, [`DataCollatorWithPadding`](https://huggingface.co/docs/transformers/v4.47.1/en/main_classes/data_collator#transformers.DataCollatorWithPadding) can be used. It dynamically pads the sequences to longest length in the batch during collation. As an alternative, all samples in the dataset can be also padded to maximum length, but we do not need this. What we need is that only the samples in the same batch should have same length. At this point, dynamic padding inside the batch becomes more efficient.

In [7]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Evaluate

To be able to evaluate the performance of the model during training and validation steps, we need some metrics. The most used ones for classification is accuracy, precision and recall. We can use HF [evaluate](https://huggingface.co/docs/evaluate/en/index) library.

We implement `compute_metrics()` function, which will be automatically invoked to calculate 3 evaluation metrics for validation set during training. When multiple metrics are calculated in this function, its return statement needs to be a dictionary [\[11, ](https://huggingface.co/docs/transformers/en/main_classes/trainer)[12\]](https://discuss.huggingface.co/t/combine-multiple-metrics-in-compute-metrics-for-validation/90088)

In [8]:
recall = evaluate.load("recall")
precision = evaluate.load("precision")
accuracy = evaluate.load("accuracy")


def compute_metrics(preds_labels) -> dict:
    preds, labels = preds_labels
    preds = np.argmax(preds, axis=1)  # expected shape: (B, 2)

    acc = accuracy.compute(predictions=preds, references=labels)["accuracy"]
    rec = recall.compute(predictions=preds, references=labels)["recall"]
    pre = precision.compute(predictions=preds, references=labels)["precision"]

    return {"accuracy": acc, "recall": rec, "precision": pre}

# Training

BERT classifier is extended version of BERT architecture [\[13\]](https://huggingface.co/docs/transformers/en/model_doc/bert); it has additional classification head at the top of pooled output. The embedding space is mapped into class space along with softmax activation.

The parameters of each layer in BERT and BERT-classifier are stored in a dict, and how many parameter sets exist is counted. Then, they are simultaneously printed. We observe that the weights and biases of each layer are matched, but BERT classifier has additional weights and biases for classification head.



In [9]:
# Dictionaries to map from class ids to corresponding labels or vice versa.
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

# instantiating bert model and its text classifier version
bert = AutoModel.from_pretrained("google-bert/bert-base-uncased")
bert_classifier = AutoModelForSequenceClassification.from_pretrained(
    "google-bert/bert-base-uncased",
    num_labels=2,
    id2label=id2label,
    label2id=label2id,
)

# going through BERT
num_layers, bert_dict = 0, {}
for name, module in bert.named_parameters():
    bert_dict[name] = module
    num_layers += 1
print("\nNumber of param groups in BERT: ", num_layers)

# going through BERT classifier
num_layers, bert_classifier_dict = 0, {}
for name, module in bert_classifier.named_parameters():
    bert_classifier_dict[name] = module
    num_layers += 1
print("Number of param groups in BERT Classifier: ", num_layers)

# Comparison between their layers
layer_names1 = list(bert_dict.keys()) + ["", ""]
layer_names2 = list(bert_classifier_dict.keys())

print("\nLayer Weights:")
print("--------------")
for name1, name2 in zip(layer_names1, layer_names2):
    print(name1, "\t", name2)

print(bert_classifier_dict["classifier.weight"].shape)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Number of param groups in BERT:  199
Number of param groups in BERT Classifier:  201

Layer Weights:
--------------
embeddings.word_embeddings.weight 	 bert.embeddings.word_embeddings.weight
embeddings.position_embeddings.weight 	 bert.embeddings.position_embeddings.weight
embeddings.token_type_embeddings.weight 	 bert.embeddings.token_type_embeddings.weight
embeddings.LayerNorm.weight 	 bert.embeddings.LayerNorm.weight
embeddings.LayerNorm.bias 	 bert.embeddings.LayerNorm.bias
encoder.layer.0.attention.self.query.weight 	 bert.encoder.layer.0.attention.self.query.weight
encoder.layer.0.attention.self.query.bias 	 bert.encoder.layer.0.attention.self.query.bias
encoder.layer.0.attention.self.key.weight 	 bert.encoder.layer.0.attention.self.key.weight
encoder.layer.0.attention.self.key.bias 	 bert.encoder.layer.0.attention.self.key.bias
encoder.layer.0.attention.self.value.weight 	 bert.encoder.layer.0.attention.self.value.weight
encoder.layer.0.attention.self.value.bias 	 bert.encoder.

## Training Arguments and Trainer

To train the model, we need to determine hyperparameters, which is handled by `TrainingArguments` [\[11\]](https://huggingface.co/docs/transformers/en/main_classes/trainer).

- ***output_dir:*** The directory where model predictions and checkpoints will be saved.

- ***per_device_train_batch_size:*** When you train your model with multiple GPUs or TPUs, it controls the batch size for each GPU/TPU.

- ***per_device_eval_batch_size:*** When you evaluate your model with multiple GPUs or TPUs, it controls the batch size for each GPU/TPU.

- ***num_train_epochs:*** How many number of epochs the model will be trained.

- ***eval_strategy:*** It can be "no", "steps", or "epoch".
    - no: No evaluation is done during training
    - steps: Evaluation is done and logged every *\"eval_steps\"*
    - epoch: Evaluation is done at the end of each epoch.


- ***save_strategy:*** It can be "no", "steps", "epoch", "best".
    - no: No ckpt save is done during training
    - steps: Ckpt save is done every *\"eval_steps\"*
    - epoch: Ckpt save is done at the end of each epoch.
    - best: Ckpt save is done when a new *\"best_metric\"* achieved.

- ***load_best_model_at_end:*** It is a boolean value to make sure whether the best model according to `metric_for_best_model` will be saved or not.

In [10]:
training_args = TrainingArguments(
    output_dir="bert_imdb",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=4,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=True,
    report_to="none",
)

trainer = Trainer(
    model=bert_classifier,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["test"],
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Recall,Precision
1,0.2099,0.24565,0.91016,0.84808,0.968305
2,0.1379,0.244284,0.9274,0.89112,0.960838
3,0.0752,0.284517,0.93908,0.95088,0.928957
4,0.0352,0.311895,0.94028,0.94296,0.937933


No files have been modified since last commit. Skipping to prevent empty commit.


TrainOutput(global_step=6252, training_loss=0.12284956799053795, metrics={'train_runtime': 2639.6527, 'train_samples_per_second': 37.884, 'train_steps_per_second': 2.368, 'total_flos': 2.603344526150064e+16, 'train_loss': 0.12284956799053795, 'epoch': 4.0})

In [11]:
trainer.push_to_hub()

CommitInfo(commit_url='https://huggingface.co/goktug14/bert_imdb/commit/7e21762b691fd87d11d9fbccd299d5b21ba9bf4c', commit_message='End of training', commit_description='', oid='7e21762b691fd87d11d9fbccd299d5b21ba9bf4c', pr_url=None, repo_url=RepoUrl('https://huggingface.co/goktug14/bert_imdb', endpoint='https://huggingface.co', repo_type='model', repo_id='goktug14/bert_imdb'), pr_revision=None, pr_num=None)

# Inference

In [9]:
# 9/10 - substance - 2024
text1 = """This is one of those movies that would not do justice seeing at home.
          You need to be in a packed theater, feel the waves and rushes of energy
          from the crowd. It is an experience to say the least. You will laugh,
          you will look away, your jaw will drop, you will feel uncomfortable.
          But it is all worth it. It is incredible filmmaking, award winning acting
          and a smashing soundtrack all in one. Huge applause to Demi for taking
          on such a vulnerable role covering a subject that is rarely discussed.
          Looking forward to see what comes of this, hopefully more open conversation
          within the industry and more doors opened than closed."""

# 4/10 - smile 2 - 2024
text2 = """Smile 2 boasted impressive cinematography and strong performances, yet
         ultimately stumbled due to a shallow plot and a poorly developed core
         concept. The film revolves around a core demon spirit, a chiling premise
          that unfortunately remained largely unexplored."""

model = pipeline("sentiment-analysis", model="goktug14/bert_imdb")

out1 = model(text1)
out2 = model(text2)

print(out1)
print(out2)

Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.9953957200050354}]
[{'label': 'NEGATIVE', 'score': 0.995867133140564}]


<a name="ref"></a>

# References

1. [HF Tutorial](https://huggingface.co/docs/transformers/en/tasks/sequence_classification#inference)

2. [Imdb Dataset](https://huggingface.co/datasets/stanfordnlp/imdb)

3. [Tokenizer Documentation](https://huggingface.co/docs/transformers/main_classes/tokenizer)

4. [Standard/Fast Tokenizer Discussion](https://discuss.huggingface.co/t/difference-betweeen-distilberttokenizerfast-and-distilberttokenizer/5961/2)

5. [Uncased BERT Models](https://huggingface.co/google-bert/bert-base-uncased)

6. [Batch Mapping in Datasets](https://huggingface.co/docs/datasets/en/about_map_batch)

7. [PreTrainedTokenizerBase Class](https://huggingface.co/docs/transformers/internal/tokenization_utils)

8. [Sub-Word Tokenization](https://www.youtube.com/watch?v=zHvTiHr506c)

9. [Summary of Tokenizers](https://huggingface.co/docs/transformers/en/tokenizer_summary)

10. [Padding and Truncation](https://huggingface.co/docs/transformers/en/pad_truncation#)

11. [Compute Metrics and Trainer](https://huggingface.co/docs/transformers/en/main_classes/trainer)

12. [Computation of Multiple Metrics Discussion](https://discuss.huggingface.co/t/combine-multiple-metrics-in-compute-metrics-for-validation/90088)

13. [BERT Architecture](https://huggingface.co/docs/transformers/en/model_doc/bert)