[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1tayKZyBpQX0nswLEjX6f7smUwvHii6D9?usp=sharing)

# Text classification - IMDB Dataset

In [1]:
# Install the latest version of the transformers library with the sentencepiece tokenizer and the datasets library, 
# used for natural language processing tasks.
!pip install transformers[sentencepiece] datasets

## Datasets library

In [30]:
# Load the list of datasets from the datasets library
from datasets import list_datasets
list_datasets()

In [33]:
# Load the IMDb movie review dataset
# The dataset is split into a dictionary of training, testing and unsupervised datasets with different number of recrods
from datasets import load_dataset
imdb = load_dataset("imdb")
imdb

- Similar to a python dictionary, where each key corresponds to a different split

In [35]:
# Print the first record of the imdb training dataset
imdb['train'][0]

In [41]:
# Print the first three records of the imdb testing datast
imdb['test'][:3]

In [44]:
# Select and save 2000 random records from the imdb training dataset
imdb['train'] = imdb['train'].shuffle(seed=1).select(range(2000))
imdb['train']

In [46]:
# Split the training dataset to train and test datasets where the training dataset is 80% of thw records
# and the remaining 20% are in the testing dataset
imdb_train_validation = imdb['train'].train_test_split(train_size=0.8)
imdb_train_validation

In [49]:
# See how the generated testing dataset is structured (features and number of records)
imdb_train_validation['test']

In [56]:
# Print the first record of the imdb_train_validation testing dataset
imdb_train_validation['test'][0]

In [57]:
# Set the imdb_train_validation valudation dataset by popping the test dataset
imdb_train_validation['validation'] = imdb_train_validation.pop('test')
imdb_train_validation

In [59]:
# Update the imdb dataset dictionary by adding the train and validation datasets
imdb.update(imdb_train_validation)
imdb

In [61]:
# Select 400 random records from the imdb test dataset and store them in the test dataset 
# imdb['test'] = imdb['test'].shuffle(seed=1).select(range(400))
# imdb['test']

In [63]:
# Print the first three records in the imdb unsupervised dataset
imdb['unsupervised'][:3]

In [65]:
# Remove the imdb unsupervised dataset from the imdb dataset dictionary and keep
# only the train, test and validation datasets
imdb.pop('unsupervised')
imdb

## Overview of IMDB Dataset

In [68]:
import pandas as pd
import matplotlib.pyplot as plt

# Set an option in Pandas to display columns with a maximum width of 250 characters
pd.set_option('max_colwidth', 250)

In [74]:
# Since we set the option of max_colwidth above, let's set this format to the imdb dataset dictionary,
# select the imdb complete train dataset
# and store a sample of the training records in the df dataframe
imdb.set_format('pandas')
df = imdb['train'][:]
df.sample(frac=1 ,random_state=1).head(10)

Unnamed: 0,text,label
75,"The emotional powers and characters of Dominick and Eugene are the things that Hollywood doesn't make anymore. This is one of the most emotional, sensitive, and heart-felt movies that I have ever seen! Roy Liotta, Tom Hulce, and supporting actres...",1
1284,I rank this the best of the Zorro chapterplays.The exciting musical score adds punch to an exciting screen play.There is an excellent supporting cast and mystery villain that will keep you guessing until the final chapter.Reed Hadley does a fine ...,1
408,"It's interesting at first. A naive park ranger (Colin Firth) marries a pretty, mysterious woman (Lisa Zane) he's only known for a short time. They seem to be happy, then she disappears without warning. He searches for her and, after a few dead en...",0
1282,"The Dirty Harry series began with very gritty cop action, and was almost immediately lightened up for ""Magnum Force"". By the time that ""The Enforcer"" rolled around, Dirty Harry was little more than a television cop show (saved only by Tyne Daly)....",1
1447,"The first time I saw this film in the theatre at a foreign film festival, I thought it intriguing, fascinating, the sensitive bi-sexual artist. So very European, so very Dutch! I recently rented it for a second viewing and could hardly keep from ...",1
1144,"I greatly enjoyed Margaret Atwood's novel 'The Robber Bride', and I was thrilled to see there was a movie version. A woman frames a cop boyfriend for her own murder, and his buddy, an ex-cop journalist, tries to clear his name by checking up on t...",0
1381,"Welcome to Collinwood is one of the most delightful films I have ever seen. A superb ensemble cast, tight editing and wonderful direction. A caper movie that doesn't get bogged down in the standard tricks.<br /><br />Not much can be said about th...",1
181,"I believe a lot of people down rated the movie, NOT because of the lack of quality. But it did not follow the standard Hollywood formula. Some of the conflicts are not resolved. The ending is just a little too real for others, but the journey the...",1
1183,"This 1939 film tried to capitalize on the much better Michael Curtiz's film ""Angels with Dirty Faces"". As directed by Ray Enright, the only interesting thing is how tamed these kids were in comparison with what's going on with the youth in Americ...",1
1103,"Vampires, sexy guys, guns and some blood. Who could ask for more? Moon Child delivers it all in one nicely packaged flick! Gackt is the innocent Sho - who befriends a Vampire Kei (HYDE), their relationship grows with time but as Sho ages, Kei's i...",1


In [77]:
# Print the text column of the first record of the dataset 
df.loc[0, 'text']

In [78]:
# Replace the <br /> with empty space to cleanup the contents in the dataframe
# and print the text column of the first record of the dataset 
df['text'] = df.text.str.replace('<br />', '')
df.loc[0, 'text']

In [80]:
# Print the count of the number of times a label value is present in the label column
df.label.value_counts()

In [84]:
# Create a new column named Words per review that counts the number of words present in the text column for each record.
df["Words per review"] = df["text"].str.split().apply(len)

# Create a boxplot where the "Words per review" argument specifies the column to be plotted, 
# "label" specifies the column used to group the data, and grid=False and showfliers=False remove the grid 
# and outliers from the plot. The color="black" argument sets the color of the plot to black.
df.boxplot("Words per review", by="label", grid=False, showfliers=False, color="black")

# Remove the title of the plot
plt.suptitle("")

# Remove the x-axis of the plot
plt.xlabel("")
plt.show()

In [86]:
# 0 is negative
# 1 is positive

# Select all rows in the df DataFrame where the length of the 'text' column is less than 200 characters.
# The str.len() method is used to calculate the length of each review in characters, and then the 
# condition text.str.len() < 200 is used to filter the rows where the length is less than 200.
df[df.text.str.len() < 200]

In [88]:
# Reset the format of a Dataset object to the default format, which is a dict of numpy arrays. 
# In the case of the IMDb dataset, this format means that each split of the dataset (e.g. 'train', 'test', 'unsupervised')
# is a dictionary with two keys: 'text' and 'label'.
imdb.reset_format()

## Tokenizer

In [22]:
# Import the AutoTokenizer library from the transformers package
# Define the checkpoint to "distilbert-base-cased" or "bert-base-cased" which is the name of a pre-trained DistilBERT model 
# available in the library.
# Then use AutoTokenizer.from_pretrained() method to load the tokenizer corresponding to the specified checkpoint. 
# This creates a tokenizer object that can be used to tokenize the text data.

from transformers import AutoTokenizer

checkpoint = "distilbert-base-cased"
#checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


# The tokenize_function() function is defined to tokenize a batch of texts using the tokenizer object. 
# The function takes a batch of data in the form of a dictionary, where the key "text" corresponds to 
# the text data to be tokenized. The function applies the tokenizer() method to the "text" column of 
# the batch dictionary, with padding=True and truncation=True arguments to ensure that all tokenized 
# sequences have the same length.
def tokenize_function(batch):
    return tokenizer(batch["text"], padding=True, truncation=True)

# The map() method is used to apply the tokenize_function() to each batch in the 'train' split 
# of the IMDb dataset. The batched=True argument specifies that the function should be applied 
# to batches of data rather than individual examples. The resulting tokenized data is returned 
# in a new Dataset object called imdb_encoded, which has the same format as the original IMDb 
# dataset but with the 'text' column replaced by a list of token IDs.
imdb_encoded = imdb.map(tokenize_function, batched=True, batch_size=None)
imdb_encoded

In [23]:
# Print out the first example in the tokenized 'train' split of the IMDb dataset after applying 
# the tokenize_function() defined in the previous code snippet.
# The output is a dictionary where the keys are the features: ['text', 'label', 'input_ids', 'attention_mask']
# and the values are a list of the records in the columns for each feature
print(imdb_encoded['train'][0])

## Tiny IMDB

In [90]:
import transformers
import re

# Find all the classes in the transformers module that start with the string "AutoModel".
[x for x in dir(transformers) if re.search(r'^AutoModel', x)]

In [25]:
import torch
from transformers import AutoModelForSequenceClassification

# Define a PyTorch model for sequence classification using the AutoModelForSequenceClassification 
# class from the transformers library.
# First, check if a GPU is available and set the device to "cuda" if it is, otherwise it sets the device to "cpu".
# Next, the code sets the number of labels to 2 (0 or 1 i.e. positive or negative), which is appropriate for the 
# binary classification problem of the IMDb dataset.
# At the end create an instance of the AutoModelForSequenceClassification class by calling its from_pretrained()
# method with the previousely specified checkpoint and number of labels. It then moves the model to the specified 
# device using the to() method.
# So we are setting up a pre-trained transformer model for sequence classification using the specified checkpoint,
# which will be fine-tuned on the IMDb dataset.

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
num_labels = 2
model = (AutoModelForSequenceClassification
         .from_pretrained(checkpoint, num_labels=num_labels)
         .to(device))

In [26]:
from datasets import DatasetDict

# The code bellow is creating a smaller version of the IMDb dataset called tiny_imdb. 
# It is a DatasetDict, which is a dictionary-like object that contains different subsets 
# of the dataset (train, validation, test). The tiny_imdb dataset is created by selecting 
# a subset of 50 examples from the training set, and 10 examples from the validation and 
# test sets, respectively. The examples are randomly shuffled using a seed of 1 before selection.
# Next, the tokenize_function() is applied to the tiny_imdb dataset using the map() method, 
# with the same batched and batch_size arguments as before. The resulting encoded dataset is stored in tiny_imdb_encoded.

tiny_imdb = DatasetDict()
tiny_imdb['train'] = imdb['train'].shuffle(seed=1).select(range(50))
tiny_imdb['validation'] = imdb['validation'].shuffle(seed=1).select(range(10))
tiny_imdb['test'] = imdb['test'].shuffle(seed=1).select(range(10))

tiny_imdb_encoded = tiny_imdb.map(tokenize_function, batched=True, batch_size=None)
tiny_imdb_encoded

In [27]:
from transformers import Trainer, TrainingArguments

# The code bellow is setting up the training arguments for fine-tuning the pre-trained transformer model 
# on the tiny_imdb dataset. Batch_size is set to 8, which means that the model will be trained on batches 
# of 8 examples at a time. Logging_steps is set to the number of training examples divided by the batch size, 
# which determines how often progress is logged during training. Model_name is set to a string that includes 
# the name of the pre-trained checkpoint being fine-tuned, as well as a description of the fine-tuning task 
# (in this case, "tiny-imdb").
# TrainingArguments is a class from the transformers library that takes a variety of arguments to configure 
# the training process. The specific arguments used in this code are:
    # output_dir: the directory where the fine-tuned model and other output will be saved
    # num_train_epochs: the number of epochs to train for
    # learning_rate: the learning rate for the optimizer
    # per_device_train_batch_size: the batch size per device for training
    # per_device_eval_batch_size: the batch size per device for evaluation
    # weight_decay: the weight decay coefficient for the optimizer
    # evaluation_strategy: when to evaluate the model during training (in this case, after each epoch)
    # disable_tqdm: whether to disable the progress bar during training
    # logging_steps: how often to log training progress (in this case, after every logging_steps examples)
    # log_level: the level of logging to use (in this case, only log errors)
    # optim: the optimizer to use (in this case, AdamW)
# Overall, this code sets up the training arguments that will be used in the Trainer object to fine-tune
# the pre-trained transformer model on the tiny_imdb dataset.

batch_size = 8
logging_steps = len(tiny_imdb_encoded["train"]) // batch_size
model_name = f"{checkpoint}-finetuned-tiny-imdb"
training_args = TrainingArguments(output_dir=model_name,
                                  num_train_epochs=2,
                                  learning_rate=2e-5,
                                  per_device_train_batch_size=batch_size,
                                  per_device_eval_batch_size=batch_size,
                                  weight_decay=0.01,
                                  evaluation_strategy="epoch",
                                  disable_tqdm=False,
                                  logging_steps=logging_steps,
                                  log_level="error",
                                  optim='adamw_torch'
                                  )
training_args

In [28]:
from transformers import Trainer

# PyTorch function that clears the cache of all CUDA devices. It is useful for preventing out-of-memory errors 
# when training deep learning models with large datasets on GPUs. By freeing up unused memory, this function can 
# prevent memory leaks and ensure that the model can load new data. This function does not affect the contents or 
# computation of tensors.
torch.cuda.empty_cache()

# This code is training a transformer model on a small subset of the IMDB reviews dataset (50 training examples 
# and 10 validation examples). The Trainer class from the Transformers library is used to train the model. 
# The model being trained is a fine-tuned version of the DistilBERT-base-cased model, which has been configured 
# to perform binary classification (positive or negative sentiment) on the text input.
# The TrainingArguments object specifies the hyperparameters of the training process, such as the number of epochs, 
# learning rate, batch size, all defined in the previous command.
# The train_dataset and eval_dataset arguments specify the training and validation datasets, respectively, after 
# they have been preprocessed and encoded by the tokenizer.
# Finally, the trainer.train() method is called to initiate the training process. During training, the model 
# is fine-tuned on the training dataset in mini-batches, and the loss is computed and backpropagated through 
# the network to update the model's parameters. After each epoch, the model is evaluated on the validation 
# dataset to measure its performance on unseen examples. The training process continues for the specified number of 
# epochs or until early stopping conditions are met.
trainer = Trainer(model=model, 
                  args=training_args, 
                  train_dataset=tiny_imdb_encoded["train"],
                  eval_dataset=tiny_imdb_encoded["validation"],
                  tokenizer=tokenizer)
trainer.train();

In [None]:
# The `trainer.predict()` method is used to generate predictions from the trained model. In this case, it is used 
# to generate predictions on the test dataset of the tiny IMDB dataset. The `preds` variable will contain the 
# predicted probabilities for each class for each example in the test dataset, along with the corresponding true labels.

preds = trainer.predict(tiny_imdb_encoded['test'])
preds

In [None]:
# Print the shape of the output data structure, which in our case is 10 rows and 2 columns 
preds.predictions.shape

In [None]:
# The `preds.predictions` attribute contains the predicted probabilities for each example in the test dataset, 
# where each row corresponds to an example and each column corresponds to a class. The `argmax(axis=-1)` method 
# is used to find the index of the column with the highest probability for each example, which corresponds to 
# the predicted label. This will output an array of predicted labels for each example in the test dataset.
preds.predictions.argmax(axis=-1)

In [None]:
# The `preds.label_ids` attribute returns the true labels of the test dataset in the same order as the predictions. 
# In other words, it returns a list of integers representing the actual labels (0 or 1) of each test example.
preds.label_ids

In [None]:
from sklearn.metrics import accuracy_score

# Check the accuracy of the model
accuracy_score(preds.label_ids, preds.predictions.argmax(axis=-1))

In [None]:
# The `get_accuracy` function takes in `preds` as an argument, which is the output of `trainer.predict` on 
# a test dataset. It uses `preds.predictions.argmax(axis=-1)` to get the predicted labels and `preds.label_ids` 
# to get the true labels. It then calculates the accuracy score using `accuracy_score` from scikit-learn library. 
# Finally, it returns a dictionary containing the accuracy score.
def get_accuracy(preds):
  predictions = preds.predictions.argmax(axis=-1)
  labels = preds.label_ids
  accuracy = accuracy_score(preds.label_ids, preds.predictions.argmax(axis=-1))
  return {'accuracy': accuracy}

In [None]:
from transformers import Trainer

torch.cuda.empty_cache()

# Initiate a `Trainer` object from the `transformers` library, with the specified parameters:
# - `model`: The fine-tuned model for sequence classification.
# - `compute_metrics`: The function that computes the evaluation metric(s) to be used during training 
# and validation. In this case, the function `get_accuracy()` is used to calculate the accuracy score of the predictions.
# - `args`: The `TrainingArguments` object that specifies the hyperparameters and settings for training the model.
# - `train_dataset`: The dataset of training examples, which is a tokenized and encoded version of the `tiny_imdb` dataset.
# - `eval_dataset`: The dataset of validation examples, which is a tokenized and encoded version of the `tiny_imdb` dataset.
# - `tokenizer`: The tokenizer object used to tokenize the text data.
# The purpose of this `Trainer` object is to train the fine-tuned model on the `train_dataset` and evaluate it on the 
# `eval_dataset` using the specified hyperparameters and settings. During training and validation, the `compute_metrics` 
# function is called to calculate the accuracy score of the predictions. 
trainer = Trainer(model=model, 
                  compute_metrics=get_accuracy,
                  args=training_args, 
                  train_dataset=tiny_imdb_encoded["train"],
                  eval_dataset=tiny_imdb_encoded["validation"],
                  tokenizer=tokenizer)
trainer.train();

## Training run

In [None]:
batch_size = 16
logging_steps = len(imdb_encoded["train"]) // batch_size
model_name = f"{checkpoint}-finetuned-imdb"
training_args = TrainingArguments(output_dir=model_name,
                                  num_train_epochs=2,
                                  learning_rate=2e-5,
                                  per_device_train_batch_size=batch_size,
                                  per_device_eval_batch_size=batch_size,
                                  weight_decay=0.01,
                                  evaluation_strategy="epoch",
                                  disable_tqdm=False,
                                  logging_steps=logging_steps,
                                  log_level="error",
                                  optim='adamw_torch'
                                  )

In [None]:
from transformers import Trainer

torch.cuda.empty_cache()

trainer = Trainer(model=model, 
                  args=training_args, 
                  compute_metrics=get_accuracy,
                  train_dataset=imdb_encoded["train"],
                  eval_dataset=imdb_encoded["validation"],
                  tokenizer=tokenizer)
trainer.train();

In [None]:
trainer.evaluate()

In [None]:
trainer.save_model()

In [None]:
model_name

In [None]:
from transformers import pipeline
classifier = pipeline('text-classification', model=model_name)
classifier('This is not my idea of fun')

In [None]:
classifier('This was beyond incredible')