## Assignment 7: Transformers (Solution)

Transformer models [(wiki,](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)) [paper)](https://arxiv.org/pdf/1706.03762.pdf) revolutionized natural language processing by leveraging self-attention mechanisms to effectively model long-range dependencies and outperform traditional models like RNNs and LSTMs. 

------------
In this assignment we are going to gain a better understanding of how transformers work and how they can be fine-tuned for a specific natural language processing task. We will use the pre-trained Bidirectional Encoder Representations from Transformers (BERT) model [(wiki](https://en.wikipedia.org/wiki/BERT_(language_model)), [,paper)](https://arxiv.org/pdf/1810.04805.pdf) and fine-tune it to **classify** toxic comments found online.

We will make use of the [HuggingFace](https://huggingface.co) library to obtain pre-trained model weights and to simplify processes like tokenization.

*Note:* 
1. *This assignment requires a GPU. So please run this notebook in Google Colab (https://colab.research.google.com/). The BERT model has 100 Million parameters, most likely a laptop GPU wont be able fit this big of a model.*
2. *Before you begin, please make sure you know how to run a GPU instance in Colab.*

In [1]:
# Install Huffingface Transformers
!pip install transformers -U

# Obtain the dataset
!wget https://media.githubusercontent.com/media/poudel-bibek/AI-Assignments/main/Datasets/Transformers/toxic_comments.csv

--2023-03-15 15:49:56--  https://media.githubusercontent.com/media/poudel-bibek/AI-Assignments/main/Datasets/Transformers/toxic_comments.csv
Resolving media.githubusercontent.com (media.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to media.githubusercontent.com (media.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 68802655 (66M) [text/plain]
Saving to: ‘toxic_comments.csv’


2023-03-15 15:50:02 (11.1 MB/s) - ‘toxic_comments.csv’ saved [68802655/68802655]



In [2]:
# import packages
import numpy as np
import pandas as pd

import torch 
from torch.utils.data import Dataset
from sklearn.model_selection import train_test_split

from transformers import TrainingArguments, Trainer
from transformers import BertTokenizer, BertForSequenceClassification

from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

In [3]:
# Data loading and exploration
# The data has a lot of columns, we will just consider two 
columns = ['comment_text', 'toxic']
data = pd.read_csv('./toxic_comments.csv')[columns]

# Example non-toxic comments
print(data[data['toxic']==0])

# 0 = Not toxic, 1 = toxic
print("\nBefore balance")
print(data['toxic'].value_counts())
# Total 144277 + 15294 = 159,572 data points

# Since the data is heavily imbalanced, class toxic=1 only has about 10% of the data that class toxic=0 has
# We are going to balance it by considering equal data points from each class
# Also we will work on a small subset of data (5,000 data points for each class)
# Select the rows from the first dataframe where 'toxic' is 1 and take the first 5000 rows
df1 = data[data['toxic'] == 0].iloc[:5000] #[0:15294]

# Select the rows from the second dataframe where 'toxic' is 0
df2 = data[data['toxic'] == 1].iloc[:5000]

# Concatenate the two dataframes
new_data = pd.concat([df1, df2], ignore_index=True)

print("\nAfter balance")
print(new_data['toxic'].value_counts())

# Define the inputs and targets
X = list(new_data['comment_text'])
y = list(new_data['toxic'])

print(f"\nData: {len(X)}, {len(y)}")

                                             comment_text  toxic
0       Explanation\nWhy the edits made under my usern...      0
1       D'aww! He matches this background colour I'm s...      0
2       Hey man, I'm really not trying to edit war. It...      0
3       "\nMore\nI can't make any real suggestions on ...      0
4       You, sir, are my hero. Any chance you remember...      0
...                                                   ...    ...
159566  ":::::And for the second time of asking, when ...      0
159567  You should be ashamed of yourself \n\nThat is ...      0
159568  Spitzer \n\nUmm, theres no actual article for ...      0
159569  And it looks like it was actually you who put ...      0
159570  "\nAnd ... I really don't think you understand...      0

[144277 rows x 2 columns]

Before balance
0    144277
1     15294
Name: toxic, dtype: int64

After balance
0    5000
1    5000
Name: toxic, dtype: int64

Data: 10000, 10000


In [4]:
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

print(f"Train: {len(X_train), len(y_train)}")
print(f"Test: {len(X_test), len(y_test)}")

Train: (9000, 9000)
Test: (1000, 1000)


In [5]:
# This cell will take about a minute
# Get the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Note that unlike the original transformer, the size of input embeddings in BERT is 768
# But we force it to size 512 with max_length
# Dont worry about padding for now
X_train_tokenized = tokenizer(X_train, padding=True, truncation=True, max_length=512)
X_test_tokenized = tokenizer(X_test, padding=True, truncation=True, max_length=512)

# Lets explore the tokenized inputs
print(X_train_tokenized.keys())

# size of each dimension
print(len(X_train_tokenized['input_ids'][0]))
print(len(X_train_tokenized['token_type_ids'][0]))
print(len(X_train_tokenized['attention_mask'][0]))

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
512
512
512


Each data point has 3 dimensions: 

**1. input_ids:** The tokenized input sequence, where each token is mapped to an integer value that corresponds to its position in the vocabulary. For example, the token "cat" might be mapped to the integer 42. 

**2. token_type_ids:** Used to distinguish between different segments of the input sequence (if present). For example, in a question-answering task, the input sequence might consist of a question followed by a paragraph of text that contains the answer. A value of 0 would be assigned to to all tokens in the question and a value of 1 to all tokens in the answer. 

**3. attention_mask:** Represents which tokens in the input sequence should be attended to by the transformer-based model and which tokens should be ignored.

In [6]:
# Helper function, pytorch Dataset class
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels=None):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        if self.labels:
            item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.encodings["input_ids"])

In [7]:
# Create a PyTorch compatible parallelizable dataset from our data
train_dataset = Dataset(X_train_tokenized, y_train)
test_dataset = Dataset(X_test_tokenized, y_test)

# A peek into how each input looks like
print(train_dataset[6])

{'input_ids': tensor([  101,  2074,  2178,  2317, 10514, 28139, 22911,  2923,  3899,  1012,
         1005, 16371,  4246,  2056,  1012,  2069,  1037,  2388, 11263,  9102,
        11758,  5895,  1012, 24501, 19317, 16048, 26332,  7559,  2038,  2042,
         7065,  8743,  2075,  2023,  2005,  2070,  2051,  1012,  1045,  4687,
         2065,  2010,  2023, 10369,  1005,  1055, 12608,  1029,   102,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0, 

In [8]:
# Obtain the pre-trained BERT model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased',num_labels=2)

# Uncomment the line below to see the model description
#print(model)

# Parameter count of the model
print(f"\nNumber of traninable parameters in the model: {sum(p.numel() for p in model.parameters())}")

# Load model to GPU
# Make sure you are running a GPU instance before you execute the line below
model = model.to('cuda')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at


Number of traninable parameters in the model: 109483778


Natural Language Processing (NLP) models can be really large, this model has 109483778 (100 Million) parameters.

---
# Exercise 1.

Please finish the following task. 
Write code to calculate 4 metrics: 
1. Acuracy
2. Recall 
3. Precision
4. F1 score. 

The functions have already been imported (see above). Relevant documentation: [Scikit-learn metrics](https://scikit-learn.org/stable/modules/model_evaluation.html)

In [9]:
# Helper function, Compute metrics during training
def compute_metrics(datum):
    #print(type(datum))
    pred, labels = datum
    pred = np.argmax(pred, axis=1)

    accuracy = accuracy_score(y_true=labels, y_pred=pred)
    recall = recall_score(y_true=labels, y_pred=pred)
    precision = precision_score(y_true=labels, y_pred=pred)
    f1 = f1_score(y_true=labels, y_pred=pred)

    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}

In [10]:
# Training setup
num_epochs = 2
args = TrainingArguments(output_dir="output", 
                         num_train_epochs=num_epochs, 
                         per_device_train_batch_size=16,
                         per_device_eval_batch_size = 16,
                         warmup_steps = 100,
                         weight_decay = 0.01,
                         logging_strategy = 'steps',
                         logging_dir = './logs',
                         logging_steps = 200,
                         evaluation_strategy = 'steps',
                         )
# load_best_model_at_end = True?

# Write a trainer, does not train yet, just setting it up
trainer = Trainer(model=model, args=args, train_dataset=train_dataset, eval_dataset=test_dataset, compute_metrics=compute_metrics)

In [11]:
# Actual training
results = trainer.train()

# Save a trained model
trainer.save_model(f'Trained_models_{num_epochs}')

# This will take a while (30 minutes for train and eval). 
# So skip training for now, terminate this cell and run the cells below



Step,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
200,0.3661,0.268855,0.907,0.854895,0.97996,0.913165
400,0.2444,0.175646,0.938,0.958071,0.915832,0.936475
600,0.2001,0.204991,0.949,0.927481,0.973948,0.950147
800,0.1401,0.174832,0.951,0.942913,0.95992,0.951341
1000,0.1281,0.161089,0.953,0.95749,0.947896,0.952669


In [12]:
# Load a model that was trained in a RTX3090 GPU with a batch size of 32 for 10 epochs (6.5 hours) on a dataset that was 3x larger 
#!git clone https://huggingface.co/matrix-multiply/BERT_toxic_classifier

In [13]:
!wget -r -nH --no-parent --reject="index.html*" https://github.com/poudel-bibek/AI-Assignments/blob/main/Trained_models/Transformers/BERT.zip
!unzip ./poudel-bibek/AI-Assignments/blob/main/Trained_models/Transformers/BERT.zip*raw=true -d ./

--2023-03-15 15:57:33--  https://github.com/poudel-bibek/AI-Assignments/blob/main/Trained_models/Transformers/BERT.zip
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘poudel-bibek/AI-Assignments/blob/main/Trained_models/Transformers/BERT.zip’

poudel-bibek/AI-Ass     [ <=>                ] 137.13K  --.-KB/s    in 0.06s   

2023-03-15 15:57:33 (2.11 MB/s) - ‘poudel-bibek/AI-Assignments/blob/main/Trained_models/Transformers/BERT.zip’ saved [140417]

Loading robots.txt; please ignore errors.
--2023-03-15 15:57:34--  https://github.com/robots.txt
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 1508 (1.5K) [text/plain]
Saving to: ‘robots.txt’


2023-03-15 15:57:34 (1.91 MB/s) - ‘robots.txt’ saved [1508/1508]

--2023-03-15 15:57:34--  https://github.com/poudel-bibek/AI-Assignmen

In [14]:
# Uncomment this if you want to use the model that you trained
#trained_model =  BertForSequenceClassification.from_pretrained(f'Trained_models_{num_epochs}').to('cuda')

trained_model =  BertForSequenceClassification.from_pretrained("BERT").to('cuda')

In [15]:
# Helper function
# The model output is a 2D real number (logits), apply softmax to display probability and meaningful result
def show_result(outputs):
  #print("Model output:", outputs)
  predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
  predictions = predictions.cpu().detach().numpy()

  #print("Predictions:", predictions) # This is unnecessarily 2D
  possible_results = ["Not a toxic comment!", "Toxic comment!"]
  confidence = np.max(predictions[0])

  # The final result is the one for which model has higher confidence
  i = np.where(predictions[0] ==confidence)[0][0]
  print(f"Model output: {possible_results[i]}\nConfidence: {100*confidence}%")

---
# Exercise 2.

Please finish the following task. 
1. Apply the tokenizer to the input. Make sure to include `return_tensors='pt'` as argument to tokenizer. 

In [16]:
# Lets try some new sentences to test this model
text_dict = {'text1': 'Thanks for the encouraging review', 
            'text2': 'go to hell',
            'text3': 'This is some crappy work that you put out',
            'text4': 'Warm welcome and best wishes'

        # add more sentences if you like
        }

for i in range(1,len(text_dict)+1):
  input = text_dict.get(f"text{i}")
  print(f"\nInput: {input}")

  tokenized_input = tokenizer(input, padding=True, truncation=True, max_length=512, return_tensors='pt')

  # Load tokenized input to GPU
  tokenized_input = tokenized_input.to('cuda')

  # Pass input to the model
  outputs = trained_model(**tokenized_input)
  
  show_result(outputs)


Input: Thanks for the encouraging review
Model output: Not a toxic comment!
Confidence: 99.99946355819702%

Input: go to hell
Model output: Toxic comment!
Confidence: 99.99806880950928%

Input: This is some crappy work that you put out
Model output: Toxic comment!
Confidence: 99.99808073043823%

Input: Warm welcome and best wishes
Model output: Not a toxic comment!
Confidence: 99.99942779541016%
