**Name: Hamidreza Amirzadeh**

**Std. No.: 401206999**

<div style="direction:rtl;line-height:200%;"><font face="B Nazanin" size=5>
<p>
توجه: این تمرین با همفکری دانشجو آقای امیرمحمد منصوریان انجام شده است.
</p>

# 0. Introduction

In this notebook, we aim to make a classifier to identify spam messages. We will use a dataset that is consisted of 5000 SMS texts. Some of theses texts are labeled as `spam` while the rest are considered `ham`.

For this aim, we will use **BERT** word-embeddings from the `transformers` library. We will not train a transformer, as it requires a lot of GPU power, but we will fine-tune a pre-trained transformer encoder (**BERT**) for our classification problem.


In [1]:
# !pip install --quiet transformers torch

Keyring is skipped due to an exception: 'keyring.backends'
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pytest-astropy 0.8.0 requires pytest-cov>=2.0, which is not installed.
pytest-astropy 0.8.0 requires pytest-filter-subpackage>=0.1, which is not installed.


In [7]:
# IMPORTS
from math import ceil
import numpy as np

import pandas as pd

import torch
import torch.nn as nn

from transformers import BertTokenizer, BertModel

# 1. Data

In [8]:
df = pd.read_csv('spam.csv', encoding='latin-1')
df = df.drop(['Unnamed: 2','Unnamed: 3','Unnamed: 4'],axis=1)

In [9]:
df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [10]:
######################   TODO 1.1   ########################
# change the label column so that `spam` labels get `1` 
# and `ham` gets `0`
df['label'].replace({'spam':1, 'ham':0}, inplace=True)
###################### (2 points) ##########################

In [15]:
######################   TODO 1.2   ########################
# split the dataframe into two sections of train and val. 
# keep the train size 10 times of val.
from sklearn.model_selection import train_test_split
df_train, df_val = train_test_split(df, train_size=0.9, random_state=10)
###################### (3 points) ##########################

In [16]:
######################   TODO 1.3   ########################
# based on what you did in homework 1, create a dataset and 
# a dataloader. Your dataset should return a text with its 
# respective label when iterated.
###################### (10 points) ##########################

class CustomDataset:
    def __init__(self, df):
        self.dataset = df

    def __getitem__(self, index):
        return self.dataset['text'][index], self.dataset['label'][index]

    def __len__(self):
        return len(self.dataset)


class CustomDataloader:
    def __init__(self, dataset, batch_size, shuffle=False):
        self.dataset = dataset
        self.batch_size = batch_size
        self.shuffle = shuffle
        
    def __len__(self):
        return ceil(len(self.dataset)/self.batch_size)

    def __iter__(self, calm=True):
        indexes = list(range(len(self.dataset)))
        if self.shuffle:
            np.random.shuffle(indexes)

        for idx in range(0, len(self.dataset), self.batch_size):
            batch_texts = []
            batch_labels = []
            
            batch_indexes = list(range(idx, min(idx+self.batch_size, len(self.dataset))))
            batch_indexes = [indexes[i] for i in batch_indexes]

            for i in batch_indexes:
                lbl,txt=self.dataset.dataset.iloc[indexes[i]]
                batch_texts.append(txt)
                batch_labels.append(lbl)

            yield batch_texts, batch_labels
        return



In [17]:
######################   TODO 1.4   ########################
# initialize a dataloader for each of your train and val
# splits.
train_dataset = CustomDataset(df_train)
val_dataset = CustomDataset(df_val)

batch_size = 64
train_dataloader = CustomDataloader(train_dataset, batch_size)
val_dataloader = CustomDataloader(val_dataset, batch_size)
###################### (5 points) ##########################

# 2. Pretrained Language Model

In this section we will use the pretrained **BERT** model from the `transformers` library with its respective `tokenizer`. **BERT** is a transformer encoder which is suited for various downstream NLP tasks namely *Sequence classification*.

In [18]:
# Defining the tokenizer and model
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained("bert-base-uncased")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [19]:
text = "What is your name?"
tokenized = bert_tokenizer(text, max_length=128, padding="max_length", truncation=True, return_tensors='pt')
encoding = bert_model(**tokenized)

**TODO 2.1.** In section bellow, try to explain the arguments that `bert_tokenizer` gets as input. (text, max_length, padding, truncation, return_tensors) *(10 points)*

**text**: the string we want to be tokenized

**max_length**: specified maximum length of tokenized text

**padding**: It can be set to the 'longest' to pad to the longest sequence in the batch, or 'max_length' to pad to a length specified by the max_length argument or the maximum length accepted by the model if no max_length is provided, or False or 'do_not_pad' to not pad the sequences.

**truncation**: It can be set to the 'only_first' to only truncate the first sentence of a pair to a maximum length specified by the max_length argument or the maximum length accepted by the model if no max_length is provided (max_length=None), or 'only_second' to only truncate the second sentence of a pair to a maximum length specified by the max_length argument or the maximum length accepted by the model if no max_length is provided, or 'longest_first' which will truncate token by token, removing a token from the longest sequence in the pair until the proper length is reached, or False or 'do_not_truncate' to not truncate the sequences.

**return_tensors**: Specifies the datatype returned. pt stands for pytorch.

# 3. Model

If you inspect the `encoding` of the `BERT`, you will realize that `BERT` gives a vector for each of the tokens included in the input sentence. However, all of these word tokens are not needed for a simple classification task.

Instead, we can use the first token representation, as it captures the whole tokens meanings. `BERT` provides this token for us in a special variable called `pooler_output`. We will use this `pooler_output` as the input of our classification head inside our classifier model.
![BERT pooler output](https://miro.medium.com/max/1100/1*Or3YV9sGX7W8QGF83es3gg.webp)

In [20]:
class SpamClassifier(nn.Module):
    def __init__(self, embedding_tokenizer, embedding_model):
        super().__init__()
        ######################   TODO 3.1   ########################
        # construct layers and structure of the network
        self.embedding_size = 768

        self.tokenizer = embedding_tokenizer
        self.embedding = embedding_model
        self.classifier = torch.nn.Linear(self.embedding_size, 1)
        self.sigmoid = nn.Sigmoid()
        ###################### (10 points) #########################

    def forward(self, x):
        ######################   TODO 3.2   ########################
        # implement the forward pass of your model. first tokenizer
        # the sentence, the get the embeddings from your language
        # model, then use the `pooler_output` for your classifier
        # layer. 
        tokenized = self.tokenizer(x, max_length=128, padding="max_length", truncation=True, return_tensors='pt')
        encoding = self.embedding(**tokenized)
        return self.sigmoid(self.classifier(encoding.pooler_output))
        ###################### (10 points) #########################

    def predict(self, x):
        ######################   TODO 3.3   ########################
        # get the predicted class of x.
        if self.forward(x).item() > 0.5: 
            return 1
        else: 
            return 0
        ###################### (5 points) #########################

# 4. Training and Evaluation

In [21]:
######################   TODO 4.1   ########################
# define the learning parameters here (lr and epochs.)
# then initilizer your model, an appropriate optimizer
# and loss function.
model = SpamClassifier(bert_tokenizer, bert_model)
lr, epochs = 10e-4, 10
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=1e-4)
###################### (10 points) ##########################

In [22]:
class Trainer:
    def __init__(self, 
        train_dataloader, val_dataloader, model,
        optimizer, criterion, *args, **kwargs
    ):
        self.train_dataloader = train_dataloader
        self.val_dataloader = val_dataloader
        self.model = model
        self.best_model = model
        self.optimizer = optimizer
        self.criterion = criterion
        self.train_losses = []
        self.val_losses = []
        self.val_loss = None
        self.min_val_loss = np.inf

    def train(self, epochs, log_each_n_percent_epoch):
        train_steps = len(self.train_dataloader)
        log_steps = int(train_steps * log_each_n_percent_epoch/100)

        self.model.train()

        for epoch in range(epochs):
            print(f"epoch {epoch+1} started".title().center(50, "="))
            train_loss = 0.0
            listLoss = []
            for step, (data, labels) in enumerate(self.train_dataloader):
                ######################   TODO 3.1   ########################

                labels_pred = self.model(data)
                labels = torch.tensor(list(map(int, labels))).view(len(labels),-1)
                loss = self.criterion(labels_pred, labels)
                listLoss.append(loss.item())
                self.optimizer.zero_grad()
                loss.backward()
                self.optimizer.step()
                
                ###################### (10 points) #########################

                if step % log_steps == 1:
                    self.val_loss, accuracy = self.evaluate(save=True)
                    self.val_losses.append(self.val_loss)
                    self.train_losses.append(np.sum(listLoss)/len(listLoss))
                    listLoss = []
                    info_text = f'Validation Loss: {self.val_loss:.6f}\t Accuracy-score: {accuracy:.2f}'
                    print(info_text)
                    self.post_evaluation_actions()
                    

    def evaluate(self, save=False):
        listLoss =[]
        accuracy = 0
        with torch.no_grad():
            y_true, y_pred = [], []
            ######################   TODO 3.2   ########################

            for step, (data, labels) in enumerate(self.val_dataloader):
                data, labels = data.to(self.device), labels.to(self.device)
                labels_pred = self.model(data)
                labels = torch.tensor(list(map(int, labels))).view(len(labels),-1)
                loss = self.criterion(labels_pred, labels)
                labels_pred_index = self.model.predict(data)
                listLoss.append(loss.item())
                list_Y = labels.tolist()
                list_Predict_Y = labels_pred_index.tolist()
                for i in range(len(list_Y)):
                    y_true.append(list_Y[i])
                    y_pred.append(list_Predict_Y[i])
                
            val_loss = np.sum(listLoss)
            joint = list(zip(y_true, y_pred))
            accuracy = np.mean([l1==l2 for (l1,l2) in joint])
            
            ###################### (5 points) #########################
            self.val_losses.append(val_loss)
            return val_loss, accuracy

In [None]:
trainer = Trainer(train_dataloader, val_dataloader, model, optimizer, criterion)
trainer.train(epochs=5, log_each_n_percent_epoch=10)

In [None]:
######################   TODO 4.2   ########################
# implement your training loop and train your model.
# return to homework 1 if needed.
from copy import deepcopy
from sklearn.metrics import accuracy_score

class Trainer:
    def __init__(self, train_dataloader, val_dataloader, model, optimizer, criterion):
        self.train_dataloader = train_dataloader
        self.val_dataloader = val_dataloader
        self.model = model
        self.best_model = model
        self.optimizer = optimizer
        self.criterion = criterion
        self.train_losses = []
        self.val_losses = []

    def train(self, epochs):
        self.model.train()
        train_steps = len(self.train_dataloader.dataset.dataset)
        min_val_loss = np.inf
        for epoch in range(epochs):
            print(f"epoch {epoch+1} started".title().center(50, "="))
            train_loss = 0.0
            for step, (data, labels) in enumerate(self.train_dataloader):
                self.optimizer.zero_grad()
                prediction = self.model(data)
                labels = torch.tensor(list(map(int, labels)))
                loss = self.criterion(prediction, labels)
                loss.backward()
                self.optimizer.step()
                train_loss += loss.item()
        self.test()
    
    def test(self, save=False):
        self.model.eval()
        with torch.no_grad():
            y_true, y_pred = [], []
            val_loss = 0.0
            for data, labels in self.val_dataloader:

                prediction = self.model(data)
                loss = self.criterion(prediction, labels)
                val_loss += loss.item() * len(data)

                prediction = self.model.predict(data)
                y_pred.extend(prediction.cpu().numpy())

                labels = labels.data.cpu().numpy()
                y_true.extend(labels)

            val_loss = val_loss / len(self.val_dataloader)
            accuracy = accuracy_score(y_true, y_pred)
            self.val_losses.append(val_loss)
            return val_loss, accuracy


trainer = Trainer(train_dataloader, val_dataloader, model, optimizer, criterion)
trainer.train(epochs)
###################### (10 points) ##########################

# 5. Using HuggingFace

[HuggingFace library](http://huggingface.co/) has built a nice API for NLP tasks around the transformers. To get familiar with this comrehensive library, In this section you are asked to use the huggingface `Trainer`, `Dataset`, and `BertForSequenceClassification` to do what we did above again.

Feel free to refer to the library documentation to learn about these modules.

In [None]:
######################   TODO 5.1   ########################
# use huggingface Trainer and Dataset API and train the 
# `SpamClassifier`. You should not use the `SpamClassifier`
# we implemented previously. Instead you should use 
# `BertForSequenceClassification` here.
###################### (25 points) #########################

In [None]:
# !pip install datasets

In [23]:
from datasets import load_metric
from transformers import BertForSequenceClassification
from transformers import TrainingArguments, Trainer
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', return_dict=True)
model.train()

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [24]:
class Dataset:
    def __init__(self, df, tokenizer):
        self.target = df['label'].values
        self.data = df['text'].values
        self.tokenizer = tokenizer
    def __len__(self):
      return len(self.data)
      
    def __getitem__(self, index):
      encoded_sent = self.tokenizer.encode_plus(
            text=self.data[index],  
            add_special_tokens=True,        
            max_length=128,                  
            pad_to_max_length=True,         
            return_tensors='pt',           
            return_attention_mask=True      
            )
      input_ids = encoded_sent.get('input_ids')
      attn_masks = encoded_sent.get('attention_mask')
      return {
          'input_ids': input_ids.squeeze(0),
          'attention_mask' : attn_masks.squeeze(0),
          'labels': self.target[index]
      }

In [None]:
BATCH_SIZE = 32
train_dataset = Dataset(df_train, bert_tokenizer)
test_dataset = Dataset(df_val, bert_tokenizer)
metric = load_metric('accuracy')

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments( output_dir="DRIVE_PATH",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    seed=0,
    load_best_model_at_end=True,)

trainer = Trainer(model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
)
trainer.train()    