# 0. Introduction

In this notebook, we aim to make a classifier to identify spam messages. We will use a dataset that is consisted of 5000 SMS texts. Some of theses texts are labeled as `spam` while the rest are considered `ham`.

For this aim, we will use **BERT** word-embeddings from the `transformers` library. We will not train a transformer, as it requires a lot of GPU power, but we will fine-tune a pre-trained transformer encoder (**BERT**) for our classification problem.


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pip install --quiet transformers torch

[K     |████████████████████████████████| 5.8 MB 15.2 MB/s 
[K     |████████████████████████████████| 182 kB 69.1 MB/s 
[K     |████████████████████████████████| 7.6 MB 62.1 MB/s 
[?25h

In [3]:
# IMPORTS
from math import ceil
import pandas as pd
import numpy as np
import random
import torch
import torch.nn as nn
import torch.optim as optim
from transformers import BertTokenizer, BertModel

# 1. Data

In [4]:
%cd /content/drive/MyDrive/MSC1401_1/DeepLearning/HW4

/content/drive/.shortcut-targets-by-id/1jw5EbFPPDheYFyKvd8dZjLnzLFdd40Z9/MSC1401_1/DeepLearning/HW4


In [5]:
df = pd.read_csv('spam.csv', encoding='latin-1')
df = df.drop(['Unnamed: 2','Unnamed: 3','Unnamed: 4'],axis=1)

In [6]:
df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [7]:
######################   TODO 1.1   ########################
# change the label column so that `spam` labels get `1` 
# and `ham` gets `0`
###################### (2 points) ##########################
df['label'] = df['label'].map({'ham':0, 'spam':1})
df.head()

Unnamed: 0,label,text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [8]:
def train_validate_split(df, validate_percent=.1, seed=None):
    np.random.seed(seed)
    perm = np.random.permutation(df.index)
    m = len(df.index)
    train_end = int((1-validate_percent) * m)
    validate_end = int(validate_percent * m) + train_end
    train = df.iloc[perm[:train_end]]
    validate = df.iloc[perm[train_end:validate_end]]
    return train, validate

In [9]:
######################   TODO 1.2   ########################
# split the dataframe into two sections of train and val. 
# keep the train size 10 times of val.
###################### (3 points) ##########################
df_train, df_val = train_validate_split(df,0.1,0)

In [10]:
df_train

Unnamed: 0,label,text
4456,0,Aight should I just plan to come up later toni...
690,0,Was the farm open?
944,0,I sent my scores to sophas and i had to do sec...
3768,0,Was gr8 to see that message. So when r u leavi...
1189,0,In that case I guess I'll see you at campus lodge
...,...,...
424,0,Send this to ur friends and receive something ...
4421,0,MMM ... Fuck .... Merry Christmas to me
3715,0,Networking technical support associate.
664,0,Yes baby! We can study all the positions of th...


In [11]:
df_val

Unnamed: 0,label,text
447,0,I wont get concentration dear you know you are...
709,1,4mths half price Orange line rental & latest c...
1924,0,Ok
1221,0,Prakesh is there know.
3922,0,Okay lor... Will they still let us go a not ah...
...,...,...
4859,0,"\Response\"" is one of d powerful weapon 2 occu..."
4931,0,Match started.india &lt;#&gt; for 2
3264,1,"44 7732584351, Do you want a New Nokia 3510i c..."
1653,0,I was at bugis juz now wat... But now i'm walk...


In [12]:
######################   TODO 1.3   ########################
# based on what you did in homework 1, create a dataset and 
# a dataloader. Your dataset should return a text with its 
# respective label when iterated.
###################### (10 points) ##########################

class CustomDataset:
    def __init__(self, df):
        self.data = df['text'].values
        self.targets = df['label'].values

    def __getitem__(self, index):
        return self.data[index].lower(), self.targets[index]

    def __len__(self):
        return len(self.data)

class CustomDataloader:
    def __init__(self, dataset, batch_size, shuffle=False, drop_last=False):
        self.dataset = dataset
        self.batch_size = batch_size
        self.shuffle = shuffle
        self.drop_last = drop_last

    def __len__(self):
        if self.drop_last:
          return len(self.dataset)//self.batch_size

        return ceil(len(self.dataset)/self.batch_size)

    def __iter__(self, calm=True):
        indexes = list(range(len(self.dataset)))

        if self.shuffle:
          random.shuffle(indexes)

        for idx in range(0, len(self.dataset), self.batch_size):

            batch_indexes = indexes[idx:idx + self.batch_size]
            texts = self.dataset.data[batch_indexes]
            batch_labels = self.dataset.targets[batch_indexes]
            yield list(texts), torch.tensor(batch_labels)

        return

In [13]:
######################   TODO 1.4   ########################
# initialize a dataloader for each of your train and val
# splits.
###################### (5 points) ##########################
train_set = CustomDataset(df_train)
val_set = CustomDataset(df_val)
train_loader = CustomDataloader(train_set, 64, True, False)
val_loader = CustomDataloader(val_set, 64, False, False)

In [14]:
for data, labels in train_loader:
      print(type(data))
      print(labels.shape)
      break

<class 'list'>
torch.Size([64])


# 2. Pretrained Language Model

In this section we will use the pretrained **BERT** model from the `transformers` library with its respective `tokenizer`. **BERT** is a transformer encoder which is suited for various downstream NLP tasks namely *Sequence classification*.

In [15]:
# Defining the tokenizer and model
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained("bert-base-uncased")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [16]:
bert_model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

In [17]:
text = "What is your name?"
tokenized = bert_tokenizer(text, max_length=128, padding="max_length", truncation=True, return_tensors='pt')
encoding = bert_model(**tokenized)

**TODO 2.1.** In section bellow, try to explain the arguments that `bert_tokenizer` gets as input. (text, max_length, padding, truncation, return_tensors) *(10 points)*

<font color=red>please write your answer in this cell</font>



### **text**: 
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string).

### **padding**:
Activates and controls padding. 
It accepts the following values:
<li>True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence provided).
<li>'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
<li>False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).

### **truncation**:
Activates and controls truncation. Accepts the following values:
<li>True or 'longest_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.</li>
<li>'only_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.</li>
<li>'only_second': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.</li>
<li>False or 'do_not_truncate' (default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).

### **max_length**: 
Controls the maximum length to use by one of the truncation/padding parameters.
If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.

### **return_tensors**:
If set, will return tensors instead of list of python integers. Acceptable values are:
'tf': Return TensorFlow tf.constant objects.
'pt': Return PyTorch torch.Tensor objects.
'np': Return Numpy np.ndarray objects.

### **Example**
'My name is Reihaneh'


The max_length specifies the length of the tokenized text. By default, BERT performs word-piece tokenization, followed by adding [CLS] token at the beginning of the sentence, and [SEP] token at the end of sentence. Thus, it first tokenizes the sentence, truncates it to max_length-2 (if truncation=True), then prepend [CLS] at the beginning and [SEP] token at the end.(So a total length of max_length)

padding='max_length': In this example if we have a max_length of 10, the tokenized text corresponds to [101, 2026, 2171, 2003, 11754, 102, 0, 0, 0, 0], where 101 is id of [CLS] and 102 is id of [SEP] tokens. Thus, padded by zeros to make all the text to the length of max_length.

Likewise, truncate=True will ensure that the max_length is strictly adhered, i.e, longer sentences are truncated to max_length only if truncate=True


# 3. Model

If you inspect the `encoding` of the `BERT`, you will realize that `BERT` gives a vector for each of the tokens included in the input sentence. However, all of these word tokens are not needed for a simple classification task.

Instead, we can use the first token representation, as it captures the whole tokens meanings. `BERT` provides this token for us in a special variable called `pooler_output`. We will use this `pooler_output` as the input of our classification head inside our classifier model.
![BERT pooler output](https://miro.medium.com/max/1100/1*Or3YV9sGX7W8QGF83es3gg.webp)

In [18]:
class SpamClassifier(nn.Module):
    def __init__(self, embedding_tokenizer, embedding_model):
        super().__init__()
        ######################   TODO 3.1   ########################
        # construct layers and structure of the network

        self.tokenizer = embedding_tokenizer
        self.embedding = embedding_model
        self.classifier = nn.Linear(self.embedding.config.hidden_size,2)
        self.sigmoid = nn.Sigmoid()
        ###################### (10 points) #########################

    def forward(self, x):
        ######################   TODO 3.2   ########################
        # implement the forward pass of your model. first tokenizer
        # the sentence, the get the embeddings from your language
        # model, then use the `pooler_output` for your classifier
        # layer. 
        x = self.tokenizer(x,max_length=128, padding="max_length", truncation=True, return_tensors='pt')
        x = {k: v.cuda() for k,v in x.items()}
        x = self.embedding(**x)
        x = self.classifier(x['pooler_output'])
        x = self.sigmoid(x)
        return x
        ###################### (10 points) #########################

    def predict(self, x):
        ######################   TODO 3.3   ########################
        # get the predicted class of x.
        outputs = self.forward(x)
        _,prediction = torch.max(outputs, 1)
        return prediction
        ###################### (5 points) #########################

# 4. Training and Evaluation

In [19]:
######################   TODO 4.1   ########################
# define the learning parameters here (lr and epochs.)
# then initilizer your model, an appropriate optimizer
# and loss function.
EPOCH = 5

model = SpamClassifier(bert_tokenizer,bert_model)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-6, weight_decay=1e-4)
save_path = '/content/drive/MyDrive/MSC1401_1/DeepLearning/HW4/Checkpoints/best_model.pt'
if torch.cuda.is_available():
    model = model.cuda()
    criterion = criterion.cuda()
###################### (10 points) ##########################

In [20]:
######################   TODO 4.2   ########################
# implement your training loop and train your model.
# return to homework 1 if needed.
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

def train(n_epochs, loaders, model, optimizer, criterion, use_cuda, save_path):
    """returns trained model"""
    valid_loss_min = np.Inf 
    for epoch in range(1, n_epochs+1):
        train_loss = 0.0
        valid_loss = 0.0
        train_corrects = 0
        valid_corrects = 0
        model.train()
        for batch_idx, (data,target) in enumerate(loaders['train']):
            if use_cuda:
              target = target.to(device)
            optimizer.zero_grad()
            outputs = model(data)
            _,preds = torch.max(outputs, 1)
            loss = criterion(outputs,target)
            loss.backward()
            optimizer.step()
            train_loss += loss.item()
            train_corrects += torch.sum(preds.data == target.data)
        model.eval()
        for batch_idx, (data, target) in enumerate(loaders['valid']):
            if use_cuda:
              target = target.to(device)
            outputs = model(data)
            _,preds = torch.max(outputs, 1)
            loss = criterion(outputs,target)
            valid_loss += loss.item()
            valid_corrects += torch.sum(preds == target.data)

        train_loss = train_loss/len(loaders['train'])
        train_acc = train_corrects / len(loaders['train'].dataset)

        valid_loss = valid_loss/len(loaders['valid'])
        valid_acc = valid_corrects / len(loaders['valid'].dataset)

        print(" "*10+"="*5+' Epoch: {}'.format(epoch),"="*5+' \nTraining Loss: {:.6f} Acc: {:.6f} \nValidation: Loss: {:.6f} Acc: {:.6f}'.format(
            train_loss,
            train_acc,
            valid_loss,
            valid_acc
            ))
        if valid_loss < valid_loss_min:
            print('Validation loss decreased ({:.6f} --> {:.6f}).  Saving model ...'.format(
            valid_loss_min,
            valid_loss))
            torch.save(model.state_dict(), save_path)
            valid_loss_min = valid_loss 
    return model
###################### (10 points) ##########################

In [21]:
loaders = {'train': train_loader, 'valid': val_loader}
model = train(EPOCH,loaders,model,optimizer,criterion,True,save_path)

          ===== Epoch: 1 ===== 
Training Loss: 0.612331 Acc: 0.866973 
Validation: Loss: 0.582147 Acc: 0.856373
Validation loss decreased (inf --> 0.582147).  Saving model ...
          ===== Epoch: 2 ===== 
Training Loss: 0.554235 Acc: 0.866973 
Validation: Loss: 0.527511 Acc: 0.856373
Validation loss decreased (0.582147 --> 0.527511).  Saving model ...
          ===== Epoch: 3 ===== 
Training Loss: 0.493951 Acc: 0.896290 
Validation: Loss: 0.450711 Acc: 0.976661
Validation loss decreased (0.527511 --> 0.450711).  Saving model ...
          ===== Epoch: 4 ===== 
Training Loss: 0.436636 Acc: 0.981452 
Validation: Loss: 0.411346 Acc: 0.994614
Validation loss decreased (0.450711 --> 0.411346).  Saving model ...
          ===== Epoch: 5 ===== 
Training Loss: 0.409161 Acc: 0.987036 
Validation: Loss: 0.392862 Acc: 0.994614
Validation loss decreased (0.411346 --> 0.392862).  Saving model ...


In [22]:
def test(loaders, model, criterion, use_cuda):
    test_loss = 0.
    correct = 0.
    total_data = 0.
    model.eval()
    for batch_idx, (data, labels) in enumerate(loaders['valid']):
        if use_cuda:
            labels = labels.cuda()
        outputs = model.predict(data)
        total_data += labels.size(0)
        correct += (outputs == labels).sum().item()
        loss = criterion(outputs.float(),labels.float())
        test_loss += loss.item()

    print('Test Loss: {:.6f}\n'.format(test_loss/len(loaders['valid'].dataset)))
    print('\nTest Accuracy: %2d%% (%2d/%2d)' % (
        100. * correct / total_data, correct, total_data))

model.load_state_dict(torch.load(save_path))
test(loaders, model, criterion, True)


Test Loss: 0.486483


Test Accuracy: 99% (554/557)


# 5. Using HuggingFace

[HuggingFace library](http://huggingface.co/) has built a nice API for NLP tasks around the transformers. To get familiar with this comrehensive library, In this section you are asked to use the huggingface `Trainer`, `Dataset`, and `BertForSequenceClassification` to do what we did above again.

Feel free to refer to the library documentation to learn about these modules.

In [None]:
! pip install transformers datasets

In [5]:
from datasets import load_dataset
from sklearn.model_selection import train_test_split


dataset = load_dataset("sms_spam")
dataset["train"][100]




  0%|          | 0/1 [00:00<?, ?it/s]

{'sms': "Please don't text me anymore. I have nothing else to say.\n",
 'label': 0}

In [2]:
shuffled_dataset = dataset["train"].shuffle(seed=0)
ds_train = shuffled_dataset.select(range(int(0.9*len(dataset["train"]))))
ds_val = shuffled_dataset.select(range(int(0.9*len(dataset["train"])),len(dataset["train"])))

In [3]:
ds_train

Dataset({
    features: ['sms', 'label'],
    num_rows: 5016
})

In [4]:
ds_val

Dataset({
    features: ['sms', 'label'],
    num_rows: 558
})

In [6]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')


def tokenize_function(examples):
    return tokenizer(examples["sms"], padding="max_length", truncation=True)


tokenized_dataset_train = ds_train.map(tokenize_function, batched=True)
tokenized_dataset_val = ds_val.map(tokenize_function, batched=True)

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [7]:
tokenized_dataset_val

Dataset({
    features: ['sms', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 558
})

In [9]:
######################   TODO 5.1   ########################
# use huggingface Trainer and Dataset API and train the 
# `SpamClassifier`. You should not use the `SpamClassifier`
# we implemented previously. Instead you should use 
# `BertForSequenceClassification` here.
###################### (25 points) #########################
from transformers import BertForSequenceClassification, Trainer, TrainingArguments
from datasets import load_metric
import numpy as np

metric = load_metric('accuracy')

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)


model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased", 
    num_labels = 2,          
    return_dict=True
)



training_args = TrainingArguments( output_dir="test_trainer",
    evaluation_strategy="epoch", 
    save_strategy="epoch",
    num_train_epochs=2,
    seed=0,
    load_best_model_at_end=True)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset_train,
    eval_dataset=tokenized_dataset_val,
    compute_metrics=compute_metrics,
)

trainer.train()

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-uncased/snapshots/0a6aa9128b6194f4f3c4db429b6cb4891cdb421b/config.json
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.25.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--bert-base-uncased/snapshots/0a6aa9128b6194f4f3c4db429b6cb4891cdb421b/pytorch_model.bin
Some weights of the model check

Epoch,Training Loss,Validation Loss,Accuracy
1,0.0757,0.034638,0.994624
2,0.0429,0.029118,0.996416


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sms. If sms are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 558
  Batch size = 8
Saving model checkpoint to test_trainer/checkpoint-627
Configuration saved in test_trainer/checkpoint-627/config.json
Model weights saved in test_trainer/checkpoint-627/pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sms. If sms are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 558
  Batch size = 8
Saving model checkpoint to test_trainer/checkpoint-1254
Configuration saved in test_trainer/checkpoint-1254/config.json
Model weights saved in test_trainer/checkpo

TrainOutput(global_step=1254, training_loss=0.049634043489726914, metrics={'train_runtime': 926.202, 'train_samples_per_second': 10.831, 'train_steps_per_second': 1.354, 'total_flos': 2639530107371520.0, 'train_loss': 0.049634043489726914, 'epoch': 2.0})