<a href="https://colab.research.google.com/github/DanielHolzwart/spam-filter-with-distilbert-and-gpt2/blob/main/Spam_filter_with_GPT2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spam filter with GPT2
In this workbook we are building a spam filter zu differente spam sms from ham sms (ham meaning no harm) with GPT2. This is a rather uncommon approach as GPT2 is a autoregressive model. A classical (transformer based) approach would be BERT with classification head. This works quite well actually, but using GPT2 is another interesting approach.

The idea is quite simple: you consider a sms, e.g., "How are you?", which in this case is obviously ham, and then you simply shape it to "How are you? This SMS is: ham" and GPT2 should learn the dependence from the sms structure.

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the distilled version of GPT-2
tokenizer_gpt2 = AutoTokenizer.from_pretrained("distilgpt2")
model_gpt2 = AutoModelForCausalLM.from_pretrained("distilgpt2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [2]:
%%capture
%pip install datasets

from datasets import load_dataset # load dataset
dataset = load_dataset("ucirvine/sms_spam")

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['sms', 'label'],
        num_rows: 5574
    })
})

The dataset is imbalanced with a lot more ham than spam sms.

In [19]:
print(f"Ham percentage: {sum(dataset['train']['label']) / len(dataset['train']):.2%}")

Ham percentage: 13.40%


In [21]:
from sklearn.model_selection import train_test_split # load library for splitting dataset

In [22]:
X_train, X_test, y_train, y_test = train_test_split(dataset['train']['sms'], dataset['train']['label'], test_size=0.2, shuffle = True, stratify = dataset['train']["label"])

Define train dataset of the above mentioned form. We add the end of string token to it by using tokenizer_gpt2.eos_token. This help gpt2 to finish the sentence.

In [23]:
ds_train = [sms + f' This SMS is: spam {tokenizer_gpt2.eos_token}' if label == 1 else sms + f' This SMS is: ham {tokenizer_gpt2.eos_token}' for sms, label in zip(X_train,y_train)]

In [24]:
ds_train[:2]

["Jay says that you're a double-faggot\n This SMS is: ham <|endoftext|>",
 'A bit of Ur smile is my hppnss, a drop of Ur tear is my sorrow, a part of Ur heart is my life, a heart like mine wil care for U, forevr as my GOODFRIEND\n This SMS is: ham <|endoftext|>']

Next need to define a dataloader for training. Moreover, we will add a padding token to the tokenizer as the default tokenizer does not contain it.

In [25]:
from torch.utils.data import Dataset, DataLoader
import torch

In [27]:
tokenizer_gpt2.add_special_tokens({'pad_token': '[PAD]'}) # add padding token
model_gpt2.resize_token_embeddings(len(tokenizer_gpt2)) # tell the model that the tokenizer got one additional token
model_gpt2.config.pad_token_id = tokenizer_gpt2.pad_token_id # define id of padding token
tokenizer_gpt2.padding_side = "left" # natural right alignment of GPT2

In [28]:
class SMSDataset(Dataset): # define dataloader
    def __init__(self, sms):
        self.sms = sms

    def __getitem__(self, idx):

        item = {'input_ids' : self.sms['input_ids'][idx],
                'attention_mask' : self.sms['attention_mask'][idx]} # labels key not needed as it well be generated automatically in dataloader via data_collator function

        return item

    def __len__(self):
        return len(self.sms.input_ids)

Note that the dataloader has no labels key in the item dictonary. This is because we are going use a DataCollator for training. This automatically generated the labels and sets the sets -100 to the padding tokens of the labels such that the loss function know to ignore padding in training.

In [29]:
import transformers
data_collator = transformers.DataCollatorForLanguageModeling(tokenizer_gpt2, mlm=False)

Tokenize the dataset and initialize dataloader

In [31]:
dataset_enc = tokenizer_gpt2(ds_train ,return_tensors = 'pt', truncation = True, padding = True)
dataset_encoded = SMSDataset(dataset_enc)
train_dataloader = DataLoader(dataset_encoded, batch_size = 4, shuffle = True, collate_fn = data_collator)

Define optimizer and put everything on GPU

In [32]:
optim = torch.optim.AdamW(model_gpt2.parameters(), lr = 5e-5, weight_decay=0.01)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Now we can finally train the model. We set the number of epochs to 3.

In [33]:
NUM_EPOCHS = 3
model_gpt2.to(device)

for epoch in range(NUM_EPOCHS):

    model_gpt2.train()

    for batch_idx, batch in enumerate(train_dataloader):

        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        output = model_gpt2(input_ids = input_ids, attention_mask = attention_mask, labels = labels)
        loss, logits = output.loss, output.logits

        optim.zero_grad()
        loss.backward()
        optim.step()

        if not batch_idx % 250:
            print(f'Epoch: {epoch+1:04d}/{NUM_EPOCHS:04d}'
                  f' | Batch '
                  f'{batch_idx:04d}/'
                  f'{len(train_dataloader):04d} | '
                  f'Loss: {loss:.4f}')

Epoch: 0001/0003 | Batch 0000/1115 | Loss: 6.6023
Epoch: 0001/0003 | Batch 0250/1115 | Loss: 3.1030
Epoch: 0001/0003 | Batch 0500/1115 | Loss: 3.2274
Epoch: 0001/0003 | Batch 0750/1115 | Loss: 3.2250
Epoch: 0001/0003 | Batch 1000/1115 | Loss: 3.6394
Epoch: 0002/0003 | Batch 0000/1115 | Loss: 2.7749
Epoch: 0002/0003 | Batch 0250/1115 | Loss: 3.3694
Epoch: 0002/0003 | Batch 0500/1115 | Loss: 3.5371
Epoch: 0002/0003 | Batch 0750/1115 | Loss: 3.8696
Epoch: 0002/0003 | Batch 1000/1115 | Loss: 2.8214
Epoch: 0003/0003 | Batch 0000/1115 | Loss: 3.4172
Epoch: 0003/0003 | Batch 0250/1115 | Loss: 2.3326
Epoch: 0003/0003 | Batch 0500/1115 | Loss: 2.2196
Epoch: 0003/0003 | Batch 0750/1115 | Loss: 2.7871
Epoch: 0003/0003 | Batch 1000/1115 | Loss: 1.9675


Looking at the loss output, possibly it would make sense to consider a few addtional epoch but we will leave it as it is for the moment.

Now we will evaluate the model on the test set. Similarly as above, we need to define a dataloader.

In [34]:
ds_test = [sms for sms in X_test]
ds_test_label = [label for label in y_test]

In [35]:
ds_test_encoded = tokenizer_gpt2(ds_test ,return_tensors = 'pt', truncation = True, padding = True)

In [41]:
dataset_encoded_test = SMSDataset(ds_test_encoded)
train_dataloader_test = DataLoader(dataset_encoded_test, batch_size = 1, shuffle = False)

Let us start with just a couple of example to see it working.

In [48]:
model_gpt2.eval()
text_model = []
text_orig = []
for batch_idx, batch in enumerate(train_dataloader_test):
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    output = model_gpt2.generate(input_ids = input_ids, attention_mask = attention_mask, pad_token_id = tokenizer_gpt2.pad_token_id)
    text_model.append(tokenizer_gpt2.batch_decode(output, skip_special_tokens = True))
    text_orig.append(tokenizer_gpt2.batch_decode(input_ids, skip_special_tokens = True))
    if batch_idx == 2:
      break

In [49]:
for text, orig in zip(text_model, text_orig):
    print(text)
    print(orig)

['Oh did you charge camera\n no need to pay for the camera.\n This SMS is: ham ']
['Oh did you charge camera\n']
['Ok no prob\n ok.\n This SMS is: ham ']
['Ok no prob\n']
['Is ur paper today in e morn or aft?\n this SMS is: ham ']
['Is ur paper today in e morn or aft?\n']


We notice a couple of things.


1.   There is an extra spacebar after "ham", possibly comming from the space of in the training dataset before the eos_token.
2. Sometimes the letters are capital, sometimes not
3. For some reason, GPT2 actually changes the sms even before adding a spam or ham output. In the generate function we simply to greedy generate. We could change that by adding num_beam, top_k or top_p.

