## Setup

In [None]:
from pathlib import Path
path = Path('../input/spamham-email-classification-nlp')
! pip install -q -U datasets 
! pip install -q -U evaluate
! pip install -q -U huggingface_hub
! pip install -q -U peft
! pip install -q -U transformers

**Optional:** Login with huggingface to save dataset and model. Huggingface hub is a repository hub similar to Github but for ML. You can explore more [here](https://huggingface.co/).

In [None]:
from huggingface_hub import notebook_login
notebook_login()

The data is in the emails.csv file. For opening, manipulating, and viewing CSV files, it's generally best to use the Pandas library

## Load and process data

In [3]:
import pandas as pd
df = pd.read_csv(path/'emails.csv')

This creates a DataFrame, which is a table of named columns, a bit like a database table. To view the first and last rows, and row count of a DataFrame, just type its name:

In [4]:
df

Unnamed: 0,Text,Spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1
...,...,...
5723,Subject: re : research and development charges...,0
5724,"Subject: re : receipts from visit jim , than...",0
5725,Subject: re : enron case study update wow ! a...,0
5726,"Subject: re : interest david , please , call...",0


The DataFrame has a useful method called describe() that provides insights on the dataframe.

In [5]:
df.describe(include='object')

Unnamed: 0,Text
count,5728
unique,5695
top,"Subject: re : contact info glenn , please , ..."
freq,2


We can see that in the 5728 rows, there are only 5695 unique text value. This means the data probably contains some duplicated rows that could create some level of bias. Luckily this is a common problem and pandas provide a method to deal with duplicated values.

In [6]:
df = df.drop_duplicates()
df.describe(include='object')

Unnamed: 0,Text
count,5695
unique,5695
top,Subject: naturally irresistible your corporate...
freq,1


Now that the data has been cleaned, we need to split the data into 3 sets: train, test and validation. This [blog](https://blog.roboflow.com/train-test-split/) provide excellent explaination on why we need 3 separate sets. In short:
* train: this set is used for training your model.
* validation: this set is also used during the training process to check for overfitting/underfitting and validate the model hyperparameters.
* test: this set acts as a future dataset. It is held out from the model during training and it is only used at the very end of the process for evaluating your model.

In [7]:
from sklearn.model_selection import train_test_split

train_df, val_df = train_test_split(df, test_size=0.2, stratify=df['Spam'])
val_df, test_df = train_test_split(val_df, test_size=0.5, stratify=val_df['Spam'])



For this demo, the ratio for train/validation/test is 80%/10%/10% respectively. Note that a stratify option is set on column "Spam" to ensure that the proportions between spam and "ham" emails are the same for all sets. This can be verified after splitting the data

In [8]:
print("Spam percentage in train:", len(train_df[train_df["Spam"] == 1]) / len(train_df))
print("Spam percentage in validation:", len(val_df[val_df["Spam"] == 1]) / len(val_df))
print("Spam percentage in test:", len(test_df[test_df["Spam"] == 1]) / len(test_df))

Spam percentage in train: 0.24012291483757683
Spam percentage in validation: 0.24077328646748683
Spam percentage in test: 0.24035087719298245


The dataframe is then converted to a `Dataset` format, which is more convinient and more efficient to process and use in training.

In [9]:
train_df.reset_index(drop=True, inplace=True)
val_df.reset_index(drop=True, inplace=True)
test_df.reset_index(drop=True, inplace=True)

In [10]:
from datasets import Dataset,DatasetDict

ds = DatasetDict({
    "train": Dataset.from_pandas(train_df),
    "val": Dataset.from_pandas(val_df),
    "test": Dataset.from_pandas(test_df),
})

In [11]:
ds

DatasetDict({
    train: Dataset({
        features: ['Text', 'Spam'],
        num_rows: 4556
    })
    val: Dataset({
        features: ['Text', 'Spam'],
        num_rows: 569
    })
    test: Dataset({
        features: ['Text', 'Spam'],
        num_rows: 570
    })
})

**Optional:** Push the dataset to huggingface to persist the data

In [12]:
ds.push_to_hub('spamming-email-classification')

Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/5 [00:00<?, ?ba/s]

Deleting unused files from dataset repository:   0%|          | 0/1 [00:00<?, ?it/s]

Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Deleting unused files from dataset repository:   0%|          | 0/1 [00:00<?, ?it/s]

Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Deleting unused files from dataset repository:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading metadata:   0%|          | 0.00/680 [00:00<?, ?B/s]

## Preprocessing data for finetuning

**Optional:** Load the dataset from huggingface hub

In [3]:
from datasets import load_dataset
ds = load_dataset('legacy107/spamming-email-classification')

Downloading readme:   0%|          | 0.00/680 [00:00<?, ?B/s]



Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/4.00M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/457k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/507k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/4556 [00:00<?, ? examples/s]

Generating val split:   0%|          | 0/569 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/570 [00:00<?, ? examples/s]

Although it is called "Language model", the model itself does not receive actual text as its inputs. A deep learning model expects numbers as inputs. So we need to do two things:

- *Tokenization*: Split each text up into words (or actually, as we'll see, into *tokens*)
- *Numericalization*: Convert each word (or token) into a number.

The details about how this is done actually depend on the particular model we use. So first we'll need to pick a model. There are thousands of models available, but a reasonable starting point for nearly any NLP problem is to use the well-knowned `BERT`  model (Bidirectional Encoder Representations from Transformers). This model has a maximum token input length of 512. Note that for this demo, we will use the uncased version which will treat `A` and `a` as the same character.

***Note:*** *replace "base" with "large" for a slower but more accurate model, once you've finished exploring*

In [4]:
model_nm = 'bert-base-uncased'
max_length = 512

`AutoTokenizer` will create a tokenizer appropriate for a given model:

In [5]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokz = AutoTokenizer.from_pretrained(model_nm, model_max_length=max_length)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Here's an example of how the tokenizer splits a text into "tokens" (which are like words, but can be sub-word pieces, as you see below):

In [16]:
tokz.tokenize("G'day folks, welcome to COS30018 Intelligent system")

['g',
 "'",
 'day',
 'folks',
 ',',
 'welcome',
 'to',
 'co',
 '##s',
 '##30',
 '##01',
 '##8',
 'intelligent',
 'system']

Uncommon words will be split into pieces. Tokens that are partial words is appended with `##`.

In [17]:
tokz.tokenize("A platypus is an ornithorhynchus anatinus.")

['a',
 'pl',
 '##at',
 '##yp',
 '##us',
 'is',
 'an',
 'or',
 '##ni',
 '##thor',
 '##hy',
 '##nch',
 '##us',
 'ana',
 '##tin',
 '##us',
 '.']

Here's a simple function which tokenizes our `Text` input. Note that since the maximum token length is only 512, we need to truncate some values.

In [7]:
def tok_func(x): return tokz(x["Text"], truncation=True, max_length=max_length)

To run this quickly in parallel on every row in our dataset, use `map`

In [8]:
tok_ds = ds.map(tok_func, batched=True)

Map:   0%|          | 0/4556 [00:00<?, ? examples/s]

Map:   0%|          | 0/569 [00:00<?, ? examples/s]

Map:   0%|          | 0/570 [00:00<?, ? examples/s]

This adds a new item to our dataset called `input_ids` along with 2 additional items `token_type_ids, 'attention_mask'. For instance, here is the input and IDs for the second row of our data:

In [9]:
row = tok_ds["train"][1]
row['Text'], row['input_ids']

('Subject: vince and stinson ,  i got this resume from my friend ming sit who has a ph . d . from stanford .  please take a look at his resume to see if we can use him . i classify him as  a structurer , but things may change after all these years .  zimin  - - - - - - - - - - - - - - - - - - - - - - forwarded by zimin lu / hou / ect on 05 / 17 / 2000 04 : 08 pm  - - - - - - - - - - - - - - - - - - - - - - - - - - -  " sit , ming " on 05 / 17 / 2000 02 : 41 : 50 pm  to : " zimin lu ( e - mail ) "  cc :  subject :  - resume . doc',
 [101,
  3395,
  1024,
  12159,
  1998,
  2358,
  7076,
  2239,
  1010,
  1045,
  2288,
  2023,
  13746,
  2013,
  2026,
  2767,
  11861,
  4133,
  2040,
  2038,
  1037,
  6887,
  1012,
  1040,
  1012,
  2013,
  8422,
  1012,
  3531,
  2202,
  1037,
  2298,
  2012,
  2010,
  13746,
  2000,
  2156,
  2065,
  2057,
  2064,
  2224,
  2032,
  1012,
  1045,
  26268,
  2032,
  2004,
  1037,
  3252,
  2099,
  1010,
  2021,
  2477,
  2089,
  2689,
  2044,
  2035,
  2

The token IDs comes from a list called vocab in the tokenizer which contains a unique integer for every possible token string. We can look them up like this, for instance to find the token for the first word "this":

In [21]:
tokz.vocab['this']

2023

Looking above at our input IDs, we do indeed see that `2023` appears as expected.

Finally, we need to prepare our labels. Transformers always assumes that your labels has the column name `labels`, but in our dataset it's currently `Spam`. Therefore, we need to rename it. Then we need to remove redundant columns e.g. `Text`.

In [22]:
tok_ds = tok_ds.rename_columns({'Spam':'labels'})
tok_ds = tok_ds.remove_columns(['Text'])

In [23]:
tok_ds

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 4556
    })
    val: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 569
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 570
    })
})

## Finetuning

To train a model in Transformers we'll use the `Trainer` class which has already implemented the training loop and other things for us.

In [24]:
from transformers import TrainingArguments,Trainer

We pick a batch size that fits our GPU, and small number of epochs so we can run experiments quickly. We also define the learning rate for the model.

In [25]:
bs = 16
epochs = 3
lr = 8e-5

Transformers uses the `TrainingArguments class` to set up arguments. Don't worry too much about the values we're using here, they should generally work fine in most cases. It's just the 3 parameters above that you may need to change for different models.

In [26]:
args = TrainingArguments(
    'email-spam-classification',
    learning_rate=lr,
    warmup_ratio=0.1,
    lr_scheduler_type='linear',
    fp16=True,
    evaluation_strategy="steps",
    per_device_train_batch_size=bs,
    per_device_eval_batch_size=bs,
    num_train_epochs=epochs,
    logging_steps=100,
    eval_steps=100,
    weight_decay=0.01,
    report_to='none',
    push_to_hub=True,
)

The pretrained model can be loaded using `AutoModelForSequenceClassification`

In [27]:
id2label = {0: "Ham", 1: "Spam"}
label2id = {"Ham": 0, "Spam": 1}

In [28]:
model = AutoModelForSequenceClassification.from_pretrained(
    model_nm,
    num_labels=2,
    id2label=id2label,
    label2id=label2id,
    return_dict=True,
)

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Let's check the number of trainable parameter for this model.

In [29]:
def print_trainable_parameters(model):
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [30]:
print_trainable_parameters(model)

trainable params: 109483778 || all params: 109483778 || trainable%: 100.0


109,483,778 parameters is a pretty large number and it could takes a very long time to finetune our model. Instead of a full finetuning, we will apply the IA3 method to significantly optimize the number of trainable parameters.

First, we can view the overall architechture of the model by just print it.

In [31]:
model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

<img src="https://docs.adapterhub.ml/_images/ia3.png" width="300" height="500" />
<div>
    IA3 architechture from <a href="https://docs.adapterhub.ml/methods.html">adapterhub.ml</a>
<div>

Base on the `BERT` and `IA3` architechture, we will inject trainable vectors into 3 components: `query`, `value` and `output.dense`. IA3 method is fully supported by the `peft` library. We can define the config and load of IA3 model using `PeftModelForSequenceClassification`.

In [32]:
from peft import PeftModelForSequenceClassification, get_peft_config

config = {
    "peft_type": "IA3",
    "task_type": "SEQ_CLS",
    "inference_mode": False,
    "target_modules": ["query", "value", "output.dense"],
    "feedforward_modules": ["output.dense"],
    "modules_to_save": ["classifier"]
}

peft_config = get_peft_config(config)
model = PeftModelForSequenceClassification(model, peft_config)

In [33]:
model.print_trainable_parameters()

trainable params: 67,588 || all params: 109,549,828 || trainable%: 0.061696126076984804


The number of trainable parameter has been reduce to only 67,588 which is **0.06%** of the pretrained model!

In [34]:
model

PeftModelForSequenceClassification(
  (base_model): IA3Model(
    (model): BertForSequenceClassification(
      (bert): BertModel(
        (embeddings): BertEmbeddings(
          (word_embeddings): Embedding(30522, 768, padding_idx=0)
          (position_embeddings): Embedding(512, 768)
          (token_type_embeddings): Embedding(2, 768)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (encoder): BertEncoder(
          (layer): ModuleList(
            (0-11): 12 x BertLayer(
              (attention): BertAttention(
                (self): BertSelfAttention(
                  (query): Linear(
                    in_features=768, out_features=768, bias=True
                    (ia3_l): ParameterDict(  (default): Parameter containing: [torch.FloatTensor of size 768x1])
                  )
                  (key): Linear(in_features=768, out_features=768, bias=True)
                  (value):

We can inspect the PEFT model to verify that the trainable vectors have been injected into the correct components (look for `ia3_l`).

Before actually train the model, we need to define 2 additionals things: a data collator and a compute_metrics function.

The data collator will group multiple inputs into batches. Then it will add a special tokens call "pad token" to make all items in a batch equal in length. That process effectively forms matrices. By doing that, we can take advantages of GPU parallel processing power since it is designed to efficiently perform operations on matrices.

In [35]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokz)

The compute_metrics function, as its name suggests, output the metric value for model evaluation. For this demo, we will use the accuracy metrics. The accuracy of a model shows how many predictions are correct over all predictions.

In [36]:
import evaluate
import numpy as np

accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Now we can finally create the `Trainer` and train our model.

In [37]:
trainer = Trainer(
    model,
    args,
    train_dataset=tok_ds['train'],
    eval_dataset=tok_ds['val'],
    tokenizer=tokz,
    compute_metrics=compute_metrics,
    data_collator=data_collator
)

In [38]:
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,Accuracy
100,0.5996,0.513706,0.759227
200,0.4682,0.499727,0.762742
300,0.4676,0.445968,0.789104
400,0.3963,0.429654,0.803163
500,0.4024,0.403716,0.822496
600,0.3955,0.3905,0.829525
700,0.3659,0.395892,0.827768
800,0.3605,0.392036,0.831283


TrainOutput(global_step=855, training_loss=0.42854927464535364, metrics={'train_runtime': 692.3448, 'train_samples_per_second': 19.742, 'train_steps_per_second': 1.235, 'total_flos': 3586829861952000.0, 'train_loss': 0.42854927464535364, 'epoch': 3.0})

The training result shows that both training and validation loss decrease. This means we have successfully trained our model. Let's evaluate the final model on the validation set.

In [39]:
trainer.evaluate()

{'eval_loss': 0.38854390382766724,
 'eval_accuracy': 0.8330404217926186,
 'eval_runtime': 11.4759,
 'eval_samples_per_second': 49.582,
 'eval_steps_per_second': 3.137,
 'epoch': 3.0}

We achieve around 80% accuracy on the validation set. Not too bad given that we only train it for just 3 epoches.

After the training process, the IA3 vectors can be merged back into the pretrained model for faster inference.

In [40]:
merged_model = trainer.model.merge_and_unload()

**Optional:** Push our finetuned model to huggingface hub for future use

In [41]:
trainer.tokenizer.push_to_hub("legacy107/email-spam-classification-merged")

CommitInfo(commit_url='https://huggingface.co/legacy107/email-spam-classification-merged/commit/c91c506ad5e621e354a26f4b261a643dd3673b46', commit_message='Upload tokenizer', commit_description='', oid='c91c506ad5e621e354a26f4b261a643dd3673b46', pr_url=None, pr_revision=None, pr_num=None)

In [42]:
merged_model.push_to_hub("legacy107/email-spam-classification-merged")

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/legacy107/email-spam-classification-merged/commit/3b18ac0ea02d7dd95b8b7856c00491dfdfd3b3bf', commit_message='Upload BertForSequenceClassification', commit_description='', oid='3b18ac0ea02d7dd95b8b7856c00491dfdfd3b3bf', pr_url=None, pr_revision=None, pr_num=None)

## Evaluate

**Optional**: Load model from huggingface hub

In [20]:
from transformers import AutoModelForSequenceClassification,AutoTokenizer
merged_model = AutoModelForSequenceClassification.from_pretrained("legacy107/email-spam-classification-merged")
tokz = AutoTokenizer.from_pretrained(model_nm, model_max_length=max_length)

To evaluate our model on the test set, we will use the `evaluate` library from huggingface. First, initialise an evaluator for text classification. Then pass the model, the test data, the metric and other required parameters to the evaluator.

In [43]:
from evaluate import evaluator
from datasets import load_dataset

task_evaluator = evaluator("text-classification")
results = task_evaluator.compute(
    model_or_pipeline=merged_model,
    data=ds["test"],
    input_column="Text",
    label_column="Spam",
    metric="accuracy",
    label_mapping=label2id,
    strategy="simple",
    tokenizer = tokz,
)
results

{'accuracy': 0.8631578947368421,
 'total_time_in_seconds': 9.84486404900008,
 'samples_per_second': 57.898209377293895,
 'latency_in_seconds': 0.017271691314035227}

An accuracy of 86.3% indicate that our finetuned model is can correctly classify 86.3% of the total emails in the test set. 

## Inference

Now we can use our model to run inference on actual data using the text classification pipeline from `transformers` library

In [24]:
from transformers import pipeline

row_id = 10
text = ds["test"][row_id]["Text"]
classifier = pipeline("text-classification", model=merged_model, tokenizer=tokz)
prediction = classifier(text)

print(f'Text: {ds["test"][row_id]["Text"]}')
print(f'Label: {ds["test"][row_id]["Spam"]}')
print(f'Prediction: {prediction[0]["label"]}')

Text: Subject: look 10 years younger - free sample ! ! ! ! ! ! ! ! esoy  this e - mail ad is being sent in full compliance with u . s . senate bill 1618 , title # 3 , section 301  to remove yourself send a blank e - mail to : removal 992002 @ yahoo . com  free sample ! free tape !  new cosmetic breakthru !  look 10 years younger in ( 6 ) weeks or less !  look good duo . . from the inside out . . . . .  > from the outside in !  introducing . . . . natures answer to faster  and more obvious results for :  * * wrinkles  * * cellulite  * * dark circles  * * brown spots . . .  * * lifts the skin  * * strenghtens the hair and nails  also helps to . . . . . . . .  * reduce cell damage from excessive sun exposure  * stimulate colllagen formation  * provide protection against skin disorder  * and is hopoallergenic  find out what ! where ! and how !  to order your free sample and tape send your  request to :  lookyoungnow 2000 @ yahoo . com  subject : subscribe to free sample :  your name : . . 