---

# Ioannou_Georgios


## Copyright © 2023 by Georgios Ioannou


---

<h1 align="center"> NLP Fine-Tuning With Hugging Face </h1>

<h2 align="center"> spam_dataset.csv </h2>


---

- Fine-tuning a natural language processing (NLP) model involves adjusting the hyperparameters and architecture of the model, and often also involves adjusting the dataset, to improve the performance of the model on a specific task. This can be done by adjusting the learning rate, the number of layers in the model,the size of the embeddings, and many other factors. Fine-tuning is often used to adapt a pre-trained model to a new dataset or task, and can be a time-consuming process that requires a good understanding of the model and the task at hand.

- Fine-tuning a model can help to improve its performance on a specific task, by adjusting the hyperparameters and architecture of the model
  to suit the characteristics of the task and the dataset.


---

# INSTALLATIONS


In [1]:
# ! pip install transformers
# ! pip install beautifulsoup4
# ! pip install lxml
# ! pip install evaluate

---

# LIBRARIES


In [2]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from datasets import Dataset
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers import DataCollatorWithPadding
from transformers import TrainingArguments, Trainer
import evaluate
from sklearn.metrics import classification_report

  from .autonotebook import tqdm as notebook_tqdm


---

# PRE-TRAINED MODEL


In [3]:
# FINE TUNING THIS PRETRAINED MODEL.

model_name = "distilbert-base-uncased"

---

# DATASET


In [4]:
df = pd.read_csv("spam_dataset.csv")

In [5]:
df.head()

Unnamed: 0,email,category
0,"URL: http://www.newsisfree.com/click/-1,817167...",not-spam
1,"On Thu, 19 Sep 2002, Bill Stoddard wrote:\n\n-...",not-spam
2,Dan Kohn <dan@dankohn.com> writes:\n\n\n\n> Gu...,not-spam
3,wintermute wrote:\n\n>>Anyone know where in Ir...,not-spam
4,"I attended the same conference, and was impres...",not-spam


In [6]:
df.shape

(3796, 2)

In [7]:
df["category"].value_counts()

category
not-spam    1900
spam        1896
Name: count, dtype: int64

In [8]:
df.groupby("category").count()

Unnamed: 0_level_0,email
category,Unnamed: 1_level_1
not-spam,1900
spam,1896


---

# TAKING AN EXTREMELLY SMALL SUBSET FOR THE LECTURE


In [9]:
np.random.seed(42)


shuffled_indices = np.random.permutation(df.index)
df = df.loc[shuffled_indices].reset_index(drop=True)
df = df[:100]

In [10]:
df.shape

(100, 2)

In [11]:
df.head()

Unnamed: 0,email,category
0,"URL: http://www.newsisfree.com/click/-2,841368...",not-spam
1,"On January 1st 2002, the European countries be...",spam
2,\n\nI think what you're looking at with the du...,not-spam
3,IMPORTANT NOTICE: Regarding your domain name\...,spam
4,"<html>\n\nHello, <br><br>\n\n<div align=""cente...",spam


In [12]:
df["category"].value_counts()

category
not-spam    51
spam        49
Name: count, dtype: int64

In [13]:
df.groupby("category").count()

Unnamed: 0_level_0,email
category,Unnamed: 1_level_1
not-spam,51
spam,49


---

# CLEAN DATASET


In [14]:
class Cleaner:
    def __init__(self):
        pass

    def put_line_breaks(self, text):
        text = text.replace("</p>", "</p>\n")
        return text

    def remove_html_tags(self, text):
        cleantext = BeautifulSoup(text, "lxml").text
        return cleantext

    def clean(self, text):
        text = self.put_line_breaks(text)
        text = self.remove_html_tags(text)
        return text

In [15]:
cleaner = Cleaner()
df["text_cleaned"] = df["email"].apply(cleaner.clean)

  cleantext = BeautifulSoup(text, "lxml").text


In [16]:
df.head()

Unnamed: 0,email,category,text_cleaned
0,"URL: http://www.newsisfree.com/click/-2,841368...",not-spam,"URL: http://www.newsisfree.com/click/-2,841368..."
1,"On January 1st 2002, the European countries be...",spam,"On January 1st 2002, the European countries be..."
2,\n\nI think what you're looking at with the du...,not-spam,I think what you're looking at with the dual a...
3,IMPORTANT NOTICE: Regarding your domain name\...,spam,IMPORTANT NOTICE: Regarding your domain name\...
4,"<html>\n\nHello, <br><br>\n\n<div align=""cente...",spam,"\n\nHello, \nPremium Phone Qualified \n\nBusin..."


---

# Label Encoder


In [17]:
le = preprocessing.LabelEncoder()
le.fit(df["category"].tolist())
df["label"] = le.transform(df["category"].tolist())

In [18]:
df.head()

Unnamed: 0,email,category,text_cleaned,label
0,"URL: http://www.newsisfree.com/click/-2,841368...",not-spam,"URL: http://www.newsisfree.com/click/-2,841368...",0
1,"On January 1st 2002, the European countries be...",spam,"On January 1st 2002, the European countries be...",1
2,\n\nI think what you're looking at with the du...,not-spam,I think what you're looking at with the dual a...,0
3,IMPORTANT NOTICE: Regarding your domain name\...,spam,IMPORTANT NOTICE: Regarding your domain name\...,1
4,"<html>\n\nHello, <br><br>\n\n<div align=""cente...",spam,"\n\nHello, \nPremium Phone Qualified \n\nBusin...",1


---

# train_test_split


In [19]:
df_train, df_test = train_test_split(df, test_size=0.2)

In [20]:
df_train

Unnamed: 0,email,category,text_cleaned,label
85,This is a multipart message in MIME format\n\n...,not-spam,This is a multipart message in MIME format\n\n...,0
49,"Yes, I know, dreadful subject. However, becaus...",not-spam,"Yes, I know, dreadful subject. However, becaus...",0
9,"\n\nIn a message dated 9/19/2002 7:46:37 AM, c...",not-spam,"In a message dated 9/19/2002 7:46:37 AM, chuck...",0
90,<HTML>\n\n<HEAD>\n\n</HEAD>\n\n<BODY>\n\n<FONT...,spam,"\n\n\n\n A man endowed with a 7-8"" hammer is s...",1
67,"On Mon, 22 Jul 2002, Adam Rifkin wrote:\n\n\n\...",not-spam,"On Mon, 22 Jul 2002, Adam Rifkin wrote:\n\n\n\...",0
...,...,...,...,...
28,\n\nNot true on the choice part.\n\n\n\nAfter ...,not-spam,Not true on the choice part.\n\n\n\nAfter thre...,0
23,Do you want to make money from home? Are you ...,spam,Do you want to make money from home? Are you ...,1
53,Financial Services Company will pay a minimum ...,spam,Financial Services Company will pay a minimum ...,1
11,"First, thanks for all the rpms, and especially...",not-spam,"First, thanks for all the rpms, and especially...",0


In [21]:
df_test

Unnamed: 0,email,category,text_cleaned,label
1,"On January 1st 2002, the European countries be...",spam,"On January 1st 2002, the European countries be...",1
34,This is a multi-part message in MIME format.\n...,spam,This is a multi-part message in MIME format.\n...,1
96,--==_Exmh_1547759024P\n\nContent-Type: text/pl...,not-spam,--==_Exmh_1547759024P\n\nContent-Type: text/pl...,0
48,<!-- saved from url=3D(0022)http://internet.e-...,spam,Bright Teeth now!\n\n\n\n\n\n\n\n\n \n\n...,1
45,"<html><body bgColor=""#CCCCCC"" topmargin=1 onMo...",spam,"\nHello, jm@netnoteinc.com\nHuman \n\nGrowth \...",1
82,URL: http://www.askbjoernhansen.com/archives/2...,not-spam,URL: http://www.askbjoernhansen.com/archives/2...,0
66,------=_NextPart_000_00E4_17A73C2D.E7104E07\n\...,spam,------=_NextPart_000_00E4_17A73C2D.E7104E07\n\...,1
91,<html>\n\n\n\n\n\n<HEAD> \n\n<META charset=3DU...,spam,\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,1
29,<HTML><HEAD><TITLE>FREE Motorola Cell Phone wi...,spam,FREE Motorola Cell Phone with $50 Cash Back!\n...,1
13,"<HR>\n\n<html>\n\n<div bgcolor=3D""#FFFFCC"">\n\...",spam,"\n\n\nTremendous Savings\n\non Toners, \n\n\nI...",1


---

# Convert to Huggingface Dataset


In [22]:
train_dataset = Dataset.from_pandas(df_train)
test_dataset = Dataset.from_pandas(df_test)

---

# Tokenizer


In [23]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [24]:
def preprocess_function(examples):
    return tokenizer(examples["text_cleaned"], truncation=True)

In [25]:
tokenized_train = train_dataset.map(preprocess_function, batched=True)

Map: 100%|██████████| 80/80 [00:00<00:00, 2145.01 examples/s]


In [26]:
tokenized_test = test_dataset.map(preprocess_function, batched=True)

Map: 100%|██████████| 20/20 [00:00<00:00, 1415.89 examples/s]


---

# Initialize Model


In [27]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'classifier.weight', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


---

# Train Model


In [28]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [29]:
metric = evaluate.load("accuracy")

In [30]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [31]:
# Hyperparameters.

num_train_epochs = 5
learning_rate = 2e-4
per_device_train_batch_size = 8
per_device_eval_batch_size = 8
weight_decay = 0.01

In [32]:
evaluation_strategy = "epoch"
logging_strategy = "epoch"

training_args = TrainingArguments(
    output_dir="./results_tweets",
    learning_rate=learning_rate,
    per_device_train_batch_size=per_device_train_batch_size,
    per_device_eval_batch_size=per_device_eval_batch_size,
    num_train_epochs=num_train_epochs,
    weight_decay=weight_decay,
    evaluation_strategy=evaluation_strategy,
    logging_strategy=logging_strategy,
)

In [33]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [34]:
trainer.train()

  0%|          | 0/50 [00:00<?, ?it/s]You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
 20%|██        | 10/50 [00:49<03:19,  5.00s/it]

{'loss': 0.529, 'learning_rate': 0.00016, 'epoch': 1.0}


                                               
 20%|██        | 10/50 [00:54<03:19,  5.00s/it]

{'eval_loss': 0.24027982354164124, 'eval_accuracy': 0.85, 'eval_runtime': 4.144, 'eval_samples_per_second': 4.826, 'eval_steps_per_second': 0.724, 'epoch': 1.0}


 40%|████      | 20/50 [01:39<02:06,  4.23s/it]

{'loss': 0.0366, 'learning_rate': 0.00012, 'epoch': 2.0}


                                               
 40%|████      | 20/50 [01:42<02:06,  4.23s/it]

{'eval_loss': 0.400242418050766, 'eval_accuracy': 0.9, 'eval_runtime': 3.2234, 'eval_samples_per_second': 6.205, 'eval_steps_per_second': 0.931, 'epoch': 2.0}


 60%|██████    | 30/50 [02:20<01:16,  3.83s/it]

{'loss': 0.0498, 'learning_rate': 8e-05, 'epoch': 3.0}


                                               
 60%|██████    | 30/50 [02:24<01:16,  3.83s/it]

{'eval_loss': 0.26463380455970764, 'eval_accuracy': 0.95, 'eval_runtime': 3.3234, 'eval_samples_per_second': 6.018, 'eval_steps_per_second': 0.903, 'epoch': 3.0}


 80%|████████  | 40/50 [03:02<00:38,  3.87s/it]

{'loss': 0.129, 'learning_rate': 4e-05, 'epoch': 4.0}


                                               
 80%|████████  | 40/50 [03:05<00:38,  3.87s/it]

{'eval_loss': 0.3331300616264343, 'eval_accuracy': 0.95, 'eval_runtime': 3.1954, 'eval_samples_per_second': 6.259, 'eval_steps_per_second': 0.939, 'epoch': 4.0}


100%|██████████| 50/50 [03:41<00:00,  3.71s/it]

{'loss': 0.0008, 'learning_rate': 0.0, 'epoch': 5.0}


                                               
100%|██████████| 50/50 [03:44<00:00,  4.50s/it]

{'eval_loss': 0.2604975998401642, 'eval_accuracy': 0.95, 'eval_runtime': 3.0168, 'eval_samples_per_second': 6.63, 'eval_steps_per_second': 0.994, 'epoch': 5.0}
{'train_runtime': 224.9988, 'train_samples_per_second': 1.778, 'train_steps_per_second': 0.222, 'train_loss': 0.14903955729678273, 'epoch': 5.0}





TrainOutput(global_step=50, training_loss=0.14903955729678273, metrics={'train_runtime': 224.9988, 'train_samples_per_second': 1.778, 'train_steps_per_second': 0.222, 'train_loss': 0.14903955729678273, 'epoch': 5.0})

In [35]:
trainer.save_model("spam_model")

---

# Evaluate Model


In [36]:
# Evaluating on the training data.

preds = trainer.predict(tokenized_train)
preds = np.argmax(preds[:3][0], axis=1)
GT = df_train["label"].tolist()
print(classification_report(GT, preds))

100%|██████████| 10/10 [00:11<00:00,  1.14s/it]

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        44
           1       1.00      1.00      1.00        36

    accuracy                           1.00        80
   macro avg       1.00      1.00      1.00        80
weighted avg       1.00      1.00      1.00        80






In [37]:
# Evaluating on the testing data.

preds = trainer.predict(tokenized_test)
preds = np.argmax(preds[:3][0], axis=1)
GT = df_test["label"].tolist()
print(classification_report(GT, preds))

100%|██████████| 3/3 [00:01<00:00,  1.70it/s]

              precision    recall  f1-score   support

           0       1.00      0.86      0.92         7
           1       0.93      1.00      0.96        13

    accuracy                           0.95        20
   macro avg       0.96      0.93      0.94        20
weighted avg       0.95      0.95      0.95        20




