**Code for Fine Tuning a pre trained LLM (DistilBert) and forming a GUI using Gradio to  test it against user input**

In [None]:
!pip install transformers


Collecting transformers
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m33.5 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m42.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m47.4 MB/s[0m eta [36m0:00:0

**The above cell is to install the transformers library to get started with the project.**

transformers is the main library we are going to need for finetuning because it contains our pre trained LLMs.
If you've already installed transformers before then this cell shouldnt take much time, otherwise it could take around a minute.

**This next cell is to mount google drive**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**The next cell** is to read the spam.csv file which you must store locally.and copy its path and paste it into the "path=" line.

(I've just used the function given to us in the previous assignments)

In [None]:
import pandas as pd
def getdata(path):

  # <START>
  df = pd.read_csv(path,encoding='ISO-8859-1')
  return df
  # <END>

# Insert the path to the file in the space below
# <START>
path = '/content/spam.csv'
# <END>

df = getdata(path)
df.head(5)

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


Right now the file has a bunch of extra columns and also could use a bit of renaming on the columns...

the next few cells are for **preprocessing this csv file**

(the relabelling part is ofcourse not compulsory but will just make the proceeding code easier to understand)

In [None]:
df.drop(columns=['Unnamed: 2','Unnamed: 3','Unnamed: 4'],inplace=True)
df.head(5)

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
df.rename(columns={'v1':'label','v2':'message'},inplace=True)
df.head(5)

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
df.shape

(5572, 2)

Converting each column of ours into list form

One list for messages and another for labels.

In [None]:
X=list(df['message'])

In [None]:
y=list(df['label'])

We then check the y list and decide what we must do to convert it into data which our neural network can train on.

In [None]:
y[:10]

['ham', 'ham', 'spam', 'ham', 'ham', 'spam', 'ham', 'ham', 'spam', 'spam']

We then use the 'get_dummies' function to convert all our elements into either 1 or 0.
Here 1 represents spam while 0 represents ham.(not spam)

In [None]:
y=list(pd.get_dummies(y,drop_first=True)['spam'])


In [None]:
y[9]

1

**Splitting training and testing data using sklearn**

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)




Check out how your input  text looks

In [None]:
X_train[:10]

['No no:)this is kallis home ground.amla home town is durban:)',
 'I am in escape theatre now. . Going to watch KAVALAN in a few minutes',
 'We walked from my moms. Right on stagwood pass right on winterstone left on victors hill. Address is &lt;#&gt;',
 'I dunno they close oredi not... ÌÏ v ma fan...',
 'Yo im right by yo work',
 '\\Its Ur luck to Love someone. Its Ur fortune to Love the one who Loves U. But',
 'He also knows about lunch menu only da. . I know',
 'Oh yeah! And my diet just flew out the window',
 "Nah it's straight, if you can just bring bud or drinks or something that's actually a little more useful than straight cash",
 'SplashMobile: Choose from 1000s of gr8 tones each wk! This is a subscrition service with weekly tones costing 300p. U have one credit - kick back and ENJOY']

**Importing our pre trained model**

In this project we will be using DistilBert and its tokenizer

In [None]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Using the pre-trained model's tokenizer to encode our training and testing data  

In [None]:
train_encodings = tokenizer(X_train, truncation=True, padding=True)
test_encodings = tokenizer(X_test, truncation=True, padding=True)


In [None]:
y_train[:10]

[0, 0, 0, 0, 0, 0, 0, 0, 0, 1]

We must now convert our encoded train and test data into tensors that are suitable for using in general tensorflow learning algorithms

we do this by using the tf.data.Dataset.form_tensor_slices function

In [None]:
import tensorflow as tf

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    y_train
))

test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    y_test
))

We now begin to setup our fine tuning procedure where we fine tune the weights of DistilBert LLM

We set up the required hyperparameters and also set up our output and logs directory...



In [None]:
from transformers import TFDistilBertForSequenceClassification, TFTrainer, TFTrainingArguments

training_args = TFTrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=2,              # total number of training epochs
    per_device_train_batch_size=8,  # batch size per device during training
    per_device_eval_batch_size=16,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
    eval_steps=1000
)

We perform the training by using trainer.train() after setting our required hyperparameters

In [None]:
with training_args.strategy.scope():
    model = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

trainer = TFTrainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=test_dataset             # evaluation dataset
)

trainer.train()



Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

Check the  loss that occurs in the testing data

Should be near zero for the number of epochs specified above

(In my run it gave 0 errors in a test data size of 1115)

In [None]:
trainer.evaluate(test_dataset)

{'eval_loss': 0.015707785742623465}

In [None]:
trainer.predict(test_dataset)

PredictionOutput(predictions=array([[ 3.2332547, -3.5660896],
       [ 2.5724397, -2.8574035],
       [ 1.9241046, -2.1230915],
       ...,
       [ 3.0308146, -3.392411 ],
       [ 3.038345 , -3.3618686],
       [ 1.8400801, -2.009504 ]], dtype=float32), label_ids=array([0, 0, 0, ..., 0, 0, 0], dtype=int32), metrics={'eval_loss': 0.015700714928763255})

In [None]:
test_dataset

<_TensorSliceDataset element_spec=({'input_ids': TensorSpec(shape=(181,), dtype=tf.int32, name=None), 'attention_mask': TensorSpec(shape=(181,), dtype=tf.int32, name=None)}, TensorSpec(shape=(), dtype=tf.int32, name=None))>

In [None]:
trainer.predict(test_dataset)[1].shape

(1115,)

In [None]:
trainer.predict(test_dataset)[1]

array([0, 0, 0, ..., 0, 0, 0], dtype=int32)

Save the model which includes its configuration and weights...

In [None]:
trainer.save_model('spam_detector')

Define a classify function which classifies a text input which is presumably our sms and returns whether the given text was part of a spam message or not based on its training

In [None]:
def classify(text):
  custom_input = [text]
  custom_encodings = tokenizer(custom_input, truncation=True, padding=True)

  # Create custom input dataset
  custom_dataset = tf.data.Dataset.from_tensor_slices((
      dict(custom_encodings),
      [1]  # Placeholder label, you can assign any label here
  ))

  # Predict on custom input
  predictions = trainer.predict(custom_dataset)

  # Get predicted labels
  predicted_labels = predictions.predictions.argmax(axis=1)

  # Print the predicted labels
  if predicted_labels[0]==0:
    return "the message is not spam"
  else:
    return "the message is spam"


Test the classify function on any text input you want to give!

In [None]:
classify("hello")

'the message is not spam'

In [None]:
!pip install gradio# installing gradio (required for our gui)

Collecting gradio
  Downloading gradio-3.40.1-py3-none-any.whl (20.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m34.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl (15 kB)
Collecting fastapi (from gradio)
  Downloading fastapi-0.101.0-py3-none-any.whl (65 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.7/65.7 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ffmpy (from gradio)
  Downloading ffmpy-0.3.1.tar.gz (5.5 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting gradio-client>=0.4.0 (from gradio)
  Downloading gradio_client-0.4.0-py3-none-any.whl (297 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.4/297.4 kB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting httpx (from gradio)
  Downloading httpx-0.24.1-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

**Launching our GUI**

Finally we culminate this  model into a fancy giu using gradio which has an input and output box for detecting spam messages for any user input

We use our classify function to print whether the given input text is spam or not

After this we are done finetuning DistilBert to recognize spam messages

In [1]:
import gradio as gr

# Examples are a nested array, with each inner array contiaining all the values
# corresponing to each input field for the example. In our case, since we have
# only one input field, we may just use an array of strings instead


title = "Spam Detector Model-DistilBert"

# The function that takes the text input and generates a text output
def process_input(text):
 return classify(text)

model_gui = gr.Interface(
  process_input,
  gr.Textbox(lines=3,label="Input"),
  gr.Textbox(lines=3, label="Spam or Not Spam"),
  title=title,

)
model_gui.launch()

ModuleNotFoundError: ignored