##  Project - Fine Tuning Pre-trained Model

> SMS spam [dataset](https://drive.google.com/file/d/1Q6VJqq72vjy1RuM9zXXA77_II9WTIrcB/view?usp=sharing) is to be used for fine-tuning




In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.35.0-py3-none-any.whl (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m20.7 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.19.0-py3-none-any.whl (311 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.2/311.2 kB[0m [31m33.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m51.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m57.4 MB/s[0m eta [36m0:00:00[0m
Col

Mount the drive to get the dataset

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Import required libraries

In [None]:
import pandas as pd
import tensorflow as tf
import transformers
from sklearn.model_selection import train_test_split


In [None]:
df = pd.read_csv('/content/SMSSpamCollection.txt',sep ='\t',names=["labels","messages"])

df.head(5)   # sample of data

Unnamed: 0,labels,messages
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
df.shape     # size of data

(5572, 2)

In [None]:
X=list(df['messages'])   # Independent variable


In [None]:
y=list(df['labels'])   # Dependent variable

Convert the categorical values into binary using one hot encoding

In [None]:
y=list(pd.get_dummies(y,drop_first=True)['spam'])


Split the into training and test dataset

In [None]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 1)

Using DistilBERT pre-trained model for sequence classification

In [None]:
from transformers import TFAutoModelForSequenceClassification,AutoTokenizer
# Load the tokenizer and model
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)  # tokenized the message
model = TFAutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
batch_size = 2

# Prepared the data (train_encodings, test_encodings, y_train, y_test)
train_encodings = tokenizer(X_train, truncation=True, padding=True)
test_encodings = tokenizer(X_test,truncation=True, padding=True)


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

Converted the dataset into tensor

In [None]:
# Created Keras-style datasets
train_dataset = tf.data.Dataset.from_tensor_slices((
    {"input_ids": train_encodings["input_ids"], "attention_mask": train_encodings["attention_mask"]},
    y_train
)).batch(batch_size)

test_dataset = tf.data.Dataset.from_tensor_slices((
    {"input_ids": test_encodings["input_ids"], "attention_mask": test_encodings["attention_mask"]},
    y_test
)).batch(batch_size)

In [None]:
# Define optimizer, loss, and metrics
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metrics = [tf.keras.metrics.SparseCategoricalAccuracy("accuracy")]


In [None]:
# Compile the model
model.compile(optimizer=optimizer, loss=loss, metrics=metrics)

In [None]:
# Train the model using fit()
epochs = 2
model.fit(train_dataset, epochs=epochs)

Epoch 1/2
Epoch 2/2


<keras.src.callbacks.History at 0x7ed38ec97b50>

In [None]:
# Evaluate the model using evaluate()
eval_results = model.evaluate(test_dataset)
print("Test loss:", eval_results[0])
print("Test accuracy:", eval_results[1])

Test loss: 0.04092314839363098
Test accuracy: 0.9919282793998718


In [None]:
# Make predictions using predict()
test_predictions = model.predict(test_dataset)




Save the model and supporting files to create interactive interface

In [None]:
model.save('/content/drive/MyDrive/Colab Notebooks/Sentimental')



In [None]:
from transformers import TFDistilBertForSequenceClassification, DistilBertConfig

model = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
config = model.config
config.save_pretrained("/content/drive/MyDrive/Colab Notebooks/Sentimental")


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

In [None]:
!pip install gradio



In [None]:
import gradio

ImportError: ignored

Create interface which takes the message as input and gives output as spam or ham

In [None]:
class_names = {0: "spam", 1: "ham"}  # To convert binary encoding into meaningful insight

def predict(text):
    inputs = tokenizer(text, padding=True, truncation=True, return_tensors="tf")
    outputs = model(**inputs)
    prediction = outputs.logits.numpy().argmax()
    predicted_class_name = class_names[prediction]
    return f"Predicted Class: {predicted_class_name}"



In [None]:
# Create Gradio interface
input_textbox = gr.components.Textbox(lines=3, label="Input Text")
output_label = gr.components.Label(label="Predicted Class")
interface = gr.Interface(
    fn=predict,
    inputs=input_textbox,
    outputs=output_label,
    live=True,
    title="senti_model"
)
interface.launch(share=True)