## Sentiment_analysis-FineTune DistilBert
* DistilBert uncased
  - Dataset: SST2
  - classification: message is spam or not

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import pandas as pd

df  = pd.read_csv("SMSSpamCollection.txt", sep='\t',
                  names=["label", "message"])
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
df.shape

(5572, 2)

In [None]:
#independent feature
X = list(df['message'])
y = list(df['label'])

In [None]:
#convert y classes ham and spam into 0s and 1s
y = list(pd.get_dummies(y, drop_first=True)['spam'])

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

DistilBert model is a light weighted bert model trained on SST-2 dataset. Its is used for sentiment analysis which comes in sequence classification.

In [None]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

Downloading tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

* Transformers takes sentence in embedding format. So, we first need to encode dataset.

  - truncation - Removes wide space     in sentence. </br>
  - padding - Convert each matrix into 512 dimension.

In [None]:
train_encs = tokenizer(X_train, padding=True, truncation=True)
test_encs = tokenizer(X_test, padding=True, truncation=True)

convert these encodings into dataset objects.

In [None]:
import tensorflow as tf

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encs),
    y_train
))

test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encs),
    y_test
))

In [None]:
train_dataset

<TensorSliceDataset element_spec=({'input_ids': TensorSpec(shape=(238,), dtype=tf.int32, name=None), 'attention_mask': TensorSpec(shape=(238,), dtype=tf.int32, name=None)}, TensorSpec(shape=(), dtype=tf.int32, name=None))>

In [None]:
#now importing the model
from transformers import TFDistilBertForSequenceClassification, TFTrainer, TFTrainingArguments

training_args = TFTrainingArguments(
    output_dir="./results",         #output directory
    num_train_epochs=2,             #total no of training epochs
    per_device_train_batch_size=8,  #batch size er device during training
    per_device_eval_batch_size=16,  #batch size for evaluation
    warmup_steps=500,               #no of warmup steps for learnign rate scheduler
    weight_decay=0.01,              # strength of weight decay
    logging_dir="./logs",           #dir for storing logs
    logging_steps=10,
)

* Stanford Sentiment Treebank(sst2)
Model Description: This model is a fine-tune checkpoint of DistilBERT-base-uncased, fine-tuned on SST-2. This model reaches an accuracy of 91.3 on the dev set (for comparison, Bert bert-base-uncased version reaches an accuracy of 92.7).

link: https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english

In [None]:
with training_args.strategy.scope():
  model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')

#training the model
trainer = TFTrainer(
    model=model,  # the instantiated hugging face model
    args=training_args,  #traingin arguments
    train_dataset=train_dataset,  #training dataset
    eval_dataset=test_dataset   #eval dataset
)

Downloading tf_model.h5:   0%|          | 0.00/347M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'activation_13', 'vocab_transform', 'vocab_projector']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier', 'pre_classifier', 'dropout_19']
You should probably TRAIN this model on a down-stream task to be able to use i

In [None]:
trainer.evaluate(test_dataset)

{'eval_loss': 0.709601811000279}

In [None]:
output = trainer.predict(test_dataset)[1]

In [None]:
trainer.predict(test_dataset)[1].shape

(1115,)

In [None]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, output)
cm

array([[955,   0],
       [  0, 160]])

In [None]:
trainer.save_model("sentiment_model")