Referred doc-https://huggingface.co/docs/transformers/tasks/sequence_classification

### TODO Recording:

- Show the runtime used (T4)
- Upload the twitter_training.csv file to Colab

In [None]:
pip install transformers datasets evaluate accelerate



Importing required libraries

In [None]:
import numpy as np
import pandas as pd

Loading data.
Dataset link-

In [None]:
columns = ["id", "country", "Label", "Text"]

tweets_data = pd.read_csv("twitter_training.csv", names = columns)

tweets_data.sample(10)

Unnamed: 0,id,country,Label,Text
60608,3584,Facebook,Negative,"Hey, another warning about fake accounts.. My ..."
29609,691,ApexLegends,Positive,Ayy!! just love this damn wingman!!!!
43461,10260,PlayerUnknownsBattlegrounds(PUBG),Positive,I started to feel the way I felt after I playe...
25104,4710,Google,Negative,@AndroidDev I've just noticed an issue while u...
52032,10539,RedDeadRedemption(RDR),Positive,Phenomenal ending. . Unforgettable.
26577,965,AssassinsCreed,Irrelevant,Ghost of Tsushima looks amazing! Next ps4 game...
50149,6210,FIFA,Positive,FM20: loaded. Beer: open. Oxford United: happy...
44892,11710,Verizon,Neutral,Facebook Tries to Contain Damage as Verizon Jo...
5917,219,Amazon,Neutral,Langwu Ladies elegant comfortable casual short...
70487,10875,TomClancysGhostRecon,Positive,. Is there @GhostRecon.


Checking dimension of data

In [None]:
tweets_data.shape

(74682, 4)

In [None]:
tweets_data = tweets_data.rename(columns = {"Text": "text", "Label": "label"})

tweets_data = tweets_data.drop(columns = ["id", "country"])

tweets_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74682 entries, 0 to 74681
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   74682 non-null  object
 1   text    73996 non-null  object
dtypes: object(2)
memory usage: 1.1+ MB


Dropping rows with NAs in Text column

In [None]:
tweets_data.dropna(inplace = True, axis = 0)

tweets_data = tweets_data.drop_duplicates()

tweets_data.shape

(69769, 2)

In [None]:
tweets_data["label"].value_counts()

0    21237
2    19138
1    17110
3    12284
Name: label, dtype: int64

In [None]:
tweets_data["label"] = tweets_data["label"].replace({"Negative": 0, "Neutral": 1, "Positive": 2, "Irrelevant":3})

tweets_data.sample(10)

Unnamed: 0,label,text
52350,2,I painted my favorite location in Red Dead Red...
4071,3,Finally!!!! Death awaits Modern Warfare!! Oh t...
3376,1,Wow I can ’... t even believe it ’ s been 10 y...
38583,1,Trial by Felfire Challenges are Closed! New Le...
18004,0,This Damn
35250,2,special shoutouts to microsoft excel 2013
51391,1,Red Dead Redemption 2 (for PC) pcmag.com / rev...
24707,2,Thank a @YouTubeIndia<unk> for the support.
52486,2,It certainly does indeed. The landscapes and n...
3786,1,fiverr.com/share/xkDQya.


Converting pandas df to huggingface dataset

In [None]:
from datasets import Dataset

tweets_ds = Dataset.from_pandas(tweets_data)

tweets_ds

Dataset({
    features: ['label', 'text', '__index_level_0__'],
    num_rows: 69769
})

In [None]:
tweets_ds = tweets_ds.remove_columns(["__index_level_0__"])

tweets_ds

Dataset({
    features: ['label', 'text'],
    num_rows: 69769
})

Splitting the dataset’s  into a train and test set with the train_test_split method

In [None]:
tweets_ds = tweets_ds.train_test_split(test_size = 0.2)

tweets_ds

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 55815
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 13954
    })
})

Negative, positive , neutral instances can be seen

In [None]:
tweets_ds['train'][0]

{'label': 1,
 'text': 'Next. Was that awful rainbowsixde... Miss all of those...'}

In [None]:
tweets_ds['train'][4]

{'label': 3,
 'text': "Someone tell me why  @SpeakerPelosi @SenSchumer @ChrisMurphyCT @SenBlumenthal should'nt make this political? Squeeze him every minute of everyday it's his fault and what's his plan to get us out of this."}

In [None]:
tweets_ds['train'][5]

{'label': 0,
 'text': 'Big disagreements from great people who really create a great environment - so important in the world of Fifa toxicity!'}

In [None]:
tweets_ds['train'][3]

{'label': 0,
 'text': 'I ordered really shit ton of stuff on amazon lately..... I think I have a retail therapy day'}

### TODO Recording:

- Go to https://huggingface.co/distilbert-base-uncased and show the model card

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Creating a preprocessing function to tokenize text and truncate sequences to be no longer than DistilBERT’s maximum input length

In [None]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation = True)

In [None]:
tokenizer("Worst thing to ever happen out in Gta history")

{'input_ids': [101, 5409, 2518, 2000, 2412, 4148, 2041, 1999, 14181, 2050, 2381, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
tokenizer.vocab_size

30522

To apply the preprocessing function over the entire dataset, use 🤗 Datasets map function. You can speed up map by setting batched=True to process multiple elements of the dataset at once:

In [None]:
tokenized_tweets_ds = tweets_ds.map(preprocess_function, batched = True)

Map:   0%|          | 0/55815 [00:00<?, ? examples/s]

Map:   0%|          | 0/13954 [00:00<?, ? examples/s]

In [None]:
tweets_ds["train"][40]

{'label': 2,
 'text': 'ahh these are SO awesome! I cover absolutely everything. Such a total deal! And a fast way to pass the time until launch.'}

Tokenizing that instance

In [None]:
tokenized_tweets_ds["train"][40]

{'label': 2,
 'text': 'ahh these are SO awesome! I cover absolutely everything. Such a total deal! And a fast way to pass the time until launch.',
 'input_ids': [101,
  6289,
  2232,
  2122,
  2024,
  2061,
  12476,
  999,
  1045,
  3104,
  7078,
  2673,
  1012,
  2107,
  1037,
  2561,
  3066,
  999,
  1998,
  1037,
  3435,
  2126,
  2000,
  3413,
  1996,
  2051,
  2127,
  4888,
  1012,
  102],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1]}

Detokenizing the text to get back the input text

In [None]:
tokenizer.decode(tokenized_tweets_ds["train"][40]["input_ids"])

'[CLS] ahh these are so awesome! i cover absolutely everything. such a total deal! and a fast way to pass the time until launch. [SEP]'

Now Creating  a batch of examples using DataCollatorWithPadding. It’s more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In NLP, when processing sequences of text (like sentences or paragraphs), it's common to batch these sequences together for efficient training. However, since these sequences can vary in length, DataCollatorWithPadding automatically pads all sequences in a batch to match the length of the longest sequence in that batch. This ensures that all sequences in a batch have the same length, which is a requirement for most deep learning models.

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer = tokenizer, return_tensors = "tf")

Including a metric during training is often helpful for evaluating your model’s performance. We can quickly load a evaluation method with the 🤗 Evaluate library. For this task, loading the accuracy metric.

In [None]:
import evaluate

accuracy = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Then creating a function that passes your predictions and labels to compute to calculate the accuracy

In [None]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)

    return accuracy.compute(predictions = predictions, references = labels)

Before you start training your model, creating a map of the expected ids to their labels with id2label and label2id.

In [None]:
id2label = {0: "Negative", 1: "Neutral", 2: "Positive", 3: "Irrelevant"}

label2id = {"Negative": 0, "Neutral": 1, "Positive": 2, "Irrelevant":3}

For Finetuning a model in TensorFlow, first we start by setting up an optimizer function, learning rate schedule, and some training hyperparameters:

In [None]:
from transformers import create_optimizer
import tensorflow as tf

batch_size = 16
num_epochs = 5

batches_per_epoch = len(tokenized_tweets_ds["train"]) // batch_size

total_train_steps = int(batches_per_epoch * num_epochs)

optimizer, schedule = create_optimizer(init_lr = 2e-5, num_warmup_steps = 0, num_train_steps = total_train_steps)

In [None]:
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels = 4, id2label = id2label, label2id = label2id
)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

Converting train and validation datasets to the tf.data.Dataset format with prepare_tf_dataset():

In [None]:
tf_train_set = model.prepare_tf_dataset(
    tokenized_tweets_ds["train"],
    shuffle = True,
    batch_size = 16,
    collate_fn = data_collator,
)

tf_validation_set = model.prepare_tf_dataset(
    tokenized_tweets_ds["test"],
    shuffle = False,
    batch_size = 16,
    collate_fn = data_collator,
)

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Configuring the model for training with compile. Note that Transformers models all have a default task-relevant loss function, so you don’t need to specify one unless you want to

In [None]:
import tensorflow as tf

model.compile(optimizer = optimizer)

The last two things to setup before we start training is to compute the accuracy from the predictions, and provide a way to push  model to the Hub. Both are done by using Keras callbacks.

Passing  compute_metrics function to KerasMetricCallback:

In [None]:
from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(metric_fn = compute_metrics, eval_dataset = tf_validation_set)

Write permission is validated by token

### TODO Recording:

- After running the notebook cell - go to https://huggingface.co/
- Go to the top-right corner click on account avatar -> Settings
- Go to Access Tokens (we should already have a token)
- Copy and paste that in here



In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Specifying where to push your model and tokenizer in the PushToHubCallback

In [None]:
from transformers.keras_callbacks import PushToHubCallback

push_to_hub_callback = PushToHubCallback(
    output_dir = "twitter_text_classification_model",
    tokenizer = tokenizer,
)

For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.
Cloning https://huggingface.co/jinxxx123/twitter_text_classification_model into local empty directory.


### TODO Recording:

- Click on the Hugging Face repository link above
- Show the empty model card, files and versions (should be empty), community, and settings
- Keep this open in another tab

Bundling both callbacks together

In [None]:
callbacks = [metric_callback, push_to_hub_callback]

callbacks

[<transformers.keras_callbacks.KerasMetricCallback at 0x7f1a20649960>,
 <transformers.keras_callbacks.PushToHubCallback at 0x7f1a20648940>]

Starting training our model,Calling  fit with your training and validation datasets, the number of epochs, and your callbacks to finetune the model:

In [None]:
model.fit(
    x = tf_train_set, validation_data = tf_validation_set,
    epochs = 3, callbacks = callbacks
)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.src.callbacks.History at 0x7f1a206492d0>

### TODO Recording:

- Go back to the model and show that the model has been pushed (make sure you refresh)
- Show the  model card, files and versions (should NOT be empty), community, and settings

In [None]:
from transformers import pipeline

text = "This is awful. I get that profit-wise it was less than expected due to a huge budget."

classifier = pipeline("sentiment-analysis", model = model, tokenizer = tokenizer)

classifier(text)

[{'label': 'Negative', 'score': 0.9740213751792908}]

### TODO Recording:

- Go to the model open in another tab

In [None]:
text = "This was an amazing movie. I would like to explore similar movies."

classifier = pipeline("sentiment-analysis", model = "jinxxx123/twitter_text_classification_model")

classifier(text)

config.json:   0%|          | 0.00/740 [00:00<?, ?B/s]

tf_model.h5:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some layers from the model checkpoint at jinxxx123/twitter_text_classification_model were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at jinxxx123/twitter_text_classification_model and are newly initialized: ['dropout_39']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[{'label': 'Positive', 'score': 0.9736701250076294}]

In [None]:
text = "Let's play the next game tomorrow"

classifier = pipeline("sentiment-analysis", model = model, tokenizer = tokenizer)

classifier(text)

[{'label': 'Positive', 'score': 0.9707736968994141}]

In [None]:
text = "Overall I would say this movie was so so, I would not watch it again"

classifier = pipeline("sentiment-analysis", model = model, tokenizer = tokenizer)

classifier(text)

[{'label': 'Negative', 'score': 0.9909830093383789}]

In [None]:
text = "Going out right now"

classifier = pipeline("sentiment-analysis", model = model, tokenizer = tokenizer)

classifier(text)

[{'label': 'Neutral', 'score': 0.9195804595947266}]

Manual replication of above result. Text is tokenized

Model is locally available

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("twitter_text_classification_model")

inputs = tokenizer(text, return_tensors = "tf")

Passing inputs to the model to generate logits

In [None]:
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained("twitter_text_classification_model")

logits = model(**inputs).logits

Some layers from the model checkpoint at twitter_text_classification_model were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at twitter_text_classification_model and are newly initialized: ['dropout_59']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
predicted_class_id = int(tf.math.argmax(logits, axis = -1)[0])

model.config.id2label[predicted_class_id]

'Neutral'