# Problem Statement
The objective of this project is to develop a text classification model to classify Twitter posts into six distinct emotion categories: Sadness, Joy, Love, Anger, Fear, and Surprise. The goal is to leverage a pre-trained BERT model to achieve high accuracy in predicting the correct emotion from the text of a tweet.

# Model Details
- **Base Model:** bert-base-uncased from Hugging Face's transformers library.
- **Custom Model:** A custom TensorFlow model named BERTForClassification that incorporates the BERT model and an additional dense layer with a softmax activation function to predict one of the six emotion labels.
- **Number of Classes:** 6 emotion categories (Sadness, Joy, Love, Anger, Fear, Surprise).

# Important Libraries Used:
- TensorFlow: For creating, training, and evaluating the custom classification model.
- Transformers: For loading the pre-trained BERT model and tokenizer.
- Datasets: For loading and managing the Twitter emotion dataset.

# Tokenizer Information
- Tokenizer: AutoTokenizer from the transformers library.
- Function: Converts input text into token IDs, attention masks, and token type IDs, making it suitable for BERT processing.

Here we have 6 labels <br>
Numerical representation of Labels
- Sadness: 0
- Joy:   1
- Love: 2
- Anger: 3
- Fear: 4
- Surprise: 5

## Mount Google Drive

In [1]:
from google.colab import drive
drive.mount('/gdrive')

Mounted at /gdrive


## Change Directory

In [2]:
%cd /gdrive/My Drive/ML_Datasets/

/gdrive/My Drive/ML_Datasets


## Import necessary libraries

In [2]:
from transformers import TFAutoModel, AutoTokenizer
import tensorflow as tf


## Load Pre-trained BERT Model and Tokenizer:

In [3]:
model = TFAutoModel.from_pretrained("bert-base-uncased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT

In [4]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

In [5]:
inputs = tokenizer(['Hello world','Hi how are you'], padding=True, truncation = True,
                   return_tensors = 'tf')
inputs

{'input_ids': <tf.Tensor: shape=(2, 6), dtype=int32, numpy=
array([[ 101, 7592, 2088,  102,    0,    0],
       [ 101, 7632, 2129, 2024, 2017,  102]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(2, 6), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(2, 6), dtype=int32, numpy=
array([[1, 1, 1, 1, 0, 0],
       [1, 1, 1, 1, 1, 1]], dtype=int32)>}

In [6]:
output = model(inputs)
output

TFBaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=<tf.Tensor: shape=(2, 6, 768), dtype=float32, numpy=
array([[[-0.16888346,  0.13606353, -0.13940021, ..., -0.6251122 ,
          0.05217258,  0.36714542],
        [-0.36327425,  0.14121991,  0.8799884 , ...,  0.10432979,
          0.28875816,  0.37267938],
        [-0.6985947 , -0.6987969 ,  0.06450288, ..., -0.221037  ,
          0.00986866, -0.5939791 ],
        [ 0.83098316,  0.12366718, -0.15119025, ...,  0.10309617,
         -0.67792654, -0.2628524 ],
        [-0.4026664 , -0.01928189,  0.5732511 , ..., -0.20656863,
          0.02338578,  0.20126325],
        [-0.6228405 , -0.27453387,  0.18117723, ..., -0.12944841,
         -0.03839084, -0.05733161]],

       [[ 0.09286507, -0.02636369, -0.12239276, ..., -0.2106356 ,
          0.1738639 ,  0.17250952],
        [ 0.4074202 , -0.05931015,  0.5523475 , ..., -0.6790574 ,
          0.6555744 , -0.29456657],
        [-0.2115531 , -0.685863  , -0.46280605, ...,  0.15278436

In [7]:
# Here we get two output "last_hidden_state" and "Pooler_output"
# In last_hidden_state  = (2,6,768)
# Here 2 is two input
# 6 is size of input
# 768 is hidden_size
# Here hidden_size means number of nurons in feed-forword netword
# We used BERT-base so hidden size is 768. If we use BERT-Large then hidden_size is 1024
# In last_hidden_state it generates hidden state for every words but tasks such as Text classification does not need hidden state for every words we Just need one hidden state for single sentences not word.
# So, for that Pooler_output comes in. It does not have middle axis like the large hidden states.
# Pooler_output = (2,768)



## Now Load Datasets

In [8]:
# !pip install datasets

In [9]:
# load_dataset is the function from the hugging face library
from datasets import load_dataset

In [10]:
emotions = load_dataset('SetFit/emotion')

Downloading readme:   0%|          | 0.00/194 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data:   0%|          | 0.00/2.23M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/276k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/279k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [11]:
emotions

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 2000
    })
})

Here we have datasets in the three form
- Train_Data --> 16000 rows
- Validation_Data --> 2000 rows
- Test_Data --> 2000 rows


### Map Labels to Text

In [12]:
# Define a dictionary to map label indices to label texts
label_texts = {
    0: "label_text_0",
    1: "label_text_1",
    2: "label_text_2",
    3: "label_text_3",
    4: "label_text_4",
    5: "label_text_5",
}


# Print train data
print("Train:")
for example in emotions['train']:
    text = example['text']
    label = example['label']
    label_text = label_texts[label]
    print("Text:", text)
    print("Label:", label)
    print("Label Text:", label_text)
    print()

# Print validation data
print("Validation:")
for example in emotions['validation']:
    text = example['text']
    label = example['label']
    label_text = label_texts[label]
    print("Text:", text)
    print("Label:", label)
    print("Label Text:", label_text)
    print()

# Print test data
print("Test:")
for example in emotions['test']:
    text = example['text']
    label = example['label']
    label_text = label_texts[label]
    print("Text:", text)
    print("Label:", label)
    print("Label Text:", label_text)
    print()


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Text: i lost a few pounds but i also started to feel really awful
Label: 0
Label Text: label_text_0

Text: i enjoy about his work is the genuine feel and the pleasant message he is trying to deliver with all this
Label: 1
Label Text: label_text_1

Text: i knew except they ve lost that girly feeling and gained a graceful wisdom
Label: 1
Label Text: label_text_1

Text: i am feeling a bit ungrateful and choose to correct that
Label: 0
Label Text: label_text_0

Text: i notice how different this question is from why i am feeling so agitated
Label: 4
Label Text: label_text_4

Text: i feel like such a noob when the customers make really dull and stupid jokes that im supposed to find funny
Label: 0
Label Text: label_text_0

Text: getting sent on a company expense trip to another state to work for a week at that plan
Label: 1
Label Text: label_text_1

Text: i feel insulted by how those heroes of cosplay goons said they don t care 

## Dataset Preparation:

In [13]:
# First tasks of dataset is to convet it into tokenized formate it is in string formate.

In [14]:
def tokenize(batch):
  return tokenizer(batch["text"], padding = True, truncation = True)

In [15]:
emotions_encoded = emotions.map(tokenize, batched = True, batch_size=None)

Map:   0%|          | 0/16000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [16]:
emotions_encoded

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'label_text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label', 'label_text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label', 'label_text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2000
    })
})

In [17]:
# Setting Batch_size to 64
BATCH_SIZE = 64

def order(inp):
    '''
    This function prepares the input data for BERT model.
    '''
    return {
        'input_ids': inp['input_ids'],
        'attention_mask': inp['attention_mask'],
        'token_type_ids': inp['token_type_ids']
    }, inp['label']

# Convert train split of 'emotions_encoded' to TensorFlow format
train_dataset = tf.data.Dataset.from_tensor_slices(emotions_encoded['train'][:])

# Set batch_size and shuffle
train_dataset = train_dataset.batch(BATCH_SIZE).shuffle(1000)

# Map the 'order' function
train_dataset = train_dataset.map(order, num_parallel_calls=tf.data.AUTOTUNE)

# Do the same for the test set
test_dataset = tf.data.Dataset.from_tensor_slices(emotions_encoded['test'][:])
test_dataset = test_dataset.batch(BATCH_SIZE).shuffle(1000)
test_dataset = test_dataset.map(order, num_parallel_calls=tf.data.AUTOTUNE)


In [18]:
# # Setting Batch_size to 64
# BATCH_SIZE = 64

# def order(inp):
#   '''
#   This will group all the input of the BERT
#   into a single dicrionary and then outupit is with labels.
#   '''

#   data = list(inp.values())
#   return{
#       'input_ids': data[1],
#       'attention_mask': data[2],
#       'token_type_ids': data[3]
#   }.data[0]

# # converting train split of 'emotions_encoded' to tensotflow format
# train_dataset = tf.data.Dataset.from_tensor_slices(emotions_encoded['train'][:])

# # Set batch_size and shuffele
# train_dataset = train_dataset.batch(BATCH_SIZE).shuffle(1000)

# # map the 'order' function
# train_dataset = train_dataset.map(order, num_parallel_calls = tf.data.AUTOTUNE)

# # Doing the same for test set
# test_dataset = tf.data.Dataset.from_tensor_slices(emotions_encoded['test'][:])
# test_dataset = test_dataset.batch(BATCH_SIZE).shuffle(1000)
# test_dataset = test_dataset.map(order,num_parallel_calls = tf.data.AUTOTUNE)

In [19]:
inp, out = next(iter(train_dataset))
print(inp,'\n\n' , out)

{'input_ids': <tf.Tensor: shape=(64, 87), dtype=int32, numpy=
array([[ 101, 2044, 4909, ...,    0,    0,    0],
       [ 101, 1045, 2064, ...,    0,    0,    0],
       [ 101, 1045, 2245, ...,    0,    0,    0],
       ...,
       [ 101, 1045, 2018, ...,    0,    0,    0],
       [ 101, 1045, 2031, ...,    0,    0,    0],
       [ 101, 4921, 2063, ...,    0,    0,    0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(64, 87), dtype=int32, numpy=
array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(64, 87), dtype=int32, numpy=
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int32)>} 

 tf.Tensor(
[0 1 2 0 0 1 1 4 0 1 1 3

# Building the Model BERT for Classification

In [20]:
class BERTForClassification(tf.keras.Model):
  def __init__(self, bert_model, num_classes):
    super().__init__() # super() which call the __init__ of parant class
    self.bert = bert_model # save the BERT model in bert attribute
    self.fc = tf.keras.layers.Dense(num_classes, activation = 'softmax') # This is the Final Dense layer of our model where num_classes is no. of nurons.

  def call(self, inputs):  # Now we write forward pass int the call method it will accept inputs as onlly parameter
    x = self.bert(inputs)[1] # 1 we want only pooler_output
    return self.fc(x)

### Compile and train Model:

In [21]:
classifier = BERTForClassification(model, num_classes = 6)

classifier.compile(
    optimizer = tf.keras.optimizers.Adam(learning_rate = 1e-5),
    loss = tf.keras.losses.SparseCategoricalCrossentropy(),
    metrics = ['accuracy']
)

In [22]:
history = classifier.fit(train_dataset,epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


### Evaluate the Model

In [23]:
classifier.evaluate(test_dataset)



[0.186604306101799, 0.9194999933242798]

## Make Predictions:

In [38]:
# Tokenize your input text
input_text = "i did a body scan and realized that everything was feeling amazing"
tokenized_input = tokenizer(input_text, padding=True, truncation=True, return_tensors="tf")

# Convert the tokenized input into a TensorFlow Dataset
input_dataset = tf.data.Dataset.from_tensor_slices({
    'input_ids': [tokenized_input['input_ids']],  # Wrap the input_ids in a list
    'attention_mask': [tokenized_input['attention_mask']],  # Wrap the attention_mask in a list
    'token_type_ids': [tokenized_input['token_type_ids']]  # Wrap the token_type_ids in a list
})

# Make predictions using the model
predictions = classifier.predict(input_dataset)

# `predictions` will contain the predicted probabilities for each class
# You can then take the argmax to get the predicted class
predicted_classes = tf.argmax(predictions, axis=-1)

# Print or use the predicted classes as needed
print(predicted_classes)


tf.Tensor([5], shape=(1,), dtype=int64)


In [25]:
predicted_classes

<tf.Tensor: shape=(1,), dtype=int64, numpy=array([1])>

## Save the Model:

In [50]:
# Save the model and tokenizer
save_directory = '/gdrive/My Drive/ML_Datasets/'
classifier.save(save_directory)
tokenizer.save_pretrained(save_directory)


('/gdrive/My Drive/ML_Datasets/tokenizer_config.json',
 '/gdrive/My Drive/ML_Datasets/special_tokens_map.json',
 '/gdrive/My Drive/ML_Datasets/vocab.txt',
 '/gdrive/My Drive/ML_Datasets/added_tokens.json',
 '/gdrive/My Drive/ML_Datasets/tokenizer.json')

In [48]:
# Load the model from Google Drive
loaded_model = tf.keras.models.load_model('/gdrive/My Drive/ML_Datasets/')

# **Conclusion**
This project demonstrates the use of a pre-trained BERT model fine-tuned on a Twitter dataset to classify tweets into six distinct emotions: Sadness, Joy, Love, Anger, Fear, and Surprise. By leveraging the powerful contextual embeddings provided by BERT, the custom TensorFlow model achieves high accuracy in emotion classification tasks.