<a href="https://colab.research.google.com/github/Darrenn231/DeepLearning/blob/main/NLP_using_Albert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Prepare Dataset

In [None]:
data = pd.read_csv('Emotion.csv')

In [None]:
data.head(5)

Unnamed: 0,text,label
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned...,sadness
2,im grabbing a minute to post i feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplac...,love
4,i am feeling grouchy,anger


In [None]:
data.isnull().sum()

text     0
label    0
dtype: int64

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    20000 non-null  object
 1   label   20000 non-null  object
dtypes: object(2)
memory usage: 312.6+ KB


# Preprocessing

In the initial steps of Natural Language Processing (NLP) preprocessing, the text undergoes lowercase conversion to maintain consistency across cases.Special characters and numeric digits are then eliminated to simplify the data and reduce strange information. Tokenization follows, breaking the text into individual words or tokens, laying the groundwork for subsequent analysis.  To further prepare the text for machine learning algorithms, encoding using encoding methods like Label Encoder is used to transform categorical data into a numerical format.

In [None]:
import re

def clean_text(text):
    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Convert to lowercase
    text = text.lower()
    return text

data['text'] = data['text'].apply(clean_text)

data.head(5)

Unnamed: 0,text,label
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned...,sadness
2,im grabbing a minute to post i feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplac...,love
4,i am feeling grouchy,anger


In [None]:
texts = data['text'].values
labels = data['label'].values

In [None]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
labels = label_encoder.fit_transform(labels)

changing categorical values to numerical for label

In [None]:
!pip install SentencePiece



Splitting data(70,15,15)

In [None]:
from sklearn.model_selection import train_test_split
from transformers import AlbertTokenizer, TFAlbertForSequenceClassification
import tensorflow as tf

train_texts, test_texts, train_labels, test_labels = train_test_split(texts, labels, test_size=0.3, random_state=99)
val_texts, test_texts, val_labels, test_labels = train_test_split(test_texts, test_labels, test_size=0.5, random_state=99)

print("train size:", len(train_texts))
print("validation size:", len(val_texts))
print("test size:", len(test_texts))

train size: 14000
validation size: 3000
test size: 3000


In [None]:
tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')
model = TFAlbertForSequenceClassification.from_pretrained('albert-base-v2', num_labels=len(set(labels)))

All PyTorch model weights were used when initializing TFAlbertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFAlbertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
train_encodings = tokenizer(train_texts.tolist(), truncation=True, padding=True)
val_encodings = tokenizer(val_texts.tolist(), truncation=True, padding=True)
test_encodings = tokenizer(test_texts.tolist(), truncation=True, padding=True)

Tokenize for each train, val, and test texts. This process separate a sentence into individual words and remove any unnecessary words to provide an effective analysis.

In [None]:
train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_encodings), tf.convert_to_tensor(train_labels)))
val_dataset = tf.data.Dataset.from_tensor_slices((dict(val_encodings), tf.convert_to_tensor(val_labels)))
test_dataset = tf.data.Dataset.from_tensor_slices((dict(test_encodings), tf.convert_to_tensor(test_labels)))

Combine the tokenized texts with the labels into a dataset according to train, val, and test.

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')

In [None]:
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

In [None]:
model.fit(train_dataset.batch(32), epochs=8, validation_data=val_dataset.batch(32))

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.src.callbacks.History at 0x7bf9b04011b0>

In [None]:
from sklearn.metrics import classification_report

y_pred_probs = model.predict(test_dataset.batch(32))
y_pred = tf.argmax(y_pred_probs.logits, axis=1)

y_pred_labels = label_encoder.inverse_transform(y_pred.numpy())
test_true_labels = label_encoder.inverse_transform(test_labels)


tf.Tensor([2 2 2 ... 4 1 0], shape=(3000,), dtype=int64)


In [None]:
results_df = pd.DataFrame({'Actual Labels': test_true_labels, 'Predicted Labels': y_pred_labels})

In [None]:
results_df

Unnamed: 0,Actual Labels,Predicted Labels
0,joy,joy
1,joy,joy
2,joy,joy
3,fear,anger
4,anger,fear
...,...,...
2995,love,love
2996,joy,joy
2997,sadness,sadness
2998,fear,fear


In [None]:
print(classification_report(test_true_labels, y_pred_labels))

              precision    recall  f1-score   support

       anger       0.96      0.92      0.94       436
        fear       0.83      0.97      0.90       355
         joy       0.95      0.95      0.95       999
        love       0.84      0.86      0.85       235
     sadness       0.98      0.98      0.98       872
    surprise       0.97      0.54      0.70       103

    accuracy                           0.94      3000
   macro avg       0.92      0.87      0.89      3000
weighted avg       0.94      0.94      0.93      3000



Overall,  the result shows that the model is able to recognize the emotion very well with an accuracy of 94%. Furthermore, there is a good balance between the recall and precision showing a good f1-score. However, the model is not able to recognize 'surprise' as well as the others showing a low f1-score and recall.