##IMPORTING NECESSARY LIBRARIES

In [1]:
import pandas as pd
import numpy as np
import re
import tensorflow as tf
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

##MACHINE LEARING MODELS
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
import xgboost as xgb
from sklearn.ensemble import RandomForestClassifier

## DEEP LEARNING MODELS
from  tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Dense, Dropout, LSTM


## ACCURACY CHECKING
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, precision_score, recall_score, f1_score

In [2]:
### READING THE DATASET

data = pd.read_csv('/content/tweet_emotions.csv', encoding = 'latin-1')
pd.set_option('display.max_colwidth', None)
data.head()

Unnamed: 0,tweet_id,sentiment,content
0,1956967341,empty,@tiffanylue i know i was listenin to bad habit earlier and i started freakin at his part =[
1,1956967666,sadness,Layin n bed with a headache ughhhh...waitin on your call...
2,1956967696,sadness,Funeral ceremony...gloomy friday...
3,1956967789,enthusiasm,wants to hang out with friends SOON!
4,1956968416,neutral,"@dannycastillo We want to trade with someone who has Houston tickets, but no one will."


##DATA PREPROCESSING

In [3]:
data.info() # the data has 40000 entries with 3 columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 0 to 39999
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   tweet_id   40000 non-null  int64 
 1   sentiment  40000 non-null  object
 2   content    40000 non-null  object
dtypes: int64(1), object(2)
memory usage: 937.6+ KB


In [4]:
## As the id column is not relevant to the classification of emotions, we can drop the column

data = data.drop(["tweet_id"], axis = 1)
data.head()

Unnamed: 0,sentiment,content
0,empty,@tiffanylue i know i was listenin to bad habit earlier and i started freakin at his part =[
1,sadness,Layin n bed with a headache ughhhh...waitin on your call...
2,sadness,Funeral ceremony...gloomy friday...
3,enthusiasm,wants to hang out with friends SOON!
4,neutral,"@dannycastillo We want to trade with someone who has Houston tickets, but no one will."


In [5]:
data.isna().sum()

Unnamed: 0,0
sentiment,0
content,0


While looking into the columns we could see that this is a clean data with no errors.



In [6]:
data["sentiment"].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
neutral,8638
worry,8459
happiness,5209
sadness,5165
love,3842
surprise,2187
fun,1776
relief,1526
hate,1323
empty,827


As we could see there are 13 sentiments in this dataset where some them belongs to similar classes.

For eg: happiness, love, fun can be considered as positive emotion

 worry, sadness, hate, anger can be negative emotion

  neutral, relief, empty, boredom can be neutral emotion

   surprise and enthusiasm can be considered as excitement.

Therefore for easy processing we are mapping the emotions.

In [7]:
# mapping different emotions to similar class

emotion_map = {
    "neutral" : "NEUTRAL",
    "relief" : "NEUTRAL",
    "empty" : "NEUTRAL",
    "boredom" : "NEUTRAL",
    "fun" : "POSITIVE OR HAPPY",
    "happiness" : "POSITIVE OR HAPPY",
    "love" : "POSITIVE OR HAPPY",
    "enthusiasm" : "EXCITEMENT",
    "surprise": "EXCITEMENT",
    "worry" : "NEGATIVE OR SAD",
    "hate" : "NEGATIVE OR SAD",
    "anger" : "NEGATIVE OR SAD",
    "sadness" : "NEGATIVE OR SAD",
}

# Adding mapped emotion to the dataset
data['sentiment'] = data['sentiment'].map(emotion_map)
print(data['sentiment'].value_counts())


sentiment
NEGATIVE OR SAD      15057
NEUTRAL              11170
POSITIVE OR HAPPY    10827
EXCITEMENT            2946
Name: count, dtype: int64


We could see that the data has special characters, numbers, URL, comments all are there which should be removed making it easier for analysis.
The data also have upper case letters which should be convberted to lower case.

In [8]:
def clean_data(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove special characters
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'@\w+', '', text)  # Remove mentions
    text = re.sub(r'#\w+', '', text)  # Remove hashtags
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra whitespace
    return text


data['content'] = data['content'].astype(str).apply(clean_data)

In [9]:
data.head(20)

Unnamed: 0,sentiment,content
0,NEUTRAL,tiffanylue i know i was listenin to bad habit earlier and i started freakin at his part
1,NEGATIVE OR SAD,layin n bed with a headache ughhhhwaitin on your call
2,NEGATIVE OR SAD,funeral ceremonygloomy friday
3,EXCITEMENT,wants to hang out with friends soon
4,NEUTRAL,dannycastillo we want to trade with someone who has houston tickets but no one will
5,NEGATIVE OR SAD,repinging ghostridah why didnt you go to prom bc my bf didnt like my friends
6,NEGATIVE OR SAD,i should be sleep but im not thinking about an old friend who i want but hes married now damn amp he wants me scandalous
7,NEGATIVE OR SAD,hmmm is down
8,NEGATIVE OR SAD,charviray charlene my love i miss you
9,NEGATIVE OR SAD,kelcouch im sorry at least its friday


After cleaning we could see that the data has text only.


##ENCODING THE DATASET





We could see that after mapping we have only four classes left so we can use label encoder for emotion attribute.
As our algorithms need numerical input we could use label encoder which convert categorical columns to numerical format.

In [10]:
le = LabelEncoder()
data["sentiment"] = le.fit_transform(data["sentiment"])
num_class = len(le.classes_)
data.head()


Unnamed: 0,sentiment,content
0,2,tiffanylue i know i was listenin to bad habit earlier and i started freakin at his part
1,1,layin n bed with a headache ughhhhwaitin on your call
2,1,funeral ceremonygloomy friday
3,0,wants to hang out with friends soon
4,2,dannycastillo we want to trade with someone who has houston tickets but no one will


In [11]:
data['sentiment'].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
1,15057
2,11170
3,10827
0,2946


##SPLITTING THE DATASET




Now, we split the dataset into train and test with 80% of data used for training and 20% of data for testing.*italicized text*

In [12]:
x = data["content"]
y = data["sentiment"]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42, stratify = y)

## FEATURE EXTRACTION


####TOKENIZE AND PAD THE DATA

In [18]:
## tfid for machine learning models

vectorizer = TfidfVectorizer(max_features=5000)
x_train_t = vectorizer.fit_transform(x_train)
x_test_t = vectorizer.transform(x_test)

In [14]:
## Tokenization is used to break text into smaller units called tokens.

tk = Tokenizer(num_words = 10000)
tk.fit_on_texts(x_train)
x_train_seq = tk.texts_to_sequences(x_train)
x_test_seq = tk.texts_to_sequences(x_test)

In [15]:
## Padding is done so that all sequences will be of same length when fed into the model

x_train_pad = pad_sequences(x_train_seq, maxlen = 100, padding='post')
x_test_pad = pad_sequences(x_test_seq, maxlen = 100, padding='post')

## MODEL BUILDING AND EVALUATION

###MACHINE LEARNING MODELS



##### 1) LOGISTIC REGRESSION

In [19]:
lr = LogisticRegression()
lr.fit(x_train_t, y_train)
y_pred_lr = lr.predict(x_test_t)

print("ACCURACY: ",accuracy_score(y_test, y_pred_lr),"\n")

ACCURACY:  0.55025 



In [48]:
# Displaying a sample prediction

for i in range(5):
    print(f"Tweet: {x_test.values[i]}")
    print(f"Predicted Sentiment: {le.inverse_transform([y_pred_lr[i]])[0]}")
    print(f"Actual Sentiment: {le.inverse_transform([y_test.values[i]])[0]}")
    print("-" * 50)

Tweet: a simple nice dinner doesnt exist in my world
Predicted Sentiment: POSITIVE OR HAPPY
Actual Sentiment: NEGATIVE OR SAD
--------------------------------------------------
Tweet: does anyone know any good rap songs i need to make a rapfun cd and i have no idea helllllpppp
Predicted Sentiment: NEGATIVE OR SAD
Actual Sentiment: NEGATIVE OR SAD
--------------------------------------------------
Tweet: currently playing part of the list
Predicted Sentiment: NEUTRAL
Actual Sentiment: POSITIVE OR HAPPY
--------------------------------------------------
Tweet: heading to bed with tea to finish breaking dawn
Predicted Sentiment: NEUTRAL
Actual Sentiment: POSITIVE OR HAPPY
--------------------------------------------------
Tweet: morning all its beautiful out already glorious sunshine wash and vacum car done fill with petrol done bbq stuff buy done
Predicted Sentiment: POSITIVE OR HAPPY
Actual Sentiment: POSITIVE OR HAPPY
--------------------------------------------------


##### 2) XG BOOSTING

In [23]:
xgb_model = xgb.XGBClassifier()
xgb_model.fit(x_train_t, y_train)
y_pred_xgb = xgb_model.predict(x_test_t)
acc_xgb = accuracy_score(y_test, y_pred_xgb)


print("Accuracy for XGBoost = ",acc_xgb)

Accuracy for XGBoost =  0.53175


In [47]:
# Displaying a sample prediction

for i in range(5):
    print(f"Tweet: {x_test.values[i]}")
    print(f"Predicted Sentiment: {le.inverse_transform([y_pred_xgb[i]])[0]}")
    print(f"Actual Sentiment: {le.inverse_transform([y_test.values[i]])[0]}")
    print("-" * 50)

Tweet: a simple nice dinner doesnt exist in my world
Predicted Sentiment: POSITIVE OR HAPPY
Actual Sentiment: NEGATIVE OR SAD
--------------------------------------------------
Tweet: does anyone know any good rap songs i need to make a rapfun cd and i have no idea helllllpppp
Predicted Sentiment: NEGATIVE OR SAD
Actual Sentiment: NEGATIVE OR SAD
--------------------------------------------------
Tweet: currently playing part of the list
Predicted Sentiment: NEUTRAL
Actual Sentiment: POSITIVE OR HAPPY
--------------------------------------------------
Tweet: heading to bed with tea to finish breaking dawn
Predicted Sentiment: NEUTRAL
Actual Sentiment: POSITIVE OR HAPPY
--------------------------------------------------
Tweet: morning all its beautiful out already glorious sunshine wash and vacum car done fill with petrol done bbq stuff buy done
Predicted Sentiment: NEGATIVE OR SAD
Actual Sentiment: POSITIVE OR HAPPY
--------------------------------------------------


##### 3) RANDOM FOREST

In [25]:
rf = RandomForestClassifier()
rf.fit(x_train_t,y_train)
y_pred_rf = rf.predict(x_test_t)
acc_rf = accuracy_score(y_test,y_pred_rf)


print("Accuracy for random forest model = ",acc_rf)

Accuracy for random forest model =  0.52975


In [46]:
# Displaying a sample prediction

for i in range(5):
    print(f"Tweet: {x_test.values[i]}")
    print(f"Predicted Sentiment: {le.inverse_transform([y_pred_rf[i]])[0]}")
    print(f"Actual Sentiment: {le.inverse_transform([y_test.values[i]])[0]}")
    print("-" * 50)

Tweet: a simple nice dinner doesnt exist in my world
Predicted Sentiment: POSITIVE OR HAPPY
Actual Sentiment: NEGATIVE OR SAD
--------------------------------------------------
Tweet: does anyone know any good rap songs i need to make a rapfun cd and i have no idea helllllpppp
Predicted Sentiment: NEGATIVE OR SAD
Actual Sentiment: NEGATIVE OR SAD
--------------------------------------------------
Tweet: currently playing part of the list
Predicted Sentiment: NEUTRAL
Actual Sentiment: POSITIVE OR HAPPY
--------------------------------------------------
Tweet: heading to bed with tea to finish breaking dawn
Predicted Sentiment: NEUTRAL
Actual Sentiment: POSITIVE OR HAPPY
--------------------------------------------------
Tweet: morning all its beautiful out already glorious sunshine wash and vacum car done fill with petrol done bbq stuff buy done
Predicted Sentiment: POSITIVE OR HAPPY
Actual Sentiment: POSITIVE OR HAPPY
--------------------------------------------------


### DEEP LEARNING MODELS

##### 1) LONG-SHORT TERM MEMORY (LSTM): Used to learn long term dependencies in data

In [26]:
## WE are using the LSTM model

model_LSTM = Sequential([
    Embedding(input_dim = 10000, output_dim=128, input_length= 100),
    LSTM(128, return_sequences=False),
    Dropout(0.5),
    Dense(64, activation='relu'),
    Dropout(0.5),
    Dense(num_class, activation='softmax')## output layer activation function used is softmax
])

In [27]:
model_LSTM.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

In [29]:
### TRAINING THE MODEL

history = model_LSTM.fit(
    x_train_pad, y_train,
    validation_split= 0.2,
    epochs= 20,
    batch_size= 32,
    verbose= 1
)

Epoch 1/20
[1m800/800[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 10ms/step - accuracy: 0.3744 - loss: 1.2761 - val_accuracy: 0.3850 - val_loss: 1.2627
Epoch 2/20
[1m800/800[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 9ms/step - accuracy: 0.3673 - loss: 1.2742 - val_accuracy: 0.3850 - val_loss: 1.2629
Epoch 3/20
[1m800/800[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 8ms/step - accuracy: 0.3673 - loss: 1.2754 - val_accuracy: 0.3850 - val_loss: 1.2639
Epoch 4/20
[1m800/800[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 9ms/step - accuracy: 0.3780 - loss: 1.2717 - val_accuracy: 0.3850 - val_loss: 1.2631
Epoch 5/20
[1m800/800[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 9ms/step - accuracy: 0.3730 - loss: 1.2758 - val_accuracy: 0.3850 - val_loss: 1.2621
Epoch 6/20
[1m800/800[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 10ms/step - accuracy: 0.3805 - loss: 1.2685 - val_accuracy: 0.3850 - val_loss: 1.2632
Epoch 7/20
[1m800/800

In [30]:
#EVALUATE THE MODEL

model_LSTM.evaluate(x_test_pad, y_test)

[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.3789 - loss: 1.2620


[1.2698235511779785, 0.3765000104904175]

In [31]:
# Evaluate the model on the test set

test_loss, test_accuracy = model_LSTM.evaluate(x_test_pad, y_test, verbose=1)
print(f"Test Accuracy: {test_accuracy * 100:.2f}%")
print(f"Test Loss: {test_loss:.4f}")

[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.3789 - loss: 1.2620
Test Accuracy: 37.65%
Test Loss: 1.2698


In [43]:
# Make predictions on the test set

y_pred = model_LSTM.predict(x_test_pad)
y_pred_classes = y_pred.argmax(axis=1)

[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step


In [44]:
# Displaying a sample prediction

for i in range(5):
    print(f"Tweet: {x_test.values[i]}")
    print(f"Predicted Sentiment: {le.inverse_transform([y_pred_classes[i]])[0]}")
    print(f"Actual Sentiment: {le.inverse_transform([y_test.values[i]])[0]}")
    print("-" * 50)

Tweet: a simple nice dinner doesnt exist in my world
Predicted Sentiment: NEGATIVE OR SAD
Actual Sentiment: NEGATIVE OR SAD
--------------------------------------------------
Tweet: does anyone know any good rap songs i need to make a rapfun cd and i have no idea helllllpppp
Predicted Sentiment: NEGATIVE OR SAD
Actual Sentiment: NEGATIVE OR SAD
--------------------------------------------------
Tweet: currently playing part of the list
Predicted Sentiment: NEGATIVE OR SAD
Actual Sentiment: POSITIVE OR HAPPY
--------------------------------------------------
Tweet: heading to bed with tea to finish breaking dawn
Predicted Sentiment: NEGATIVE OR SAD
Actual Sentiment: POSITIVE OR HAPPY
--------------------------------------------------
Tweet: morning all its beautiful out already glorious sunshine wash and vacum car done fill with petrol done bbq stuff buy done
Predicted Sentiment: NEGATIVE OR SAD
Actual Sentiment: POSITIVE OR HAPPY
--------------------------------------------------


In [45]:
# Checking the distribution of 4 classes in the data

unique, counts = np.unique(y_train, return_counts=True)
class_distribution = dict(zip(unique, counts))
print("Class Distribution in Training Set:", class_distribution)

Class Distribution in Training Set: {0: 2357, 1: 12045, 2: 8936, 3: 8662}


On comparing the accuracies

Logistic regression: 55%

XGBoost: 53.1%

Random forest: 53%

LSTM: 37.65%


So we conclude logistic regression as the best model for sentiment analysis of tweet emotions dataset.