# Twitter Sentiment Analysis

## *You can download the dataset from here:* [Dataset Link](https://www.kaggle.com/datasets/pashupatigupta/emotion-detection-from-text)

#### In this project, I used the <u>*SpaCy*</u> for Data Cleaning( removing puctuations, stop words,etc ) and <u>*Gensim*</u> for Tokenization. I used 

#### the <u>*Regular Expression*</u> to remove the tags and finally <u>*Word2Vec*</u> model to convert text into vectors.

In [1]:
import gensim

In [2]:
import gensim.downloader as api
from gensim.models import Word2Vec

In [3]:
import pandas as pd

df = pd.read_csv("E:/Downloads/My Projects with datasets/Twitter Tweets Sentiment Analysis/tweet_emotions.csv")
df.head()

Unnamed: 0,tweet_id,sentiment,content
0,1956967341,empty,@tiffanylue i know i was listenin to bad habi...
1,1956967666,sadness,Layin n bed with a headache ughhhh...waitin o...
2,1956967696,sadness,Funeral ceremony...gloomy friday...
3,1956967789,enthusiasm,wants to hang out with friends SOON!
4,1956968416,neutral,@dannycastillo We want to trade with someone w...


In [4]:
df.isna().sum()

tweet_id     0
sentiment    0
content      0
dtype: int64

In [5]:
df.sentiment.value_counts()

sentiment
neutral       8638
worry         8459
happiness     5209
sadness       5165
love          3842
surprise      2187
fun           1776
relief        1526
hate          1323
empty          827
enthusiasm     759
boredom        179
anger          110
Name: count, dtype: int64

In [6]:
df.content[0]

'@tiffanylue i know  i was listenin to bad habit earlier and i started freakin at his part =['

In [7]:
df1 = df.copy()

# Text Cleaning and Prepocessing...

In [8]:
import re

def remove_tags(text):
    return re.sub(r"@\w+","",text)

df1['content'] = df1.content.apply(remove_tags)
df1.content[0]

' i know  i was listenin to bad habit earlier and i started freakin at his part =['

In [9]:
import spacy

#### I am using SpaCy for data cleaning like removing stop words, punctuations, converting text to lower

#### Function for data cleaning :

In [10]:
nlp = spacy.load("en_core_web_sm")
def preprocess(text):
    doc = nlp(text)
    tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        tokens.append(token.text.lower())
    return " ".join(tokens)

In [11]:
df1['content'] = df1['content'].apply(preprocess)
df1.content[0:5]

0      know   listenin bad habit earlier started fr...
1                 layin n bed headache   ughhhh waitin
2                       funeral ceremony gloomy friday
3                              wants hang friends soon
4                           want trade houston tickets
Name: content, dtype: object

In [12]:
df1.head()

Unnamed: 0,tweet_id,sentiment,content
0,1956967341,empty,know listenin bad habit earlier started fr...
1,1956967666,sadness,layin n bed headache ughhhh waitin
2,1956967696,sadness,funeral ceremony gloomy friday
3,1956967789,enthusiasm,wants hang friends soon
4,1956968416,neutral,want trade houston tickets


In [13]:
df1['content'] = df1.content.apply(gensim.utils.simple_preprocess) # Ignore this error because i runned this cell two times

In [14]:
df1.content[:5]

0    [know, listenin, bad, habit, earlier, started,...
1               [layin, bed, headache, ughhhh, waitin]
2                  [funeral, ceremony, gloomy, friday]
3                         [wants, hang, friends, soon]
4                      [want, trade, houston, tickets]
Name: content, dtype: object

In [15]:
df1.head()

Unnamed: 0,tweet_id,sentiment,content
0,1956967341,empty,"[know, listenin, bad, habit, earlier, started,..."
1,1956967666,sadness,"[layin, bed, headache, ughhhh, waitin]"
2,1956967696,sadness,"[funeral, ceremony, gloomy, friday]"
3,1956967789,enthusiasm,"[wants, hang, friends, soon]"
4,1956968416,neutral,"[want, trade, houston, tickets]"


In [16]:
df1['content'] = df1.content.apply(lambda x: " ".join(x))
df1.head()

Unnamed: 0,tweet_id,sentiment,content
0,1956967341,empty,know listenin bad habit earlier started freakin
1,1956967666,sadness,layin bed headache ughhhh waitin
2,1956967696,sadness,funeral ceremony gloomy friday
3,1956967789,enthusiasm,wants hang friends soon
4,1956968416,neutral,want trade houston tickets


In [17]:
df2 = df1.drop('tweet_id',axis = 1)

In [18]:
df2.head()

Unnamed: 0,sentiment,content
0,empty,know listenin bad habit earlier started freakin
1,sadness,layin bed headache ughhhh waitin
2,sadness,funeral ceremony gloomy friday
3,enthusiasm,wants hang friends soon
4,neutral,want trade houston tickets


In [19]:
df2.shape

(40000, 2)

In [20]:
df2.isna().sum()

sentiment    0
content      0
dtype: int64

# Balancing The Dataset...

### I am using SMOTEN() here to balance the dataset since SMOTEN() is specifically used for balancing text type of dataset

In [21]:
from imblearn.over_sampling import SMOTEN
sampler = SMOTEN(random_state = 2002)

In [22]:
df2_content,df2_sentiment = sampler.fit_resample(df2[['content']],df2['sentiment'])

In [23]:
df2_sentiment.head()

0         empty
1       sadness
2       sadness
3    enthusiasm
4       neutral
Name: sentiment, dtype: object

In [24]:
df2_content.head()

Unnamed: 0,content
0,know listenin bad habit earlier started freakin
1,layin bed headache ughhhh waitin
2,funeral ceremony gloomy friday
3,wants hang friends soon
4,want trade houston tickets


In [25]:
df3 = pd.DataFrame()

In [26]:
df3 = pd.concat([df2_content,df2_sentiment],axis = 1)

In [27]:
df3.head()

Unnamed: 0,content,sentiment
0,know listenin bad habit earlier started freakin,empty
1,layin bed headache ughhhh waitin,sadness
2,funeral ceremony gloomy friday,sadness
3,wants hang friends soon,enthusiasm
4,want trade houston tickets,neutral


In [28]:
df3.sentiment.value_counts()

sentiment
empty         8638
sadness       8638
enthusiasm    8638
neutral       8638
worry         8638
surprise      8638
love          8638
fun           8638
hate          8638
happiness     8638
boredom       8638
relief        8638
anger         8638
Name: count, dtype: int64

# Feature Engineering...

In [29]:
df4 = df3.copy()

In [30]:
df4.head()

Unnamed: 0,content,sentiment
0,know listenin bad habit earlier started freakin,empty
1,layin bed headache ughhhh waitin,sadness
2,funeral ceremony gloomy friday,sadness
3,wants hang friends soon,enthusiasm
4,want trade houston tickets,neutral


In [31]:
w2v = Word2Vec(
    window= 10,
    min_count=2,
    workers=4
)

In [32]:
df4['content'] = df4.content.apply(gensim.utils.simple_preprocess)

In [33]:
df4.content[112288]

['hate', 'storms']

***Building Vocabulary for my Word2Vec model..***

In [34]:
w2v.build_vocab(df4['content'],progress_per=1000)

In [35]:
w2v.corpus_count

112294

In [36]:
w2v.epochs

5

***Training my Word2Vec Model:***

In [36]:
w2v.train(df4.content, total_examples = w2v.corpus_count, epochs = 7)

(2548836, 4389742)

In [37]:
w2v.wv.key_to_index

{'time': 0,
 'bioshock': 1,
 'like': 2,
 'good': 3,
 'sorry': 4,
 'right': 5,
 'night': 6,
 'upset': 7,
 'copy': 8,
 'knows': 9,
 'sims': 10,
 'legit': 11,
 'ea': 12,
 'apperently': 13,
 'internet': 14,
 'online': 15,
 'dead': 16,
 'liking': 17,
 'livebox': 18,
 'tonight': 19,
 'best': 20,
 'soon': 21,
 'stay': 22,
 'lady': 23,
 'shall': 24,
 'decided': 25,
 'farewell': 26,
 'pittsburgh': 27,
 'tomorrow': 28,
 'better': 29,
 'sleep': 30,
 'shiit': 31,
 'feeeel': 32,
 'child': 33,
 'glad': 34,
 'apologised': 35,
 'okay': 36,
 'safe': 37,
 'played': 38,
 'fantastic': 39,
 'amazing': 40,
 'makes': 41,
 'beer': 42,
 'company': 43,
 'chicken': 44,
 'great': 45,
 'getting': 46,
 'day': 47,
 'aw': 48,
 'july': 49,
 'reasons': 50,
 'mood': 51,
 'http': 52,
 'got': 53,
 'work': 54,
 'quot': 55,
 'happy': 56,
 'love': 57,
 'today': 58,
 'nt': 59,
 'going': 60,
 'lol': 61,
 'com': 62,
 'know': 63,
 'amp': 64,
 'thanks': 65,
 'home': 66,
 'new': 67,
 'think': 68,
 'morning': 69,
 'oh': 70,
 'want'

In [38]:
import numpy as np

#### Function to convert sentences to word vectors:

In [40]:
def get_average_vector(sentence, model):
    # words = sentence
    vector_list = []
    for word in sentence:
        if word in model.wv:
            vector_list.append(model.wv[word])
    if vector_list:
        return np.mean(vector_list, axis=0)
    else:
        return np.zeros(model.vector_size)


In [73]:
df5 = df4.copy()

In [74]:
df5['content_vector'] = df5['content'].apply(lambda x: get_average_vector(x, w2v))

In [75]:
df5.content_vector

0         [-0.14128602, 0.091950305, 0.025827577, -0.020...
1         [-0.2824797, -0.07272955, -0.00021652206, 0.03...
2         [-0.30620986, -0.019180704, -0.13035025, 0.067...
3         [-0.9374821, 0.37097648, -0.1253948, 0.7537924...
4         [-0.17938209, 0.19556782, 0.11605928, 0.384373...
                                ...                        
112289    [0.13496372, -0.054701038, 0.26649117, 0.76622...
112290    [0.13496372, -0.054701038, 0.26649117, 0.76622...
112291    [0.30860466, 0.3969116, -0.77708006, 0.514085,...
112292    [0.13496372, -0.054701038, 0.26649117, 0.76622...
112293    [0.13496372, -0.054701038, 0.26649117, 0.76622...
Name: content_vector, Length: 112294, dtype: object

In [76]:
df5.isna().sum()

content           0
sentiment         0
content_vector    0
dtype: int64

In [77]:
df5.head()

Unnamed: 0,content,sentiment,content_vector
0,"[know, listenin, bad, habit, earlier, started,...",empty,"[-0.14128602, 0.091950305, 0.025827577, -0.020..."
1,"[layin, bed, headache, ughhhh, waitin]",sadness,"[-0.2824797, -0.07272955, -0.00021652206, 0.03..."
2,"[funeral, ceremony, gloomy, friday]",sadness,"[-0.30620986, -0.019180704, -0.13035025, 0.067..."
3,"[wants, hang, friends, soon]",enthusiasm,"[-0.9374821, 0.37097648, -0.1253948, 0.7537924..."
4,"[want, trade, houston, tickets]",neutral,"[-0.17938209, 0.19556782, 0.11605928, 0.384373..."


In [78]:
df5.content_vector[0]

array([-0.14128602,  0.0919503 ,  0.02582758, -0.02040857,  0.4036786 ,
       -0.30895033,  0.44488683,  0.14477934, -0.2633002 ,  0.10567006,
       -0.12466274, -0.49473903, -0.24031489, -0.13036118,  0.33984882,
       -0.2979195 , -0.29503193, -0.0266986 ,  0.14794147, -0.3322997 ,
       -0.08796207,  0.10037561, -0.36047584,  0.0220992 , -0.02464748,
       -0.37422752,  0.17843887, -0.43788633, -0.32111925, -0.25633413,
       -0.01223435,  0.03150762, -0.03962642,  0.01314007, -0.17837751,
        0.18412909,  0.03661321,  0.00764329, -0.0087408 , -0.13488027,
        0.17240727, -0.16894673,  0.09960955,  0.17656216,  0.23261161,
        0.06754521,  0.08710126,  0.08126742, -0.00840445, -0.16469704,
       -0.05738096, -0.05407866, -0.02778674,  0.04730238, -0.15498182,
        0.07617431, -0.10247932,  0.05184773, -0.23991361,  0.04039104,
        0.21173395,  0.34663263, -0.13509484, -0.10857193, -0.03518109,
        0.07976367,  0.23150727,  0.01024074, -0.19221583,  0.30

In [79]:
df5.content_vector.values[0]

array([-0.14128602,  0.0919503 ,  0.02582758, -0.02040857,  0.4036786 ,
       -0.30895033,  0.44488683,  0.14477934, -0.2633002 ,  0.10567006,
       -0.12466274, -0.49473903, -0.24031489, -0.13036118,  0.33984882,
       -0.2979195 , -0.29503193, -0.0266986 ,  0.14794147, -0.3322997 ,
       -0.08796207,  0.10037561, -0.36047584,  0.0220992 , -0.02464748,
       -0.37422752,  0.17843887, -0.43788633, -0.32111925, -0.25633413,
       -0.01223435,  0.03150762, -0.03962642,  0.01314007, -0.17837751,
        0.18412909,  0.03661321,  0.00764329, -0.0087408 , -0.13488027,
        0.17240727, -0.16894673,  0.09960955,  0.17656216,  0.23261161,
        0.06754521,  0.08710126,  0.08126742, -0.00840445, -0.16469704,
       -0.05738096, -0.05407866, -0.02778674,  0.04730238, -0.15498182,
        0.07617431, -0.10247932,  0.05184773, -0.23991361,  0.04039104,
        0.21173395,  0.34663263, -0.13509484, -0.10857193, -0.03518109,
        0.07976367,  0.23150727,  0.01024074, -0.19221583,  0.30

In [80]:
df5['content_vector'] = np.vstack(df5["content_vector"].values)

In [81]:
df5.content_vector[0]

-0.14128601551055908

In [148]:
# df5["cv"] = list(cv)  # Convert back to a list of (100,) arrays
# df5.cv[0]

In [82]:
df5.head()

Unnamed: 0,content,sentiment,content_vector
0,"[know, listenin, bad, habit, earlier, started,...",empty,-0.141286
1,"[layin, bed, headache, ughhhh, waitin]",sadness,-0.28248
2,"[funeral, ceremony, gloomy, friday]",sadness,-0.30621
3,"[wants, hang, friends, soon]",enthusiasm,-0.937482
4,"[want, trade, houston, tickets]",neutral,-0.179382


In [83]:
df_final = df5.drop(["content"],axis =1)
df_final.head()

Unnamed: 0,sentiment,content_vector
0,empty,-0.141286
1,sadness,-0.28248
2,sadness,-0.30621
3,enthusiasm,-0.937482
4,neutral,-0.179382


In [84]:
df_final.sentiment.value_counts()

sentiment
empty         8638
sadness       8638
enthusiasm    8638
neutral       8638
worry         8638
surprise      8638
love          8638
fun           8638
hate          8638
happiness     8638
boredom       8638
relief        8638
anger         8638
Name: count, dtype: int64

In [85]:
df_final.sentiment.values

array(['empty', 'sadness', 'sadness', ..., 'worry', 'worry', 'worry'],
      dtype=object)

In [86]:
df_final['num_sentiment'],_ = pd.factorize(df_final['sentiment'])
df_final.head()

Unnamed: 0,sentiment,content_vector,num_sentiment
0,empty,-0.141286,0
1,sadness,-0.28248,1
2,sadness,-0.30621,1
3,enthusiasm,-0.937482,2
4,neutral,-0.179382,3


In [87]:
df_final[df_final.num_sentiment == 12]

Unnamed: 0,sentiment,content_vector,num_sentiment
494,anger,-0.096789,12
527,anger,-0.477329,12
612,anger,0.328363,12
1377,anger,-0.092990,12
1384,anger,-0.171938,12
...,...,...,...
48523,anger,-0.401329,12
48524,anger,-0.401329,12
48525,anger,-0.401329,12
48526,anger,-0.401329,12


In [88]:
df_final.sentiment.value_counts()

sentiment
empty         8638
sadness       8638
enthusiasm    8638
neutral       8638
worry         8638
surprise      8638
love          8638
fun           8638
hate          8638
happiness     8638
boredom       8638
relief        8638
anger         8638
Name: count, dtype: int64

In [89]:
df_final.isna().sum()

sentiment         0
content_vector    0
num_sentiment     0
dtype: int64

In [90]:
x = df_final[['content_vector']]
y = df_final.sentiment.values

In [91]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier

In [92]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2,random_state = 42,stratify = y)

In [93]:
x_train.shape,y_train.shape

((89835, 1), (89835,))

# KNN Model

In [94]:
knn_model = KNeighborsClassifier(n_neighbors=35,metric ="euclidean")

In [95]:
knn_model.fit(x_train,y_train)

In [96]:
from sklearn.metrics import classification_report

In [97]:
y_pred = knn_model.predict(x_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

       anger       1.00      0.99      0.99      1727
     boredom       0.99      0.98      0.98      1728
       empty       0.85      0.91      0.88      1728
  enthusiasm       0.94      0.91      0.92      1728
         fun       0.97      0.80      0.87      1728
   happiness       0.54      0.44      0.48      1727
        hate       0.97      0.86      0.91      1728
        love       0.75      0.60      0.67      1728
     neutral       0.22      0.39      0.28      1727
      relief       0.94      0.80      0.86      1727
     sadness       0.62      0.44      0.52      1728
    surprise       0.94      0.74      0.83      1727
       worry       0.24      0.36      0.29      1728

    accuracy                           0.71     22459
   macro avg       0.77      0.71      0.73     22459
weighted avg       0.77      0.71      0.73     22459



In [98]:
knn_model.score(x_test,y_test)

0.7087136559953693

# Random Forest Model:

In [99]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=37)
rf.fit(x_train,y_train)

In [100]:
print(classification_report(y_test,rf.predict(x_test)))

              precision    recall  f1-score   support

       anger       0.99      0.99      0.99      1727
     boredom       0.98      0.98      0.98      1728
       empty       0.83      0.91      0.87      1728
  enthusiasm       0.90      0.91      0.91      1728
         fun       0.80      0.81      0.80      1728
   happiness       0.48      0.47      0.48      1727
        hate       0.86      0.87      0.86      1728
        love       0.62      0.62      0.62      1728
     neutral       0.24      0.23      0.23      1727
      relief       0.83      0.81      0.82      1727
     sadness       0.49      0.48      0.48      1728
    surprise       0.77      0.75      0.76      1727
       worry       0.24      0.24      0.24      1728

    accuracy                           0.70     22459
   macro avg       0.69      0.70      0.70     22459
weighted avg       0.69      0.70      0.70     22459



# KNN Model Using Metric As Minkowski :

In [101]:
knn_model_min = KNeighborsClassifier(n_neighbors=47,metric ="minkowski")
knn_model_min.fit(x_train,y_train)

In [102]:
y_pred_min = knn_model_min.predict(x_test)
print(classification_report(y_test,y_pred_min))

              precision    recall  f1-score   support

       anger       1.00      0.99      0.99      1727
     boredom       0.98      0.98      0.98      1728
       empty       0.84      0.91      0.87      1728
  enthusiasm       0.94      0.91      0.92      1728
         fun       0.97      0.80      0.87      1728
   happiness       0.60      0.42      0.50      1727
        hate       0.97      0.86      0.91      1728
        love       0.77      0.59      0.67      1728
     neutral       0.22      0.40      0.28      1727
      relief       0.93      0.80      0.86      1727
     sadness       0.68      0.43      0.52      1728
    surprise       0.93      0.74      0.82      1727
       worry       0.23      0.39      0.29      1728

    accuracy                           0.71     22459
   macro avg       0.77      0.71      0.73     22459
weighted avg       0.77      0.71      0.73     22459



.

# KNN Model Predictions:

In [104]:
def convert_to_vector(text):
    df = pd.DataFrame({'tweet':[text]})
    def get_average_vector(sentence, model):
        words = sentence.split()
        vector_list = []
        for word in words:
            if word in model.wv:
                vector_list.append(model.wv[word])
        if vector_list:
            return np.mean(vector_list, axis=0)
        else:
            return np.zeros(model.vector_size)

    df['vector'] = df['tweet'].apply(lambda x: get_average_vector(x, w2v))

    df['vector'] = np.vstack(df["vector"].values)
    return df.vector

In [105]:
tweet = '@tiffanylue i know  i was listenin to bad habit earlier and i started freakin at his part =['
t = convert_to_vector(tweet)

In [106]:
# knn_model.predict(t.reshape(1, -1))  # Reshape to (1, 100)

knn_model.predict([t])



array(['neutral'], dtype=object)

In [107]:
knn_model.predict(x_test[:5])

array(['hate', 'sadness', 'anger', 'surprise', 'worry'], dtype=object)

In [108]:
y_test[:5]

array(['hate', 'sadness', 'anger', 'surprise', 'sadness'], dtype=object)

In [109]:
knn_model.predict(x_test[50:70])

array(['enthusiasm', 'neutral', 'love', 'neutral', 'surprise', 'sadness',
       'hate', 'worry', 'enthusiasm', 'hate', 'enthusiasm', 'empty',
       'neutral', 'enthusiasm', 'anger', 'love', 'love', 'fun',
       'enthusiasm', 'enthusiasm'], dtype=object)

In [110]:
y_test[50:70]

array(['enthusiasm', 'worry', 'love', 'neutral', 'surprise', 'sadness',
       'hate', 'fun', 'enthusiasm', 'hate', 'enthusiasm', 'empty',
       'worry', 'enthusiasm', 'anger', 'love', 'love', 'fun',
       'enthusiasm', 'enthusiasm'], dtype=object)

### You can see that out of 20, only 3 predictions are wrong. So we can say that our model is working pretty fine !!