<a href="https://colab.research.google.com/github/Harshitnitw/generative-ai-practice/blob/main/sentiment_analysis_using_word2vec_and_ML_algorithms.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[Data Link: ](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)

#### Key insights:

- When we convert sentences into vectors token by token (word by word) using word2vec, we get array of values as vector for each word, our tokens are also processed as array, and our sentences in the dataset are also processed as arrays, as a result we get a 3 dimentional array. Since ML algorithms need 2 dimentional arrays for training, we need to take word wise / column wise mean (axis=0) of the internal arrays.

- If we set min count of 2 or more to be followed while setting word2vec model, we need to take care of exception cases of keys not present in model vocal while iterating and passing the corpus for outputting the vectors of token. We also need to take care of the edge cases when any tokens in a sentence don't match the model vocab.

- Bag of words and TFIDF had similar accuracy score as word2vec for RandomForest, infact word2vec preformed worse than the former two for naise bias.

In [2]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
import string

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
!nvidia-smi

/bin/bash: line 1: nvidia-smi: command not found


In [3]:
import numpy as np
import pandas as pd

In [4]:
from google.colab import drive
drive.mount("/content/gdrive")

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [6]:
data_path = "/content/gdrive/MyDrive/llm/IMDB Dataset.csv"

In [7]:
df = pd.read_csv(data_path)
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [8]:
df.shape

(50000, 2)

In [9]:
df = df.iloc[:10000]
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [10]:
df.shape

(10000, 2)

In [11]:
df['review'][0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

In [12]:
df['sentiment'].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
positive,5028
negative,4972


In [13]:
df.isnull().sum()

Unnamed: 0,0
review,0
sentiment,0


In [14]:
df.duplicated().sum()

17

In [15]:
df.drop_duplicates(inplace=True)

In [16]:
df.duplicated().sum()

0

# Basic Preprocessing
  - Remove tags - HTML
  - Lower case
  - remove stopwords

In [17]:
import re
def remove_tags(raw_text):
    cleaned_text = re.sub(re.compile('<.*?>'), '', raw_text)
    return cleaned_text

In [18]:
df['review'] = df['review'].apply(remove_tags)

In [19]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. The filming tec...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [20]:
df['review'][0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.I would say the main appeal of the show is due to the fact that it goes where other shows wo

In [21]:
df['review'] = df['review'].apply(lambda x:x.lower())

In [22]:
df['review'][0]

"one of the other reviewers has mentioned that after watching just 1 oz episode you'll be hooked. they are right, as this is exactly what happened with me.the first thing that struck me about oz was its brutality and unflinching scenes of violence, which set in right from the word go. trust me, this is not a show for the faint hearted or timid. this show pulls no punches with regards to drugs, sex or violence. its is hardcore, in the classic use of the word.it is called oz as that is the nickname given to the oswald maximum security state penitentary. it focuses mainly on emerald city, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. em city is home to many..aryans, muslims, gangstas, latinos, christians, italians, irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.i would say the main appeal of the show is due to the fact that it goes where other shows wo

In [23]:
from nltk.corpus import stopwords
import nltk

nltk.download('stopwords')
sw_list = stopwords.words('english')

df['review'] = df['review'].apply(lambda x: [item for item in x.split() if item not in sw_list])

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [24]:
df['review'][0]

['one',
 'reviewers',
 'mentioned',
 'watching',
 '1',
 'oz',
 'episode',
 'hooked.',
 'right,',
 'exactly',
 'happened',
 'me.the',
 'first',
 'thing',
 'struck',
 'oz',
 'brutality',
 'unflinching',
 'scenes',
 'violence,',
 'set',
 'right',
 'word',
 'go.',
 'trust',
 'me,',
 'show',
 'faint',
 'hearted',
 'timid.',
 'show',
 'pulls',
 'punches',
 'regards',
 'drugs,',
 'sex',
 'violence.',
 'hardcore,',
 'classic',
 'use',
 'word.it',
 'called',
 'oz',
 'nickname',
 'given',
 'oswald',
 'maximum',
 'security',
 'state',
 'penitentary.',
 'focuses',
 'mainly',
 'emerald',
 'city,',
 'experimental',
 'section',
 'prison',
 'cells',
 'glass',
 'fronts',
 'face',
 'inwards,',
 'privacy',
 'high',
 'agenda.',
 'em',
 'city',
 'home',
 'many..aryans,',
 'muslims,',
 'gangstas,',
 'latinos,',
 'christians,',
 'italians,',
 'irish',
 'more....so',
 'scuffles,',
 'death',
 'stares,',
 'dodgy',
 'dealings',
 'shady',
 'agreements',
 'never',
 'far',
 'away.i',
 'would',
 'say',
 'main',
 'ap

In [25]:
df.head()

Unnamed: 0,review,sentiment
0,"[one, reviewers, mentioned, watching, 1, oz, e...",positive
1,"[wonderful, little, production., filming, tech...",positive
2,"[thought, wonderful, way, spend, time, hot, su...",positive
3,"[basically, there's, family, little, boy, (jak...",negative
4,"[petter, mattei's, ""love, time, money"", visual...",positive


In [26]:
X = df.iloc[:,0:1]
y = df['sentiment']

In [27]:
X.head()

Unnamed: 0,review
0,"[one, reviewers, mentioned, watching, 1, oz, e..."
1,"[wonderful, little, production., filming, tech..."
2,"[thought, wonderful, way, spend, time, hot, su..."
3,"[basically, there's, family, little, boy, (jak..."
4,"[petter, mattei's, ""love, time, money"", visual..."


In [28]:
y

Unnamed: 0,sentiment
0,positive
1,positive
2,positive
3,negative
4,positive
...,...
9995,positive
9996,negative
9997,negative
9998,negative


In [29]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

y = encoder.fit_transform(y)

In [30]:
y

array([1, 1, 1, ..., 0, 0, 1])

In [31]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1)

In [32]:
X_train.shape

(7986, 1)

In [33]:
X_test.shape

(1997, 1)

In [34]:
import gensim



In [35]:
!pip install --upgrade gensim --user



In [36]:
model = gensim.models.Word2Vec(
    window=10,
    min_count=2
)

In [37]:
X

Unnamed: 0,review
0,"[one, reviewers, mentioned, watching, 1, oz, e..."
1,"[wonderful, little, production., filming, tech..."
2,"[thought, wonderful, way, spend, time, hot, su..."
3,"[basically, there's, family, little, boy, (jak..."
4,"[petter, mattei's, ""love, time, money"", visual..."
...,...
9995,"[fun,, entertaining, movie, wwii, german, spy,..."
9996,"[give, break., anyone, say, ""good, hockey, mov..."
9997,"[movie, bad, movie., watching, endless, series..."
9998,"[movie, probably, made, entertain, middle, sch..."


In [38]:
X_arr=np.array([x for x in X['review']],dtype='object')

In [39]:
X_arr.shape

(9983,)

In [40]:
X_arr

array([list(['one', 'reviewers', 'mentioned', 'watching', '1', 'oz', 'episode', 'hooked.', 'right,', 'exactly', 'happened', 'me.the', 'first', 'thing', 'struck', 'oz', 'brutality', 'unflinching', 'scenes', 'violence,', 'set', 'right', 'word', 'go.', 'trust', 'me,', 'show', 'faint', 'hearted', 'timid.', 'show', 'pulls', 'punches', 'regards', 'drugs,', 'sex', 'violence.', 'hardcore,', 'classic', 'use', 'word.it', 'called', 'oz', 'nickname', 'given', 'oswald', 'maximum', 'security', 'state', 'penitentary.', 'focuses', 'mainly', 'emerald', 'city,', 'experimental', 'section', 'prison', 'cells', 'glass', 'fronts', 'face', 'inwards,', 'privacy', 'high', 'agenda.', 'em', 'city', 'home', 'many..aryans,', 'muslims,', 'gangstas,', 'latinos,', 'christians,', 'italians,', 'irish', 'more....so', 'scuffles,', 'death', 'stares,', 'dodgy', 'dealings', 'shady', 'agreements', 'never', 'far', 'away.i', 'would', 'say', 'main', 'appeal', 'show', 'due', 'fact', 'goes', 'shows', 'dare.', 'forget', 'pretty', '

In [41]:
len(X_arr[0])

168

In [42]:
model.build_vocab(X_arr)

In [43]:
model.train(X_arr, total_examples=model.corpus_count, epochs=model.epochs)

(5532271, 6145055)

In [46]:
model.wv.most_similar('fairly')

[('suspense', 0.9908896684646606),
 ('storyline', 0.9898179769515991),
 ('incredibly', 0.9887595772743225),
 ('movie.the', 0.9865071773529053),
 ('extremely', 0.9856547713279724),
 ('storyline,', 0.982740581035614),
 ('pacing', 0.9823650121688843),
 ('decent,', 0.9816403985023499),
 ('gore.', 0.9815224409103394),
 ('weak', 0.9797107577323914)]

In [47]:
model.wv['fairly'].shape

(100,)

In [48]:
word2vec_model_data_path = "/content/gdrive/MyDrive/llm ds with bappy/word2vec_IMBD_dataset.model"
model.save(word2vec_model_data_path)

In [52]:
X['review'][0][0]

'one'

In [53]:
model.wv[X['review'][0][0]]

array([-0.56175935,  0.15326197, -1.105652  , -0.3028208 ,  0.30853847,
       -0.4850984 , -0.44698477,  1.1297549 ,  0.02745492, -0.6093492 ,
        0.3345873 , -0.9033632 ,  0.28406754,  1.4278433 ,  0.64180666,
       -1.4092679 ,  0.29003546,  0.61800903, -1.0315362 , -0.8713858 ,
       -0.01178522, -0.8450057 , -0.08715782, -0.3932116 ,  1.3563823 ,
        0.06977923, -0.6232024 , -0.40595934, -0.16032824,  1.0133965 ,
        1.7535251 , -0.66755   ,  0.2687928 , -1.5933821 , -1.1184654 ,
        0.7417158 , -0.10615756,  0.11110865,  0.34293333, -1.1513401 ,
       -0.04305315,  0.13977104, -0.08799618, -0.31057367,  0.15017398,
       -0.35402   , -0.2732866 , -0.03993792,  0.51712966,  0.33943543,
        0.8986214 ,  0.12564334, -1.7580487 , -0.33642873, -0.18774396,
        0.5804615 , -0.28337327,  0.6846111 , -0.630163  ,  0.6326389 ,
        0.14977302,  0.6003814 ,  0.34143755,  1.602379  , -0.52430725,
        1.9307755 , -0.5857035 ,  0.47500855, -0.9707908 ,  1.54

In [55]:
len(model.wv)

53988

In [56]:
len(model.wv[X['review'][0][0]])

100

In [59]:
valid_tokens = [model.wv[token] for token in X['review'][0] if token in model.wv]

if valid_tokens:
    mean_embedding = np.mean(valid_tokens, axis=0)
else:
    mean_embedding = np.zeros(model.wv.vector_size)  # or any other default value

print(mean_embedding)


[-9.21961442e-02  1.81292400e-01  1.47400960e-01 -8.74398798e-02
  1.41746238e-01 -6.37932003e-01  2.33018443e-01  7.91034341e-01
 -2.72815824e-01 -2.87537962e-01 -7.99840167e-02 -5.69146574e-01
 -9.63844284e-02  4.13802952e-01  2.95508087e-01 -2.22581998e-01
  9.68990265e-04 -3.48267376e-01  2.58386116e-02 -9.19312775e-01
  3.15803081e-01  1.95930123e-01  1.58254758e-01 -4.74543199e-02
 -2.04271581e-02  1.34665817e-01 -2.66012877e-01 -2.60218978e-01
 -3.76691550e-01 -6.99974671e-02  4.84093547e-01 -2.36638561e-02
  5.48713170e-02 -2.35289797e-01 -1.67152658e-01  3.79245937e-01
 -7.90904313e-02 -2.70752043e-01 -1.73162833e-01 -9.69056904e-01
  3.82116809e-02 -5.99378586e-01 -1.62050352e-01  2.05737531e-01
  2.56121188e-01 -8.57213736e-02 -3.93786490e-01  6.53400421e-02
  5.94848096e-02  3.00321698e-01  1.73231244e-01 -5.24099410e-01
 -2.68663257e-01 -5.82210459e-02 -2.24718586e-01  1.42669663e-01
  2.86806792e-01  7.95635656e-02 -3.12523156e-01  1.79270416e-01
 -3.70352268e-02  2.00557

In [64]:
X_train.head()

Unnamed: 0,review
6713,"[i've, waiting, superhero, movie, like, long, ..."
1178,"[movie, excellent, acted,, excellent, directed..."
4707,"[movie, makes, want, throw, every, time, see, ..."
6772,"[first, saw, movie, elementary, school,, back,..."
7461,"[show, made, persons, iq, lower, 80., jokes, s..."


In [65]:
mean_embeddings=[]

for row in X_train['review']:
  valid_tokens = [model.wv[token] for token in row if token in model.wv]

  if valid_tokens:
    mean_embedding = np.mean(valid_tokens, axis=0)

  else:
    mean_embedding = np.zeros(model.wv.vector_size)

  mean_embeddings.append(mean_embedding)

In [66]:
len(mean_embeddings)

7986

In [70]:
X_test_mean_embeddings=[]

for row in X_test['review']:
  valid_tokens = [model.wv[token] for token in row if token in model.wv]

  if valid_tokens:
    mean_embedding = np.mean(valid_tokens, axis=0)

  else:
    mean_embedding = np.zeros(model.wv.vector_size)

  X_test_mean_embeddings.append(mean_embedding)

In [71]:
len(X_test_mean_embeddings)

1997

In [75]:
mean_embeddings[0]

array([-1.14405625e-01,  2.16396585e-01,  2.34749317e-01, -6.37036487e-02,
        1.22654386e-01, -6.19826436e-01,  2.69692361e-01,  8.43321979e-01,
       -2.86658376e-01, -2.44258925e-01, -1.93744153e-02, -5.67772865e-01,
       -1.44635618e-01,  3.15603405e-01,  2.40026265e-01, -2.23786950e-01,
       -4.81551401e-02, -3.22576612e-01, -1.54136838e-02, -8.88029158e-01,
        2.71140307e-01,  2.07875654e-01,  1.36906445e-01, -7.66206384e-02,
       -1.26276419e-01,  6.74938709e-02, -2.96142429e-01, -1.41560256e-01,
       -4.01061594e-01,  8.53507742e-02,  5.00052214e-01,  3.12591419e-02,
       -2.02932322e-04, -1.75245732e-01, -1.15980372e-01,  4.82362807e-01,
       -4.06168699e-02, -4.20480371e-01, -2.62756079e-01, -8.75108480e-01,
        3.38919796e-02, -5.70926130e-01, -1.89924181e-01,  3.07379395e-01,
        2.77409583e-01, -1.55138284e-01, -2.58718431e-01,  4.12821285e-02,
        1.60001010e-01,  2.64180422e-01,  1.67310491e-01, -5.03490984e-01,
       -2.19406068e-01, -

In [74]:
X_test_mean_embeddings[0]

array([-0.17093244,  0.18681996,  0.19606256, -0.06301402,  0.09193455,
       -0.59106076,  0.20399846,  0.81979096, -0.25458387, -0.24327709,
       -0.05661563, -0.56896585, -0.20537831,  0.33973223,  0.16644718,
       -0.19772665, -0.01473749, -0.33524045, -0.02493157, -0.9037894 ,
        0.25045288,  0.20274128,  0.11852077, -0.09692533, -0.11364084,
        0.05390318, -0.28876135, -0.19444653, -0.37319294,  0.02346391,
        0.45492306,  0.03476888,  0.02688003, -0.22530489, -0.20268396,
        0.4124645 , -0.05880081, -0.36623386, -0.21205898, -0.85632586,
        0.02692242, -0.53669345, -0.19999425,  0.26289645,  0.28959188,
       -0.15620767, -0.2519223 ,  0.05246327,  0.15645516,  0.31755307,
        0.19715622, -0.48284125, -0.18984298, -0.02275031, -0.25326854,
        0.16194667,  0.26196766,  0.06102298, -0.36206827,  0.18273987,
       -0.01813997,  0.24551034, -0.09692648, -0.0129061 , -0.49208653,
        0.5368746 ,  0.05901195,  0.32375133, -0.6303746 ,  0.45

In [77]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

gnb.fit(mean_embeddings,y_train)

In [78]:
y_pred = gnb.predict(X_test_mean_embeddings)

from sklearn.metrics import accuracy_score,confusion_matrix
accuracy_score(y_test,y_pred)

0.6745117676514772

In [79]:
confusion_matrix(y_test,y_pred)

array([[636, 316],
       [334, 711]])

In [80]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()

rf.fit(mean_embeddings,y_train)

y_pred = rf.predict(X_test_mean_embeddings)

accuracy_score(y_test,y_pred)

0.7481221832749124


#### Key insights:

- When we convert sentences into vectors token by token (word by word) using word2vec, we get array of values as vector for each word, our tokens are also processed as array, and our sentences in the dataset are also processed as arrays, as a result we get a 3 dimentional array. Since ML algorithms need 2 dimentional arrays for training, we need to take word wise / column wise mean (axis=0) of the internal arrays.

- If we set min count of 2 or more to be followed while setting word2vec model, we need to take care of exception cases of keys not present in model vocal while iterating and passing the corpus for outputting the vectors of token. We also need to take care of the edge cases when any tokens in a sentence don't match the model vocab.

- Bag of words and TFIDF had similar accuracy score as word2vec for RandomForest, infact word2vec preformed worse than the former two for naise bias.
