# Sentiment Analysis - NLP

### Importing Libraries

In [3]:
import pandas as pd
import numpy as np

### Loading Dataset

In [6]:
data = pd.read_csv('IMDB Dataset.csv')

### Data Exploration

In [11]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [13]:
data.isna().sum()

review       0
sentiment    0
dtype: int64

In [15]:
data.dtypes

review       object
sentiment    object
dtype: object

In [17]:
data.shape

(50000, 2)

### Sample Data

In [20]:
data = data.sample(1000)

In [22]:
data

Unnamed: 0,review,sentiment
25637,The Poverty Row horror pictures of the 1930s a...,negative
16403,I can't see the point in burying a movie like ...,positive
47693,"I have read over 100 of the Nancy Drew books, ...",positive
23545,"Originally I wrote what was a sarcastic,scathi...",negative
27622,"In the 70's in Afghanistan, the Pushtun boy Am...",positive
...,...,...
42543,"bad acting , combats are very awful , 3-4 seco...",negative
13038,Oh Mr. Carell! How far you've fallen! After a ...,negative
6542,Any one who writes that this is any good there...,negative
33185,Here is what happened:<br /><br />1) Head of B...,negative


### Checking Unique Values and Replacing Numerical Values 

In [25]:
data['sentiment'].unique()

array(['negative', 'positive'], dtype=object)

In [27]:
data['sentiment'].replace({'positive':1,'negative':0},inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['sentiment'].replace({'positive':1,'negative':0},inplace=True)
  data['sentiment'].replace({'positive':1,'negative':0},inplace=True)


In [29]:
data.head()

Unnamed: 0,review,sentiment
25637,The Poverty Row horror pictures of the 1930s a...,0
16403,I can't see the point in burying a movie like ...,1
47693,"I have read over 100 of the Nancy Drew books, ...",1
23545,"Originally I wrote what was a sarcastic,scathi...",0
27622,"In the 70's in Afghanistan, the Pushtun boy Am...",1


In [31]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, 25637 to 6510
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     1000 non-null   object
 1   sentiment  1000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 23.4+ KB


### Importing Regular Expression

In [34]:
import re

In [36]:
def clean_html(text):
    clean = re.compile('<,*>?')
    return re.sub(clean,'',text)

In [38]:
data["review"] = data["review"].apply(clean_html)

In [40]:
data["review"].iloc[5]

"My daughter, her friends and I have watched this movie literally dozens of times. I bought it twice and some little girlfriends absconded with it. Subsequently, I rented it so very many times. It just never gets old!!! Blockbuster doesn't even have it in their listings anymore and I have tried to buy, find, rent it for over 5 years. Without a doubt, this was and is my most favourite movie of my daughter's childhood...it has it all! We laughed, we cried, we discussed real life and how hard some children have it in the world. There was nothing pretend about this movie. We related to every second and every line Bill! Thanks a million for restoring our faith in human nature. Sincerely, Shelleen and Kailin Vandermey. Craven, Saskatchewan. CANADA,eh!!! :-)br />br />August '07 update:br />br />Who are we to judge if a rich woman falls in love with a poor man; or a man who has love chooses to raise a child who is not his own. It may not be my or your life. It is not only believable, it happen

### Removing Special Characters

In [43]:
def remove_special(text):
    x = ' '
    for i in text:
        if i.isalnum():
            x = x+i
        else:
            x = x+' '
    return x    

In [45]:
data["review"] = data["review"].apply(remove_special)

In [47]:
data["review"].iloc[5]

' My daughter  her friends and I have watched this movie literally dozens of times  I bought it twice and some little girlfriends absconded with it  Subsequently  I rented it so very many times  It just never gets old    Blockbuster doesn t even have it in their listings anymore and I have tried to buy  find  rent it for over 5 years  Without a doubt  this was and is my most favourite movie of my daughter s childhood   it has it all  We laughed  we cried  we discussed real life and how hard some children have it in the world  There was nothing pretend about this movie  We related to every second and every line Bill  Thanks a million for restoring our faith in human nature  Sincerely  Shelleen and Kailin Vandermey  Craven  Saskatchewan  CANADA eh       br   br   August  07 update br   br   Who are we to judge if a rich woman falls in love with a poor man  or a man who has love chooses to raise a child who is not his own  It may not be my or your life  It is not only believable  it happe

In [49]:
#converting text into lowercase
def convert_low(text):
    return text.lower()

In [51]:
data["review"].iloc[5]

' My daughter  her friends and I have watched this movie literally dozens of times  I bought it twice and some little girlfriends absconded with it  Subsequently  I rented it so very many times  It just never gets old    Blockbuster doesn t even have it in their listings anymore and I have tried to buy  find  rent it for over 5 years  Without a doubt  this was and is my most favourite movie of my daughter s childhood   it has it all  We laughed  we cried  we discussed real life and how hard some children have it in the world  There was nothing pretend about this movie  We related to every second and every line Bill  Thanks a million for restoring our faith in human nature  Sincerely  Shelleen and Kailin Vandermey  Craven  Saskatchewan  CANADA eh       br   br   August  07 update br   br   Who are we to judge if a rich woman falls in love with a poor man  or a man who has love chooses to raise a child who is not his own  It may not be my or your life  It is not only believable  it happe

### Stopwords

In [54]:
import nltk
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Johnson\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [55]:
from nltk.corpus import stopwords

In [60]:
a = stopwords.words("english")

In [62]:
def remove_stop(text):
    x = []
    for i in text.split():
        if i not in a:
            x.append(i)
        else:
            pass
    return x        

In [64]:
data["review"] = data["review"].apply(remove_stop)

In [66]:
data["review"].iloc[5]

['My',
 'daughter',
 'friends',
 'I',
 'watched',
 'movie',
 'literally',
 'dozens',
 'times',
 'I',
 'bought',
 'twice',
 'little',
 'girlfriends',
 'absconded',
 'Subsequently',
 'I',
 'rented',
 'many',
 'times',
 'It',
 'never',
 'gets',
 'old',
 'Blockbuster',
 'even',
 'listings',
 'anymore',
 'I',
 'tried',
 'buy',
 'find',
 'rent',
 '5',
 'years',
 'Without',
 'doubt',
 'favourite',
 'movie',
 'daughter',
 'childhood',
 'We',
 'laughed',
 'cried',
 'discussed',
 'real',
 'life',
 'hard',
 'children',
 'world',
 'There',
 'nothing',
 'pretend',
 'movie',
 'We',
 'related',
 'every',
 'second',
 'every',
 'line',
 'Bill',
 'Thanks',
 'million',
 'restoring',
 'faith',
 'human',
 'nature',
 'Sincerely',
 'Shelleen',
 'Kailin',
 'Vandermey',
 'Craven',
 'Saskatchewan',
 'CANADA',
 'eh',
 'br',
 'br',
 'August',
 '07',
 'update',
 'br',
 'br',
 'Who',
 'judge',
 'rich',
 'woman',
 'falls',
 'love',
 'poor',
 'man',
 'man',
 'love',
 'chooses',
 'raise',
 'child',
 'It',
 'may',
 'li

In [68]:
from nltk.stem.porter import PorterStemmer

In [70]:
ps = PorterStemmer()

In [72]:
def stem_words(text):
    x =[]
    for i in text:
        x.append(ps.stem(i))
    return x    

In [74]:
data["review"] = data["review"].apply(stem_words)

In [75]:
data["review"].iloc[5]

['my',
 'daughter',
 'friend',
 'i',
 'watch',
 'movi',
 'liter',
 'dozen',
 'time',
 'i',
 'bought',
 'twice',
 'littl',
 'girlfriend',
 'abscond',
 'subsequ',
 'i',
 'rent',
 'mani',
 'time',
 'it',
 'never',
 'get',
 'old',
 'blockbust',
 'even',
 'list',
 'anymor',
 'i',
 'tri',
 'buy',
 'find',
 'rent',
 '5',
 'year',
 'without',
 'doubt',
 'favourit',
 'movi',
 'daughter',
 'childhood',
 'we',
 'laugh',
 'cri',
 'discuss',
 'real',
 'life',
 'hard',
 'children',
 'world',
 'there',
 'noth',
 'pretend',
 'movi',
 'we',
 'relat',
 'everi',
 'second',
 'everi',
 'line',
 'bill',
 'thank',
 'million',
 'restor',
 'faith',
 'human',
 'natur',
 'sincer',
 'shelleen',
 'kailin',
 'vandermey',
 'craven',
 'saskatchewan',
 'canada',
 'eh',
 'br',
 'br',
 'august',
 '07',
 'updat',
 'br',
 'br',
 'who',
 'judg',
 'rich',
 'woman',
 'fall',
 'love',
 'poor',
 'man',
 'man',
 'love',
 'choos',
 'rais',
 'child',
 'it',
 'may',
 'life',
 'it',
 'believ',
 'happen',
 'everi',
 'day',
 'thank',

In [78]:
def join(list_input):
    return " ".join(list_input)

In [80]:
data["review"] = data["review"].apply(join)

In [82]:
data["review"].iloc[5]

'my daughter friend i watch movi liter dozen time i bought twice littl girlfriend abscond subsequ i rent mani time it never get old blockbust even list anymor i tri buy find rent 5 year without doubt favourit movi daughter childhood we laugh cri discuss real life hard children world there noth pretend movi we relat everi second everi line bill thank million restor faith human natur sincer shelleen kailin vandermey craven saskatchewan canada eh br br august 07 updat br br who judg rich woman fall love poor man man love choos rais child it may life it believ happen everi day thank god keep faith human natur aliv celebr'

In [84]:
data

Unnamed: 0,review,sentiment
25637,the poverti row horror pictur 1930 40 depress ...,0
16403,i see point buri movi like sulfur sarcasm way ...,1
47693,i read 100 nanci drew book bright enough catch...,1
23545,origin i wrote sarcast scath review pathet pie...,0
27622,in 70 afghanistan pushtun boy amir zekeria ebr...,1
...,...,...
42543,bad act combat aw 3 4 second text bad music ba...,0
13038,oh mr carel how far fallen after glow moment l...,0
6542,ani one write good kid may work put money god ...,0
33185,here happen br br 1 head bbc3 need make progra...,0


In [86]:
X = data["review"]
y = data["sentiment"]

### Implementing ML algorithms

In [89]:
from sklearn.feature_extraction.text import CountVectorizer

In [91]:
cv=CountVectorizer()

In [93]:
cv.fit_transform(data["review"])

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 100374 stored elements and shape (1000, 13042)>

In [95]:
X = cv.fit_transform(data["review"]).toarray()

In [97]:
X

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [99]:
from sklearn.model_selection import train_test_split

In [101]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)

In [103]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score,confusion_matrix

In [105]:
nb = GaussianNB()

In [107]:
nb.fit(X_train,y_train)

In [109]:
y_pred = nb.predict(X_test)

In [111]:
y_pred

array([0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1,
       0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1,
       1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1,
       1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0,
       0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0,
       1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1,
       1, 0], dtype=int64)

In [113]:
cm = confusion_matrix(y_test,y_pred)

In [115]:
cm

array([[61, 29],
       [53, 57]], dtype=int64)