# Sentiment Analysis

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [3]:
data = pd.read_csv("../Datasets/imdb_dataset.csv")

In [4]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [5]:
data.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


In [6]:
data.isnull().sum()

review       0
sentiment    0
dtype: int64

In [7]:
data['sentiment'].value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [9]:
data['review'][0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

## Text Cleaning <br>

1. Remove HTML Tags
2. Remove Special Characters
3. Converting all the words into lower case
4. removing stop words
5. Stemming

### Step 1. Remove HTML Tags

In [10]:
# Convert the binary label column to 1(positive) & 0(negative)
data['sentiment'].replace({'positive' : 1, 'negative' : 0}, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['sentiment'].replace({'positive' : 1, 'negative' : 0}, inplace=True)
  data['sentiment'].replace({'positive' : 1, 'negative' : 0}, inplace=True)


In [11]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [12]:
# clean HTML tags using `re` library
import re

clean = re.compile('<.*?>')

In [13]:
# testing
print('With HTMl tags:', data.iloc[2].review)
print('Without HTML tags:', re.sub(clean, '', data.iloc[2].review))

With HTMl tags: I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty and the characters are likable (even the well bread suspected serial killer). While some may be disappointed when they realize this is not Match Point 2: Risk Addiction, I thought it was proof that Woody Allen is still fully in control of the style many of us have grown to love.<br /><br />This was the most I'd laughed at one of Woody's comedies in years (dare I say a decade?). While I've never been impressed with Scarlet Johanson, in this she managed to tone down her "sexy" image and jumped right into a average, but spirited young woman.<br /><br />This may not be the crown jewel of his career, but it was wittier than "Devil Wears Prada" and more interesting than "Superman" a great comedy to go see with friends.
Without HTML tags: I thought this was a wonderful way to 

In [14]:
# function for removing all the tags
def remove_tags(text):
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

In [15]:
# apply function to all the reviews
data['review'] = data['review'].apply(remove_tags)

In [16]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. The filming tec...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [17]:
# testing 
data['review'][45063]

'You know, this movie isn\'t that great, but, I mean, c\'mon, it\'s about angels helping a baseball team. I find the plot line to be hilarious anyways, this kid\'s dad says he\'ll take him back if the angels win the pennant (because he knows they won\'t) Kid prays to his fake god to help the angels win, god helps the whole time (via the angel Christopher Lloyd, RIP) And in the end, his dad doesn\'t take him back and rides off on his motorcycle right in that kids face. it\'s hilarious until Danny Glover adopts it and it\'s friend.I guess the upside is that the old lady is left alone to die with her stitchin\' projects and her stories. The real winner here, though, is god. Because later he got a job as a writer for numerous prank shows.As a kids movie, it gets a 7. As a movie about the mysteries of blind, stupid faith, and the nature of "god," it gets a 10.'

### Step 2. Remove Special Characters

In [18]:
# Function for removinf special charecters
def remove_special_char(text):
    return re.sub(r'[^A-Za-z0-9\s]', '', text)

In [19]:
# testing
test = data['review'][2]
test = remove_special_char(test)
print(test)

I thought this was a wonderful way to spend time on a too hot summer weekend sitting in the air conditioned theater and watching a lighthearted comedy The plot is simplistic but the dialogue is witty and the characters are likable even the well bread suspected serial killer While some may be disappointed when they realize this is not Match Point 2 Risk Addiction I thought it was proof that Woody Allen is still fully in control of the style many of us have grown to loveThis was the most Id laughed at one of Woodys comedies in years dare I say a decade While Ive never been impressed with Scarlet Johanson in this she managed to tone down her sexy image and jumped right into a average but spirited young womanThis may not be the crown jewel of his career but it was wittier than Devil Wears Prada and more interesting than Superman a great comedy to go see with friends


In [20]:
data['review'] = data['review'].apply(remove_special_char)

In [21]:
data['review'][45063]

'You know this movie isnt that great but I mean cmon its about angels helping a baseball team I find the plot line to be hilarious anyways this kids dad says hell take him back if the angels win the pennant because he knows they wont Kid prays to his fake god to help the angels win god helps the whole time via the angel Christopher Lloyd RIP And in the end his dad doesnt take him back and rides off on his motorcycle right in that kids face its hilarious until Danny Glover adopts it and its friendI guess the upside is that the old lady is left alone to die with her stitchin projects and her stories The real winner here though is god Because later he got a job as a writer for numerous prank showsAs a kids movie it gets a 7 As a movie about the mysteries of blind stupid faith and the nature of god it gets a 10'

### Step 3. Convert Text into Lower Case

In [22]:
# function to convert into lower case
def convert_lower_case(text):
    return text.lower()

In [23]:
# testing
test = convert_lower_case(test)
print(test)

i thought this was a wonderful way to spend time on a too hot summer weekend sitting in the air conditioned theater and watching a lighthearted comedy the plot is simplistic but the dialogue is witty and the characters are likable even the well bread suspected serial killer while some may be disappointed when they realize this is not match point 2 risk addiction i thought it was proof that woody allen is still fully in control of the style many of us have grown to lovethis was the most id laughed at one of woodys comedies in years dare i say a decade while ive never been impressed with scarlet johanson in this she managed to tone down her sexy image and jumped right into a average but spirited young womanthis may not be the crown jewel of his career but it was wittier than devil wears prada and more interesting than superman a great comedy to go see with friends


In [24]:
data['review'] = data['review'].apply(convert_lower_case)

In [25]:
data['review'][45063]

'you know this movie isnt that great but i mean cmon its about angels helping a baseball team i find the plot line to be hilarious anyways this kids dad says hell take him back if the angels win the pennant because he knows they wont kid prays to his fake god to help the angels win god helps the whole time via the angel christopher lloyd rip and in the end his dad doesnt take him back and rides off on his motorcycle right in that kids face its hilarious until danny glover adopts it and its friendi guess the upside is that the old lady is left alone to die with her stitchin projects and her stories the real winner here though is god because later he got a job as a writer for numerous prank showsas a kids movie it gets a 7 as a movie about the mysteries of blind stupid faith and the nature of god it gets a 10'

### Step 4. Remove Stop Words

In [26]:
import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ask50\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [27]:
from nltk.corpus import stopwords

stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [28]:
len(stopwords.words('english'))

179

In [29]:
# function for  removing stop words
def remove_stop_words(text):
    stop_words = set(stopwords.words('english'))
    words = text.split()
    filtered_text = " ".join([word for word in words if word not in stop_words])
    return filtered_text

In [30]:
# tesing
test = remove_stop_words(test)
print(test)

thought wonderful way spend time hot summer weekend sitting air conditioned theater watching lighthearted comedy plot simplistic dialogue witty characters likable even well bread suspected serial killer may disappointed realize match point 2 risk addiction thought proof woody allen still fully control style many us grown lovethis id laughed one woodys comedies years dare say decade ive never impressed scarlet johanson managed tone sexy image jumped right average spirited young womanthis may crown jewel career wittier devil wears prada interesting superman great comedy go see friends


In [31]:
data['review'] = data['review'].apply(remove_stop_words)

In [32]:
print(data['review'][45063])

know movie isnt great mean cmon angels helping baseball team find plot line hilarious anyways kids dad says hell take back angels win pennant knows wont kid prays fake god help angels win god helps whole time via angel christopher lloyd rip end dad doesnt take back rides motorcycle right kids face hilarious danny glover adopts friendi guess upside old lady left alone die stitchin projects stories real winner though god later got job writer numerous prank showsas kids movie gets 7 movie mysteries blind stupid faith nature god gets 10


### Step 5. Stemming

In [33]:
from nltk.stem import PorterStemmer

# function for stemming the words
def stemming(text):
    stem = PorterStemmer()
    words = text.split()
    stemmed_text = " ".join([stem.stem(word) for word in words ])
    return stemmed_text

In [34]:
test = stemming(test)
print(test)

thought wonder way spend time hot summer weekend sit air condit theater watch lightheart comedi plot simplist dialogu witti charact likabl even well bread suspect serial killer may disappoint realiz match point 2 risk addict thought proof woodi allen still fulli control style mani us grown lovethi id laugh one woodi comedi year dare say decad ive never impress scarlet johanson manag tone sexi imag jump right averag spirit young womanthi may crown jewel career wittier devil wear prada interest superman great comedi go see friend


In [42]:
data['review'] = data['review'].apply(stemming)

In [46]:
data['review'][45063]

'know movi isnt great mean cmon angel help basebal team find plot line hilari anyway kid dad say hell take back angel win pennant know wont kid pray fake god help angel win god help whole time via angel christoph lloyd rip end dad doesnt take back ride motorcycl right kid face hilari danni glover adopt friendi guess upsid old ladi left alon die stitchin project stori real winner though god later got job writer numer prank showsa kid movi get 7 movi mysteri blind stupid faith natur god get 10'

##### Using 10,000 sampels from the dataset

In [49]:
sample = data.sample(10000)

In [50]:
print(len(sample))

10000


## Feature Extraction 

In [80]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features=1000)

In [81]:
X = cv.fit_transform(sample['review']).toarray()

In [82]:
X.shape

(10000, 1000)

In [83]:
X.nbytes

80000000

In [84]:
y = sample.iloc[:,-1].values

In [85]:
type(y)

numpy.ndarray

### Split Data for training and testing

In [86]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state=42)

#### Training Model with naive bayes theorem

In [87]:
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.metrics import accuracy_score

Gaussian naive bayes theorem

In [88]:
clf_ga = GaussianNB()
clf_ga.fit(X_train, y_train)

In [89]:
y_pred_ga = clf_ga.predict(X_test)

In [90]:
accuracy_score(y_test, y_pred_ga)

0.7776

Multinomial naive bayes theorem

In [91]:
clf_mul = MultinomialNB()
clf_mul.fit(X_train, y_train)

In [92]:
y_pred_mul = clf_mul.predict(X_test)

In [93]:
accuracy_score(y_test, y_pred_mul)

0.8264

Bernoulli naive bayes theorem

In [94]:
clf_br = BernoulliNB()
clf_br.fit(X_train, y_train)

In [95]:
y_pred_br = clf_br.predict(X_test)

In [96]:
accuracy_score(y_test, y_pred_br)

0.83

#### Training with full dataset

In [97]:
X = cv.fit_transform(data['review']).toarray()

In [98]:
X.shape

(50000, 1000)

In [99]:
y = data.iloc[:,-1].values

In [100]:
y.shape

(50000,)

In [101]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state=42)

In [102]:
clf_ga.fit(X_train, y_train)

In [104]:
y_pred_ga = clf_ga.predict(X_test)
accuracy_score(y_test, y_pred_ga)

0.77824

In [105]:
clf_mul = MultinomialNB()
clf_mul.fit(X_train, y_train)

In [106]:
y_pred_mul = clf_mul.predict(X_test)
accuracy_score(y_test, y_pred_mul)

0.82536

In [107]:
clf_br = BernoulliNB()
clf_br.fit(X_train, y_train)

In [108]:
y_pred_br = clf_br.predict(X_test)
accuracy_score(y_test, y_pred_br)

0.82504