# Sentiment Analysis

We will perform sentiment analysis on the movie_reviews dataset pre-loaded in the nltk.corpus library. I used tf-idf vectorizer and multinomial naive bayes algorithm to train the data and predict the sentiment on the data.

In [1]:
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import movie_reviews

**Extracting the Reviews(features) from the movie_reviews dataset**

In [2]:
print(movie_reviews.categories())

['neg', 'pos']


In [3]:
pos_rev = movie_reviews.fileids('pos')
len(pos_rev)

1000

In [4]:
neg_rev = movie_reviews.fileids('neg')
len(neg_rev)

1000

In [5]:
rev_list = []

**Applying some general pre-processing steps**

In [6]:
for rev in pos_rev:
    rev_text_pos = nltk.corpus.movie_reviews.words(rev)
    review_one_string = " ".join(rev_text_pos)
    review_one_string = review_one_string.replace(' ,' , ',')
    review_one_string = review_one_string.replace(' .' , '.')
    review_one_string = review_one_string.replace("\' " , "'")
    review_one_string = review_one_string.replace(" \'", "'")
    rev_list.append(review_one_string)

In [7]:
for rev in neg_rev:
    rev_text_neg = nltk.corpus.movie_reviews.words(rev)
    review_one_string = " ".join(rev_text_neg)
    review_one_string = review_one_string.replace(' ,' , ',')
    review_one_string = review_one_string.replace(' .' , '.')
    review_one_string = review_one_string.replace("\' " , "'")
    review_one_string = review_one_string.replace(" \'", "'")
    rev_list.append(review_one_string)

In [8]:
len(rev_list)

2000

In [9]:
pd.Series(rev_list)

0       films adapted from comic books have had plenty...
1       every now and then a movie comes along from a ...
2       you've got mail works alot better than it dese...
3       " jaws " is a rare film that grabs your attent...
4       moviemaking is a lot like being the general ma...
                              ...                        
1995    if anything, " stigmata " should be taken as a...
1996    john boorman's " zardoz " is a goofy cinematic...
1997    the kids in the hall are an acquired taste. it...
1998    there was a time when john carpenter was a gre...
1999    two party guys bob their heads to haddaway's d...
Length: 2000, dtype: object

**Creating the sentiment(target) column**

In [10]:
pos_targets = np.ones((1000,),dtype=np.int)
len(pos_targets)

1000

In [11]:
neg_targets= np.zeros((1000,),dtype=np.int)
len(neg_targets)

1000

In [12]:
target_list = []

In [13]:
for pos_tar in pos_targets:
    target_list.append(pos_tar)

In [14]:
for neg_tar in neg_targets:
    target_list.append(neg_tar)

In [15]:
len(target_list)

2000

In [16]:
pd.Series(target_list)

0       1
1       1
2       1
3       1
4       1
       ..
1995    0
1996    0
1997    0
1998    0
1999    0
Length: 2000, dtype: int64

**Creating the Dataframe**

In [17]:
Data = pd.DataFrame(data={"Reviews":rev_list,"Sentiment":target_list})
Data

Unnamed: 0,Reviews,Sentiment
0,films adapted from comic books have had plenty...,1
1,every now and then a movie comes along from a ...,1
2,you've got mail works alot better than it dese...,1
3,""" jaws "" is a rare film that grabs your attent...",1
4,moviemaking is a lot like being the general ma...,1
...,...,...
1995,"if anything, "" stigmata "" should be taken as a...",0
1996,"john boorman's "" zardoz "" is a goofy cinematic...",0
1997,the kids in the hall are an acquired taste. it...,0
1998,there was a time when john carpenter was a gre...,0


**Performing some more pre-processing steps**

In [18]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

def remove_stopwords(text):
        no_stopword_text = [w for w in text.split() if not w in stop]
        text = ' '.join(no_stopword_text)
        return text

Data["Reviews"] = Data["Reviews"].apply(lambda x: remove_stopwords(x))
Data["Reviews"]

0       films adapted comic books plenty success, whet...
1       every movie comes along suspect studio, every ...
2       got mail works alot better deserves to. order ...
3       " jaws " rare film grabs attention shows singl...
4       moviemaking lot like general manager nfl team ...
                              ...                        
1996    john boorman's " zardoz " goofy cinematic deba...
1997    kids hall acquired taste. took least season wa...
1998    time john carpenter great horror director. cou...
1999    two party guys bob heads haddaway's dance hit ...
Name: Reviews, Length: 2000, dtype: object

In [19]:
from nltk import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

def lemmatize(text):
    word_tokens = nltk.word_tokenize(text)
    lemmatized_word = [wordnet_lemmatizer.lemmatize(word) for word in word_tokens]
    text = " ".join(lemmatized_word)
    return text
Data["Reviews"].apply(lambda x: lemmatize(x))
Data["Reviews"]

0       films adapted comic books plenty success, whet...
1       every movie comes along suspect studio, every ...
2       got mail works alot better deserves to. order ...
3       " jaws " rare film grabs attention shows singl...
4       moviemaking lot like general manager nfl team ...
                              ...                        
1996    john boorman's " zardoz " goofy cinematic deba...
1997    kids hall acquired taste. took least season wa...
1998    time john carpenter great horror director. cou...
1999    two party guys bob heads haddaway's dance hit ...
Name: Reviews, Length: 2000, dtype: object

**Applying TF-IDF vectorizer on the features**

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
TFIDF_vect = TfidfVectorizer(ngram_range = (1,2),max_df=0.3,min_df=7)
X = TFIDF_vect.fit_transform(Data["Reviews"])
X_names = TFIDF_vect.get_feature_names()
X = pd.DataFrame(X.toarray(),columns=X_names)
X

Unnamed: 0,00,000,007,10,10 000,10 10,10 minutes,10 scale,10 things,10 year,...,zest,zeta,zeta jones,zingers,zombie,zombies,zone,zooms,zucker,zwick
0,0.069403,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.000000,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.000000,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.000000,0.02576,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.000000,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,0.000000,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1996,0.000000,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1997,0.000000,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1998,0.000000,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [21]:
Y = Data["Sentiment"]

**Splitting the data and then applying multinomial naive bayes algorithm on it**

In [22]:
from sklearn.model_selection import train_test_split
Xtrain,Xtest,Ytrain,Ytest = train_test_split(X,Y,test_size=0.25,random_state=42)

In [23]:
from sklearn.naive_bayes import MultinomialNB  
NB_model = MultinomialNB()
NB_model.fit(Xtrain,Ytrain)

MultinomialNB()

In [24]:
Ypred = NB_model.predict(Xtest)

In [25]:
from sklearn.metrics import confusion_matrix
CM = confusion_matrix(Ytest,Ypred)
CM

array([[200,  43],
       [ 56, 201]], dtype=int64)

In [26]:
from sklearn.metrics import accuracy_score
AC = accuracy_score(Ytest,Ypred)
AC

0.802

**Performing hyperparameter optimization using the techniques of grid search and cross-validation in an attempt to improve the accuracy of the model.**

In [27]:
NB2_model = MultinomialNB()
NB2_model.fit(X,Y)

MultinomialNB()

In [28]:
from sklearn.model_selection import GridSearchCV
param_grid ={"alpha":[0.2,0.5,0.7,1.0,1.2,1.5,1.7,2.0,2.5,3.0]}
GD_obj = GridSearchCV(NB2_model,param_grid,cv=5)
GD_obj.fit(X,Y)

GridSearchCV(cv=5, estimator=MultinomialNB(),
             param_grid={'alpha': [0.2, 0.5, 0.7, 1.0, 1.2, 1.5, 1.7, 2.0, 2.5,
                                   3.0]})

In [29]:
GD_obj.best_score_

0.833

In [30]:
GD_obj.best_estimator_

MultinomialNB(alpha=2.0)

In [31]:
NB_optimized = MultinomialNB(alpha=2.0)
NB_optimized.fit(Xtrain,Ytrain)
Ypred_NB2 = NB_optimized.predict(Xtest)

In [32]:
CM_NB2 = confusion_matrix(Ytest,Ypred_NB2)
CM_NB2

array([[202,  41],
       [ 57, 200]], dtype=int64)

In [33]:
AC_NB2 = accuracy_score(Ytest,Ypred_NB2)
AC_NB2

0.804