<h3> Sentiment analysis of IMDB movie reviews</h3>

In [1]:
#Importing the required libraries
import nltk
import pandas as pd
import numpy as np
import re
from nltk.corpus import stopwords
from textblob import Word
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from sklearn import svm 
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.ensemble import GradientBoostingClassifier
import spacy

  from numpy.core.umath_tests import inner1d


In [4]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\Santhosh
[nltk_data]     Kumar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [5]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to C:\Users\Santhosh
[nltk_data]     Kumar\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
#loading the train dataset
dataset_train = pd.read_csv('G:\machine_learning\sentiment\Train.csv')
dataset_train.head()

Unnamed: 0,text,label
0,I grew up (b. 1965) watching and loving the Th...,0
1,"When I put this movie in my DVD player, and sa...",0
2,Why do people who do not know what a particula...,0
3,Even though I have great interest in Biblical ...,0
4,Im a die hard Dads Army fan and nothing will e...,1


In [3]:
#loading the test dataset
dataset_test = pd.read_csv('G:\machine_learning\sentiment\Test.csv')
dataset_test.head()

Unnamed: 0,text,label
0,I always wrote this series off as being a comp...,0
1,1st watched 12/7/2002 - 3 out of 10(Dir-Steve ...,0
2,This movie was so poorly written and directed ...,0
3,The most interesting thing about Miryang (Secr...,1
4,"when i first read about ""berlin am meer"" i did...",0


- <b> Data preprocessing - Removing stopwords, numbers, special characters and lemmatizing the text data.

In [4]:
#Business logic to remove noises from the dataset
def pre_processing(df, stop_words):
    df['text'] = df['text'].apply(lambda x: " ".join(x.lower() for x in x.split()))
    df['text'] = df['text'].str.replace('[^\w\s]', '')
    df['text'] = df['text'].str.replace('\d', '')
    df['text'] = df['text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop_words)) 
    df['text'] = df['text'].apply(lambda x: " ".join([Word(x).lemmatize() for x in x.split()]))
    return df

In [5]:
#Removing noises from train dataset
stop_words = stopwords.words('english')
dataset_train_v2 = pre_processing(dataset_train,stop_words)
dataset_train_v2.head()

Unnamed: 0,text,label
0,grew b watching loving thunderbird mate school...,0
1,put movie dvd player sat coke chip expectation...,0
2,people know particular time past like feel nee...,0
3,even though great interest biblical movie bore...,0
4,im die hard dad army fan nothing ever change g...,1


In [6]:
#Removing noises from test dataset
dataset_test_v2 = pre_processing(dataset_test, stop_words)
dataset_test_v2.head()

Unnamed: 0,text,label
0,always wrote series complete stinkfest jim bel...,0
1,st watched dirsteve purcell typical mary kate ...,0
2,movie poorly written directed fell asleep minu...,0
3,interesting thing miryang secret sunshine acto...,1
4,first read berlin meer didnt expect much thoug...,0


In [11]:
train_reviews = dataset_train_v2.text[:]
train_sentiments = dataset_train_v2.label[:]

In [12]:
test_reviews = dataset_test_v2.text[:]
test_sentiments = dataset_test_v2.label[:]

- <b> Encoding the text data using TF-IDF vectorizer</b>

In [13]:
vec = TfidfVectorizer(min_df = 4, max_df = 0.9)
train_vec = vec.fit_transform(train_reviews)
test_vec = vec.transform(test_reviews)

- <b> SVM model to classify sentiments</b> 

In [14]:
model = svm.SVC(kernel ='linear')
model.fit(train_vec,train_sentiments)
prediction = model.predict(test_vec)

In [15]:
accr_score = accuracy_score(test_sentiments,prediction)
print("Accuracy score of the model:",accr_score)

Accuracy score of the model: 0.8952


In [16]:
svm_report=classification_report(test_sentiments,prediction,target_names=['Positive','Negative'])
print(svm_report)

             precision    recall  f1-score   support

   Positive       0.90      0.89      0.89      2495
   Negative       0.89      0.90      0.90      2505

avg / total       0.90      0.90      0.90      5000



- <b> Naive Bayes Model to classify sentiments</b>

In [17]:
mnb_model = MultinomialNB()
mnb_model.fit(train_vec,train_sentiments)
prediction_mnb = mnb_model.predict(test_vec)
print(prediction_mnb)

[0 0 0 ... 0 0 1]


In [18]:
accr_score = accuracy_score(test_sentiments,prediction_mnb)
print("Accuracy score of the model:",accr_score)

Accuracy score of the model: 0.866


In [19]:
mnb_report=classification_report(test_sentiments,prediction_mnb,target_names=['Positive','Negative'])
print(mnb_report)

             precision    recall  f1-score   support

   Positive       0.86      0.88      0.87      2495
   Negative       0.87      0.85      0.86      2505

avg / total       0.87      0.87      0.87      5000



- <b> Gradient boosting classifier to classify sentiments</b>

In [20]:
gb = GradientBoostingClassifier(n_estimators=20,verbose=2)
gb.fit(train_vec,train_sentiments)
prediction_gb = gb.predict(test_vec)

      Iter       Train Loss   Remaining Time 
         1           1.3551           51.79s
         2           1.3294           48.15s
         3           1.3074           45.52s
         4           1.2889           42.64s
         5           1.2727           40.60s
         6           1.2580           37.60s
         7           1.2445           34.86s
         8           1.2323           32.00s
         9           1.2210           29.12s
        10           1.2103           26.44s
        11           1.2008           23.95s
        12           1.1913           21.24s
        13           1.1823           18.59s
        14           1.1736           15.89s
        15           1.1653           13.24s
        16           1.1575           10.58s
        17           1.1505            7.99s
        18           1.1433            5.31s
        19           1.1367            2.64s
        20           1.1300            0.00s


In [21]:
accr_score = accuracy_score(test_sentiments,prediction_gb)
print("Accuracy score of the model:",accr_score)

Accuracy score of the model: 0.733


In [22]:
gb_report=classification_report(test_sentiments,prediction_gb,target_names=['Positive','Negative'])
print(gb_report)

             precision    recall  f1-score   support

   Positive       0.81      0.61      0.69      2495
   Negative       0.69      0.86      0.76      2505

avg / total       0.75      0.73      0.73      5000



- <b> Text Categorizer model to classify sentiments </b>

In [8]:
nlp_model = spacy.blank('en')
text_cat = nlp_model.create_pipe("textcat", config={"exclusive_classes": True, "architecture": "bow"})
nlp_model.add_pipe(text_cat)

In [13]:
train_nlp = dataset_train_v2
train_nlp['label'] = np.where(train_nlp['label'] == 0, "negative", "positive")
train_nlp.head()

Unnamed: 0,text,label
0,grew b watching loving thunderbird mate school...,negative
1,put movie dvd player sat coke chip expectation...,negative
2,people know particular time past like feel nee...,negative
3,even though great interest biblical movie bore...,negative
4,im die hard dad army fan nothing ever change g...,positive


In [14]:
text_cat.add_label("negative")
text_cat.add_label("positive")

1

In [17]:
train_text  = train_nlp['text'].values
train_labels = [{'cats': {'negative': label == 'negative',
                          'positive': label == 'positive'}} 
                for label in train_nlp['label']]

train_data = list(zip(train_text, train_labels))
train_data[:2]

[('grew b watching loving thunderbird mate school watched played thunderbird school lunch school wanted virgil scott one wanted alan counting became art form took child see movie hoping would get glimpse loved child bitterly disappointing high point snappy theme tune could compare original score thunderbird thankfully early saturday morning one television channel still play rerun series gerry anderson wife created jonatha frakes hand director chair version completely hopeless waste film utter rubbish cgi remake may acceptable replacing marionette homo sapiens subsp sapiens huge error judgment',
  {'cats': {'negative': True, 'positive': False}}),
 ('put movie dvd player sat coke chip expectation hoping movie would contain strongpoints first movie awsome animation good flowing story excellent voice cast funny comedy kickass soundtrack disappointment found atlantis milo return read review first might let following paragraph directed seen first movie enjoyed primarily point mentionedbr br 

In [18]:
from spacy.util import minibatch
import random
random.seed(1)
spacy.util.fix_random_seed(1)
nlp_optimizer = nlp_model.begin_training()

losses = {}

for epoch in range(10):
    random.shuffle(train_data)
    batches = minibatch(train_data, size=8)
    for batch in batches:
        texts, labels = zip(*batch)
        nlp_model.update(texts, labels, sgd=nlp_optimizer, losses=losses)
    print(losses)


{'textcat': 14.852532352518665}
{'textcat': 22.629297904982288}
{'textcat': 27.8142326073555}
{'textcat': 31.501532514833542}
{'textcat': 34.14742987158691}
{'textcat': 36.174788963063314}
{'textcat': 37.7906354758264}
{'textcat': 39.15777847441762}
{'textcat': 40.2595147863914}
{'textcat': 41.16428734303316}


In [20]:
test_nlp = dataset_test_v2
test_nlp['label'] = np.where(test_nlp['label'] == 0, "negative", "positive") 
test_nlp.head()

Unnamed: 0,text,label
0,always wrote series complete stinkfest jim bel...,negative
1,st watched dirsteve purcell typical mary kate ...,negative
2,movie poorly written directed fell asleep minu...,negative
3,interesting thing miryang secret sunshine acto...,positive
4,first read berlin meer didnt expect much thoug...,negative


In [22]:
test_data = list(test_nlp['text'])
test_data[:3]

['always wrote series complete stinkfest jim belushi involved heavily one day tragic happenstance occurred white sox game ended realized remote way side room somehow could gotten walked across room get remote even tv turn channel get walk across country watch tv another state nut said decided hang tight couch take whatever fate store fate store episode show episode remember little except made broad general sweeping blanket judgment based zero objective experiential evidence nothing whatsoever back opinion completely right show total crudpie belushi comedic delivery hairy lighthouse foghorn woman physically attractive stepfordis elicit real feeling viewer absolutely reason stop running local tv station gasoline flamethrower sending every copy mutt howling back hell br br except br br except wonderful comic sty ling larry joe campbell america greatest comic character actor guy play belushis brotherinlaw andy gold good really well aside funny job make belushi look good thats like trying m

In [33]:
docs = [nlp_model.tokenizer(text) for text in test_data]
textcat = nlp_model.get_pipe('textcat')
scores, _ = textcat.predict(docs)
print(scores)

[[1.0022707e-01 8.9977288e-01]
 [9.9481165e-01 5.1883864e-03]
 [1.0000000e+00 1.2402998e-11]
 ...
 [9.7325277e-01 2.6747242e-02]
 [1.0000000e+00 1.4417805e-08]
 [9.4097376e-01 5.9026275e-02]]


In [36]:
predicted_labels = scores.argmax(axis=1)
prediction_result = [textcat.labels[label] for label in predicted_labels]

In [37]:
test_data_labels = list(test_nlp['label'])
# test_data_labels[:3]
accr_score = accuracy_score(test_data_labels,prediction_result)
print("Accuracy score of the model:",accr_score)

Accuracy score of the model: 0.878


<b>Conclusion</b>
 -  Accuracy of <b> SVM Model</b> is 89.5 %.
 -  Accuracy of <b> Text Categorizer model</b> is 87.8 %. 
 -  SVM and text categorizer model performs better than other models for our dataset.
 -  The accuracy can still be improved by advanced methods like neural network and deep learning.