# Sentiment Analysis ML way

#Given to you short review of some movies. The reviews could talk bad or good about the movie. We can identify the sentiment of the text by looking/reading the words in the sentence. How can we make a machine/system understand the sentiment in the text.

#One way is the ML way. There is a ground truth that is created for some corpus i.e  we have both postive and negative reviews that are tagged with their respective class. This forms the base and the algorithm is trained on this data (after converting this to structured form) and depending on the words used the classification is done (Machine/system tries to obtain a pattern from data).

#Another way is dictionary approach, where we create a dictionary of positive and negative words and explicitly state that these words are positive or negative. We can then count the number of positive and negative words in the sentence and give a score. If the score is positive then its positive else its negative.

#In either cases, there is manual work involved (creating ground truth in case 1 or creating the dictionary in case 2)

In [1]:
import re
import nltk
import random
from nltk.corpus import movie_reviews
from nltk.classify.scikitlearn import SklearnClassifier
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix

In [2]:
f1 = open("short_reviews/positive.txt","r", encoding = 'latin-1')   # "r" is for reading

short_pos = f1.readlines() 

In [3]:
short_pos[1]

'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth . \n'

In [4]:
type(short_pos)

list

In [5]:
short_pos = [re.sub("\n","",i) for i in short_pos]

x_short_pos = short_pos[:1000]

In [6]:
f2 = open("short_reviews/negative.txt","r", encoding = 'latin-1')

short_neg = f2.readlines()

short_neg=[re.sub("\n","",i)for i in short_neg]

x_short_neg=short_neg[:1000]

In [7]:
short_neg[:1000]

['simplistic , silly and tedious . ',
 "it's so laddish and juvenile , only teenage boys could possibly find it funny . ",
 'exploitative and largely devoid of the depth or sophistication that would make watching such a graphic treatment of the crimes bearable . ',
 '[garbus] discards the potential for pathological study , exhuming instead , the skewed melodrama of the circumstantial situation . ',
 'a visually flashy but narratively opaque and emotionally vapid exercise in style and mystification . ',
 "the story is also as unoriginal as they come , already having been recycled more times than i'd care to count . ",
 "about the only thing to give the movie points for is bravado -- to take an entirely stale concept and push it through the audience's meat grinder one more time . ",
 'not so much farcical as sour . ',
 'unfortunately the story and the actors are served with a hack script . ',
 'all the more disquieting for its relatively gore-free allusions to the serial murders , but it

In [7]:
type(short_neg)

list

In [8]:
short_pos[:1000]

['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal . ',
 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth . ',
 'effective but too-tepid biopic',
 'if you sometimes like to go to the movies to have fun , wasabi is a good place to start . ',
 "emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one . ",
 'the film provides some great insight into the neurotic mindset of all comics -- even those who have reached the absolute top of the game . ',
 'offers that rare combination of entertainment and education . ',
 'perhaps no picture ever made has more literally showed that the road to hell is paved with good intentions . ',
 "steers 

In [9]:
cv = CountVectorizer(stop_words='english',lowercase=True, strip_accents='unicode',decode_error='ignore')

data = x_short_pos + x_short_neg

tdm = cv.fit_transform(data)

Mat = tdm.todense()

In [10]:
Mat.shape

(2000, 7399)

In [11]:
import pandas as pd

Mat = pd.DataFrame(Mat)

# adding the target var -> y : since 1st 1000 cols are pos and next 1000 are negative
Mat['type'] = ['pos']*1000 + ['neg']*1000

Mat = pd.DataFrame(Mat)

Mat = Mat.sample(frac = 1,random_state=1234)

train = Mat.iloc[:1800]

test = Mat.iloc[1800:]

In [12]:
test

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,7390,7391,7392,7393,7394,7395,7396,7397,7398,type
343,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,pos
1691,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,neg
1235,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,neg
1342,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,neg
1585,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,neg
1337,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,neg
211,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,pos
1662,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,neg
573,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,pos
1472,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,neg


In [13]:
from sklearn.linear_model import LogisticRegression

In [14]:
logreg = LogisticRegression()

X = train.iloc[:,:-1].values

Y = train.iloc[:,-1].values

In [15]:
logreg.fit(X,Y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [16]:
test1 = test.iloc[:,:-1].values #from col1 to collast, except last one (Excl target) slicing

true = test.iloc[:,-1].values #selecting target col

pred = logreg.predict(test1)

In [17]:
confusion_matrix(true, pred)

array([[58, 39],
       [26, 77]])

In [18]:
from sklearn.metrics import recall_score, precision_score, accuracy_score

acc = accuracy_score(true, pred)

rec = recall_score(true, pred, pos_label = 'neg')

prec = precision_score(true, pred, pos_label = 'neg')

print(acc)
print(rec)
print(prec)

0.675
0.5979381443298969
0.6904761904761905


# Work with any tfidf vectorizer and check if you can improve the accuracies

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [20]:
# max_df -> if a token is occuring more than this value it ignores
# min_df -> if a token is occuring less than this value it ignores
vectorizer = TfidfVectorizer(max_df = 0.8,stop_words='english')

In [21]:
tdm = vectorizer.fit_transform(data)
Mat = tdm.todense()


In [22]:
Mat = pd.DataFrame(Mat)

Mat['type'] = ['pos']*1000 + ['neg']*1000

Mat = pd.DataFrame(Mat)

# shuffling the data
Mat = Mat.sample(frac = 1,random_state=1234)

train = Mat.iloc[:1800]

test = Mat.iloc[1800:]

In [23]:
test

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,7396,7397,7398,7399,7400,7401,7402,7403,7404,type
343,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,pos
1691,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neg
1235,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neg
1342,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neg
1585,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neg
1337,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neg
211,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,pos
1662,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neg
573,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,pos
1472,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,neg


In [62]:
X = train.iloc[:,:-1].values

Y = train.iloc[:,-1].values

In [25]:
logreg.fit(X,Y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [26]:
test1 = test.iloc[:,:-1].values #from col1 to collast, except last one (Excl target) slicing

true = test.iloc[:,-1].values #selecting target col

pred = logreg.predict(test1)

In [27]:
confusion_matrix(true, pred)

array([[61, 36],
       [25, 78]])

In [28]:
acc = accuracy_score(true, pred)

rec = recall_score(true, pred, pos_label = 'neg')

prec = precision_score(true, pred, pos_label = 'neg')

print(acc)
print(rec)
print(prec)

0.695
0.6288659793814433
0.7093023255813954


# Work with any other classification model and check if you can improve the accuracies

In [29]:
from sklearn import svm

In [30]:
svm_linear = svm.SVC(kernel='linear') 
svm_linear.fit(X,Y)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [33]:
test1 = test.iloc[:,:-1].values #from col1 to collast, except last one (Excl target) slicing

true = test.iloc[:,-1].values #selecting target col

pred = svm_linear.predict(test1)

In [34]:
confusion_matrix(true, pred)

array([[59, 38],
       [27, 76]])

In [35]:
acc = accuracy_score(true, pred)

rec = recall_score(true, pred, pos_label = 'neg')

prec = precision_score(true, pred, pos_label = 'neg')

print(acc)
print(rec)
print(prec)

0.675
0.6082474226804123
0.686046511627907


#What else could be done to improve the accuracies

Grid Search with SVM

In [63]:
type(X)
type(Y)


numpy.ndarray

In [74]:
# Y=pd.get_dummies(Y)
#convert pos to
import numpy as np
newY = np.where(Y == 'pos', 1, 0)  
newY    

array([0, 1, 0, ..., 1, 1, 0])

In [89]:
# Grid Search with SVM

#from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import GridSearchCV
Cs = [0.01, 0.1,1]
gammas = [0.01, 0.1,1]
param_grid = {'C': Cs, 'gamma' : gammas}
clf = GridSearchCV(svm.SVC(kernel='linear',class_weight={1: 2,0:0.5}),param_grid,n_jobs=4,scoring='recall')
clf.fit(X=X, y=newY)

GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight={1: 2, 0: 0.5}, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params=None, iid=True, n_jobs=4,
       param_grid={'C': [0.01, 0.1, 1], 'gamma': [0.01, 0.1, 1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='recall', verbose=0)

In [91]:
svm_model = clf.best_estimator_
print (clf.best_score_, clf.best_params_) 

1.0 {'C': 0.01, 'gamma': 0.01}


In [92]:
test2 = test.iloc[:,:-1].values #from col1 to collast, except last one (Excl target) slicing

true = test.iloc[:,-1].values #selecting target col
new_true = np.where(true == 'pos',1,0)
print(new_true)
pred = svm_model.predict(test2)
print(pred)

[1 0 0 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 1 0 1 0 0 1 0
 0 0 1 1 1 0 1 0 1 0 1 0 1 0 1 1 1 1 0 1 0 0 1 0 1 1 1 1 1 1 0 1 1 0 0 0 1
 1 0 1 1 0 0 1 0 1 0 1 1 1 0 1 1 1 1 0 0 1 1 1 1 1 1 0 1 0 0 1 1 1 0 0 1 0
 0 0 0 1 1 1 0 0 1 1 0 1 1 1 0 0 0 1 0 1 1 0 1 0 1 1 0 1 0 1 1 1 0 1 0 0 0
 0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 0 0 1 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 0 0 0 0
 0 0 1 0 0 1 1 0 1 0 0 0 0 1 1]
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]


In [85]:
confusion_matrix(new_true, pred)

array([[  0,  97],
       [  0, 103]])

In [83]:
acc = accuracy_score(new_true, pred)

rec = recall_score(new_true, pred, pos_label = 0)

prec = precision_score(new_true, pred, pos_label = 0)

print(acc)
print(rec)
print(prec)

0.515
0.0
0.0


  'precision', 'predicted', average, warn_for)
