## Sentiment Analysis With Machine Learning

In this project, we'll try to utilize machine learning models with tfidf/countervectorizer words representation to do sentiment analysis. It's amazing that machine leaning algorithms  have better performance of 88.01% accuracy even beat deep learning models with glove word embeddings which is 81% accuracy.

### Data Read
Here we have a look at movies reviews and labels.

In [1]:
import numpy as np
import pandas as pd
with open("data/reviews.txt",'r') as f:
    reviews = f.read()
with open("data/labels.txt",'r') as f:
    labels = f.read()
print (reviews[:1000])
print (labels[:100])

bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   
story of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is turn

### Data Exploration
Before we represent words as numerical data,we split and check reviews with zero length and remove it.
1. The reviews are delimited with \n , we'll split reviews by \n
2. Checking reviews for zero length and removing it.
3. We explore reviews length and most common words to compare positive and negative reviews.

1. We split reviews and labels by \n

In [2]:
reviews_list = reviews.split('\n')
labels_list = labels.split('\n')

2. We find one review with zero length and remove it as well as its label.

In [3]:
reviews_series = pd.Series([r for r in reviews_list])
print ("Checking null values:  ",sum(reviews_series.isna()))
review_length = reviews_series.apply(lambda r:len(r))
zero_index = [i for i,v in enumerate(review_length) if v==0]
print ("the zero-length review have index  ",zero_index)
del reviews_list[25000]
del labels_list[25000]
print ("Remove the zero-length review!")

Checking null values:   0
the zero-length review have index   [25000]
Remove the zero-length review!


In [4]:
reviews_series = reviews_series.drop(index=zero_index,axis=0,inplace=False)
#print (reviews_series.shape)
labels_series = pd.Series([l for l in labels_list]).map({'positive':1,'negative':0})
print ("Sample labels:\n",labels_series[:3])
print ("The total reviews is {}.The ratio of positive reviews is {:.2f} ".format(reviews_series.shape[0],labels_series.sum()/labels_series.count()))

Sample labels:
 0    1
1    0
2    1
dtype: int64
The total reviews is 25000.The ratio of positive reviews is 0.50 


3. We explore and compare the reviews length for positive and negative reviews

In [5]:
dt = pd.DataFrame(reviews_series,columns=["review"])
dt["review_length"] = dt.review.apply(len)
dt["label"] = labels_list
dt.head()

Unnamed: 0,review,review_length,label
0,bromwell high is a cartoon comedy . it ran at ...,832,positive
1,story of a man who has unnatural feelings for ...,667,negative
2,homelessness or houselessness as george carli...,2398,positive
3,airport starts as a brand new luxury pla...,4476,negative
4,brilliant over acting by lesley ann warren . ...,857,positive


In [6]:
dt.groupby("label").review_length.describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
negative,12500.0,1324.638,969.383844,55.0,722.0,993.0,1595.0,9117.0
positive,12500.0,1367.62336,1058.787192,74.0,707.0,998.0,1673.0,13859.0


We see positive reviews are slightly longer than negative reviews.

In [7]:
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 8))

dt[dt.label=='positive'].review_length.plot(bins=35, kind='hist', color='green', 
                                       label='positive', alpha=0.6)
dt[dt.label=='negative'].review_length.plot(bins=35,kind='hist', color='red', 
                                       label='negative', alpha=0.6)
plt.legend()
plt.xlabel("review Length")

Text(0.5, 0, 'review Length')

In [8]:
from collections import Counter
import string
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
def text_process(review):
    """
    Remove punctuations and stopwords in review.
    Words stemming in review.
    
    Return:
    list of cleaned words with lower case.
    """
    nopunc = [char for char in review if char not in string.punctuation]#remove punctuations
    nopunc = ''.join(nopunc)
    review = [word.lower() for word in nopunc.split() if word.lower() not in stopwords.words('english')]#remove stopwords
    ps = PorterStemmer()
    review = [ps.stem(w) for w in review]#word stemming
    return ' '.join(review)


We see the most common words are both in positive and negative reviews, these are useless to predict labels, we'll remove these most common words in model building.

In [9]:
dt["review_clean"] = dt.review.apply(text_process)
words_pos = dt[dt.label=='positive'].review_clean.apply(lambda x:[w for w in x.split()])
words_pos_counter = Counter()
for w in words_pos:
    words_pos_counter.update(w)
print ("The most common 50 words in positive reviews: ",words_pos_counter.most_common(50))
words_neg = dt[dt.label=='negative'].review_clean.apply(lambda x:[w for w in x.split()])
words_neg_counter = Counter()
for w in words_neg:
    words_neg_counter.update(w)
print ("The most common 50 words in negtive reviews:  ",words_neg_counter.most_common(50))

The most common 50 words in positive reviews:  [('br', 49235), ('film', 25309), ('movi', 22661), ('one', 14170), ('like', 10461), ('time', 8497), ('good', 7839), ('see', 7492), ('stori', 7481), ('charact', 7075), ('make', 6968), ('well', 6703), ('watch', 6554), ('great', 6489), ('get', 6476), ('love', 6167), ('show', 5590), ('also', 5550), ('realli', 5476), ('would', 5400), ('even', 5121), ('play', 5099), ('scene', 4994), ('first', 4756), ('much', 4685), ('end', 4660), ('peopl', 4547), ('way', 4532), ('best', 4323), ('think', 4295), ('go', 4249), ('life', 4207), ('look', 4178), ('year', 4055), ('work', 3968), ('made', 3823), ('mani', 3776), ('two', 3735), ('perform', 3653), ('know', 3612), ('thing', 3511), ('man', 3486), ('act', 3476), ('take', 3442), ('come', 3421), ('seen', 3415), ('still', 3363), ('littl', 3341), ('say', 3271), ('actor', 3251)]
The most common 50 words in negtive reviews:   [('br', 52637), ('movi', 29046), ('film', 22894), ('one', 13572), ('like', 12340), ('make', 8

### Model and Train

We split the data into train and test set then use CountVectorizer and TfidfTransformer to vectorize words. We should fit and transform trainset and transform testset, so we have to do train_test_split at first. If we fit and transform on the whole dataset and then split them, the tfidf vectors will have both information on trainset and testset, that is data leakage. Be aware that if we fit on trainset, words only showing in testset will be disgarded when transform on testset. We could also use TfidfVectorizer to convert a collection of raw documents to a matrix of tfidf features. It combines the CountVectorizer and TfidfTransformer.

In [10]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(dt.review_clean,dt.label,train_size = 0.6,random_state=42)

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB


pipeline1 = Pipeline([('CountVect',CountVectorizer(lowercase=True,stop_words='english',analyzer='word',ngram_range=(1,2))),
                      ('TfidfTrans',TfidfTransformer()),
                      ('MultinomialNB',MultinomialNB())])

param = dict(CountVect__max_df=[0.1,0.2],CountVect__min_df=[2,5])

gridsearch1 = GridSearchCV(pipeline1,param_grid=param,scoring='accuracy',cv=5)

#print (gridsearch1.estimator.get_params().keys())
gridsearch1.fit(X_train,y_train)
print (gridsearch1.best_estimator_)

Pipeline(memory=None,
         steps=[('CountVect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=0.1,
                                 max_features=None, min_df=2,
                                 ngram_range=(1, 2), preprocessor=None,
                                 stop_words='english', strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('TfidfTrans',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('MultinomialNB',
                 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
         verbose=False)


In [12]:
from sklearn import metrics
y_predict = gridsearch1.predict(X_test)
metrics.accuracy_score(y_test,y_predict)

0.8742

We try to remove tfidf transformer and get the slightly lower score than best estimator with tfidf, so we keep tfidf.

In [13]:
pipeline_onlycv = Pipeline([
    ('CountVect',CountVectorizer(lowercase=True,stop_words='english',analyzer='word',ngram_range=(1,2),max_df=0.1,min_df=2)),
    ('MultinomialNB',MultinomialNB())
]
)
pipeline_onlycv.fit(X_train,y_train)
y_predict = pipeline_onlycv.predict(X_test)
metrics.accuracy_score(y_test,y_predict)

0.8729

We test model with logisticregression ,and get a better performance on testset .

In [15]:
pipeline_lr = Pipeline([
    ('CountVect',CountVectorizer(lowercase=True,stop_words='english',analyzer='word',ngram_range=(1,2),max_df=0.1,min_df=2)),
    ('tfidf',TfidfTransformer()),
    ('logisticreg',LogisticRegression(solver='liblinear'))
]
)
pipeline_lr.fit(X_train,y_train)
y_predict = pipeline_lr.predict(X_test)
metrics.accuracy_score(y_test,y_predict)

0.8801

In [16]:
metrics.confusion_matrix(y_test, y_predict)

array([[4276,  692],
       [ 507, 4525]])

In [45]:
y_test1 = pd.Series(y_test).map({'positive':1,'negative':0})
y_predict1 = pd.Series(y_predict).map({'positive':1,'negative':0})

In [46]:
y_test1 = y_test1.reset_index(drop=True) #reset index

In [58]:
X_index = X_test.index
np.array(dt.review.iloc[X_index])[(y_test1==1)&(y_test1==y_predict1)][2:3]

array(['part of the enjoyment that i took from this film stemmed from the fact that i knew nothing more about it than that it starred john turturro and emily watson   reasons enough to watch   was a period piece and involved chess . everything that evolved before me was completely unexpected . i shan  t  therefore  give away much more . suffice to say that turturro is magnificent as an eccentric  obsessive and deeply vulnerable chess genius and em matches him step for step as the strong  minded woman who is drawn to him . it  s about love and obsession  rather than the venerated board game and after drawing me in gradually over the first half hour  became totally compelling . and i defy anyone to second  guess the ending .  '],
      dtype=object)

In [59]:
print ("Sample predict error with true positive. \n")
print (np.array(dt.review.iloc[X_index])[(y_test1==1)&(y_test1==y_predict1)][2:3])
print ("\n Sample predict error with true negative. \n")
print (np.array(dt.review.iloc[X_index])[(y_test1==0)&(y_test1==y_predict1)][2:3])

Sample predict error with true positive. 

['part of the enjoyment that i took from this film stemmed from the fact that i knew nothing more about it than that it starred john turturro and emily watson   reasons enough to watch   was a period piece and involved chess . everything that evolved before me was completely unexpected . i shan  t  therefore  give away much more . suffice to say that turturro is magnificent as an eccentric  obsessive and deeply vulnerable chess genius and em matches him step for step as the strong  minded woman who is drawn to him . it  s about love and obsession  rather than the venerated board game and after drawing me in gradually over the first half hour  became totally compelling . and i defy anyone to second  guess the ending .  ']

 Sample predict error with true negative. 

['the director tries to be quentin tarantino  the screenwriters try to be tennessee williams  deborah kara unger tries to be faye dunaway  the late james coburn tries to be orson we

In [63]:
print ("Sample predict error with false positive. \n")
print (np.array(dt.review.iloc[X_index])[y_test1<y_predict1][2:3])
print ("\n Sample predict error with false negative. \n")
print (np.array(dt.review.iloc[X_index])[y_test1>y_predict1][2:3])

Sample predict error with false positive. 

['snow white  which just came out in locarno  where i had the chance to see it  of course refers to the world famous fairy tale . and it also refers to coke . in the end  real snow of the swiss alps plays its part as well .  br    br   thus all three aspects of the title are addressed in this film . there is a lot of dope on scene  and there is also a pale  dark haired girl  with a prince who has to go through all kind of trouble to come to her rescue .  br    br   but it  s not a fairy tale . it  s supposed to be a realistic drama located in zurich  switzerland  according to the tagline  .  br    br   technically the movie is close to perfect . unfortunately a weak plot  foreseeable dialogs  a mostly unreal scenery and the mixed acting don  t add up to create authenticity . thus as a spectator i remained untouched .  br    br   and then there were the clichs  which drove me crazy one by one snow white is a rich and spoiled upper class daught

### Conclusion:
Machine learning algorithms with tfidf/countervectorizer words representing also do a good job of 88.01% accuracy on sentiment analysis. Even better than lstm,cnn with glove word embeddings of about 81% accuracy.