# Assignment4: Sentiment Analysis

Use the UCI Sentiment Labelled Sentences Data Set in this assignment:  https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences.
Create a sentiment analyzer using the TfidfVectorizer representation of the amazon reviews from the UCI Sentiment Labelled Sentences Data Set. Use the TfidfVectorizer function in Sklearn and train both a logistic regression model.

Load the dataset and select the amazon reviews

In [128]:
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer, PorterStemmer
# from wordcloud import WordCloud, STOPWORDS
import re
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

In [129]:
df1 = pd.read_table('amazon_cells_labelled.txt',header=None)
df2 = pd.read_table('imdb_labelled.txt',header=None)
df3 = pd.read_table('yelp_labelled.txt',header=None)

In [130]:
df1.head()

Unnamed: 0,0,1
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1


In [131]:
df2.head()

Unnamed: 0,0,1
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1


In [132]:
df3.head()

Unnamed: 0,0,1
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [133]:
frames = [df1, df2, df3]
df = pd.concat(frames)

In [134]:
df.head()

Unnamed: 0,0,1
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1


In [135]:
df.columns = ["review","target"]
df.head()

Unnamed: 0,review,target
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1


preprocess the data (tokenize, stop word removal, lemmatization)

In [137]:
stemmer = SnowballStemmer("english")
stop_words = set(stopwords.words('english'))
def preprocess(text_string):
    text_string = text_string.lower() # Convert everything to lower case.
    text_string = re.sub('[^A-Za-z0-9]+', ' ', text_string) # Remove special characters and punctuations
    
    x = text_string.split()
    new_text = []
    
    for word in x:
        if word not in stop_words:
            new_text.append(stemmer.stem(word))
            
    text_string = ' '.join(new_text)
    return text_string

In [138]:
df['preprocessed_reviews'] = df['review'].apply(preprocess)

In [139]:
df.head()

Unnamed: 0,review,target,preprocessed_reviews
0,So there is no way for me to plug it in here i...,0,way plug us unless go convert
1,"Good case, Excellent value.",1,good case excel valu
2,Great for the jawbone.,1,great jawbon
3,Tied to charger for conversations lasting more...,0,tie charger convers last 45 minut major problem
4,The mic is great.,1,mic great


In [140]:
df.dtypes

review                  object
target                   int64
preprocessed_reviews    object
dtype: object

split the data to 20\% test and 80\%train and train a logistic regression model to claissfy the sentiment and convert to tfidf vectors

In [141]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(df[['preprocessed_reviews']],df['target'],test_size=0.2,random_state=40)
# model = LogisticRegression()
# model.fit(X_train,y_train)
# model.score(X_test,y_test)

In [142]:
tfidf_vec = TfidfVectorizer(ngram_range=(1,2), max_features=10000)
tfidf_train = tfidf_vec.fit_transform(X_train['preprocessed_reviews'])
tfidf_test = tfidf_vec.transform(X_test['preprocessed_reviews'])

print(tfidf_train.shape)
print(y_train.shape)
print(tfidf_test.shape)
print(y_test.shape)

(2198, 10000)
(2198,)
(550, 10000)
(550,)


In [143]:
from sklearn.linear_model import LogisticRegression

model_params = {
    'logistic_regression': {
        'model': LogisticRegression(),
        'params': {
            'max_iter':[100,500,1000],
            'penalty': ['l2'],
            'solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
         }
    }
}

In [144]:
# Model Selection and hyper-parameter tuning 
from sklearn.model_selection import GridSearchCV

scores = []

for model_name, mp in model_params.items():
    clf =  GridSearchCV(mp['model'],mp['params'], cv=5, return_train_score=False)
    clf.fit(tfidf_train, y_train)
    scores.append({
        'model': model_name,
        'best_score': clf.best_score_,
        'best_params': clf.best_params_
    })
    
df_temp = pd.DataFrame(scores,columns=['model','best_score','best_params'])
df_temp

Unnamed: 0,model,best_score,best_params
0,logistic_regression,0.799355,"{'max_iter': 100, 'penalty': 'l2', 'solver': '..."


In [145]:
clf.best_estimator_.score(tfidf_test,y_test)

0.7981818181818182

What are some example of sentiments claissified wrongly?

In [146]:
y_pred = clf.best_estimator_.predict(tfidf_test)

In [147]:
X_test['target'] = y_test
X_test['pred'] = y_pred

In [152]:
X_test[X_test['target']!=X_test['pred']]

Unnamed: 0,preprocessed_reviews,target,pred
794,internet access fine rare instanc work,0,1
34,car charger well ac charger includ make sure n...,1,0
133,movi seem drag hero realli work freedom,0,1
374,good item work start problem auto revers tape ...,0,1
73,cannot believ actor agre film,0,1
...,...,...,...
66,problem thought actor play villain low rent mi...,0,1
72,great disappoint,0,1
451,food gooodd,1,0
282,first wear well,0,1
