# Case Study: Sentiment Analysis

In this lab we use part of the 'Amazon_Unlocked_Mobile.csv' dataset published by Kaggle. The dataset contain the following information:
* Product Name
* Brand Name
* Price
* Rating
* Reviews
* Review Votes

We are mainly interested by the 'Reviews' (X) and by the 'Rating' (y)

The goal is to try to predict the 'Rating' after reading the 'Reviews'. I've prepared for you TRAIN and TEST set.

## 1. Load dataset

In [2]:
import pandas as pd
import numpy as np

In [5]:
TRAIN = pd.read_csv('dataset/TRAIN.csv.gz')
display(TRAIN.head())

Unnamed: 0.1,Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0
1,1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0
2,2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0
3,3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0
4,4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0


In [6]:
TEST = pd.read_csv('dataset/TEST.csv.gz')
display(TEST.head())

Unnamed: 0.1,Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,13320,Apple iPhone 4S 16GB Unlocked GSM - White (Cer...,Apple,129.99,4,Very nice well taken care of phone.... Works g...,0.0
1,13321,Apple iPhone 4S 16GB Unlocked GSM - White (Cer...,Apple,129.99,5,Exactly as described. Perfect.,0.0
2,13322,Apple iPhone 4S 16GB Unlocked GSM - White (Cer...,Apple,129.99,4,"phone is good work in Ukraine, but charge incl...",1.0
3,13323,Apple iPhone 4S 16GB Unlocked GSM - White (Cer...,Apple,129.99,3,"At the moment I have to say its ok,because I h...",1.0
4,13324,Apple iPhone 4S 16GB Unlocked GSM - White (Cer...,Apple,129.99,3,iPhone is flawless. But the battery doesn't st...,0.0


## 2. Build X (features vectors) and y (labels)

#### 2.a.  Construct X_train and y_train

In [4]:
TRAIN.shape

(10000, 7)

In [5]:
X_train = TRAIN['Reviews']
y_train = TRAIN['Rating']
len(X_train)

10000

In [6]:
X_test = TEST['Reviews']
y_test = TEST['Rating']
len(X_test)

5000

In [None]:
import nltk
nltk.download('punkt')
X_train =  "".join(TRAIN['Reviews'])
y_train = TRAIN['Rating']
tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")
X_train = tokenizer.tokenize(X_train)
y_train = TRAIN['Rating']
for i in range(0,len(X_train)-1):
    X_train[i] = X_train[i].split()
X_train

#### 2.b.  Construct X_test and y_train


In [None]:
X_test =  "".join(TEST['Reviews']+" ")
X_test = tokenizer.tokenize(X_test)
y_test = TEST['Rating']
X_test

In [None]:
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
fig, ax =plt.subplots(1,2,figsize=(12, 3), sharex=True, sharey=True)

sns.countplot(x='Rating',data=TRAIN,palette='rainbow',ax=ax[0]).set_title('Ratings for training dataset')
sns.countplot(x='Rating',data=TEST,palette='rainbow',ax=ax[1]).set_title('Ratings for test dataset')
fig.show()


## 3. Construct a Baseline
Using CountVectorizer and a classifier, learned in a previous lecture, build a first model.

For this model, you will not pre-process the text and will only use words (not N-grams).

The evaluation metric is accuracy.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X_train = cv.fit_transform(X_train)
X_test = cv.transform(X_test)

In [None]:
print(X_train.toarray())

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train)
X_train_tfidf.shape

In [None]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [None]:
nb.fit(X_train_tfidf,y_train)

In [None]:
predictions = nb.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,predictions)

In [None]:
from sklearn.metrics import confusion_matrix,classification_report
print(confusion_matrix(y_test,predictions))
print('\n')
print(classification_report(y_test,predictions))

## 4. A better classifier with a preprocessing
By always using words and the same classification but pre-processing the text with one or more notebook techniques "text-preprocessing" try to get a better model.

The evaluation metric is always accuracy.

In [None]:
from nltk.corpus import stopwords
import string
import nltk
def text_process(mess):
    """
    Takes in a string of text, then performs the following:
    1. Remove all punctuation
    2. Remove all stopwords
    3. Returns a list of the cleaned text
    """
    for i in range(len(mess)-1):
    # Check characters to see if they are in punctuation
        nopunc = [char for char in mess[i] if char not in string.punctuation]

    # Join the characters again to form the string.
        nopunc = ''.join(nopunc)
    
    # Now just remove any stopwords
        return nopunc
    #[word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

In [None]:
import string
X_train.apply(text_process)


In [8]:
X_train = X_train.apply(lambda x: str.lower(x))
X_test = X_test.apply(lambda x: str.lower(x))

print(X_train[20])
print(X_test[20])

the battery was old & had been over used because it barely holds a charge. otherwise, no issues with the phone itself.
this phone looks like new and works perfectly. just inserted the sim card from my old phone, went through the easy setup prompts and done!


In [10]:
import re
X_train = X_train.apply(lambda x : " ".join(re.findall('[\w]+',x)))
X_test = X_test.apply(lambda x : " ".join(re.findall('[\w]+',x)))

print(X_train[20])
print(X_test[20])

the battery was old had been over used because it barely holds a charge otherwise no issues with the phone itself
this phone looks like new and works perfectly just inserted the sim card from my old phone went through the easy setup prompts and done


In [11]:
from nltk.corpus import stopwords


def remove_stopWords(s):
    '''For removing stop words
    '''
    s = ' '.join(word for word in s.split() if word not in stopwords.words('english'))
    return s

X_train = X_train.apply(lambda x: remove_stopWords(x))
X_test = X_test.apply(lambda x: remove_stopWords(x))

print(X_train[20])
print(X_test[20])

battery old used barely holds charge otherwise issues phone
phone looks like new works perfectly inserted sim card old phone went easy setup prompts done


In [13]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X_train = cv.fit_transform(X_train)
X_test = cv.transform(X_test)

In [14]:
X_train.shape

(10000, 10342)

In [29]:
print(X_train)

  (0, 887)	1
  (0, 930)	1
  (0, 3675)	1
  (0, 3683)	1
  (0, 3731)	1
  (0, 3892)	1
  (0, 4381)	1
  (0, 4495)	1
  (0, 4551)	1
  (0, 5438)	1
  (0, 5453)	1
  (0, 5605)	1
  (0, 6355)	1
  (0, 6364)	2
  (0, 6751)	3
  (0, 7385)	1
  (0, 7453)	1
  (0, 7919)	1
  (0, 8106)	2
  (0, 8500)	1
  (0, 8517)	1
  (0, 8526)	1
  (0, 9199)	1
  (0, 9752)	1
  (0, 9754)	1
  :	:
  (9998, 7947)	1
  (9999, 913)	1
  (9999, 977)	1
  (9999, 1524)	1
  (9999, 1544)	1
  (9999, 1692)	1
  (9999, 1766)	1
  (9999, 1893)	1
  (9999, 1898)	1
  (9999, 3865)	1
  (9999, 4911)	1
  (9999, 5437)	1
  (9999, 5551)	1
  (9999, 6164)	1
  (9999, 6186)	1
  (9999, 6539)	1
  (9999, 6751)	3
  (9999, 7084)	1
  (9999, 7174)	1
  (9999, 7191)	1
  (9999, 7440)	1
  (9999, 8009)	1
  (9999, 8017)	1
  (9999, 8319)	1
  (9999, 10165)	1


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
X_transformer = CountVectorizer(analyzer=text_process).fit(X_train)
#X_test = cv.transform(X_test)

In [None]:
X_train_cv = X_transformer.transform(X_train)

In [None]:
print(X_transformer.get_feature_names()[2549])

In [28]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train)
X_train_tfidf.shape
print(X_train_tfidf)

  (0, 10276)	0.16161994903310473
  (0, 10224)	0.10021757451082593
  (0, 9996)	0.13672784393996923
  (0, 9799)	0.34408866480251593
  (0, 9777)	0.16750089401734622
  (0, 9754)	0.19861238996161037
  (0, 9752)	0.16825692374376866
  (0, 9199)	0.13814917281714342
  (0, 8526)	0.1568519762158024
  (0, 8517)	0.16700688851400694
  (0, 8500)	0.16533712522250962
  (0, 8106)	0.2438086575403722
  (0, 7919)	0.1342367793446165
  (0, 7453)	0.12575746037194296
  (0, 7385)	0.1166257728355801
  (0, 6751)	0.15580294920785784
  (0, 6364)	0.19442981783350388
  (0, 6355)	0.13795586975312746
  (0, 5605)	0.22453393219319448
  (0, 5453)	0.1767040220142858
  (0, 5438)	0.18139010432810335
  (0, 4551)	0.24313963194598473
  (0, 4495)	0.16161994903310473
  (0, 4381)	0.15981819039555778
  (0, 3892)	0.15150370523428736
  :	:
  (9998, 21)	0.7326834829949377
  (9999, 10165)	0.18051471233754432
  (9999, 8319)	0.15416642138736006
  (9999, 8017)	0.13023317392857917
  (9999, 8009)	0.17875287508448678
  (9999, 7440)	0.3096487

In [16]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [22]:
nb.fit(X_train_tfidf,y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [26]:
predictions = nb.predict(X_test)

In [27]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,predictions)

0.6702

## 5. A better classifier using Ngrams
Starting from the previous work but bi or tri-fat try to get a better classification.

The evaluation metric is always accuracy.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(2, 2))
X_train_cv1 = cv.fit_transform(X_train)
X_test_cv1 = cv.transform(X_test)

In [8]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf1 = tfidf_transformer.fit_transform(X_train_cv1)
X_train_tfidf1.shape

(10000, 101848)

In [9]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [10]:
nb.fit(X_train_tfidf1,y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [11]:
predictions1 = nb.predict(X_test_cv1)

In [12]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,predictions1)

0.653

## 6. (Optional) A better classifier by combining many classifier

The goal of [ensemble methods](https://scikit-learn.org/stable/modules/ensemble.html) is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator.

I proposed to use **averaging methods**. In this family, the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced.



In [30]:
from sklearn.ensemble import BaggingClassifier

model = BaggingClassifier(base_estimator = nb) 

In [31]:
model.fit(X_train_tfidf,y_train)

BaggingClassifier(base_estimator=MultinomialNB(alpha=1.0, class_prior=None,
                                               fit_prior=True),
                  bootstrap=True, bootstrap_features=False, max_features=1.0,
                  max_samples=1.0, n_estimators=10, n_jobs=None,
                  oob_score=False, random_state=None, verbose=0,
                  warm_start=False)

In [32]:
predictions2 = model.predict(X_test)

In [35]:
accuracy_score(y_test,predictions2)

0.6704

In [34]:
from sklearn import model_selection

results = model_selection.cross_val_score(model, X_train_tfidf,y_train,cv=10)
for i in range(len(results)):
    print("Model: "+str(i)+" Accuracy is: "+str(results[i]))

Model: 0 Accuracy is: 0.6007984031936128
Model: 1 Accuracy is: 0.624750499001996
Model: 2 Accuracy is: 0.6453546453546454
Model: 3 Accuracy is: 0.6283716283716284
Model: 4 Accuracy is: 0.6303696303696303
Model: 5 Accuracy is: 0.6433566433566433
Model: 6 Accuracy is: 0.6266266266266266
Model: 7 Accuracy is: 0.6202404809619239
Model: 8 Accuracy is: 0.6392785571142284
Model: 9 Accuracy is: 0.6148445336008024


## 6. Summarize your conclusion here

Due to lack of time, I could not analyze the results very well. 

However, without doing any preprocessing, the model gives 0.66 accuracy score which is quite low.
Then I did some text preprocessing which considers removing punctuations and stopwords. Somehow it gives a worse accuracy score than the first model. Even when I applied n-grams, weirdly it did not improve the score.

I'm guessing that I could have done a mistake by doing count vectorizer at the beginning. I will continue to work on to improve accuracy score.
