# Case Study: Sentiment Analysis

In this lab we use part of the 'Amazon_Unlocked_Mobile.csv' dataset published by Kaggle. The dataset contain the following information:
* Product Name
* Brand Name
* Price
* Rating
* Reviews
* Review Votes

We are mainly interested by the 'Reviews' (X) and by the 'Rating' (y)

The goal is to try to predict the 'Rating' after reading the 'Reviews'. I've prepared for you TRAIN and TEST set.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Load-dataset" data-toc-modified-id="Load-dataset-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load dataset</a></span></li><li><span><a href="#Build-X-(features-vectors)-and-y-(labels)" data-toc-modified-id="Build-X-(features-vectors)-and-y-(labels)-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Build X (features vectors) and y (labels)</a></span><ul class="toc-item"><li><span><a href="#Construct-X_train-and-y_train" data-toc-modified-id="Construct-X_train-and-y_train-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Construct X_train and y_train</a></span></li><li><span><a href="#Construct-X_test-and-y_test" data-toc-modified-id="Construct-X_test-and-y_test-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Construct X_test and y_test</a></span></li></ul></li><li><span><a href="#Construct-a-Baseline" data-toc-modified-id="Construct-a-Baseline-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Construct a Baseline</a></span></li><li><span><a href="#A-better-classifier-with-a-preprocessing" data-toc-modified-id="A-better-classifier-with-a-preprocessing-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>A better classifier with a preprocessing</a></span></li><li><span><a href="#Summarize-your-conclusion-here" data-toc-modified-id="Summarize-your-conclusion-here-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Summarize your conclusion here</a></span></li></ul></div>

## Load dataset

In [1]:
import pandas as pd
import numpy as np

In [2]:
TRAIN = pd.read_csv("http://www.i3s.unice.fr/~riveill/dataset/Amazon_Unlocked_Mobile/train.csv.gz")
TEST = pd.read_csv("http://www.i3s.unice.fr/~riveill/dataset/Amazon_Unlocked_Mobile/test.csv.gz")

TRAIN.head()

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,Samsung Galaxy Note 4 N910C Unlocked Cellphone...,Samsung,449.99,4,I love it!!! I absolutely love it!! 👌👍,0.0
1,BLU Energy X Plus Smartphone - With 4000 mAh S...,BLU,139.0,5,I love the BLU phones! This is my second one t...,4.0
2,Apple iPhone 6 128GB Silver AT&T,Apple,599.95,5,Great phone,1.0
3,BLU Advance 4.0L Unlocked Smartphone -US GSM -...,BLU,51.99,4,Very happy with the performance. The apps work...,2.0
4,Huawei P8 Lite US Version- 5 Unlocked Android ...,Huawei,198.99,5,Easy to use great price,0.0


## Build X (features vectors) and y (labels)

### Construct X_train and y_train

In [56]:
X_train = TRAIN['Reviews']
y_train = TRAIN['Rating']
X_train.shape, y_train.shape

((5000,), (5000,))

###  Construct X_test and y_test


In [57]:
X_test = TEST['Reviews']
y_test = TEST['Rating']
X_test.shape, y_test.shape

((1000,), (1000,))

## Construct a Baseline
Using a binary `CountVectorizer` and a `LogisticRegression` classifier, learned in a previous lecture, build a first model.

For this model, you will not pre-process the text and will only use words (not N-grams). Leaves all parameter as default.

The evaluation metric is accuracy.

In [58]:
''' Encode X_train '''
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
cv.fit(X_train)
X_train_encoded = cv.transform(X_train)
X_train_encoded.shape

(5000, 8991)

In [59]:
''' Encode X_test '''
X_test_encoded = cv.transform(X_test)
X_test_encoded.shape

(1000, 8991)

In [60]:
''' Fit a model with train '''
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train_encoded, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [61]:
''' Evaluate a model with test '''
from sklearn.metrics import accuracy_score

y_pred = lr.predict(X_test_encoded)
accuracy_score(y_test, y_pred)

0.666

## A better classifier with a preprocessing

It's up to you. Try to get a better score (accuracy) using what we have seen in this course:
- efficient text pre-processing
- choice of feature extraction
- use of a more powerful classifier or better hyper-parameter for LogisticRegression.

The training of the model must be done on the Train and the evaluation on the Test. You can of course use GridSearchCV or RandomizedSearchCV.

In [62]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

# remove stop word (https://en.wikipedia.org/wiki/Stop_word)
stop_word_list = [".", ",", "[", "]", "`", "(", ")", "?", "'", "'s", ":", "!"]

def remove_stop_word(txt_token, stop_word_list):
    return [w for w in txt_token if w not in stop_word_list]

def stemming(txt_token):
    porter = nltk.PorterStemmer()
    return [porter.stem(w) for w in txt_token]

def lemmatization(txt_token):
    WNlemma = nltk.WordNetLemmatizer()
    return [WNlemma.lemmatize(w) for w in txt_token]

In [63]:
tokenized_sentences = [lemmatization(stemming(remove_stop_word(word_tokenize(sentence), 
                                                               stop_word_list)))
                       for sentence in X_train.append(X_test)]
tokenized_sentences = list(map(" ".join, tokenized_sentences))

X_train_tokenized = tokenized_sentences[:5000]
X_test_tokenized = tokenized_sentences[5000:]

In [64]:
cv = CountVectorizer(analyzer="char_wb", ngram_range=(1,4), stop_words="english")
cv.fit(X_train_tokenized)
X_train_encoded = cv.transform(X_train_tokenized)
X_test_encoded = cv.transform(X_test_tokenized)



### Testing different hyperparameters

In [66]:
lr = LogisticRegression(penalty="l2", solver="newton-cg", max_iter=100)
lr.fit(X_train_encoded, y_train)
y_pred = lr.predict(X_test_encoded)
accuracy_score(y_test, y_pred)

0.647

In [68]:
lr = LogisticRegression(solver="newton-cg", max_iter=1000)
lr.fit(X_train_encoded, y_train)
y_pred = lr.predict(X_test_encoded)
accuracy_score(y_test, y_pred)

0.647

In [69]:
lr = LogisticRegression()
lr.fit(X_train_encoded, y_train)
y_pred = lr.predict(X_test_encoded)
accuracy_score(y_test, y_pred)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.671

In [71]:
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train_encoded, y_train)
y_pred = lr.predict(X_test_encoded)
accuracy_score(y_test, y_pred)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.637

In [72]:
lr = LogisticRegression(solver="liblinear")
lr.fit(X_train_encoded, y_train)
y_pred = lr.predict(X_test_encoded)
accuracy_score(y_test, y_pred)

0.644

## Summarize your conclusion here

Give the best score obtained and describe the pipeline that allows you to obtain it.