## Project
### **Performing Sentimental Analysis on reviews from amazon.**
The data to use is located here: http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Cell_Phones_and_Accessories_5.json.gz.
As it is a gziped json file, I would like to download and extract it directly into colab, this can be done using the following lines:

In [0]:
!curl http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Cell_Phones_and_Accessories_5.json.gz -o reviews.json.gz
!gunzip reviews.json.gz
!ls

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  161M  100  161M    0     0  10.7M      0  0:00:15  0:00:15 --:--:-- 11.0M
y

reviews.json  sample_data


**Quick look at the Data**

In [0]:
import pandas as pd
data = pd.read_json('reviews.json',lines=True)

In [0]:
data

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image
0,5,True,"08 4, 2014",A24E3SXTC62LJI,7508492919,{'Color:': ' Bling'},Claudia Valdivia,Looks even better in person. Be careful to not...,Can't stop won't stop looking at it,1407110400,,
1,5,True,"02 12, 2014",A269FLZCB4GIPV,7508492919,,sarah ponce,When you don't want to spend a whole lot of ca...,1,1392163200,,
2,3,True,"02 8, 2014",AB6CHQWHZW4TV,7508492919,,Kai,"so the case came on time, i love the design. I...",Its okay,1391817600,,
3,2,True,"02 4, 2014",A1M117A53LEI8,7508492919,,Sharon Williams,DON'T CARE FOR IT. GAVE IT AS A GIFT AND THEY...,CASE,1391472000,,
4,4,True,"02 3, 2014",A272DUT8M88ZS8,7508492919,,Bella Rodriguez,"I liked it because it was cute, but the studs ...",Cute!,1391385600,,
...,...,...,...,...,...,...,...,...,...,...,...,...
1128432,4,True,"12 22, 2016",A1QWMCG1FNEP3A,B01HJC7N4C,,Amazon Customer,Good for viewing. But doesn't have a button or...,Good,1482364800,,
1128433,5,False,"07 15, 2016",A3FOBEJ9UVUTR3,B01HJC7N4C,,David Harlow,I was given the Rockrok 3D VR Glasses Headset ...,THE FUTURE IS NOW!!!!!!!,1468540800,,
1128434,5,False,"07 14, 2016",AMUEAMKB4E33M,B01HJC7N4C,,Tom D,Super Fun! The RockRoc 3d vr headset is waaaay...,Get more out of your smartphone .......,1468454400,,[https://images-na.ssl-images-amazon.com/image...
1128435,5,False,"07 13, 2016",A2EV91MMOJ3IL4,B01HJC7N4C,,Timber12,Love it!\n\nI've had other VR glasses which al...,Join the VR fun train!,1468368000,,


I can potentially get sentiment analysis with the overall rating column aswell the review text.
I can also label each review based on each sentiment
title can contain positive/negative or netural information about review.

In [0]:
data.overall.value_counts()

5    707038
4    184431
3     98254
1     81539
2     57175
Name: overall, dtype: int64

In [0]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1128437 entries, 0 to 1128436
Data columns (total 12 columns):
 #   Column          Non-Null Count    Dtype 
---  ------          --------------    ----- 
 0   overall         1128437 non-null  int64 
 1   verified        1128437 non-null  bool  
 2   reviewTime      1128437 non-null  object
 3   reviewerID      1128437 non-null  object
 4   asin            1128437 non-null  object
 5   style           605241 non-null   object
 6   reviewerName    1128302 non-null  object
 7   reviewText      1127672 non-null  object
 8   summary         1127920 non-null  object
 9   unixReviewTime  1128437 non-null  int64 
 10  vote            92034 non-null    object
 11  image           27107 non-null    object
dtypes: bool(1), int64(2), object(9)
memory usage: 95.8+ MB


Based on the information above:

Dropping the style vote and image columns.

In [0]:
df = data.drop(['style', 'vote','image'], axis = 1)

Before I explore the dataset I am going to split data into training set and test sets

My aim is to finally train a sentiment analysis classifier

Since the majority of reviews are positive (5 stars), we will need to do a stratified split on the reviews score to ensure that we don't train the classifier on imbalanced data.

In [0]:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=5, test_size=0.2)
for train_index, test_index in split.split(df, df["overall"]): 
    strat_train = df.reindex(train_index)
    strat_test = df.reindex(test_index)

Checking if the train or test sets were stratified proportionately in comparison to raw data.

In [0]:
len(strat_train)

902749

In [0]:
strat_train["overall"].value_counts()/len(strat_train) # value_count() counts all the value

5    0.626564
4    0.163440
3    0.087071
1    0.072258
2    0.050667
Name: overall, dtype: float64

In [0]:
len(strat_test)

225688

In [0]:
strat_test["overall"].value_counts()/len(strat_test)

5    0.626564
4    0.163438
3    0.087072
1    0.072259
2    0.050667
Name: overall, dtype: float64

In [0]:
details = strat_train.copy()
details.head(2)

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime
649589,4,True,"03 2, 2015",AN6M72QDDWRDI,B00S5PZIMW,Deitre Malory,very pleased,Four Stars,1425254400
208918,4,True,"10 2, 2013",AHO6LKSR001ZQ,B00BKWY6AW,Karin Stern,Saw this case demo'd on an online review and l...,"Great case, especially for the price",1380672000


Segregate ratings from 1-5 into positive, neutral, and negative.

In [0]:
def sentiments(overall):
    if (overall == 5) or (overall == 4):
        return "Positive"
    elif overall == 3:
        return "Neutral"
    elif (overall == 2) or (overall == 1):
        return "Negative"
# Add sentiments to the data
strat_train["Sentiment"] = strat_train["overall"].apply(sentiments)
strat_test["Sentiment"] = strat_test["overall"].apply(sentiments)
strat_train["Sentiment"][:20]

649589     Positive
208918     Positive
996414     Positive
7420       Positive
1090692    Positive
255262     Positive
515552     Positive
1086853    Positive
362674     Positive
952464     Positive
116881     Negative
841421     Positive
546658      Neutral
201829     Positive
431458     Positive
224293     Positive
427372     Positive
724886     Positive
1050266    Negative
162732     Positive
Name: Sentiment, dtype: object

In [0]:
X_train = strat_train["reviewText"]
X_train_targetSentiment = strat_train["Sentiment"]
X_test = strat_test["reviewText"]
X_test_targetSentiment = strat_test["Sentiment"]
print(len(X_train), len(X_test))

902749 225688


Using CountVectorizer to performs:

Text preprocessing:Tokenization (breaking sentences into words)

Stopwords (filtering "the", "are", etc)

Occurrence counting (builds a dictionary of features from integer indices with word occurrences)

Feature Vector (converts the dictionary of text documents into a feature vector)

In [0]:
X_train = X_train.fillna(' ')
X_test = X_test.fillna(' ')
X_train_targetSentiment = X_train_targetSentiment.fillna(' ')
X_test_targetSentiment = X_test_targetSentiment.fillna(' ')

# Text preprocessing and occurance counting
from sklearn.feature_extraction.text import CountVectorizer 
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train) 
X_train_counts.shape

(902749, 118421)

I see higher average count values on words that carry very little meaning, this will overshadow shorter documents that have lower average counts with same frequencies, as a result, we will use TfidfTransformer to reduce this redundancy:

Term Frequencies (Tf) divides number of occurrences for each word by total number of words

Term Frequencies times Inverse Document Frequency (Tfidf) downscales the weights of each word (assigns less value to unimportant stop words ie. "the", "are", etc)

In [0]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer(use_idf=False)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(902749, 118421)


Logistic Regression Classifier

In [0]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
clf_logReg_pipe = Pipeline([("vect", CountVectorizer()), 
                            ("tfidf", TfidfTransformer()), 
                            ("clf_logReg", LogisticRegression())])
clf_logReg_pipe.fit(X_train, X_train_targetSentiment)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=Non...
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf_logReg',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling

In [0]:
import numpy as np
predictedLogReg = clf_logReg_pipe.predict(X_test)
np.mean(predictedLogReg == X_test_targetSentiment)

0.8645873949877707

Support Vector Machine Classifier

In [0]:
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
clf_linearSVC_pipe = Pipeline([("vect", CountVectorizer()), 
                               ("tfidf", TfidfTransformer()),
                               ("clf_linearSVC", LinearSVC())])
clf_linearSVC_pipe.fit(X_train, X_train_targetSentiment)



Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf_linearSVC',
                 LinearSVC(C=1.0, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
           

In [0]:
import numpy as np
predictedLinearSVC = clf_linearSVC_pipe.predict(X_test)
np.mean(predictedLinearSVC == X_test_targetSentiment)

0.8638385736060402

Support Vector Machine Regressor

In [0]:
from sklearn.svm import SVR
svr_linear = SVR(kernel='linear',gamma='scale', C=1.0, epsilon=0.1)
svr_linear.fit(X_train, X_train_targetSentiment)

Multinominal Naive Bayes

In [0]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
clf_multiNB_pipe = Pipeline([("vect", CountVectorizer()), ("tfidf", TfidfTransformer()), ("clf_nominalNB", MultinomialNB())])
clf_multiNB_pipe.fit(X_train, X_train_targetSentiment)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf_nominalNB',
                 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
         verbose=False)

In [0]:
import numpy as np
predictedMultiNB = clf_multiNB_pipe.predict(X_test)
np.mean(predictedMultiNB == X_test_targetSentiment)

0.808616319875226

I will the Support Vector Machine Classifier since it has the highest accuracy level. 
Now I will fine tune the Support Vector Machine model (Linear_SVC) to avoid any potential over-fitting.

Fine tuning the Support Vector Machine Classifier

I will run a Grid Search of the best parameters on a grid of possible values, instead of tweaking the parameters of various components of the chain (ie. use_idf in tfidftransformer)and also run the grid search with LinearSVC classifier pipeline, parameters and cpu core maximization
Then I will fit the grid search to our training data set and use final classifier (after fine-tuning) to test some arbitrary reviews

In [0]:
from sklearn.model_selection import GridSearchCV
parameters = {'vect__ngram_range': [(1, 1), (1, 2)],    
             'tfidf__use_idf': (True, False), 
             } 
gs_clf_LinearSVC_pipe = GridSearchCV(clf_linearSVC_pipe, parameters, n_jobs=-1)
gs_clf_LinearSVC_pipe = gs_clf_LinearSVC_pipe.fit(X_train, X_train_targetSentiment)
new_text = ["The product is good.", # positive
            "The product is ok.", # neutral
            "The product is not good."] # negative

X_train_targetSentiment[gs_clf_LinearSVC_pipe.predict(new_text)]



Positive    NaN
Neutral     NaN
Negative    NaN
Name: Sentiment, dtype: object

After testing some arbitrary reviews, it seems that my features is performing correctly with Positive, Neutral, Negative results

In [0]:
predictedGS_clf_LinearSVC_pipe = gs_clf_LinearSVC_pipe.predict(X_test)
np.mean(predictedGS_clf_LinearSVC_pipe == X_test_targetSentiment)

0.8844776860090036

Analyzing the best mean score of the grid search (classifier, parameters, CPU core)

Analyzing the best estimator

Analyzing the best parameter

In [0]:
for performance_analysis in (gs_clf_LinearSVC_pipe.best_score_, 
                             gs_clf_LinearSVC_pipe.best_estimator_, 
                             gs_clf_LinearSVC_pipe.best_params_):
        print(performance_analysis)

0.8829043281924406
Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 2), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=False)),
                ('clf_linearSVC',
                 LinearSVC(C=1.0, class_weight=None, dual=True,
                           fit_intercept=True, intercept_sc

Precision: determines how many objects selected were correct

Recall: tells you how many of the objects that should have been selected were actually selected

F1 score measures the weights of recall and precision (1 means precision and recall are equally important, 0 otherwise)

Support is the number of occurrences of each class

In [0]:
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

print(classification_report(X_test_targetSentiment, predictedGS_clf_LinearSVC_pipe))
print('Accuracy'. format(accuracy_score(X_test_targetSentiment, predictedGS_clf_LinearSVC_pipe)))

              precision    recall  f1-score   support

    Negative       0.76      0.77      0.76     27743
     Neutral       0.53      0.20      0.29     19651
    Positive       0.92      0.98      0.95    178294

    accuracy                           0.88    225688
   macro avg       0.74      0.65      0.67    225688
weighted avg       0.86      0.88      0.87    225688

Accuracy: 0.8844776860090036


In [0]:
from sklearn import metrics
metrics.confusion_matrix(X_test_targetSentiment, predictedGS_clf_LinearSVC_pipe)

array([[ 21313,   1619,   4811],
       [  4755,   3932,  10964],
       [  2120,   1803, 174371]])

Finally, the overall result here explains that the products in this dataset are generally positively rated.

Reference: https://github.com/mick-zhang/Amazon-Reviews-using-Sentiment-Analysis/
