### Amazon Kindle Review Sentimental Analysis

Context:  
A small subset of dataset of product reviews from Amazon Kindle Store category.

Content:  
5-core dataset of product reviews from Amazon Kindle Store category from May 1996 - July 2014. Contains total of 982619 entries. Each reviewer has at least 5 reviews and each product has at least 5 reviews in this dataset.

Columns:  
- asin - ID of the product, like B000FA64PK  
- helpful - helpfulness rating of the review - example: 2/3.  
- overall - rating of the product.  
- reviewText - text of the review (heading).  
- reviewTime - time of the review (raw).  
- reviewerID - ID of the reviewer, like A3SPTOKDG7WBLN  
- reviewerName - name of the reviewer.  
- summary - summary of the review (description).  
- unixReviewTime - unix timestamp.  

Acknowledgements:  
This dataset is taken from Amazon product data, Julian McAuley, UCSD website. http://jmcauley.ucsd.edu/data/amazon/  

License to the data files belong to them.  

Inspiration:  
1) Sentiment analysis on reviews.  
2) Understanding how people rate usefulness of a review/ What factors influence helpfulness of a review.
3) Fake reviews/ outliers.
4) best rated product IDs, or similarity between products based on reviews alone.

In [51]:
import numpy as np
import pandas as pd

df = pd.read_csv(r'C:\Users\Nitin Flavier\Desktop\Data Nexus\Data Science\ML_BootCamp\ML_Algos\NLP\Data\kindle_reviews.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,0,B000F83SZQ,"[0, 0]",5,I enjoy vintage books and movies so I enjoyed ...,"05 5, 2014",A1F6404F1VG29J,Avidreader,Nice vintage story,1399248000
1,1,B000F83SZQ,"[2, 2]",4,This book is a reissue of an old one; the auth...,"01 6, 2014",AN0N05A9LIJEQ,critters,Different...,1388966400
2,2,B000F83SZQ,"[2, 2]",4,This was a fairly interesting read. It had ol...,"04 4, 2014",A795DMNCJILA6,dot,Oldie,1396569600
3,3,B000F83SZQ,"[1, 1]",5,I'd never read any of the Amy Brewster mysteri...,"02 19, 2014",A1FV0SX13TWVXQ,"Elaine H. Turley ""Montana Songbird""",I really liked it.,1392768000
4,4,B000F83SZQ,"[0, 1]",4,"If you like period pieces - clothing, lingo, y...","03 19, 2014",A3SPTOKDG7WBLN,Father Dowling Fan,Period Mystery,1395187200


In [None]:
print(df.shape)

# Randomly select 12,000 records
df = df.sample(n=12000, random_state=42)


(982619, 10)


In [53]:
# We will focus on Reviews and Ratings

df = df[['reviewText','overall']]
df.head()

Unnamed: 0,reviewText,overall
869697,ARC provided by author in exchange for an hone...,5
760913,Wild Ride by Nancy WarrenChanging Gears Series...,5
159841,"Well thought out story, with many things going...",5
868915,This is book four of a five part serial. By n...,3
980703,I really enjoyed this book. It kept me interes...,5


In [54]:
df.isnull().sum()

reviewText    1
overall       0
dtype: int64

In [55]:
# drop the missing values
df = df.dropna()
df.head()

Unnamed: 0,reviewText,overall
869697,ARC provided by author in exchange for an hone...,5
760913,Wild Ride by Nancy WarrenChanging Gears Series...,5
159841,"Well thought out story, with many things going...",5
868915,This is book four of a five part serial. By n...,3
980703,I really enjoyed this book. It kept me interes...,5


In [56]:
df['overall'].unique()
df['overall'].value_counts()

overall
5    7035
4    3094
3    1146
2     420
1     304
Name: count, dtype: int64

In [57]:
# Lets make it as a classification of review as postiive 
df['sentiment'] = df['overall'].apply(lambda x:1 if x>=3 else 0)

In [58]:
df.drop(['overall'],axis=1,inplace=True)
df.head()

Unnamed: 0,reviewText,sentiment
869697,ARC provided by author in exchange for an hone...,1
760913,Wild Ride by Nancy WarrenChanging Gears Series...,1
159841,"Well thought out story, with many things going...",1
868915,This is book four of a five part serial. By n...,1
980703,I really enjoyed this book. It kept me interes...,1


In [59]:
# Cleaning and Preprocessing the test data

# Lower all the text
df['reviewText'] = df['reviewText'].str.lower()
df.head()

Unnamed: 0,reviewText,sentiment
869697,arc provided by author in exchange for an hone...,1
760913,wild ride by nancy warrenchanging gears series...,1
159841,"well thought out story, with many things going...",1
868915,this is book four of a five part serial. by n...,1
980703,i really enjoyed this book. it kept me interes...,1


In [60]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

stop_words = stopwords.words('english')
lemmatizer = WordNetLemmatizer()

In [61]:
pip install beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


In [62]:
import re
from bs4 import BeautifulSoup

# remove urls/links:
df['reviewText'] = df['reviewText'].apply(lambda x: re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+','',x))

# removing html tags:
df['reviewText'] = df['reviewText'].apply(lambda x: BeautifulSoup(x,'lxml').get_text())

# remove special characters:
df['reviewText'] = df['reviewText'].apply(lambda x:re.sub('[^a-z A-Z 0-9-]','',x))

# remove stop words:
df['reviewText'] = df['reviewText'].apply(lambda x: " ".join([y for y in x.split() if y not in stop_words]))

# remove additional spaces
df['reviewText'] = df['reviewText'].apply(lambda x: " ".join(x.split()))

  df['reviewText'] = df['reviewText'].apply(lambda x: BeautifulSoup(x,'lxml').get_text())


In [63]:
df.head()

Unnamed: 0,reviewText,sentiment
869697,arc provided author exchange honest reviewthis...,1
760913,wild ride nancy warrenchanging gears seriesdun...,1
159841,well thought story many things going time alie...,1
868915,book four five part serial suspense highest im...,1
980703,really enjoyed book kept interested page one w...,1


In [64]:
df['reviewText'] = df['reviewText'].apply(lambda x: " ".join([lemmatizer.lemmatize(w) for w in x.split()]))

In [65]:
from sklearn.model_selection import train_test_split 

X_train,X_test,y_train,y_test = train_test_split(df['reviewText'],df['sentiment'],test_size=0.33,random_state=32)

### BOW

In [66]:
from sklearn.feature_extraction.text import CountVectorizer 

bow = CountVectorizer()

X_train_bow = bow.fit_transform(X_train).toarray()
X_test_bow = bow.transform(X_test).toarray()

In [69]:
from sklearn.feature_extraction.text import TfidfVectorizer 

tfIdf = TfidfVectorizer() 

X_train_tfidf = tfIdf.fit_transform(X_train).toarray()
X_test_tfidf = tfIdf.transform(X_test).toarray()

In [72]:
# naive bayes best suited for classification
from sklearn.naive_bayes import GaussianNB 
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

nb = GaussianNB()

model_bow = nb.fit(X_train_bow,y_train)
y_pred_bow = model_bow.predict(X_test_bow)

model_tf = nb.fit(X_train_tfidf,y_train)
y_pred_tf = model_tf.predict(X_test_tfidf)


print("BOW Word ---> vector")
print()
print("The accuracy score for BOW", accuracy_score(y_test,y_pred_bow))
print()
print("Confusion Matrix for BOW \n", confusion_matrix(y_test,y_pred_bow))
print()
print("Classification Report for BOW", classification_report(y_test,y_pred_bow))
print()
print()

print("TF-IDF Word ---> vector")
print()

print("The accuracy score for TF-IDF", accuracy_score(y_test,y_pred_tf))
print()
print("Confusion Matrix for TF-IDF \n", confusion_matrix(y_test,y_pred_tf))
print()
print("Classification Report TF-IDF", classification_report(y_test,y_pred_tf))
print()

BOW Word ---> vector

The accuracy score for BOW 0.8285353535353536

Confusion Matrix for BOW 
 [[  45  174]
 [ 505 3236]]

Classification Report for BOW               precision    recall  f1-score   support

           0       0.08      0.21      0.12       219
           1       0.95      0.87      0.91      3741

    accuracy                           0.83      3960
   macro avg       0.52      0.54      0.51      3960
weighted avg       0.90      0.83      0.86      3960



TF-IDF Word ---> vector

The accuracy score for TF-IDF 0.8282828282828283

Confusion Matrix for TF-IDF 
 [[  45  174]
 [ 506 3235]]

Classification Report TF-IDF               precision    recall  f1-score   support

           0       0.08      0.21      0.12       219
           1       0.95      0.86      0.90      3741

    accuracy                           0.83      3960
   macro avg       0.52      0.54      0.51      3960
weighted avg       0.90      0.83      0.86      3960




### Word2Vec:

to be continued......