### About Dataset
Context This is a small subset of dataset of Book reviews from Amazon Kindle Store category.


Content 5-core dataset of product reviews from Amazon Kindle Store category from May 1996 - July 2014. Contains total of 982619 entries. Each reviewer has at least 5 reviews and each product has at least 5 reviews in this dataset. Columns :

* asin - ID of the product, like B000FA64PK
* helpful - helpfulness rating of the review - example: 2/3.
* overall - rating of the product.
* reviewText - text of the review (heading).
* reviewTime - time of the review (raw).
* reviewerID - ID of the reviewer, like A3SPTOKDG7WBLN
* reviewerName - name of the reviewer.
* summary - summary of the review (description).
* unixReviewTime - unix timestamp.

Acknowledgements This dataset is taken from Amazon product data, Julian McAuley, UCSD website. http://jmcauley.ucsd.edu/data/amazon/

License to the data files belong to them.

Inspiration

* Sentiment analysis on reviews.
* Understanding how people rate usefulness of a review/ What factors influence helpfulness of a review.
* Fake reviews/ outliers.
* Best rated product IDs, or similarity between products based on reviews alone (not the best idea ikr).
* Any other interesting analysis

Best Practises :
 * Preprocessing And Cleaning
 * Train Test Split
 * BOW,TFIDF,Word2vec
 * Train ML algorithms

In [1]:
# Load the dataset
import pandas as pd
data=pd.read_csv('kindle_review.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,0,B000F83SZQ,"[0, 0]",5,I enjoy vintage books and movies so I enjoyed ...,"05 5, 2014",A1F6404F1VG29J,Avidreader,Nice vintage story,1399248000
1,1,B000F83SZQ,"[2, 2]",4,This book is a reissue of an old one; the auth...,"01 6, 2014",AN0N05A9LIJEQ,critters,Different...,1388966400
2,2,B000F83SZQ,"[2, 2]",4,This was a fairly interesting read. It had ol...,"04 4, 2014",A795DMNCJILA6,dot,Oldie,1396569600
3,3,B000F83SZQ,"[1, 1]",5,I'd never read any of the Amy Brewster mysteri...,"02 19, 2014",A1FV0SX13TWVXQ,"Elaine H. Turley ""Montana Songbird""",I really liked it.,1392768000
4,4,B000F83SZQ,"[0, 1]",4,"If you like period pieces - clothing, lingo, y...","03 19, 2014",A3SPTOKDG7WBLN,Father Dowling Fan,Period Mystery,1395187200


In [4]:
df=data[['reviewText','overall']]
df.head()

Unnamed: 0,reviewText,overall
0,I enjoy vintage books and movies so I enjoyed ...,5
1,This book is a reissue of an old one; the auth...,4
2,This was a fairly interesting read. It had ol...,4
3,I'd never read any of the Amy Brewster mysteri...,5
4,"If you like period pieces - clothing, lingo, y...",4


In [5]:
df.shape

(982619, 2)

In [6]:
## Missing Values
df.isnull().sum()

reviewText    22
overall        0
dtype: int64

In [7]:
df['overall'].unique()

array([5, 4, 3, 2, 1], dtype=int64)

In [8]:
df['overall'].value_counts()

overall
5    575264
4    254013
3     96194
2     34130
1     23018
Name: count, dtype: int64

## Preprocessing And Cleaning

In [9]:
## postive review is 1 and negative review is 0
df['overall']=df['overall'].apply(lambda x:0 if x<3 else 1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['overall']=df['overall'].apply(lambda x:0 if x<3 else 1)


In [10]:
df['overall'].value_counts()

overall
1    925471
0     57148
Name: count, dtype: int64

In [11]:
## 1. Lower All the cases
df['reviewText']=df['reviewText'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['reviewText']=df['reviewText'].str.lower()


In [12]:
df.head()

Unnamed: 0,reviewText,overall
0,i enjoy vintage books and movies so i enjoyed ...,1
1,this book is a reissue of an old one; the auth...,1
2,this was a fairly interesting read. it had ol...,1
3,i'd never read any of the amy brewster mysteri...,1
4,"if you like period pieces - clothing, lingo, y...",1


In [13]:
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\LENOVO_PC\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [14]:
from bs4 import BeautifulSoup

In [19]:
# Convert all entries to string (avoid TypeError)
df['reviewText'] = df['reviewText'].astype(str)

# Remove special characters
df['reviewText'] = df['reviewText'].apply(lambda x: re.sub(r'[^a-zA-Z0-9\s-]', '', x))

# Remove stopwords
stops = set(stopwords.words('english'))
df['reviewText'] = df['reviewText'].apply(lambda x: " ".join([word for word in x.split() if word.lower() not in stops]))

# Remove URLs
df['reviewText'] = df['reviewText'].apply(lambda x: re.sub(r'(http|https|ftp|ssh)://[^\s]+', '', x))

# Remove HTML tags
df['reviewText'] = df['reviewText'].apply(lambda x: BeautifulSoup(x, 'html.parser').get_text())

# Remove extra spaces
df['reviewText'] = df['reviewText'].apply(lambda x: " ".join(x.split()))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['reviewText'] = df['reviewText'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['reviewText'] = df['reviewText'].apply(lambda x: re.sub(r'[^a-zA-Z0-9\s-]', '', x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['reviewText'] = df['reviewText'].apply(lambda x: " ".join([word for

In [20]:
df.head()

Unnamed: 0,reviewText,overall
0,enjoy vintage books movies enjoyed reading boo...,1
1,book reissue old one author born 1910 era say ...,1
2,fairly interesting read old- style terminology...,1
3,id never read amy brewster mysteries one reall...,1
4,like period pieces - clothing lingo enjoy myst...,1


In [21]:
## Lemmatizer
from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()
def lemmatize_words(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

In [22]:
df['reviewText']=df['reviewText'].apply(lambda x:lemmatize_words(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['reviewText']=df['reviewText'].apply(lambda x:lemmatize_words(x))


In [23]:
df.head()

Unnamed: 0,reviewText,overall
0,enjoy vintage book movie enjoyed reading book ...,1
1,book reissue old one author born 1910 era say ...,1
2,fairly interesting read old- style terminology...,1
3,id never read amy brewster mystery one really ...,1
4,like period piece - clothing lingo enjoy myste...,1


In [24]:
## Train Test Split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(df['reviewText'],df['overall'],
                                              test_size=0.20)

In [27]:
from sklearn.feature_extraction.text import CountVectorizer
bow=CountVectorizer(max_features=2000)
X_train_bow=bow.fit_transform(X_train)
X_test_bow=bow.transform(X_test)

In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf=TfidfVectorizer()
X_train_tfidf=tfidf.fit_transform(X_train)
X_test_tfidf=tfidf.transform(X_test)

In [31]:
from sklearn.naive_bayes import MultinomialNB
nb_model_bow = MultinomialNB().fit(X_train_bow, y_train)
nb_model_tfidf = MultinomialNB().fit(X_train_tfidf, y_train)

In [33]:
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report
y_pred_bow=nb_model_bow.predict(X_test_bow)
y_pred_tfidf=nb_model_tfidf.predict(X_test_tfidf)
confusion_matrix(y_test,y_pred_bow)

array([[  7154,   4309],
       [ 11245, 173816]], dtype=int64)

In [34]:
print("BOW accuracy: ",accuracy_score(y_test,y_pred_bow))

BOW accuracy:  0.9208544503470314


In [35]:
confusion_matrix(y_test,y_pred_tfidf)

array([[     2,  11461],
       [     0, 185061]], dtype=int64)

In [36]:
print("TFIDF accuracy: ",accuracy_score(y_test,y_pred_tfidf))

TFIDF accuracy:  0.9416814231340701
