## About Dataset

This is a small subset of dataset of Book reviews from Amazon Kindle Store category.

Content 5-core dataset of product reviews from Amazon Kindle Store category from May 1996 - July 2014. Contains total of 982619 entries. Each reviewer has at least 5 reviews and each product has at least 5 reviews in this dataset. Columns

* asin - ID of the product, like B000FA64PK
* helpful - helpfulness rating of the review - example: 2/3.
* overall - rating of the product.
* reviewText - text of the review (heading).
* reviewTime - time of the review (raw).
* reviewerID - ID of the reviewer, like A3SPTOKDG7WBLN
* reviewerName - name of the reviewer.
* summary - summary of the review (description).
* unixReviewTime - unix timestamp.
* Acknowledgements This dataset is taken from Amazon product data,  Julian McAuley, UCSD website. http://jmcauley.ucsd.edu/data/amazon/

License to the data files belong to them.

## Inspiration

* Sentiment analysis on reviews.
* Understanding how people rate usefulness of a review/ What factors * influence helpfulness of a review.
* Fake reviews/ outliers.
* Best rated product IDs, or similarity between products based on * reviews alone (not the best idea ikr).
* Any other interesting analysis

## Outlier
1. Preprocessing And Cleaning
2. Train Test Split
3. BOW,TFIDF,Word2vec
3. Train ML algorithms

In [19]:
import pandas as pd 
import numpy as np

import warnings
warnings.filterwarnings('ignore')

In [5]:
data = pd.read_csv("all_kindle_review.csv")
data.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,asin,helpful,rating,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,0,11539,B0033UV8HI,"[8, 10]",3,"Jace Rankin may be short, but he's nothing to ...","09 2, 2010",A3HHXRELK8BHQG,Ridley,Entertaining But Average,1283385600
1,1,5957,B002HJV4DE,"[1, 1]",5,Great short read. I didn't want to put it dow...,"10 8, 2013",A2RGNZ0TRF578I,Holly Butler,Terrific menage scenes!,1381190400
2,2,9146,B002ZG96I4,"[0, 0]",3,I'll start by saying this is the first of four...,"04 11, 2014",A3S0H2HV6U1I7F,Merissa,Snapdragon Alley,1397174400
3,3,7038,B002QHWOEU,"[1, 3]",3,Aggie is Angela Lansbury who carries pocketboo...,"07 5, 2014",AC4OQW3GZ919J,Cleargrace,very light murder cozy,1404518400
4,4,1776,B001A06VJ8,"[0, 1]",4,I did not expect this type of book to be in li...,"12 31, 2012",A3C9V987IQHOQD,Rjostler,Book,1356912000


In [7]:
data_text = data[['reviewText', 'rating']]
data_text.head()

Unnamed: 0,reviewText,rating
0,"Jace Rankin may be short, but he's nothing to ...",3
1,Great short read. I didn't want to put it dow...,5
2,I'll start by saying this is the first of four...,3
3,Aggie is Angela Lansbury who carries pocketboo...,3
4,I did not expect this type of book to be in li...,4


In [10]:
data_text.shape

(12000, 2)

In [11]:
# missing values 
data_text.isnull().sum()

reviewText    0
rating        0
dtype: int64

In [12]:
data_text['rating'].unique()

array([3, 5, 4, 2, 1], dtype=int64)

In [13]:
data_text['rating'].value_counts()

rating
5    3000
4    3000
3    2000
2    2000
1    2000
Name: count, dtype: int64

In [20]:
## preprocessing and cleaning 

# positive review => 1
# negative review => 0

data_text['rating'] = data_text['rating'].apply(lambda x:0 if x < 3 else 1)

In [17]:
df = data_text
df.head()

Unnamed: 0,reviewText,rating
0,"Jace Rankin may be short, but he's nothing to ...",1
1,Great short read. I didn't want to put it dow...,1
2,I'll start by saying this is the first of four...,1
3,Aggie is Angela Lansbury who carries pocketboo...,1
4,I did not expect this type of book to be in li...,1


In [33]:
import re

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

stopwords = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove special characters and numbers
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\d+', '', text)
    
    # Tokenize
    tokens = word_tokenize(text)
    
    # Remove stopwords
    tokens = [word for word in tokens if word not in stopwords]
    # Apply lemmatization
    
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    # Rejoin tokens into a single string
    return ' '.join(tokens)

In [34]:
df['cleanReviewText'] = df['reviewText'].apply(preprocess_text)

In [35]:
df.head()

Unnamed: 0,reviewText,rating,cleanReviewText
0,"jace rankin may be short, but he's nothing to ...",0,jace rankin may short he nothing mess man haul...
1,great short read. i didn't want to put it dow...,0,great short read didnt want put read one sitti...
2,i'll start by saying this is the first of four...,0,ill start saying first four book wasnt expecti...
3,aggie is angela lansbury who carries pocketboo...,0,aggie angela lansbury carry pocketbook instead...
4,i did not expect this type of book to be in li...,0,expect type book library pleased find price right


In [36]:
## Train Test Split

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(df['cleanReviewText'],df['rating'],
                                              test_size=0.20)

### vectorization using Bow

In [37]:
from sklearn.feature_extraction.text import CountVectorizer

bow=CountVectorizer()

X_train_bow=bow.fit_transform(X_train).toarray()

X_test_bow=bow.transform(X_test).toarray()

### Vectorization using tfidf

In [38]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf=TfidfVectorizer()

X_train_tfidf=tfidf.fit_transform(X_train).toarray()

X_test_tfidf=tfidf.transform(X_test).toarray()

### Vectorization using Wordd2Vec

In [42]:
## load the google pre triend model 
from gensim.models import KeyedVectors
model_path =r"C:\Users\hassa\gensim-data\word2vec-google-news-300\word2vec-google-news-300.gz"

model = KeyedVectors.load_word2vec_format(model_path, binary=True)

In [43]:
def sentence_to_vector(sentence, model, vector_size=300):
    words = sentence.split()
    word_vectors = [model[word] for word in words if word in model]
    if len(word_vectors) == 0:
        return np.zeros(vector_size)
    return np.mean(word_vectors, axis=0)

# Convert training and testing text into vectors
X_train_word2vec = np.array([sentence_to_vector(sent, model) for sent in X_train])
X_test_word2vec = np.array([sentence_to_vector(sent, model) for sent in X_test])

In [46]:
X_test_word2vec.shape

(2400, 300)

### Train the model

In [44]:
from sklearn.naive_bayes import GaussianNB

nb_model_bow=GaussianNB().fit(X_train_bow,y_train)
nb_model_tfidf=GaussianNB().fit(X_train_tfidf,y_train)
nb_model_word2vec=GaussianNB().fit(X_train_word2vec,y_train)

### Evaluate the models

In [47]:
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report

y_pred_bow=nb_model_bow.predict(X_test_bow)
y_pred_tfidf=nb_model_bow.predict(X_test_tfidf)
y_pred_word2vec=nb_model_word2vec.predict(X_test_word2vec)

confusion_matrix(y_test,y_pred_bow)

print("BOW accuracy: ",accuracy_score(y_test,y_pred_bow))


confusion_matrix(y_test,y_pred_tfidf)


print("TFIDF accuracy: ",accuracy_score(y_test,y_pred_tfidf))

confusion_matrix(y_test,y_pred_word2vec)


print("TFIDF accuracy: ",accuracy_score(y_test,y_pred_word2vec))


BOW accuracy:  1.0
TFIDF accuracy:  1.0
TFIDF accuracy:  1.0
