### Comments and Explanations before Peer Review

1. In this notebook, we are investigating what model should we use to build a recommender system.
2. We plan to use K-nearest Neighbors since its nature is finding k most similar items, and therefore we believe KNN would be a natural candidate for recommender system. 
3. Our model does not require hyperparameter tuning, but we do try all the combinations of parameters, specifically, n_neighbors and algorithms, to find the best parameter pairs. 
4. Our features are the summaries of review texts. We tokenize texts and transform the tokens using CountVectorizer. The rationale behind selecting summaries of review texts is that summaries of review texts contain certain keywords which can show sentiment polarity. The reason why we do not use review texts is due to the limitation of our hardware computation ability, and it takes significantly long time to process.  
5. We do not perform cross validation, since we have plenty of data. How, we do split our data into train and test.
6. We use accuracy, precision, recall, and f1-score to evalute the model.
7. We assess both the training and testing accuracy (and all other metrics). We make sure the differences between them are acceptable. Otherwise, there might be overfitting.
8. I think our results would be the best KNN could do. To improve, we might need to find better models, such as naive bayes, textCNN or other complicated deep learning models. 

In [1]:
!pip install wordcloud



In [1]:
import os
import json
import gzip
import pandas as pd
import numpy as np
import seaborn as sns
import nltk
import re
import string

from tqdm import tqdm
from matplotlib import pyplot as plt
from urllib.request import urlopen
from numpy.linalg import norm
from collections import defaultdict
from math import sqrt


from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.feature_selection import SelectKBest

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import NearestNeighbors
from sklearn import neighbors

from scipy.spatial.distance import cosine

from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import roc_curve

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

from string import punctuation
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import ngrams
from itertools import chain
from wordcloud import WordCloud, STOPWORDS
from fractions import Fraction

# default plot configurations 
%matplotlib inline 
plt.rcParams['figure.figsize'] = (16,8)
plt.rcParams['figure.dpi'] = 150
sns.set()

### Loading Data

In [2]:
def parse(path):
    g = gzip.open(path, 'rb')
    for l in g:
        yield json.loads(l)

def getDF(path):
    i = 0
    df = {}
    for d in parse(path):
        df[i] = d
        i += 1
        if i % 500000 == 0: print(i)
    return pd.DataFrame.from_dict(df, orient='index')

In [4]:
print("Start loading review data")
review_data = getDF(r'C:\Users\Xylon\Desktop\data200_grad\data\Toys_and_Games.json.gz')
print("Finish loading review data")

# total length of list, this number equals total number of products
print(len(review_data))

Start loading review data
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
4500000
5000000
5500000
6000000
6500000
7000000
7500000
8000000
Finish loading review data
8201231


In [5]:
review_data.head(5)

Unnamed: 0,overall,vote,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,image,style
0,2.0,12.0,False,"09 22, 2016",A1IDMI31WEANAF,20232233,Mackenzie Kent,"When it comes to a DM's screen, the space on t...",The fact that 50% of this space is wasted on a...,1474502400,,
1,1.0,21.0,False,"09 18, 2016",A4BCEVVZ4Y3V3,20232233,Jonathan Christian,An Open Letter to GaleForce9*:\n\nYour unpaint...,Another worthless Dungeon Master's screen from...,1474156800,,
2,3.0,19.0,True,"09 12, 2016",A2EZ9PY1IHHBX0,20232233,unpreparedtodie,"Nice art, nice printing. Why two panels are f...","pretty, but also pretty useless",1473638400,,
3,5.0,,True,"03 2, 2017",A139PXTTC2LGHZ,20232233,Ashley,Amazing buy! Bought it as a gift for our new d...,Five Stars,1488412800,,
4,1.0,3.0,True,"02 8, 2017",A3IB33V29XIL8O,20232233,Oghma_EM,As my review of GF9's previous screens these w...,Money trap,1486512000,,


### Data Cleaning

Technically, a user can only make 1 review at a reviewTime for a product. Therefore, we remove the duplicates which share the same reviewerID, product asin, and unixReviewTime.

In [6]:
review_data.drop_duplicates(subset=['reviewerID', 'asin', 'unixReviewTime'], inplace=True)

Extract useful columns.

In [7]:
review_data = review_data[['overall', 'reviewerID', 'asin', 'summary']]

Fill in 0 and " " for NaN values in 'vote', 'reviewText', and 'summary'. Drop the remaining NaN values.

In [8]:
review_data['summary'] = review_data['summary'].fillna('')

In [9]:
review_data = review_data.dropna()

In [10]:
print(len(review_data))

8002579


In [11]:
review_data.head(5)

Unnamed: 0,overall,reviewerID,asin,summary
0,2.0,A1IDMI31WEANAF,20232233,The fact that 50% of this space is wasted on a...
1,1.0,A4BCEVVZ4Y3V3,20232233,Another worthless Dungeon Master's screen from...
2,3.0,A2EZ9PY1IHHBX0,20232233,"pretty, but also pretty useless"
3,5.0,A139PXTTC2LGHZ,20232233,Five Stars
4,1.0,A3IB33V29XIL8O,20232233,Money trap


### Data Preprocessing

Select reviews of products that have more than 50 reviewers.

In [12]:
count = review_data.groupby("asin", as_index=False).count()

df_merge = pd.merge(review_data, count, how='right', on=['asin'])

df_merge = df_merge.rename(columns={"overall_x": "overall", "summary_x": "summary", "reviewerID_y": "numReviewer"})

df_merge = df_merge.sort_values(by='numReviewer', ascending=False)
df_50 = df_merge[df_merge['numReviewer'] >= 50]

df_50 = df_50[['overall', 'asin', 'summary', 'numReviewer']]

In [13]:
df_50.head(5)

Unnamed: 0,overall,asin,summary,numReviewer
2343547,5.0,B004S8F7QM,Five Stars,8815
2345034,5.0,B004S8F7QM,Must have for every party or adult game night.,8815
2345032,1.0,B004S8F7QM,Trash!,8815
2345031,5.0,B004S8F7QM,Five Stars,8815
2345030,5.0,B004S8F7QM,Expect to laugh a lot,8815


In [14]:
len(df_50)

4843224

Grouping all the summary reviews by product ID into lists

In [15]:
summary_product = df_50.groupby("asin")["summary"].apply(list)
df_summary_product = pd.DataFrame(summary_product)

In [16]:
df_summary_product.head(5)

Unnamed: 0_level_0,summary
asin,Unnamed: 1_level_1
486448789,"[Great idea..., good, but not a good price, Sh..."
545561647,"[Three Stars, they love them, but it's a ton o..."
615638996,"[Four Stars, Great product, Five Stars, Worth ..."
692770445,"[The egg is a great teaching idea, The 7-year-..."
735333467,"[Super cute but not durable, So cute!, At firs..."


Append the average overall rating for each product

In [17]:
df_mean = review_data.groupby("asin", as_index=False).mean()

In [18]:
df = pd.merge(df_summary_product, df_mean, on="asin", how='inner')
df = df[['asin','summary','overall']]

In [19]:
df['summary'] = df['summary'].astype(str)

In [20]:
df.head(5)

Unnamed: 0,asin,summary,overall
0,486448789,"['Great idea...', 'good, but not a good price'...",3.85567
1,545561647,"['Three Stars', ""they love them, but it's a to...",3.950739
2,615638996,"['Four Stars', 'Great product', 'Five Stars', ...",4.651515
3,692770445,"['The egg is a great teaching idea', 'The 7-ye...",4.294872
4,735333467,"['Super cute but not durable', 'So cute!', 'At...",4.508475


Preprocessing the summary

In [21]:
# tokenizer
regEx = re.compile('[^a-z]+')
def cleanReviews(reviewText):
    reviewText = reviewText.lower()
    reviewText = regEx.sub(' ', reviewText).strip()
    return reviewText

In [22]:
#reset index and drop duplicate rows
df["summaryClean"] = df["summary"].apply(cleanReviews)

In [23]:
df.head(5)

Unnamed: 0,asin,summary,overall,summaryClean
0,486448789,"['Great idea...', 'good, but not a good price'...",3.85567,great idea good but not a good price should be...
1,545561647,"['Three Stars', ""they love them, but it's a to...",3.950739,three stars they love them but it s a ton of d...
2,615638996,"['Four Stars', 'Great product', 'Five Stars', ...",4.651515,four stars great product five stars worth the ...
3,692770445,"['The egg is a great teaching idea', 'The 7-ye...",4.294872,the egg is a great teaching idea the year old ...
4,735333467,"['Super cute but not durable', 'So cute!', 'At...",4.508475,super cute but not durable so cute at first i ...


### Feature Extraction

In [24]:
reviews = df["summaryClean"] 

In [25]:
# might be able to use TfIdf tokenizer
countVector = CountVectorizer(max_features = 300, stop_words='english') 
transformed_reviews = countVector.fit_transform(reviews) 

In [26]:
df_feature = pd.DataFrame(transformed_reviews.A, columns=countVector.get_feature_names())
df_feature = df_feature.astype(int)

In [27]:
df_feature.head(5)

Unnamed: 0,absolutely,actually,addition,adorable,adults,advertised,age,ages,amazing,amazon,...,work,worked,working,works,worth,wrong,year,years,young,yr
0,0,0,1,0,0,1,1,0,0,0,...,0,1,0,0,0,0,4,0,0,1
1,0,0,0,9,1,0,2,1,0,0,...,1,0,0,1,3,0,7,0,0,1
2,0,0,1,0,0,0,1,1,0,0,...,3,0,1,2,3,0,2,0,0,1
3,0,0,1,0,1,0,0,1,0,0,...,0,0,0,0,0,0,2,1,0,0
4,2,0,0,2,0,0,0,0,0,0,...,0,0,0,0,2,0,0,0,0,0


### Train-test split

In [28]:
X = np.array(df_feature)

train_size = 0.8
tsize = int(np.floor(train_size * len(df_feature)))
X_train = X[:tsize]
X_test = X[tsize:]

print("Length of training set:", len(X_train))
print("Length of test set:", len(X_test))

Length of training set: 23846
Length of test set: 5962


### Recommendation System (KNN)

In [29]:
knn = NearestNeighbors(n_neighbors=3, algorithm='ball_tree').fit(X_train)

In [30]:
# find most related products for the first 20 products
for i in range(20):
    a = knn.kneighbors([X_test[i]])
    related_product_list = a[1]

    first_related_product = [item[0] for item in related_product_list]
    first_related_product = str(first_related_product).strip('[]')
    first_related_product = int(first_related_product)
    second_related_product = [item[1] for item in related_product_list]
    second_related_product = str(second_related_product).strip('[]')
    second_related_product = int(second_related_product)
    
    print ("Based on product reviews, for ", df["asin"][len(X_train) + i] ," average rating is ",df["overall"][len(X_train) + i])
    print ("The first similar product is ", df["asin"][first_related_product] ," average rating is ",df["overall"][first_related_product])
    print ("The second similar product is ", df["asin"][second_related_product] ," average rating is ",df["overall"][second_related_product])
    print ("-----------------------------------------------------------")

Based on product reviews, for  B00WXYNLYS  average rating is  4.796296296296297
The first similar product is  B00EZIKSZK  average rating is  4.96969696969697
The second similar product is  B00F14IHO6  average rating is  4.515625
-----------------------------------------------------------
Based on product reviews, for  B00WXYNS2S  average rating is  4.71830985915493
The first similar product is  B00R8ZVPVS  average rating is  4.783333333333333
The second similar product is  B00DR7T8W4  average rating is  4.672727272727273
-----------------------------------------------------------
Based on product reviews, for  B00WXYNSJQ  average rating is  4.633333333333334
The first similar product is  B00I3MOU58  average rating is  4.69811320754717
The second similar product is  B00AZP3ZGG  average rating is  4.773584905660377
-----------------------------------------------------------
Based on product reviews, for  B00WZU720Y  average rating is  4.26530612244898
The first similar product is  B0078Z

### Predict overall rating using KNN

#### n_neighbors=3, algorithm='ball_tree'

In [31]:
y_train = df["overall"][:len(X_train)]
y_test = df["overall"][len(X_train):]
y_train = y_train.astype(int)
y_test = y_test.astype(int)

In [32]:
n_neighbors = 3
knnclf = neighbors.KNeighborsClassifier(n_neighbors, weights='distance', algorithm='ball_tree')
knnclf.fit(X_train, y_train)
y_train_pred = knnclf.predict(X_train)
y_test_pred = knnclf.predict(X_test)

print(classification_report(y_train, y_train_pred))
print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           1       1.00      1.00      1.00        81
           2       1.00      1.00      1.00       814
           3       1.00      1.00      1.00      5225
           4       1.00      1.00      1.00     17715
           5       1.00      1.00      1.00        11

    accuracy                           1.00     23846
   macro avg       1.00      1.00      1.00     23846
weighted avg       1.00      1.00      1.00     23846

              precision    recall  f1-score   support

           1       0.50      0.56      0.53        18
           2       0.78      0.43      0.55       226
           3       0.65      0.38      0.48      1340
           4       0.84      0.96      0.89      4376
           5       0.00      0.00      0.00         2

    accuracy                           0.81      5962
   macro avg       0.55      0.46      0.49      5962
weighted avg       0.79      0.81      0.79      5962



#### n_neighbors=3, algorithm='brute'

In [33]:
n_neighbors = 3
knnclf = neighbors.KNeighborsClassifier(n_neighbors, weights='distance', algorithm='brute')
knnclf.fit(X_train, y_train)
y_test_pred = knnclf.predict(X_test)

print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           1       0.50      0.56      0.53        18
           2       0.79      0.42      0.55       226
           3       0.65      0.37      0.47      1340
           4       0.83      0.96      0.89      4376
           5       0.00      0.00      0.00         2

    accuracy                           0.81      5962
   macro avg       0.55      0.46      0.49      5962
weighted avg       0.79      0.81      0.78      5962



#### n_neighbors=3, algorithm='KD_Tree'

In [34]:
n_neighbors = 3
knnclf = neighbors.KNeighborsClassifier(n_neighbors, weights='distance', algorithm='kd_tree')
knnclf.fit(X_train, y_train)
y_test_pred = knnclf.predict(X_test)

print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           1       0.50      0.56      0.53        18
           2       0.78      0.43      0.56       226
           3       0.65      0.37      0.47      1340
           4       0.83      0.96      0.89      4376
           5       0.00      0.00      0.00         2

    accuracy                           0.81      5962
   macro avg       0.55      0.46      0.49      5962
weighted avg       0.79      0.81      0.79      5962



#### n_neighbors=5, algorithm='ball_tree'

In [35]:
n_neighbors = 5
knnclf = neighbors.KNeighborsClassifier(n_neighbors, weights='distance', algorithm='ball_tree')
knnclf.fit(X_train, y_train)
y_test_pred = knnclf.predict(X_test)

print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           1       0.53      0.44      0.48        18
           2       0.81      0.42      0.55       226
           3       0.67      0.34      0.45      1340
           4       0.83      0.97      0.90      4376
           5       0.00      0.00      0.00         2

    accuracy                           0.81      5962
   macro avg       0.57      0.44      0.48      5962
weighted avg       0.79      0.81      0.78      5962



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


#### n_neighbors=5, algorithm='brute'

In [36]:
n_neighbors = 5
knnclf = neighbors.KNeighborsClassifier(n_neighbors, weights='distance', algorithm='brute')
knnclf.fit(X_train, y_train)
y_test_pred = knnclf.predict(X_test)

print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           1       0.53      0.44      0.48        18
           2       0.80      0.42      0.55       226
           3       0.67      0.35      0.46      1340
           4       0.83      0.97      0.90      4376
           5       0.00      0.00      0.00         2

    accuracy                           0.81      5962
   macro avg       0.57      0.44      0.48      5962
weighted avg       0.79      0.81      0.78      5962



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


#### n_neighbors=5, algorithm='KD_Tree'

In [37]:
n_neighbors = 5
knnclf = neighbors.KNeighborsClassifier(n_neighbors, weights='distance', algorithm='kd_tree')
knnclf.fit(X_train, y_train)
y_test_pred = knnclf.predict(X_test)

print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           1       0.53      0.44      0.48        18
           2       0.81      0.42      0.55       226
           3       0.67      0.34      0.45      1340
           4       0.83      0.97      0.90      4376
           5       0.00      0.00      0.00         2

    accuracy                           0.81      5962
   macro avg       0.57      0.44      0.48      5962
weighted avg       0.79      0.81      0.78      5962



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Modifications after Peer Review
1. We add new features: price and image (whether the user includes images or not)
2. We use reviewText (instead of summaryText) in order to keep consistent with the logistic regression
2. We tune the parameters by using grid search

### Loading Data

In [3]:
print("Start loading review data")
review_data_m = getDF(r'C:\Users\Xylon\Desktop\data200_grad\data\Toys_and_Games.json.gz')
print("Finish loading review data")

# total length of list, this number equals total number of products
print(len(review_data_m))

Start loading review data
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
4500000
5000000
5500000
6000000
6500000
7000000
7500000
8000000
Finish loading review data
8201231


In [4]:
review_data_m.head(5)

Unnamed: 0,overall,vote,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,image,style
0,2.0,12.0,False,"09 22, 2016",A1IDMI31WEANAF,20232233,Mackenzie Kent,"When it comes to a DM's screen, the space on t...",The fact that 50% of this space is wasted on a...,1474502400,,
1,1.0,21.0,False,"09 18, 2016",A4BCEVVZ4Y3V3,20232233,Jonathan Christian,An Open Letter to GaleForce9*:\n\nYour unpaint...,Another worthless Dungeon Master's screen from...,1474156800,,
2,3.0,19.0,True,"09 12, 2016",A2EZ9PY1IHHBX0,20232233,unpreparedtodie,"Nice art, nice printing. Why two panels are f...","pretty, but also pretty useless",1473638400,,
3,5.0,,True,"03 2, 2017",A139PXTTC2LGHZ,20232233,Ashley,Amazing buy! Bought it as a gift for our new d...,Five Stars,1488412800,,
4,1.0,3.0,True,"02 8, 2017",A3IB33V29XIL8O,20232233,Oghma_EM,As my review of GF9's previous screens these w...,Money trap,1486512000,,


In [5]:
print("Start loading meta data")
meta_data = getDF(r'C:\Users\Xylon\Desktop\data200_grad\data\meta_Toys_and_Games.json.gz')
print("Finish loading meta data")
# total length of list, this number equals total number of products
print(len(meta_data))

Start loading meta data
500000
Finish loading meta data
633883


### Data Cleaning

#### review

Technically, a user can only make 1 review at a reviewTime for a product. Therefore, we remove the duplicates which share the same reviewerID, product asin, and unixReviewTime.

In [6]:
review_data_m.drop_duplicates(subset=['reviewerID', 'asin', 'unixReviewTime'], inplace=True)

Extract useful columns.

In [7]:
review_data_m = review_data_m[['overall', 'reviewerID', 'asin', 'reviewText', 'summary', 'image']]

Fill in 0 and " " for NaN values in 'image', 'reviewText'. Drop the remaining NaN values.

In [8]:
review_data_m['reviewText'] = review_data_m['reviewText'].fillna('')
review_data_m['summary'] = review_data_m['summary'].fillna('')

In [9]:
review_data_m['hasImage'] = np.where(review_data_m['image'].isnull(), 0, 1)
review_data_m = review_data_m.drop(columns=['image'])

In [10]:
review_data_m = review_data_m.dropna()

In [11]:
print(len(review_data_m))

8002579


In [12]:
review_data_m.head(5)

Unnamed: 0,overall,reviewerID,asin,reviewText,summary,hasImage
0,2.0,A1IDMI31WEANAF,20232233,"When it comes to a DM's screen, the space on t...",The fact that 50% of this space is wasted on a...,0
1,1.0,A4BCEVVZ4Y3V3,20232233,An Open Letter to GaleForce9*:\n\nYour unpaint...,Another worthless Dungeon Master's screen from...,0
2,3.0,A2EZ9PY1IHHBX0,20232233,"Nice art, nice printing. Why two panels are f...","pretty, but also pretty useless",0
3,5.0,A139PXTTC2LGHZ,20232233,Amazing buy! Bought it as a gift for our new d...,Five Stars,0
4,1.0,A3IB33V29XIL8O,20232233,As my review of GF9's previous screens these w...,Money trap,0


#### meta

In [13]:
meta_data = meta_data[['price', 'asin']]

In [14]:
meta_data = meta_data[meta_data['price'].str.match(r'\$[0-9]*.[0-9]*')]
meta_data['price'] = meta_data['price'].str.extract(r'\$(.*)')
meta_data['price'] = pd.to_numeric(meta_data['price'],errors='coerce')

In [15]:
meta_data = meta_data.dropna()
print(len(meta_data))

314276


In [16]:
meta_data.head(5)

Unnamed: 0,price,asin
3,24.95,0004983289
4,4.92,0006466222
5,13.5,0020232233
9,35.09,019848710X
10,28.81,0198487126


#### merge

In [17]:
all_data = pd.merge(review_data_m, meta_data, on="asin")

In [18]:
len(all_data)

5629215

In [19]:
all_data.head(5)

Unnamed: 0,overall,reviewerID,asin,reviewText,summary,hasImage,price
0,2.0,A1IDMI31WEANAF,20232233,"When it comes to a DM's screen, the space on t...",The fact that 50% of this space is wasted on a...,0,13.5
1,1.0,A4BCEVVZ4Y3V3,20232233,An Open Letter to GaleForce9*:\n\nYour unpaint...,Another worthless Dungeon Master's screen from...,0,13.5
2,3.0,A2EZ9PY1IHHBX0,20232233,"Nice art, nice printing. Why two panels are f...","pretty, but also pretty useless",0,13.5
3,5.0,A139PXTTC2LGHZ,20232233,Amazing buy! Bought it as a gift for our new d...,Five Stars,0,13.5
4,1.0,A3IB33V29XIL8O,20232233,As my review of GF9's previous screens these w...,Money trap,0,13.5


### Data Preprocessing

Select reviews of products that have more than 50 reviewers.

In [20]:
count = all_data.groupby("asin", as_index=False).count()

In [21]:
df_merge = pd.merge(all_data, count, how='right', on=['asin'])

In [22]:
df_merge.head(5)

Unnamed: 0,overall_x,reviewerID_x,asin,reviewText_x,summary_x,hasImage_x,price_x,overall_y,reviewerID_y,reviewText_y,summary_y,hasImage_y,price_y
0,5.0,ASZZ869682237,4983289,Love this game!!! Game came with 4 different c...,Fun Game!!,0,24.95,5,5,5,5,5,5
1,5.0,A2N9CIXRV5BBKM,4983289,"Love this game! It's fast, entertaining and j...",Five Stars,0,24.95,5,5,5,5,5,5
2,5.0,A34H7FVUZCM87N,4983289,One of the best games I have played. Fun and p...,Love It,0,24.95,5,5,5,5,5,5
3,5.0,AZSMJSWX83C48,4983289,My son was very excited to receive this as a g...,It is a fun family card game,0,24.95,5,5,5,5,5,5
4,4.0,AWIZYTEY5MXC0,4983289,Fun game!,Four Stars,0,24.95,5,5,5,5,5,5


In [23]:
df_merge = df_merge.rename(columns={"overall_x": "overall", "reviewText_x": "reviewText", "reviewerID_y": "numReviewer", "hasImage_x": "hasImage", "price_x": "price", "summary_x": "summary"})
df_merge = df_merge.sort_values(by='numReviewer', ascending=False)

In [24]:
df_50 = df_merge[df_merge['numReviewer'] >= 50]

In [25]:
df_50 = df_50[['overall', 'asin', 'reviewText', 'summary', 'hasImage', 'price', 'numReviewer']]

In [26]:
df_50.head(5)

Unnamed: 0,overall,asin,reviewText,summary,hasImage,price,numReviewer
1744783,5.0,B004S8F7QM,Another great party game to get people to come...,Five Stars,0,25.0,8815
1747505,5.0,B004S8F7QM,Do I really have to tell yo how much fun this ...,Five Stars,0,25.0,8815
1747492,5.0,B004S8F7QM,Best crude game ever!,Too funny!,0,25.0,8815
1747493,5.0,B004S8F7QM,such a great game to play with other adults wh...,Fun adult game even better with drinks!,0,25.0,8815
1747494,3.0,B004S8F7QM,Another fun game for the family,Three Stars,0,25.0,8815


In [27]:
len(df_50)

3827457

Grouping all the summary reviews by product ID into lists

In [28]:
review_product = df_50.groupby("asin")["reviewText"].apply(list)
df_review_product = pd.DataFrame(review_product)

In [29]:
df_review_product.head(5)

Unnamed: 0_level_0,reviewText
asin,Unnamed: 1_level_1
486448789,[I wouldn't give it one star. I paid $4.39 and...
545561647,[This is much harder than it looks especially ...
615638996,"[Great tool #SchoolPsychologist, A little expe..."
692770445,"[Loved loved loved!, Wonderful product, A beau..."
735333467,[These are so much cuter for my daughter's mag...


In [30]:
summary_product = df_50.groupby("asin")["summary"].apply(list)
df_summary_product = pd.DataFrame(summary_product)

In [31]:
df_summary_product.head(5)

Unnamed: 0_level_0,summary
asin,Unnamed: 1_level_1
486448789,"[Don't wast your money., Four Stars, awesome!,..."
545561647,"[It's hard but worth it, Had so much fun with ..."
615638996,"[Five Stars, helpful therapy tool for younger ..."
692770445,"[Five Stars, Five Stars, Five Stars, This is o..."
735333467,"[So cute!, At first I loved these magnets for ..."


Append the average overall rating for each product

In [32]:
df_mean = all_data.groupby("asin", as_index=False).mean()

In [33]:
df = pd.merge(df_review_product, df_mean, on="asin", how='inner')

In [35]:
df['reviewText'] = df['reviewText'].astype(str)

In [37]:
df = pd.merge(df_summary_product, df, on="asin", how='inner')

In [42]:
df['summary'] = df['summary'].astype(str)

In [38]:
df.head(5)

Unnamed: 0,asin,summary,reviewText,overall,hasImage,price
0,486448789,"[Don't wast your money., Four Stars, awesome!,...","[""I wouldn't give it one star. I paid $4.39 an...",3.85567,0.010309,3.04
1,545561647,"[It's hard but worth it, Had so much fun with ...","[""This is much harder than it looks especially...",3.950739,0.044335,17.45
2,615638996,"[Five Stars, helpful therapy tool for younger ...","['Great tool #SchoolPsychologist', ""A little e...",4.651515,0.0,19.95
3,692770445,"[Five Stars, Five Stars, Five Stars, This is o...","['Loved loved loved!', 'Wonderful product', 'A...",4.294872,0.089744,36.99
4,735333467,"[So cute!, At first I loved these magnets for ...","[""These are so much cuter for my daughter's ma...",4.508475,0.016949,22.6


In [39]:
# tokenizer
regEx = re.compile('[^a-z]+')
def cleanReviews(reviewText):
    reviewText = reviewText.lower()
    reviewText = regEx.sub(' ', reviewText).strip()
    return reviewText

In [40]:
df["reviewClean"] = df["reviewText"].apply(cleanReviews)

In [43]:
df["summaryClean"] = df["summary"].apply(cleanReviews)

In [44]:
df.head(5)

Unnamed: 0,asin,summary,reviewText,overall,hasImage,price,reviewClean,summaryClean
0,486448789,"[""Don't wast your money."", 'Four Stars', 'awes...","[""I wouldn't give it one star. I paid $4.39 an...",3.85567,0.010309,3.04,i wouldn t give it one star i paid and the pri...,don t wast your money four stars awesome read ...
1,545561647,"[""It's hard but worth it"", 'Had so much fun wi...","[""This is much harder than it looks especially...",3.950739,0.044335,17.45,this is much harder than it looks especially w...,it s hard but worth it had so much fun with it...
2,615638996,"['Five Stars', 'helpful therapy tool for young...","['Great tool #SchoolPsychologist', ""A little e...",4.651515,0.0,19.95,great tool schoolpsychologist a little expensi...,five stars helpful therapy tool for younger ki...
3,692770445,"['Five Stars', 'Five Stars', 'Five Stars', 'Th...","['Loved loved loved!', 'Wonderful product', 'A...",4.294872,0.089744,36.99,loved loved loved wonderful product a beautifu...,five stars five stars five stars this is one o...
4,735333467,"['So cute!', 'At first I loved these magnets f...","[""These are so much cuter for my daughter's ma...",4.508475,0.016949,22.6,these are so much cuter for my daughter s magn...,so cute at first i loved these magnets for my ...


### Feature Extraction

#### summary only

In [45]:
summaries = df["summaryClean"] 

In [46]:
# might be able to use TfIdf tokenizer
countVector = CountVectorizer(max_features = 300, stop_words='english') 
transformed_summaries = countVector.fit_transform(summaries) 

In [47]:
feature_summary_only = pd.DataFrame(transformed_summaries.A, columns=countVector.get_feature_names())
feature_summary_only = feature_summary_only.astype(int)

In [48]:
feature_summary_only.head(5)

Unnamed: 0,absolutely,activity,actually,addition,adorable,adults,advertised,age,ages,amazing,...,work,worked,working,works,worth,wrong,year,years,young,yr
0,0,7,0,1,0,0,1,1,0,0,...,0,1,0,0,0,0,4,0,0,1
1,0,1,0,0,9,1,0,2,1,0,...,1,0,0,1,3,0,7,0,0,1
2,0,0,0,1,0,0,0,1,1,0,...,3,0,1,2,3,0,2,0,0,1
3,0,0,0,1,0,1,0,0,1,0,...,0,0,0,0,0,0,2,1,0,0
4,2,0,0,0,2,0,0,0,0,0,...,0,0,0,0,2,0,0,0,0,0


#### summary + price + image

In [49]:
summaries = df["summaryClean"] 

In [50]:
# might be able to use TfIdf tokenizer
countVector = CountVectorizer(max_features = 300, stop_words='english') 
transformed_summaries = countVector.fit_transform(summaries) 

In [51]:
feature_summary = pd.DataFrame(transformed_summaries.A, columns=countVector.get_feature_names())
feature_summary = feature_summary.astype(int)

In [52]:
feature_summary['hasImage'] = df['hasImage']
feature_summary['price'] = df['price']

In [53]:
feature_summary.head(5)

Unnamed: 0,absolutely,activity,actually,addition,adorable,adults,advertised,age,ages,amazing,...,worked,working,works,worth,wrong,year,years,young,yr,hasImage
0,0,7,0,1,0,0,1,1,0,0,...,1,0,0,0,0,4,0,0,1,0.010309
1,0,1,0,0,9,1,0,2,1,0,...,0,0,1,3,0,7,0,0,1,0.044335
2,0,0,0,1,0,0,0,1,1,0,...,0,1,2,3,0,2,0,0,1,0.0
3,0,0,0,1,0,1,0,0,1,0,...,0,0,0,0,0,2,1,0,0,0.089744
4,2,0,0,0,2,0,0,0,0,0,...,0,0,0,2,0,0,0,0,0,0.016949


#### review only

In [77]:
reviews = df["reviewClean"] 

In [78]:
# might be able to use TfIdf tokenizer
countVector = CountVectorizer(max_features = 300, stop_words='english') 
transformed_reviews = countVector.fit_transform(reviews) 

In [79]:
feature_review_only = pd.DataFrame(transformed_reviews.A, columns=countVector.get_feature_names())
feature_review_only = feature_review_only.astype(int)

In [80]:
feature_review_only.head(5)

Unnamed: 0,able,absolutely,actually,add,adorable,adults,age,ages,amazing,amazon,...,wonderful,work,worked,working,works,worth,year,years,young,yr
0,1,0,2,0,0,0,3,2,0,4,...,1,6,1,2,1,8,20,4,1,0
1,9,1,8,4,21,6,6,4,0,1,...,3,13,2,1,4,2,50,10,7,4
2,3,1,5,2,0,3,7,16,0,0,...,5,18,0,12,7,2,17,0,1,2
3,3,2,3,2,0,2,4,1,1,0,...,4,0,1,0,0,0,16,10,3,2
4,1,4,1,0,4,0,0,0,0,2,...,1,0,0,0,1,3,7,1,0,1


#### review + price + image

In [54]:
reviews = df["reviewClean"] 

In [55]:
# might be able to use TfIdf tokenizer
countVector = CountVectorizer(max_features = 300, stop_words='english') 
transformed_reviews = countVector.fit_transform(reviews) 

In [56]:
feature_review = pd.DataFrame(transformed_reviews.A, columns=countVector.get_feature_names())
feature_review = feature_review.astype(int)

In [57]:
feature_review['hasImage'] = df['hasImage']
feature_review['price'] = df['price']

In [58]:
feature_review.head(5)

Unnamed: 0,able,absolutely,actually,add,adorable,adults,age,ages,amazing,amazon,...,work,worked,working,works,worth,year,years,young,yr,hasImage
0,1,0,2,0,0,0,3,2,0,4,...,6,1,2,1,8,20,4,1,0,0.010309
1,9,1,8,4,21,6,6,4,0,1,...,13,2,1,4,2,50,10,7,4,0.044335
2,3,1,5,2,0,3,7,16,0,0,...,18,0,12,7,2,17,0,1,2,0.0
3,3,2,3,2,0,2,4,1,1,0,...,0,1,0,0,0,16,10,3,2,0.089744
4,1,4,1,0,4,0,0,0,0,2,...,0,0,0,1,3,7,1,0,1,0.016949


## summary only

### Train-test split

In [59]:
X = np.array(feature_summary_only)

train_size = 0.8
tsize = int(np.floor(train_size * len(feature_summary_only)))
X_train = X[:tsize]
X_test = X[tsize:]

print("Length of training set:", len(X_train))
print("Length of test set:", len(X_test))

Length of training set: 17463
Length of test set: 4366


In [60]:
y_train = df["overall"][:len(X_train)]
y_test = df["overall"][len(X_train):]
y_train = y_train.astype(int)
y_test = y_test.astype(int)

### Recommendation System (KNN) with Grid Search

In [61]:
from sklearn.model_selection import GridSearchCV

In [62]:
knn = neighbors.KNeighborsClassifier()

k_range = list(range(1, 31))
param_grid = dict(n_neighbors=k_range)
  
# defining parameter range
grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy', return_train_score=False, verbose=1)
  
# fitting the model for grid search
grid_search = grid.fit(X_train, y_train)

Fitting 10 folds for each of 30 candidates, totalling 300 fits




In [63]:
print(grid_search.best_params_)

{'n_neighbors': 6}


In [64]:
accuracy = grid_search.best_score_ *100
print("Accuracy for our training dataset with tuning is : {:.2f}%".format(accuracy) )

Accuracy for our training dataset with tuning is : 82.56%


In [67]:
knn = neighbors.KNeighborsClassifier(n_neighbors=6)

knn.fit(X_train, y_train)

y_test_pred = knn.predict(X_test) 

test_accuracy = accuracy_score(y_test, y_test_pred) * 100

print("Accuracy for our testing dataset with tuning is : {:.2f}%".format(test_accuracy) )

Accuracy for our testing dataset with tuning is : 82.91%


In [68]:
print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           1       0.57      0.33      0.42        12
           2       0.86      0.48      0.61       151
           3       0.67      0.39      0.50       906
           4       0.85      0.97      0.91      3295
           5       0.00      0.00      0.00         2

    accuracy                           0.83      4366
   macro avg       0.59      0.43      0.49      4366
weighted avg       0.81      0.83      0.81      4366



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## summary + price + image

### Train-test split

In [69]:
X = np.array(feature_summary)

train_size = 0.8
tsize = int(np.floor(train_size * len(feature_summary)))
X_train = X[:tsize]
X_test = X[tsize:]

print("Length of training set:", len(X_train))
print("Length of test set:", len(X_test))

Length of training set: 17463
Length of test set: 4366


In [70]:
y_train = df["overall"][:len(X_train)]
y_test = df["overall"][len(X_train):]
y_train = y_train.astype(int)
y_test = y_test.astype(int)

### Recommendation System (KNN) with Grid Search

In [71]:
from sklearn.model_selection import GridSearchCV

In [72]:
knn = neighbors.KNeighborsClassifier()

k_range = list(range(1, 31))
param_grid = dict(n_neighbors=k_range)
  
# defining parameter range
grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy', return_train_score=False, verbose=1)
  
# fitting the model for grid search
grid_search = grid.fit(X_train, y_train)

Fitting 10 folds for each of 30 candidates, totalling 300 fits




In [73]:
print(grid_search.best_params_)

{'n_neighbors': 6}


In [74]:
accuracy = grid_search.best_score_ *100
print("Accuracy for our training dataset with tuning is : {:.2f}%".format(accuracy) )

Accuracy for our training dataset with tuning is : 81.80%


In [75]:
knn = neighbors.KNeighborsClassifier(n_neighbors=6)

knn.fit(X_train, y_train)

y_test_pred = knn.predict(X_test) 

test_accuracy = accuracy_score(y_test, y_test_pred) * 100

print("Accuracy for our testing dataset with tuning is : {:.2f}%".format(test_accuracy) )

Accuracy for our testing dataset with tuning is : 81.06%


In [76]:
print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           1       0.83      0.42      0.56        12
           2       0.86      0.39      0.54       151
           3       0.61      0.33      0.43       906
           4       0.84      0.96      0.90      3295
           5       0.00      0.00      0.00         2

    accuracy                           0.81      4366
   macro avg       0.63      0.42      0.48      4366
weighted avg       0.79      0.81      0.78      4366



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## review only

### Train-test split

In [81]:
X = np.array(feature_review_only)

train_size = 0.8
tsize = int(np.floor(train_size * len(feature_review_only)))
X_train = X[:tsize]
X_test = X[tsize:]

print("Length of training set:", len(X_train))
print("Length of test set:", len(X_test))

Length of training set: 17463
Length of test set: 4366


In [82]:
y_train = df["overall"][:len(X_train)]
y_test = df["overall"][len(X_train):]
y_train = y_train.astype(int)
y_test = y_test.astype(int)

### Recommendation System (KNN) with Grid Search

In [83]:
from sklearn.model_selection import GridSearchCV

In [84]:
knn = neighbors.KNeighborsClassifier()

k_range = list(range(1, 31))
param_grid = dict(n_neighbors=k_range)
  
# defining parameter range
grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy', return_train_score=False, verbose=1)
  
# fitting the model for grid search
grid_search = grid.fit(X_train, y_train)

Fitting 10 folds for each of 30 candidates, totalling 300 fits




In [85]:
print(grid_search.best_params_)

{'n_neighbors': 8}


In [86]:
accuracy = grid_search.best_score_ *100
print("Accuracy for our training dataset with tuning is : {:.2f}%".format(accuracy) )

Accuracy for our training dataset with tuning is : 81.44%


In [87]:
knn = neighbors.KNeighborsClassifier(n_neighbors=8)

knn.fit(X_train, y_train)

y_test_pred = knn.predict(X_test) 

test_accuracy = accuracy_score(y_test, y_test_pred) * 100

print("Accuracy for our testing dataset with tuning is : {:.2f}%".format(test_accuracy) )

Accuracy for our testing dataset with tuning is : 80.05%


In [88]:
print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           1       0.50      0.08      0.14        12
           2       0.59      0.15      0.24       151
           3       0.56      0.39      0.46       906
           4       0.84      0.95      0.89      3295
           5       0.00      0.00      0.00         2

    accuracy                           0.80      4366
   macro avg       0.50      0.31      0.35      4366
weighted avg       0.77      0.80      0.78      4366



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## review + image + price

### Train-test split

In [89]:
X = np.array(feature_review)

train_size = 0.8
tsize = int(np.floor(train_size * len(feature_review)))
X_train = X[:tsize]
X_test = X[tsize:]

print("Length of training set:", len(X_train))
print("Length of test set:", len(X_test))

Length of training set: 17463
Length of test set: 4366


In [90]:
y_train = df["overall"][:len(X_train)]
y_test = df["overall"][len(X_train):]
y_train = y_train.astype(int)
y_test = y_test.astype(int)

### Recommendation System (KNN) with Grid Search

In [91]:
from sklearn.model_selection import GridSearchCV

In [None]:
knn = neighbors.KNeighborsClassifier()

k_range = list(range(1, 31))
param_grid = dict(n_neighbors=k_range)
  
# defining parameter range
grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy', return_train_score=False, verbose=1)
  
# fitting the model for grid search
grid_search = grid.fit(X_train, y_train)

Fitting 10 folds for each of 30 candidates, totalling 300 fits




In [None]:
print(grid_search.best_params_)

In [None]:
accuracy = grid_search.best_score_ *100
print("Accuracy for our training dataset with tuning is : {:.2f}%".format(accuracy) )

In [None]:
knn = neighbors.KNeighborsClassifier(n_neighbors=8)

knn.fit(X_train, y_train)

y_test_pred = knn.predict(X_test) 

test_accuracy = accuracy_score(y_test, y_test_pred) * 100

print("Accuracy for our testing dataset with tuning is : {:.2f}%".format(test_accuracy) )

In [None]:
print(classification_report(y_test, y_test_pred))