## Analysis of an E-commerce Dataset Part 3 (s2 2023)


In this Portfolio task, you will continue working with the dataset you have used in portfolio 2. But the difference is that the ratings have been converted to like (with score 1) and dislike (with score 0). Your task is to train classification models such as KNN to predict whether a user like or dislike an item.  


The header of the csv file is shown below. 

| userId | timestamp | review | item | helpfulness | gender | category | item_id | item_price | user_city | rating |
    | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
    
Your high level goal in this notebook is to try to build and evaluate predictive models for 'rating' from other available features - predict the value of the like (corresponding to rating 1) and dislike (corresponding to rating 0) in the data from some of the other fields. More specifically, you need to complete the following major steps: 
1) Explore the data. Clean the data if necessary. For example, remove abnormal instanaces and replace missing values.
2) Convert object features into digit features by using an encoder
3) Study the correlation between these features. 
4) Split the dataset and train a logistic regression model to predict 'rating' based on other features. Evaluate the accuracy of your model.
5) Split the dataset and train a KNN model to predict 'rating' based on other features. You can set K with an ad-hoc manner in this step. Evaluate the accuracy of your model.
6) Tune the hyper-parameter K in KNN to see how it influences the prediction performance

Note 1: We did not provide any description of each step in the notebook. You should learn how to properly comment your notebook by yourself to make your notebook file readable. 

Note 2: you are not being evaluated on the ___accuracy___ of the model but on the ___process___ that you use to generate it. Please use both ___Logistic Regression model___ and ___KNN model___ for solving this classification problem. Accordingly, discuss the performance of these two methods.
    

In [491]:
import warnings
warnings.filterwarnings('ignore')

In [492]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import OrdinalEncoder
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import linear_model
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import linkage, dendrogram, cut_tree
from scipy.spatial.distance import pdist 
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import RFE


df = pd.read_csv('portfolio_3.csv')

df.head()

Unnamed: 0,userId,timestamp,review,item,helpfulness,gender,category,item_id,item_price,user_city,rating
0,4081,71900,Not always McCrap,McDonald's,3,M,Restaurants & Gourmet,41,30.74,4,1
1,4081,72000,I dropped the chalupa even before he told me to,Taco Bell,4,M,Restaurants & Gourmet,74,108.3,4,0
2,4081,72000,The Wonderful World of Wendy,Wendy's,4,M,Restaurants & Gourmet,84,69.0,4,1
3,4081,100399,They actually did it,"South Park: Bigger, Longer & Uncut",3,M,Movies,68,143.11,4,1
4,4081,100399,Hey! Gimme some pie!,American Pie,3,M,Movies,6,117.89,4,0


In [493]:
# Explore the data. Clean the data if necessary.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2685 entries, 0 to 2684
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   userId       2685 non-null   int64  
 1   timestamp    2685 non-null   int64  
 2   review       2685 non-null   object 
 3   item         2685 non-null   object 
 4   helpfulness  2685 non-null   int64  
 5   gender       2685 non-null   object 
 6   category     2685 non-null   object 
 7   item_id      2685 non-null   int64  
 8   item_price   2685 non-null   float64
 9   user_city    2685 non-null   int64  
 10  rating       2685 non-null   int64  
dtypes: float64(1), int64(6), object(4)
memory usage: 230.9+ KB


In [494]:
# Convert object features into digit features by using an encoder

ord_enc = OrdinalEncoder()
df["genderCode"] = ord_enc.fit_transform(df[["gender"]])
df[["gender", "genderCode"]]

df["categoryCode"] = ord_enc.fit_transform(df[["category"]])
df[["category", "categoryCode"]]

df["reviewCode"] = ord_enc.fit_transform(df[["review"]])
df[["review", "reviewCode"]]

df["itemCode"] = ord_enc.fit_transform(df[["item"]])
df[["item", "itemCode"]]
print()




In [495]:
#Study the correlation between these features.
df.corr()

Unnamed: 0,userId,timestamp,helpfulness,item_id,item_price,user_city,rating,genderCode,categoryCode,reviewCode,itemCode
userId,1.0,-0.069176,-0.166136,-0.005549,0.024576,-0.030031,0.066444,-0.058324,-0.041362,0.007139,-0.005513
timestamp,-0.069176,1.0,0.014179,-0.004452,0.010979,-0.014934,-0.009739,-0.003367,0.015009,0.007029,-0.003543
helpfulness,-0.166136,0.014179,1.0,-0.019882,0.004112,0.012086,-0.010622,0.075947,-0.013408,-0.028259,-0.020433
item_id,-0.005549,-0.004452,-0.019882,1.0,-0.05445,-0.005576,0.057107,0.00237,-0.045268,0.163544,0.999765
item_price,0.024576,0.010979,0.004112,-0.05445,1.0,-0.023427,0.026062,-0.040596,-0.115571,-0.041421,-0.049885
user_city,-0.030031,-0.014934,0.012086,-0.005576,-0.023427,1.0,-0.034866,-0.065638,0.008017,0.045626,-0.00522
rating,0.066444,-0.009739,-0.010622,0.057107,0.026062,-0.034866,1.0,-0.022169,-0.142479,-0.041756,0.057793
genderCode,-0.058324,-0.003367,0.075947,0.00237,-0.040596,-0.065638,-0.022169,1.0,0.022549,-0.037884,0.001925
categoryCode,-0.041362,0.015009,-0.013408,-0.045268,-0.115571,0.008017,-0.142479,0.022549,1.0,0.00197,-0.045988
reviewCode,0.007139,0.007029,-0.028259,0.163544,-0.041421,0.045626,-0.041756,-0.037884,0.00197,1.0,0.16309


In [496]:
#Split the dataset and train a logistic regression model to predict 'rating' based on other features. Evaluate the accuracy of your model.
train, test = train_test_split(df, test_size=0.33, random_state=142)

model = LogisticRegression()
XTrain = train[["userId","timestamp","item_id","item_price","user_city","reviewCode","itemCode","helpfulness","genderCode","categoryCode"]]
YTrain = train ["rating"]
XTest = test [["userId","timestamp","item_id","item_price","user_city","reviewCode","itemCode","helpfulness","genderCode","categoryCode"]]
YTest = test ["rating"]
model.fit(XTrain,YTrain)

In [497]:
#Evaluate the accuracy of your model.
predicted_train=model.predict(XTrain)
print('training accuracy is: ', accuracy_score( YTrain,predicted_train))

training accuracy is:  0.639599555061179


In [498]:
#Evaluate the accuracy of your model.
predicted_test=model.predict(XTest)
print('testing accuracy is: ', accuracy_score( YTest,predicted_test))

testing accuracy is:  0.6459977452085682


In [499]:
#rfe to improve model
select =RFE (model, n_features_to_select =3)
select = select.fit(XTrain,YTrain)
select.ranking_

array([6, 8, 1, 5, 4, 7, 1, 2, 3, 1])

In [519]:
#Split the dataset and train a KNN model to predict 'rating' based on other features.

# Create and training a KNN classifier model
clf = KNeighborsClassifier(n_neighbors = 2)
clf.fit(XTrain, YTrain)

#Evaluate the accuracy of your model.

# Use the model to predict testing data
y_pred = clf.predict(XTest)
accuracy = accuracy_score(y_pred, YTest)
print('Testing accuracy is: ', accuracy)



Testing accuracy is:  0.46110484780157834


In [510]:
# testing accuracy hasnt improve with knn model, 

In [517]:
#Tune the hyper-parameter K in KNN to see how it influences the prediction performance
# Define search space for parameters
parameter_grid = {'n_neighbors': range(1, 100)}

# Create the machine learning model
clf = GridSearchCV(clfs, parameter_grid)
clf.fit(XTrain, YTrain)

# Identify the best parameter(s)
print(clf.best_params_)
print('The accuracy: ', clf.best_score_)

{'n_neighbors': 91}
The accuracy:  0.6357056638811514


In [518]:
# Create and training a KNN classifier model
clf = KNeighborsClassifier(n_neighbors = 91)
clf.fit(XTrain, YTrain)

# Use the model to predict testing data
y_pred = clf.predict(XTest)
accuracy = accuracy_score(y_pred, YTest)
print('Testing accuracy is: ', accuracy)

Testing accuracy is:  0.6437429537767756


In [489]:
#testing accuracy has improved, however is still not better than orginal test