## Analysis of an E-commerce Dataset Part 3 (s2 2023)


In this Portfolio task, you will continue working with the dataset you have used in portfolio 2. But the difference is that the ratings have been converted to like (with score 1) and dislike (with score 0). Your task is to train classification models such as KNN to predict whether a user like or dislike an item.  


The header of the csv file is shown below. 

| userId | timestamp | review | item | helpfulness | gender | category | item_id | item_price | user_city | rating |
    | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
    
Your high level goal in this notebook is to try to build and evaluate predictive models for 'rating' from other available features - predict the value of the like (corresponding to rating 1) and dislike (corresponding to rating 0) in the data from some of the other fields. More specifically, you need to complete the following major steps: 
1) Explore the data. Clean the data if necessary. For example, remove abnormal instanaces and replace missing values.
2) Convert object features into digit features by using an encoder
3) Study the correlation between these features. 
4) Split the dataset and train a logistic regression model to predict 'rating' based on other features. Evaluate the accuracy of your model.
5) Split the dataset and train a KNN model to predict 'rating' based on other features. You can set K with an ad-hoc manner in this step. Evaluate the accuracy of your model.
6) Tune the hyper-parameter K in KNN to see how it influences the prediction performance

Note 1: We did not provide any description of each step in the notebook. You should learn how to properly comment your notebook by yourself to make your notebook file readable. 

Note 2: you are not being evaluated on the ___accuracy___ of the model but on the ___process___ that you use to generate it. Please use both ___Logistic Regression model___ and ___KNN model___ for solving this classification problem. Accordingly, discuss the performance of these two methods.
    

### 1. Explore the data

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import r2_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.feature_selection import RFE
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings("ignore")

import seaborn as sns
import matplotlib.pylab as plt
%matplotlib inline

In [2]:
# import dataset
ec = pd.read_csv("portfolio_3.csv")
ec.head()

Unnamed: 0,userId,timestamp,review,item,helpfulness,gender,category,item_id,item_price,user_city,rating
0,4081,71900,Not always McCrap,McDonald's,3,M,Restaurants & Gourmet,41,30.74,4,1
1,4081,72000,I dropped the chalupa even before he told me to,Taco Bell,4,M,Restaurants & Gourmet,74,108.3,4,0
2,4081,72000,The Wonderful World of Wendy,Wendy's,4,M,Restaurants & Gourmet,84,69.0,4,1
3,4081,100399,They actually did it,"South Park: Bigger, Longer & Uncut",3,M,Movies,68,143.11,4,1
4,4081,100399,Hey! Gimme some pie!,American Pie,3,M,Movies,6,117.89,4,0


In [3]:
# explore
ec.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2685 entries, 0 to 2684
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   userId       2685 non-null   int64  
 1   timestamp    2685 non-null   int64  
 2   review       2685 non-null   object 
 3   item         2685 non-null   object 
 4   helpfulness  2685 non-null   int64  
 5   gender       2685 non-null   object 
 6   category     2685 non-null   object 
 7   item_id      2685 non-null   int64  
 8   item_price   2685 non-null   float64
 9   user_city    2685 non-null   int64  
 10  rating       2685 non-null   int64  
dtypes: float64(1), int64(6), object(4)
memory usage: 230.9+ KB


In [4]:
ec.isnull().sum()

userId         0
timestamp      0
review         0
item           0
helpfulness    0
gender         0
category       0
item_id        0
item_price     0
user_city      0
rating         0
dtype: int64

In [5]:
ec.describe()

Unnamed: 0,userId,timestamp,helpfulness,item_id,item_price,user_city,rating
count,2685.0,2685.0,2685.0,2685.0,2685.0,2685.0,2685.0
mean,4673.237616,58812.687151,3.908007,43.478585,83.09165,19.456983,0.639851
std,3517.893437,37013.726118,0.289069,26.630426,42.227558,11.397281,0.480133
min,4.0,10100.0,3.0,0.0,12.0,0.0,0.0
25%,1310.0,22000.0,4.0,21.0,49.0,9.0,0.0
50%,4666.0,52800.0,4.0,42.0,73.65,19.0,1.0
75%,7651.0,91000.0,4.0,67.0,129.82,28.0,1.0
max,10779.0,123199.0,4.0,88.0,149.0,39.0,1.0


### 2. Encoding 

In [6]:
ec.review = LabelEncoder().fit_transform(ec.review)
ec.item = LabelEncoder().fit_transform(ec.item)
ec.gender = LabelEncoder().fit_transform(ec.gender)
ec.category = LabelEncoder().fit_transform(ec.category)

ec.head()

Unnamed: 0,userId,timestamp,review,item,helpfulness,gender,category,item_id,item_price,user_city,rating
0,4081,71900,1618,37,3,1,8,41,30.74,4,1
1,4081,72000,1125,67,4,1,8,74,108.3,4,0
2,4081,72000,2185,77,4,1,8,84,69.0,4,1
3,4081,100399,2243,61,3,1,5,68,143.11,4,1
4,4081,100399,1033,5,3,1,5,6,117.89,4,0


### 3. Correlation

In [7]:
ec.corr()

Unnamed: 0,userId,timestamp,review,item,helpfulness,gender,category,item_id,item_price,user_city,rating
userId,1.0,-0.069176,0.007139,-0.005513,-0.166136,-0.058324,-0.041362,-0.005549,0.024576,-0.030031,0.066444
timestamp,-0.069176,1.0,0.007029,-0.003543,0.014179,-0.003367,0.015009,-0.004452,0.010979,-0.014934,-0.009739
review,0.007139,0.007029,1.0,0.16309,-0.028259,-0.037884,0.00197,0.163544,-0.041421,0.045626,-0.041756
item,-0.005513,-0.003543,0.16309,1.0,-0.020433,0.001925,-0.045988,0.999765,-0.049885,-0.00522,0.057793
helpfulness,-0.166136,0.014179,-0.028259,-0.020433,1.0,0.075947,-0.013408,-0.019882,0.004112,0.012086,-0.010622
gender,-0.058324,-0.003367,-0.037884,0.001925,0.075947,1.0,0.022549,0.00237,-0.040596,-0.065638,-0.022169
category,-0.041362,0.015009,0.00197,-0.045988,-0.013408,0.022549,1.0,-0.045268,-0.115571,0.008017,-0.142479
item_id,-0.005549,-0.004452,0.163544,0.999765,-0.019882,0.00237,-0.045268,1.0,-0.05445,-0.005576,0.057107
item_price,0.024576,0.010979,-0.041421,-0.049885,0.004112,-0.040596,-0.115571,-0.05445,1.0,-0.023427,0.026062
user_city,-0.030031,-0.014934,0.045626,-0.00522,0.012086,-0.065638,0.008017,-0.005576,-0.023427,1.0,-0.034866


All variables have low correlation with rating.

### 4. Logistic Regression Model

#### Build Models

In [8]:
# train & test split
train, test = train_test_split(ec, test_size = 0.2, random_state = 142)
print(train.shape)
print(test.shape)

(2148, 11)
(537, 11)


In [9]:
# fit model on train dataset - without feature selection
X_train_log = train.drop(['rating'], axis = 1)
y_train_log = train['rating']

X_test_log = test.drop(['rating'], axis = 1)
y_test_log = test['rating']

model = LogisticRegression()
model.fit(X_train_log, y_train_log)

In [10]:
# prediction 
fitted_y_train = model.predict(X_train_log)
fitted_y_test = model.predict(X_test_log)

In [11]:
# evaluate the performance 
print("Accuracy score on training set: ", accuracy_score(y_train_log, fitted_y_train))
print("Accuracy score on testing set: ", accuracy_score(y_test_log, fitted_y_test))

Accuracy score on training set:  0.6317504655493482
Accuracy score on testing set:  0.6685288640595903


In [12]:
# feature selection
rfe = RFE(estimator = model, n_features_to_select = 5, step = 1)
rfe.fit(X_train_log, y_train_log)

In [13]:
# evaluation on feature selection
fitted_y_test = rfe.predict(X_test_log)
print("Accuracy score on test set: ", accuracy_score(y_test_log, fitted_y_test))

Accuracy score on test set:  0.6554934823091247


In [15]:
# summarise all features
for i in range(X_train_log.shape[1]):
    print('Column: %d, Selected %s, Rank: %.3f' % (i, rfe.support_[i], rfe.ranking_[i]))

Column: 0, Selected False, Rank: 5.000
Column: 1, Selected False, Rank: 6.000
Column: 2, Selected False, Rank: 4.000
Column: 3, Selected True, Rank: 1.000
Column: 4, Selected True, Rank: 1.000
Column: 5, Selected True, Rank: 1.000
Column: 6, Selected True, Rank: 1.000
Column: 7, Selected True, Rank: 1.000
Column: 8, Selected False, Rank: 3.000
Column: 9, Selected False, Rank: 2.000


In [16]:
# fit model on train dataset - with feature selection
X_train_log_fs = train[['review', 'item', 'helpfulness', 'category']]
y_train_log_fs = train['rating']

X_test_log_fs = test[['review', 'item', 'helpfulness', 'category']]
y_test_log_fs = test['rating']

model = LogisticRegression()
model.fit(X_train_log_fs, y_train_log_fs)

In [17]:
# prediction - with feature selection
fitted_y_train_fs = model.predict(X_train_log_fs)
fitted_y_test_fs = model.predict(X_test_log_fs)

In [18]:
# evaluation of model - with feature selection
print("Accuracy score on training set: ", accuracy_score(y_train_log_fs, fitted_y_train_fs))
print("Accuracy score on testing set: ", accuracy_score(y_test_log_fs, fitted_y_test_fs))

Accuracy score on training set:  0.6405959031657356
Accuracy score on testing set:  0.6759776536312849


**Model Evaluation**: After feature selection, the model performs better. The accuracy score on both training and testing set increase only a little bit.

### 5 & 6. KNN Model & K Tuning

In [43]:
# fit model with K = 3
X_train = train.drop(['rating'], axis = 1)
y_train = train['rating']

X_test = test.drop(['rating'], axis = 1)
y_test = test['rating']

knn_model = KNeighborsClassifier(n_neighbors=3)
knn_model.fit(X_train, y_train)

In [44]:
# prediction 
y_hat_train = knn_model.predict(X_train)
y_hat_test = knn_model.predict(X_test)

In [45]:
# evaluation
print("Accuracy score on training set: ", accuracy_score(y_train, y_hat_train))
print("Accuracy score on testing set: ", accuracy_score(y_test, y_hat_test))

Accuracy score on training set:  0.7611731843575419
Accuracy score on testing set:  0.590316573556797


In [35]:
# tune K
grid_search = GridSearchCV(knn_model, param_grid = {'n_neighbors': range(5, 51, 1)}, cv=5) #5 to 50 
grid_search.fit(X_train, y_train)

In [39]:
# best K
print("Best K: ", grid_search.best_params_)

Best K:  {'n_neighbors': 43}


In [27]:
# fit new KNN model with K = 43
new_knn_model = KNeighborsClassifier(n_neighbors=43)
new_knn_model.fit(X_train, y_train)

In [28]:
# prediction
y_hat_train = new_knn_model.predict(X_train)
y_hat_test = new_knn_model.predict(X_test)

In [29]:
# evaluation
print("Accuracy score on training set: ", accuracy_score(y_train, y_hat_train))
print("Accuracy score on testing set: ", accuracy_score(y_test, y_hat_test))

Accuracy score on training set:  0.6322160148975792
Accuracy score on testing set:  0.6610800744878957


**Comment:** Here we trained and tested 2 KNN models, one with K = 3 (randomly chosen) and one with K = 43 (using GridSearchCV to tune K). With K = 3, the accuracy score is a lot different between the training set and the test set (0.76 and 0.59). After fine-tuning K to 43, the accuracy scores are more similar (0.63 and 0.66)

**Evaluation of Logistic Regression and KNN Models**  
Comparing accuracy scores of both models (after feature selection and choosing the best K), their performances are quite similar with each other.