## Analysis of an E-commerce Dataset Part 3 (s2 2023)


In this Portfolio task, you will continue working with the dataset you have used in portfolio 2. But the difference is that the ratings have been converted to like (with score 1) and dislike (with score 0). Your task is to train classification models such as KNN to predict whether a user like or dislike an item.  


The header of the csv file is shown below. 

| userId | timestamp | review | item | helpfulness | gender | category | item_id | item_price | user_city | rating |
    | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
    
Your high level goal in this notebook is to try to build and evaluate predictive models for 'rating' from other available features - predict the value of the like (corresponding to rating 1) and dislike (corresponding to rating 0) in the data from some of the other fields. More specifically, you need to complete the following major steps: 
1) Explore the data. Clean the data if necessary. For example, remove abnormal instanaces and replace missing values.
2) Convert object features into digit features by using an encoder
3) Study the correlation between these features. 
4) Split the dataset and train a logistic regression model to predict 'rating' based on other features. Evaluate the accuracy of your model.
5) Split the dataset and train a KNN model to predict 'rating' based on other features. You can set K with an ad-hoc manner in this step. Evaluate the accuracy of your model.
6) Tune the hyper-parameter K in KNN to see how it influences the prediction performance

Note 1: We did not provide any description of each step in the notebook. You should learn how to properly comment your notebook by yourself to make your notebook file readable. 

Note 2: you are not being evaluated on the ___accuracy___ of the model but on the ___process___ that you use to generate it. Please use both ___Logistic Regression model___ and ___KNN model___ for solving this classification problem. Accordingly, discuss the performance of these two methods.
    

In [1]:
# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [2]:
# import dataset
df = pd.read_csv("portfolio_3.csv")
df

Unnamed: 0,userId,timestamp,review,item,helpfulness,gender,category,item_id,item_price,user_city,rating
0,4081,71900,Not always McCrap,McDonald's,3,M,Restaurants & Gourmet,41,30.74,4,1
1,4081,72000,I dropped the chalupa even before he told me to,Taco Bell,4,M,Restaurants & Gourmet,74,108.30,4,0
2,4081,72000,The Wonderful World of Wendy,Wendy's,4,M,Restaurants & Gourmet,84,69.00,4,1
3,4081,100399,They actually did it,"South Park: Bigger, Longer & Uncut",3,M,Movies,68,143.11,4,1
4,4081,100399,Hey! Gimme some pie!,American Pie,3,M,Movies,6,117.89,4,0
...,...,...,...,...,...,...,...,...,...,...,...
2680,2445,22000,Great movie!,Austin Powers: The Spy Who Shagged Me,3,M,Movies,9,111.00,5,1
2681,2445,30700,Good food!,Outback Steakhouse,3,M,Restaurants & Gourmet,50,25.00,5,1
2682,2445,61500,Great movie!,Fight Club,3,M,Movies,26,97.53,5,1
2683,2445,100500,Awesome Game.,The Sims 2: Open for Business for Windows,4,M,Games,79,27.00,5,1


In [3]:
# length of dataframe
len(df)

2685

In [4]:
# dataframe first 10 rows
df.head(10)

Unnamed: 0,userId,timestamp,review,item,helpfulness,gender,category,item_id,item_price,user_city,rating
0,4081,71900,Not always McCrap,McDonald's,3,M,Restaurants & Gourmet,41,30.74,4,1
1,4081,72000,I dropped the chalupa even before he told me to,Taco Bell,4,M,Restaurants & Gourmet,74,108.3,4,0
2,4081,72000,The Wonderful World of Wendy,Wendy's,4,M,Restaurants & Gourmet,84,69.0,4,1
3,4081,100399,They actually did it,"South Park: Bigger, Longer & Uncut",3,M,Movies,68,143.11,4,1
4,4081,100399,Hey! Gimme some pie!,American Pie,3,M,Movies,6,117.89,4,0
5,4081,100399,Good for sci-fi,Matrix,3,M,Movies,40,24.51,4,0
6,4081,100399,Scary? you bet!,Blair Witch Project,3,M,Movies,12,44.0,4,1
7,4081,101899,Fox - the 4th basic channel,FOX,4,M,Media,25,80.0,4,1
8,4081,112099,Amen!,Dogma,3,M,Movies,22,87.59,4,1
9,4081,122899,mama mia!,Olive Garden,3,M,Restaurants & Gourmet,49,32.0,4,1


In [5]:
# info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2685 entries, 0 to 2684
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   userId       2685 non-null   int64  
 1   timestamp    2685 non-null   int64  
 2   review       2685 non-null   object 
 3   item         2685 non-null   object 
 4   helpfulness  2685 non-null   int64  
 5   gender       2685 non-null   object 
 6   category     2685 non-null   object 
 7   item_id      2685 non-null   int64  
 8   item_price   2685 non-null   float64
 9   user_city    2685 non-null   int64  
 10  rating       2685 non-null   int64  
dtypes: float64(1), int64(6), object(4)
memory usage: 230.9+ KB


In [6]:
# null values check
df.isnull().sum()

userId         0
timestamp      0
review         0
item           0
helpfulness    0
gender         0
category       0
item_id        0
item_price     0
user_city      0
rating         0
dtype: int64

In [7]:
# convert categorical values into numerial values
from sklearn.preprocessing import OrdinalEncoder

ord_enc = OrdinalEncoder(dtype=int)
df[["review", "item", "gender", "category"]] = ord_enc.fit_transform(df[["review", "item", "gender", "category"]])
df.head(10)

Unnamed: 0,userId,timestamp,review,item,helpfulness,gender,category,item_id,item_price,user_city,rating
0,4081,71900,1618,37,3,1,8,41,30.74,4,1
1,4081,72000,1125,67,4,1,8,74,108.3,4,0
2,4081,72000,2185,77,4,1,8,84,69.0,4,1
3,4081,100399,2243,61,3,1,5,68,143.11,4,1
4,4081,100399,1033,5,3,1,5,6,117.89,4,0
5,4081,100399,925,36,3,1,5,40,24.51,4,0
6,4081,100399,1854,11,3,1,5,12,44.0,4,1
7,4081,101899,795,23,4,1,4,25,80.0,4,1
8,4081,112099,262,21,3,1,5,22,87.59,4,1
9,4081,122899,2643,44,3,1,8,49,32.0,4,1


In [8]:
# correlation matrix
df.corr()

Unnamed: 0,userId,timestamp,review,item,helpfulness,gender,category,item_id,item_price,user_city,rating
userId,1.0,-0.069176,0.007139,-0.005513,-0.166136,-0.058324,-0.041362,-0.005549,0.024576,-0.030031,0.066444
timestamp,-0.069176,1.0,0.007029,-0.003543,0.014179,-0.003367,0.015009,-0.004452,0.010979,-0.014934,-0.009739
review,0.007139,0.007029,1.0,0.16309,-0.028259,-0.037884,0.00197,0.163544,-0.041421,0.045626,-0.041756
item,-0.005513,-0.003543,0.16309,1.0,-0.020433,0.001925,-0.045988,0.999765,-0.049885,-0.00522,0.057793
helpfulness,-0.166136,0.014179,-0.028259,-0.020433,1.0,0.075947,-0.013408,-0.019882,0.004112,0.012086,-0.010622
gender,-0.058324,-0.003367,-0.037884,0.001925,0.075947,1.0,0.022549,0.00237,-0.040596,-0.065638,-0.022169
category,-0.041362,0.015009,0.00197,-0.045988,-0.013408,0.022549,1.0,-0.045268,-0.115571,0.008017,-0.142479
item_id,-0.005549,-0.004452,0.163544,0.999765,-0.019882,0.00237,-0.045268,1.0,-0.05445,-0.005576,0.057107
item_price,0.024576,0.010979,-0.041421,-0.049885,0.004112,-0.040596,-0.115571,-0.05445,1.0,-0.023427,0.026062
user_city,-0.030031,-0.014934,0.045626,-0.00522,0.012086,-0.065638,0.008017,-0.005576,-0.023427,1.0,-0.034866


### Analysis
Based on the correlation coefficient, it can be seen that feedback has weak correlation with all the rest features

## Dataset split

In [9]:
from sklearn.model_selection import train_test_split

# split training and test sets
X_train, X_test, y_train, y_test = train_test_split(df.drop(['rating'], axis=1), df['rating'], stratify=df['rating'], test_size=.2, random_state=7)

# checking shapes of each
print("X_train shape: ", X_train.shape)
print("y_train shape: ", y_train.shape)
print("X_test shape: ", X_test.shape)
print("y_test.shape: ", y_test.shape)

X_train shape:  (2148, 10)
y_train shape:  (2148,)
X_test shape:  (537, 10)
y_test.shape:  (537,)


In [10]:
X_train.columns

Index(['userId', 'timestamp', 'review', 'item', 'helpfulness', 'gender',
       'category', 'item_id', 'item_price', 'user_city'],
      dtype='object')

## Logistic Regression Model

In [11]:
from sklearn.linear_model import LogisticRegression

# training model with all features
lr = LogisticRegression().fit(X_train, y_train)

# Evaluating trained model on training and test set
from sklearn.metrics import accuracy_score

# making predictions on training and test set
y_pred_train = lr.predict(X_train)
y_pred_test = lr.predict(X_test)

# calculating accuracy score on training set
print("Acc on training set: ", accuracy_score(y_train, y_pred_train))

# calculating accuracy score on test set
print("Acc on test set: ", accuracy_score(y_test, y_pred_test))

Acc on training set:  0.6391992551210428
Acc on test set:  0.638733705772812


### Analysis

- The results (around 63%) showed that the model is poor
- Next step is to tune the model by using RFE to select most important features
- Then train the model with the selected important features

In [17]:
from sklearn.feature_selection import RFE
rfe = RFE(lr)
rfe_model = rfe.fit(X_train, y_train)
print("No. of features: ", rfe_model.n_features_)
print("Selected features: ", rfe_model.support_)
print("Features ranking: ", rfe_model.ranking_)

No. of features:  5
Selected features:  [False False False  True  True  True  True  True False False]
Features ranking:  [5 6 4 1 1 1 1 1 3 2]


#### Analysis

- The results from the RFE show that userId, movieId and timestamp is the most important features
- Try to re-trian the model with three most important features 

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# split training and test sets
X_train, X_test, y_train, y_test = train_test_split(df.drop(['userId', 'timestamp', 'review', 'item', 'rating'], axis=1), df['rating'], stratify=df['rating'], test_size=.2, random_state=7)


# checking shapes of each
print("X_train shape: ", X_train.shape)
print("y_train shape: ", y_train.shape)
print("X_test shape: ", X_test.shape)
print("y_test.shape: ", y_test.shape)

# training model with the top-3 features
lr = LogisticRegression().fit(X_train, y_train)

# making predictions on training and test set
y_pred_train = lr.predict(X_train)
y_pred_test = lr.predict(X_test)

# calculating accuracy score on training set
print("Acc on training set: ", accuracy_score(y_train, y_pred_train))

# calculating accuracy score on test set
print("Acc on test set: ", accuracy_score(y_test, y_pred_test))

### Analysis

- Based on the results, it can be seen that the accuracy achieved by the logistic regression is poor
- Thus, the next step is KNN testing

### More testing using KNN model

In [None]:
from sklearn.datasets import make_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In [None]:
# split training and test sets
X_train, X_test, y_train, y_test = train_test_split(df.drop(['rating'], axis=1), df['rating'], stratify=df['rating'], test_size=.2, random_state=7)

In [None]:
# Create and training a KNN classifier model
clf = KNeighborsClassifier(n_neighbors = 7)
clf.fit(X_train, y_train)

# Use the model to predict testing data
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_pred, y_test)
print('Testing accuracy is: ', accuracy)

In [None]:
from sklearn.model_selection import GridSearchCV

# Define search space for parameters
parameter_grid = {'n_neighbors': range(5, 80)}

X = df.drop(['rating'], axis=1)
y = df['rating']

# Create the machine learning model
knn_clf = KNeighborsClassifier()
clf = GridSearchCV(knn_clf, parameter_grid, scoring='accuracy', cv=5)
clf.fit(X, y)

# Identify the best parameter(s)
print('Best K value: ', clf.best_params_['n_neighbors'])
print('The accuracy: ', clf.best_score_)

Based on the results, it can be seen that:
* The accuracy achieved by KNN is around 64%. Altough it is better than the logistic regression but it is still quite low thus the model is considered as poor. It is actually predictable because the resulted correlation coefficient is very low.