## Analysis of an E-commerce Dataset Part 3 (s2 2023)


In this Portfolio task, you will continue working with the dataset you have used in portfolio 2. But the difference is that the ratings have been converted to like (with score 1) and dislike (with score 0). Your task is to train classification models such as KNN to predict whether a user like or dislike an item.  


The header of the csv file is shown below. 

| userId | timestamp | review | item | helpfulness | gender | category | item_id | item_price | user_city | rating |
    | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
    
Your high level goal in this notebook is to try to build and evaluate predictive models for 'rating' from other available features - predict the value of the like (corresponding to rating 1) and dislike (corresponding to rating 0) in the data from some of the other fields. More specifically, you need to complete the following major steps: 
1) Explore the data. Clean the data if necessary. For example, remove abnormal instanaces and replace missing values.
2) Convert object features into digit features by using an encoder
3) Study the correlation between these features. 
4) Split the dataset and train a logistic regression model to predict 'rating' based on other features. Evaluate the accuracy of your model.
5) Split the dataset and train a KNN model to predict 'rating' based on other features. You can set K with an ad-hoc manner in this step. Evaluate the accuracy of your model.
6) Tune the hyper-parameter K in KNN to see how it influences the prediction performance

Note 1: We did not provide any description of each step in the notebook. You should learn how to properly comment your notebook by yourself to make your notebook file readable. 

Note 2: you are not being evaluated on the ___accuracy___ of the model but on the ___process___ that you use to generate it. Please use both ___Logistic Regression model___ and ___KNN model___ for solving this classification problem. Accordingly, discuss the performance of these two methods.
    

In [68]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import warnings
from sklearn.feature_selection import RFE
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV


ds = '/Users/Corinthians/portfolio-part-3-Saurabh0017/portfolio_3.csv'

df = pd.read_csv(ds)


In [47]:
# understanding the data by displaying first 5 datasets 
df.head()

Unnamed: 0,userId,timestamp,review,item,helpfulness,gender,category,item_id,item_price,user_city,rating
0,4081,71900,Not always McCrap,McDonald's,3,M,Restaurants & Gourmet,41,30.74,4,1
1,4081,72000,I dropped the chalupa even before he told me to,Taco Bell,4,M,Restaurants & Gourmet,74,108.3,4,0
2,4081,72000,The Wonderful World of Wendy,Wendy's,4,M,Restaurants & Gourmet,84,69.0,4,1
3,4081,100399,They actually did it,"South Park: Bigger, Longer & Uncut",3,M,Movies,68,143.11,4,1
4,4081,100399,Hey! Gimme some pie!,American Pie,3,M,Movies,6,117.89,4,0


In [48]:
# # checking if there are any null values in the data frame
# print(df.isnull().sum())

In [49]:
# df['rating'].value_counts().plot(kind = 'bar')
# plt.xlabel('Rating')
# plt.ylabel('Count')
# plt.title('Distribution of Ratings')

In [50]:
# converting the object features in the digit features 
encoder = OrdinalEncoder()
df[['review', 'item', 'gender', 'category']] = encoder.fit_transform(df[['review', 'item', 'gender', 'category']])
df.head()

Unnamed: 0,userId,timestamp,review,item,helpfulness,gender,category,item_id,item_price,user_city,rating
0,4081,71900,1618.0,37.0,3,1.0,8.0,41,30.74,4,1
1,4081,72000,1125.0,67.0,4,1.0,8.0,74,108.3,4,0
2,4081,72000,2185.0,77.0,4,1.0,8.0,84,69.0,4,1
3,4081,100399,2243.0,61.0,3,1.0,5.0,68,143.11,4,1
4,4081,100399,1033.0,5.0,3,1.0,5.0,6,117.89,4,0


In [51]:
# studying the correlation between these features 
df.corr()

Unnamed: 0,userId,timestamp,review,item,helpfulness,gender,category,item_id,item_price,user_city,rating
userId,1.0,-0.069176,0.007139,-0.005513,-0.166136,-0.058324,-0.041362,-0.005549,0.024576,-0.030031,0.066444
timestamp,-0.069176,1.0,0.007029,-0.003543,0.014179,-0.003367,0.015009,-0.004452,0.010979,-0.014934,-0.009739
review,0.007139,0.007029,1.0,0.16309,-0.028259,-0.037884,0.00197,0.163544,-0.041421,0.045626,-0.041756
item,-0.005513,-0.003543,0.16309,1.0,-0.020433,0.001925,-0.045988,0.999765,-0.049885,-0.00522,0.057793
helpfulness,-0.166136,0.014179,-0.028259,-0.020433,1.0,0.075947,-0.013408,-0.019882,0.004112,0.012086,-0.010622
gender,-0.058324,-0.003367,-0.037884,0.001925,0.075947,1.0,0.022549,0.00237,-0.040596,-0.065638,-0.022169
category,-0.041362,0.015009,0.00197,-0.045988,-0.013408,0.022549,1.0,-0.045268,-0.115571,0.008017,-0.142479
item_id,-0.005549,-0.004452,0.163544,0.999765,-0.019882,0.00237,-0.045268,1.0,-0.05445,-0.005576,0.057107
item_price,0.024576,0.010979,-0.041421,-0.049885,0.004112,-0.040596,-0.115571,-0.05445,1.0,-0.023427,0.026062
user_city,-0.030031,-0.014934,0.045626,-0.00522,0.012086,-0.065638,0.008017,-0.005576,-0.023427,1.0,-0.034866


In [52]:
#splitting the dataset 
x_train, x_test, y_train, y_test = train_test_split(df.drop(['rating'], axis = 1), df['rating'], test_size = 0.2, random_state = 42)

print("x_train shape: ", x_train.shape)
print("x_test shape: ", x_test.shape)
print("y_train shape: ", y_train.shape)
print("y_test shape: ", y_test.shape)
                                           

x_train shape:  (2148, 10)
x_test shape:  (537, 10)
y_train shape:  (2148,)
y_test shape:  (537,)


In [53]:
# training a logistic regression model to predict 'rating' based on other features. 
train = LogisticRegression()
train.fit(x_train, y_train)


LogisticRegression()

In [54]:
# evaluating how accurate the model is 

y_pred = train.predict(x_test)
print("Accuracy on test set: ", accuracy_score(y_test, y_pred))

Accuracy on test set:  0.6368715083798883


In [55]:
# Conclusion
# By using Logistic regression model, we only get the accuracy as 63%, which is quite low
# Therefore, we shoukd try other models, such as KN

In [56]:
warnings.filterwarnings("ignore")

In [57]:
# using rfe to select only the related features to improve accuracy 

selector = RFE(train, n_features_to_select=3)
selector = selector.fit(x_train, y_train)
selector.ranking_

array([7, 8, 6, 1, 3, 1, 1, 2, 5, 4])

In [58]:
# Using RFE slected columns as input featurs to train logistic model again
x_train, x_test, y_train, y_test = train_test_split(df[["item", "gender", "category"]], df['rating'], test_size=0.2, random_state=42)
print("x_train shape: ", x_train.shape)
print("x_test shape: ", x_test.shape)
print("y_train shape: ", y_train.shape)
print("y_test shape: ", y_test.shape)

x_train shape:  (2148, 3)
x_test shape:  (537, 3)
y_train shape:  (2148,)
y_test shape:  (537,)


In [59]:
train = LogisticRegression()
train.fit(x_train, y_train)
y_pred = train.predict(x_test)
print("The accuracy on the test set:", accuracy_score(y_test, y_pred))

The accuracy on the test set: 0.6443202979515829


In [60]:
# We can see that the accuracy has increased very slightyl from 63% to 64%. 
# We can try other models like KNN. 

In [64]:
# trainig a KNN model to predict 'rating' based on the other features 
neigh = KNeighborsClassifier(n_neighbors = 3)
neigh.fit(x_train, y_train)

KNeighborsClassifier(n_neighbors=3)

In [65]:
y_pred = neigh.predict(x_test)
print("KNN accuracy on the test set is:", accuracy_score(y_test, y_pred))

KNN accuracy on the test set is: 0.6759776536312849


In [None]:
# tune the huper paramter K in KNN to see how it influences the prediction performance 

In [66]:
x_train.head()

Unnamed: 0,item,gender,category
1210,73.0,1.0,5.0
57,38.0,0.0,5.0
2593,47.0,1.0,7.0
2431,28.0,0.0,5.0
229,22.0,0.0,5.0


In [67]:
y_train.head()

1210    1
57      0
2593    0
2431    0
229     0
Name: rating, dtype: int64

In [70]:
parameters = {'n_neighbors': range(1,100)}
train = GridSearchCV(neigh, parameters)
train.fit(x_train, y_train)

GridSearchCV(estimator=KNeighborsClassifier(n_neighbors=3),
             param_grid={'n_neighbors': range(1, 100)})

In [71]:
print('Best K value:', train.best_params_)

Best K value: {'n_neighbors': 22}


In [72]:
print('best accuracy with optimal K value', train.best_score_)

best accuracy with optimal K value 0.7453504634899983
