## Analysis of an E-commerce Dataset Part 3 (s2 2023)


In this Portfolio task, you will continue working with the dataset you have used in portfolio 2. But the difference is that the ratings have been converted to like (with score 1) and dislike (with score 0). Your task is to train classification models such as KNN to predict whether a user like or dislike an item.  


The header of the csv file is shown below. 

| userId | timestamp | review | item | helpfulness | gender | category | item_id | item_price | user_city | rating |
    | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
    
Your high level goal in this notebook is to try to build and evaluate predictive models for 'rating' from other available features - predict the value of the like (corresponding to rating 1) and dislike (corresponding to rating 0) in the data from some of the other fields. More specifically, you need to complete the following major steps: 
1) Explore the data. Clean the data if necessary. For example, remove abnormal instanaces and replace missing values.
2) Convert object features into digit features by using an encoder
3) Study the correlation between these features. 
4) Split the dataset and train a logistic regression model to predict 'rating' based on other features. Evaluate the accuracy of your model.
5) Split the dataset and train a KNN model to predict 'rating' based on other features. You can set K with an ad-hoc manner in this step. Evaluate the accuracy of your model.
6) Tune the hyper-parameter K in KNN to see how it influences the prediction performance

Note 1: We did not provide any description of each step in the notebook. You should learn how to properly comment your notebook by yourself to make your notebook file readable. 

Note 2: you are not being evaluated on the ___accuracy___ of the model but on the ___process___ that you use to generate it. Please use both ___Logistic Regression model___ and ___KNN model___ for solving this classification problem. Accordingly, discuss the performance of these two methods.
    

In [1]:
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import numpy as np
from sklearn.model_selection import GridSearchCV, KFold, cross_val_score

In [2]:
data = pd.read_csv('portfolio_3.csv')

## Exploring the data

In [3]:
data.head()

Unnamed: 0,userId,timestamp,review,item,helpfulness,gender,category,item_id,item_price,user_city,rating
0,4081,71900,Not always McCrap,McDonald's,3,M,Restaurants & Gourmet,41,30.74,4,1
1,4081,72000,I dropped the chalupa even before he told me to,Taco Bell,4,M,Restaurants & Gourmet,74,108.3,4,0
2,4081,72000,The Wonderful World of Wendy,Wendy's,4,M,Restaurants & Gourmet,84,69.0,4,1
3,4081,100399,They actually did it,"South Park: Bigger, Longer & Uncut",3,M,Movies,68,143.11,4,1
4,4081,100399,Hey! Gimme some pie!,American Pie,3,M,Movies,6,117.89,4,0


In [4]:
data.describe()

Unnamed: 0,userId,timestamp,helpfulness,item_id,item_price,user_city,rating
count,2685.0,2685.0,2685.0,2685.0,2685.0,2685.0,2685.0
mean,4673.237616,58812.687151,3.908007,43.478585,83.09165,19.456983,0.639851
std,3517.893437,37013.726118,0.289069,26.630426,42.227558,11.397281,0.480133
min,4.0,10100.0,3.0,0.0,12.0,0.0,0.0
25%,1310.0,22000.0,4.0,21.0,49.0,9.0,0.0
50%,4666.0,52800.0,4.0,42.0,73.65,19.0,1.0
75%,7651.0,91000.0,4.0,67.0,129.82,28.0,1.0
max,10779.0,123199.0,4.0,88.0,149.0,39.0,1.0


In [5]:
data.shape

(2685, 11)

## Checking for duplicate values and null values in the data

In [6]:
data.isna().values.any()

False

In [7]:
data.duplicated().unique()

array([False])

## Checking for data consistency

In [8]:
data.helpfulness.unique()

array([3, 4], dtype=int64)

In [9]:
data.gender.unique()

array(['M', 'F'], dtype=object)

In [10]:
data.category.unique()

array(['Restaurants & Gourmet', 'Movies', 'Media', 'Kids & Family',
       'Online Stores & Services', 'Games', 'Hotels & Travel', 'Books',
       'Personal Finance'], dtype=object)

In [11]:
data.rating.unique()

array([1, 0], dtype=int64)

In [12]:
data.user_city.unique()

array([ 4, 10,  9, 35, 31, 14, 34, 17, 15, 38, 19, 20, 22, 26, 25, 16,  7,
       36, 23,  8,  2, 32, 27,  1, 24, 18, 29, 28,  3,  5, 12, 39, 21, 13,
       37, 11, 33, 30,  6,  0], dtype=int64)

## Encoding the object features

In [13]:
from sklearn.preprocessing import OrdinalEncoder

In [14]:
ord_enc = OrdinalEncoder()
data["gen_code"] = ord_enc.fit_transform(data[["gender"]])

In [15]:
data.head()

Unnamed: 0,userId,timestamp,review,item,helpfulness,gender,category,item_id,item_price,user_city,rating,gen_code
0,4081,71900,Not always McCrap,McDonald's,3,M,Restaurants & Gourmet,41,30.74,4,1,1.0
1,4081,72000,I dropped the chalupa even before he told me to,Taco Bell,4,M,Restaurants & Gourmet,74,108.3,4,0,1.0
2,4081,72000,The Wonderful World of Wendy,Wendy's,4,M,Restaurants & Gourmet,84,69.0,4,1,1.0
3,4081,100399,They actually did it,"South Park: Bigger, Longer & Uncut",3,M,Movies,68,143.11,4,1,1.0
4,4081,100399,Hey! Gimme some pie!,American Pie,3,M,Movies,6,117.89,4,0,1.0


In [16]:
ord_enc = OrdinalEncoder()
data["cat_code"] = ord_enc.fit_transform(data[["category"]])

In [17]:
data.head()

Unnamed: 0,userId,timestamp,review,item,helpfulness,gender,category,item_id,item_price,user_city,rating,gen_code,cat_code
0,4081,71900,Not always McCrap,McDonald's,3,M,Restaurants & Gourmet,41,30.74,4,1,1.0,8.0
1,4081,72000,I dropped the chalupa even before he told me to,Taco Bell,4,M,Restaurants & Gourmet,74,108.3,4,0,1.0,8.0
2,4081,72000,The Wonderful World of Wendy,Wendy's,4,M,Restaurants & Gourmet,84,69.0,4,1,1.0,8.0
3,4081,100399,They actually did it,"South Park: Bigger, Longer & Uncut",3,M,Movies,68,143.11,4,1,1.0,5.0
4,4081,100399,Hey! Gimme some pie!,American Pie,3,M,Movies,6,117.89,4,0,1.0,5.0


In [18]:
data.gen_code.unique()

array([1., 0.])

In [19]:
data.cat_code.unique()

array([8., 5., 4., 3., 6., 1., 2., 0., 7.])

In [20]:
ord_enc = OrdinalEncoder()
data["rev_code"] = ord_enc.fit_transform(data[["review"]])

In [21]:
data.head()

Unnamed: 0,userId,timestamp,review,item,helpfulness,gender,category,item_id,item_price,user_city,rating,gen_code,cat_code,rev_code
0,4081,71900,Not always McCrap,McDonald's,3,M,Restaurants & Gourmet,41,30.74,4,1,1.0,8.0,1618.0
1,4081,72000,I dropped the chalupa even before he told me to,Taco Bell,4,M,Restaurants & Gourmet,74,108.3,4,0,1.0,8.0,1125.0
2,4081,72000,The Wonderful World of Wendy,Wendy's,4,M,Restaurants & Gourmet,84,69.0,4,1,1.0,8.0,2185.0
3,4081,100399,They actually did it,"South Park: Bigger, Longer & Uncut",3,M,Movies,68,143.11,4,1,1.0,5.0,2243.0
4,4081,100399,Hey! Gimme some pie!,American Pie,3,M,Movies,6,117.89,4,0,1.0,5.0,1033.0


## Studying Correlation between the data

In [22]:
data.corr(method ='pearson')

  data.corr(method ='pearson')


Unnamed: 0,userId,timestamp,helpfulness,item_id,item_price,user_city,rating,gen_code,cat_code,rev_code
userId,1.0,-0.069176,-0.166136,-0.005549,0.024576,-0.030031,0.066444,-0.058324,-0.041362,0.007139
timestamp,-0.069176,1.0,0.014179,-0.004452,0.010979,-0.014934,-0.009739,-0.003367,0.015009,0.007029
helpfulness,-0.166136,0.014179,1.0,-0.019882,0.004112,0.012086,-0.010622,0.075947,-0.013408,-0.028259
item_id,-0.005549,-0.004452,-0.019882,1.0,-0.05445,-0.005576,0.057107,0.00237,-0.045268,0.163544
item_price,0.024576,0.010979,0.004112,-0.05445,1.0,-0.023427,0.026062,-0.040596,-0.115571,-0.041421
user_city,-0.030031,-0.014934,0.012086,-0.005576,-0.023427,1.0,-0.034866,-0.065638,0.008017,0.045626
rating,0.066444,-0.009739,-0.010622,0.057107,0.026062,-0.034866,1.0,-0.022169,-0.142479,-0.041756
gen_code,-0.058324,-0.003367,0.075947,0.00237,-0.040596,-0.065638,-0.022169,1.0,0.022549,-0.037884
cat_code,-0.041362,0.015009,-0.013408,-0.045268,-0.115571,0.008017,-0.142479,0.022549,1.0,0.00197
rev_code,0.007139,0.007029,-0.028259,0.163544,-0.041421,0.045626,-0.041756,-0.037884,0.00197,1.0


correlation in desc order with rating : cat_code,userId,item_id,rev_code,user_city,item_price,gen_code,helpfulness,timestamp


## Training 5 logistic regression models with different features to obtain which features help build the most accurate model.

In [23]:
corra = data[['cat_code','userId','item_id','rev_code','user_city','item_price']]

In [24]:
#model A
X_train, X_test, y_train, y_test = train_test_split(corra, data.rating, train_size=0.7)

In [25]:
modela = LogisticRegression()

In [26]:
modela.fit(X_train,y_train)

In [27]:
modela.predict(X_test)

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1,
       1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,

In [28]:
logscorea = modela.score(X_test,y_test)

In [29]:
corrb = data[['cat_code','userId','item_id','rev_code','user_city']]

In [30]:
#model B
X_train, X_test, y_train, y_test = train_test_split(corrb, data.rating, train_size=0.7)

In [31]:
modelb = LogisticRegression()

In [32]:
modelb.fit(X_train,y_train)

In [33]:
modelb.predict(X_test)

array([1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

In [34]:
logscoreb = modelb.score(X_test,y_test)

In [35]:
corrc = data[['cat_code','userId','item_id','rev_code']]

In [36]:
#model C
X_train, X_test, y_train, y_test = train_test_split(corrc, data.rating, train_size=0.7)

In [37]:
modelc = LogisticRegression()

In [38]:
modelc.fit(X_train,y_train)

In [39]:
modelc.predict(X_test)

array([1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0,
       1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1,

In [40]:
logscorec = modelc.score(X_test,y_test)

In [41]:
corrd = data[['cat_code','userId','item_id']]

In [42]:
#model D
X_train, X_test, y_train, y_test = train_test_split(corrd, data.rating, train_size=0.7)

In [43]:
modeld = LogisticRegression()

In [44]:
modeld.fit(X_train,y_train)

In [45]:
modeld.predict(X_test)

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

In [46]:
logscored = modeld.score(X_test,y_test)

In [47]:
corre = data[['cat_code','userId']]

In [48]:
#model E
X_train, X_test, y_train, y_test = train_test_split(corre, data.rating, train_size=0.7)

In [49]:
modele = LogisticRegression()

In [50]:
modele.fit(X_train,y_train)

In [51]:
modele.predict(X_test)

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

In [52]:
logscoree = modele.score(X_test,y_test)

## The model and its respective scores are tabulated as below

In [63]:
combo = [['A', logscorea], ['B', logscoreb], ['C', logscorec],['D',logscored],['E',logscoree]]
df = pd.DataFrame(combo, columns=['Model', 'Score'])
df

Unnamed: 0,Model,Score
0,A,0.657568
1,B,0.662531
2,C,0.658809
3,D,0.637717
4,E,0.620347


Splitting the dataset on the basis of 'cat_code','userId','item_id','rev_code','user_city to build a KNN model as Model B has the best evaluation scores

In [64]:
X_train, X_test, y_train, y_test = train_test_split(corrb, data['rating'], test_size=0.3)

Fitting and predicting for rating using KNN and evaluating the model with K = 5

In [65]:
knn = KNeighborsClassifier(n_neighbors=5)

In [66]:
knn.fit(X_train, y_train)

In [67]:
y_pred = knn.predict(X_test)

In [68]:
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print('Accuracy:', accuracy)
print('Precision:', precision)
print('Recall:', recall)
print('F1 score:', f1)

Accuracy: 0.5744416873449132
Precision: 0.6632302405498282
Recall: 0.724202626641651
F1 score: 0.6923766816143498


Fitting predicting and evaluating the acurracies for KNN models with K ranging from 1 to 9

In [70]:
k_range = np.arange(1, 10)

knn = KNeighborsClassifier()

accuracies = []
for k in k_range:
    knn.n_neighbors = k
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    accuracy = np.mean(y_pred == y_test)
    accuracies.append(accuracy)

for k, accuracy in zip(k_range, accuracies):
    print('k = {} | Accuracy = {}'.format(k, accuracy))

k = 1 | Accuracy = 0.5409429280397022
k = 2 | Accuracy = 0.45409429280397023
k = 3 | Accuracy = 0.5421836228287841
k = 4 | Accuracy = 0.4913151364764268
k = 5 | Accuracy = 0.5744416873449132
k = 6 | Accuracy = 0.5260545905707196
k = 7 | Accuracy = 0.5669975186104218
k = 8 | Accuracy = 0.5297766749379652
k = 9 | Accuracy = 0.5893300248138957


Model with K=9 has the best accuracy score

Hyperparameter tuning using gridsearch wherein the model is crossvalidated with 5 folds and averaged to evaluate the best value of 'K' ranging from 1 to 9. The model with the best 'K' is chosen and its accuracy is evaluated. 

In [73]:
param_grid = {'n_neighbors': np.arange(1, 10)}

# Create a KNN classifier
knn = KNeighborsClassifier()

# Perform a grid search to find the best value for `k`
grid_search = GridSearchCV(knn, param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Get the best value for `k`
best_k = grid_search.best_params_['n_neighbors']

# Train a KNN classifier with the best value for `k`
knn = KNeighborsClassifier(n_neighbors=best_k)
knn.fit(X_train, y_train)

# Evaluate the performance of the model on the test set
y_pred = knn.predict(X_test)
accuracy = np.mean(y_pred == y_test)

# Print the accuracy of the model
print('The best value of K is:',best_k ,'Accuracy:', accuracy)

The best value of K is: 7 Accuracy: 0.5669975186104218
