# K-nearest Neighbour
## Properties:
- Its a classification algorithm
- _K_ here stands for the _number of nearest neighbours_

### What value of _K_ should we choose?
- The accuracy of the prediction is dependent on the value of _K_
- There's no method or equation to find the value of _K_.More of we go with a hit & trial method to get good value of _K_
- Avoid choosing smaller values of _K_
- Recommended to keep value of _K_ as odd, as choosing even value can lead to 'tie' such that we wont be able to predict the winning class for a certain value
- Could use hyper parameter tuning to test out the accuracies of various values of _K_ to figure put the best value of _K_

### Pros:
- Simple algorithm
- Better result as compared to others
- Can be used for multiple classes case

### Cons:
- Need to find good value of _K_ as it will affect the accuracy.
- Will add another column to store values of Euclidean distance or Manhattan distance for comparison thus more storage and computation required
- Need knowledge of various distance finding algorithms i.e Euclidean, Manhattan, Minkowski, Chebyshev etc 


## Practical Implementation

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix

In [2]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold

In [3]:
# The dataset used
df = pd.read_csv('https://raw.githubusercontent.com/omairaasim/machine_learning/master/project_10_logistic_regression/iphone_purchase_records.csv')

In [4]:
df.head()

Unnamed: 0,Gender,Age,Salary,Purchase Iphone
0,Male,19,19000,0
1,Male,35,20000,0
2,Female,26,43000,0
3,Female,27,57000,0
4,Male,19,76000,0


Based on someone's gender, age and salary we would predict of the person purchased an _I phone_

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Gender           400 non-null    object
 1   Age              400 non-null    int64 
 2   Salary           400 non-null    int64 
 3   Purchase Iphone  400 non-null    int64 
dtypes: int64(3), object(1)
memory usage: 12.6+ KB


In [6]:
df.describe()

Unnamed: 0,Age,Salary,Purchase Iphone
count,400.0,400.0,400.0
mean,37.655,69742.5,0.3575
std,10.482877,34096.960282,0.479864
min,18.0,15000.0,0.0
25%,29.75,43000.0,0.0
50%,37.0,70000.0,0.0
75%,46.0,88000.0,1.0
max,60.0,150000.0,1.0


In [7]:
# How many Male and Female
df.Gender.value_counts()

Female    204
Male      196
Name: Gender, dtype: int64

In [8]:
# How many Male have Iphone and Female have Iphone
df.loc[df['Purchase Iphone']==1, "Gender"].value_counts()

Female    77
Male      66
Name: Gender, dtype: int64

In [9]:
df.head(50)

Unnamed: 0,Gender,Age,Salary,Purchase Iphone
0,Male,19,19000,0
1,Male,35,20000,0
2,Female,26,43000,0
3,Female,27,57000,0
4,Male,19,76000,0
5,Male,27,58000,0
6,Female,27,84000,0
7,Female,32,150000,1
8,Male,25,33000,0
9,Female,35,65000,0


as seen above the dataset is imbalanced

In [10]:
x = df.iloc[:, :-1]
y = df.iloc[:, -1]

In [11]:
x

Unnamed: 0,Gender,Age,Salary
0,Male,19,19000
1,Male,35,20000
2,Female,26,43000
3,Female,27,57000
4,Male,19,76000
...,...,...,...
395,Female,46,41000
396,Male,51,23000
397,Female,50,20000
398,Male,36,33000


In [12]:
y

0      0
1      0
2      0
3      0
4      0
      ..
395    1
396    1
397    1
398    0
399    1
Name: Purchase Iphone, Length: 400, dtype: int64

In [13]:
# Label encoding
from sklearn.preprocessing import LabelEncoder

In [14]:
enc = LabelEncoder()

In [15]:
x.Gender = enc.fit_transform(x.Gender)

In [16]:
x

Unnamed: 0,Gender,Age,Salary
0,1,19,19000
1,1,35,20000
2,0,26,43000
3,0,27,57000
4,1,19,76000
...,...,...,...
395,0,46,41000
396,1,51,23000
397,0,50,20000
398,1,36,33000


In [17]:
x.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Gender  400 non-null    int32
 1   Age     400 non-null    int64
 2   Salary  400 non-null    int64
dtypes: int32(1), int64(2)
memory usage: 7.9 KB


In [18]:
# Split the data
# use skf
skf = StratifiedKFold(n_splits=5)

In [30]:
for train_index, test_index in skf.split(x, y):
    x_train, x_test = x.iloc[train_index], x.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

In [20]:
from sklearn.linear_model import LogisticRegression

In [48]:
log = LogisticRegression()
knn = KNeighborsClassifier(n_neighbors=5)

In [22]:
# feature scaling
scale = StandardScaler()

In [26]:
newX = x.iloc[:, 1:]

In [27]:
newX

Unnamed: 0,Age,Salary
0,19,19000
1,35,20000
2,26,43000
3,27,57000
4,19,76000
...,...,...
395,46,41000
396,51,23000
397,50,20000
398,36,33000


In [49]:
x_train = scale.fit_transform(x_train)
x_test = scale.fit_transform(x_test)

In [28]:
newX = scale.fit_transform(newX)

In [24]:
x_train

array([[ 1.02532046, -1.61062735, -1.48261454],
       [ 1.02532046, -0.10854866, -1.45444971],
       [-0.97530483, -0.95346792, -0.80665849],
       [-0.97530483, -0.859588  , -0.41235079],
       [ 1.02532046, -1.61062735,  0.12278108],
       [ 1.02532046, -0.859588  , -0.38418596],
       [-0.97530483, -0.859588  ,  0.34809976],
       [-0.97530483, -0.39018841,  2.20697891],
       [ 1.02532046, -1.04734784, -1.08830685],
       [-0.97530483, -0.10854866, -0.18703211],
       [-0.97530483, -0.95346792,  0.23544042],
       [-0.97530483, -0.95346792, -0.55317497],
       [ 1.02532046, -1.51674743,  0.40442943],
       [ 1.02532046, -0.39018841, -1.51077938],
       [ 1.02532046, -1.70450727,  0.29177009],
       [ 1.02532046, -0.67182817,  0.23544042],
       [ 1.02532046,  1.01801037, -1.31362553],
       [ 1.02532046,  0.83025053, -1.28546069],
       [ 1.02532046,  0.92413045, -1.22913102],
       [-0.97530483,  1.11189029, -1.20096619],
       [ 1.02532046,  0.83025053, -1.398

In [29]:
newX

array([[-1.78179743, -1.49004624],
       [-0.25358736, -1.46068138],
       [-1.11320552, -0.78528968],
       [-1.01769239, -0.37418169],
       [-1.78179743,  0.18375059],
       [-1.01769239, -0.34481683],
       [-1.01769239,  0.41866944],
       [-0.54012675,  2.35674998],
       [-1.20871865, -1.07893824],
       [-0.25358736, -0.13926283],
       [-1.11320552,  0.30121002],
       [-1.11320552, -0.52100597],
       [-1.6862843 ,  0.47739916],
       [-0.54012675, -1.51941109],
       [-1.87731056,  0.35993973],
       [-0.82666613,  0.30121002],
       [ 0.89257019, -1.3138571 ],
       [ 0.70154394, -1.28449224],
       [ 0.79705706, -1.22576253],
       [ 0.98808332, -1.19639767],
       [ 0.70154394, -1.40195167],
       [ 0.89257019, -0.60910054],
       [ 0.98808332, -0.84401939],
       [ 0.70154394, -1.40195167],
       [ 0.79705706, -1.37258681],
       [ 0.89257019, -1.46068138],
       [ 1.08359645, -1.22576253],
       [ 0.89257019, -1.16703281],
       [-0.82666613,

In [50]:
# training the model
log.fit(x_train, y_train)
knn.fit(x_train, y_train)

KNeighborsClassifier()

In [34]:
y_knn_pred = knn.predict(x_test)

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


In [51]:
newdf = pd.DataFrame({"Actual": y_test, "Predicted":y_knn_pred})

In [52]:
newdf.head()

Unnamed: 0,Actual,Predicted
266,0,0
267,0,0
269,0,0
270,0,1
276,0,0


In [53]:
confusion_matrix(y_test, y_knn_pred)

array([[48,  3],
       [22,  7]], dtype=int64)

In [54]:
acc = (48+7)/80
acc

0.6875

accuracy of _~69%_

In [39]:
lis = [i for i in range(3, 100) if i%2!=0]  # pick all odd numebr between 3 to 100

In [40]:
lis

[3,
 5,
 7,
 9,
 11,
 13,
 15,
 17,
 19,
 21,
 23,
 25,
 27,
 29,
 31,
 33,
 35,
 37,
 39,
 41,
 43,
 45,
 47,
 49,
 51,
 53,
 55,
 57,
 59,
 61,
 63,
 65,
 67,
 69,
 71,
 73,
 75,
 77,
 79,
 81,
 83,
 85,
 87,
 89,
 91,
 93,
 95,
 97,
 99]

In [41]:
from sklearn.metrics import accuracy_score

In [55]:
# some hyper parameter tuning
accuracy = []
dic = {}
for i in lis:
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(x_train, y_train)
    y_knn_pred = knn.predict(x_test)
    accuracy.append(accuracy_score(y_test, y_knn_pred))
    dic[i] = accuracy_score(y_test, y_knn_pred)

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mo

In [57]:
print(max(accuracy))

0.8875


In [47]:
dic

{3: 0.8375,
 5: 0.85,
 7: 0.85,
 9: 0.8625,
 11: 0.8875,
 13: 0.8875,
 15: 0.875,
 17: 0.85,
 19: 0.8,
 21: 0.7875,
 23: 0.775,
 25: 0.7625,
 27: 0.7375,
 29: 0.7375,
 31: 0.7375,
 33: 0.7375,
 35: 0.7375,
 37: 0.7375,
 39: 0.7375,
 41: 0.7375,
 43: 0.7375,
 45: 0.7375,
 47: 0.7375,
 49: 0.7375,
 51: 0.7375,
 53: 0.7375,
 55: 0.7375,
 57: 0.7375,
 59: 0.7375,
 61: 0.7375,
 63: 0.725,
 65: 0.725,
 67: 0.725,
 69: 0.725,
 71: 0.725,
 73: 0.725,
 75: 0.725,
 77: 0.725,
 79: 0.725,
 81: 0.725,
 83: 0.725,
 85: 0.725,
 87: 0.725,
 89: 0.7125,
 91: 0.7,
 93: 0.7,
 95: 0.7,
 97: 0.6875,
 99: 0.6875}

n_neighbors of _11_ and _13_ give us the highest accuracy

In [58]:
y_log_pred = log.predict(x_test)

In [59]:
accuracy_score(y_test, y_log_pred)

0.725

K-nearest neighbor could give us higher accuracy that Logistic regression