## **SHAPEAI - DATA SCIENTIST TRAINING & INTERNSHIP**

### **Author: Midhir Nambiar**

### **Task: KNN(K Nearest Neighbours)**

Outline:
1.   Importing Libraries
2.   Data Analysis
3.   Feature Scaling
4.   Train-test split
5.   Model Building
6.   Model Evaluation



**Importing Libraries**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler

**Data Analysis**

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/omairaasim/machine_learning/master/project_11_k_nearest_neighbor/iphone_purchase_records.csv')
df.head()

Unnamed: 0,Gender,Age,Salary,Purchase Iphone
0,Male,19,19000,0
1,Male,35,20000,0
2,Female,26,43000,0
3,Female,27,57000,0
4,Male,19,76000,0


In [3]:
print(df.shape)

(400, 4)


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Gender           400 non-null    object
 1   Age              400 non-null    int64 
 2   Salary           400 non-null    int64 
 3   Purchase Iphone  400 non-null    int64 
dtypes: int64(3), object(1)
memory usage: 12.6+ KB


In [5]:
df.describe()

Unnamed: 0,Age,Salary,Purchase Iphone
count,400.0,400.0,400.0
mean,37.655,69742.5,0.3575
std,10.482877,34096.960282,0.479864
min,18.0,15000.0,0.0
25%,29.75,43000.0,0.0
50%,37.0,70000.0,0.0
75%,46.0,88000.0,1.0
max,60.0,150000.0,1.0


In [6]:
df.isnull().sum()

Gender             0
Age                0
Salary             0
Purchase Iphone    0
dtype: int64

In [7]:
df['Purchase Iphone'].value_counts()

0    257
1    143
Name: Purchase Iphone, dtype: int64

In [8]:
df.Gender.value_counts()

Female    204
Male      196
Name: Gender, dtype: int64

In [9]:
df.loc[df['Purchase Iphone']==1, 'Gender'].value_counts()

Female    77
Male      66
Name: Gender, dtype: int64

In [10]:
df.loc[df['Purchase Iphone']==0, 'Gender'].value_counts()

Male      130
Female    127
Name: Gender, dtype: int64

In [11]:
x=df[['Gender','Age','Salary']]
y=df['Purchase Iphone']

In [12]:
x.head()

Unnamed: 0,Gender,Age,Salary
0,Male,19,19000
1,Male,35,20000
2,Female,26,43000
3,Female,27,57000
4,Male,19,76000


In [13]:
y.head()

0    0
1    0
2    0
3    0
4    0
Name: Purchase Iphone, dtype: int64

In [14]:
#Label Encoding
from sklearn.preprocessing import LabelEncoder

In [15]:
enc = LabelEncoder()

In [16]:
x.Gender=enc.fit_transform(x.Gender)
x.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


Unnamed: 0,Gender,Age,Salary
0,1,19,19000
1,1,35,20000
2,0,26,43000
3,0,27,57000
4,1,19,76000


**Feature Scaling**

In [17]:
scale = StandardScaler()
x = scale.fit_transform(x)
x

array([[ 1.02020406, -1.78179743, -1.49004624],
       [ 1.02020406, -0.25358736, -1.46068138],
       [-0.98019606, -1.11320552, -0.78528968],
       ...,
       [-0.98019606,  1.17910958, -1.46068138],
       [ 1.02020406, -0.15807423, -1.07893824],
       [-0.98019606,  1.08359645, -0.99084367]])

In [18]:
y

0      0
1      0
2      0
3      0
4      0
      ..
395    1
396    1
397    1
398    0
399    1
Name: Purchase Iphone, Length: 400, dtype: int64

**Train-test split**

In [19]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3)

In [20]:
print(x_train.shape)
print(x_test.shape)

(280, 3)
(120, 3)


**Model Building**

In [21]:
# By default the value of k is 5
knn = KNeighborsClassifier()

In [22]:
knn.fit(x_train,y_train)

In [23]:
y_pred =knn.predict(x_test)

**Model Evaluation**

In [24]:
confusion_matrix(y_test,y_pred)

array([[75,  9],
       [ 4, 32]])

In [25]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

0.8916666666666667

In [30]:
accuracy = ((75+32)/(75+9+4+32))*100
print(accuracy)

89.16666666666667


Checking the accuracy for different values of k

In [27]:
list1 = [i for i in range(2,101) if i%2!=0]

In [28]:
from sklearn.metrics import accuracy_score
dic = {}
for i in list1:
  knn = KNeighborsClassifier(n_neighbors=i)
  knn.fit(x_train,y_train)
  y_pred = knn.predict(x_test)
  acc = accuracy_score(y_test,y_pred)
  dic[i] = acc*100

In [29]:
dic

{3: 87.5,
 5: 89.16666666666667,
 7: 90.0,
 9: 90.0,
 11: 90.83333333333333,
 13: 92.5,
 15: 90.83333333333333,
 17: 90.0,
 19: 90.83333333333333,
 21: 89.16666666666667,
 23: 90.83333333333333,
 25: 88.33333333333333,
 27: 85.0,
 29: 85.83333333333333,
 31: 85.0,
 33: 83.33333333333334,
 35: 83.33333333333334,
 37: 84.16666666666667,
 39: 84.16666666666667,
 41: 83.33333333333334,
 43: 83.33333333333334,
 45: 82.5,
 47: 81.66666666666667,
 49: 80.83333333333333,
 51: 80.0,
 53: 80.0,
 55: 80.83333333333333,
 57: 80.83333333333333,
 59: 80.0,
 61: 80.0,
 63: 80.0,
 65: 80.0,
 67: 80.0,
 69: 78.33333333333333,
 71: 78.33333333333333,
 73: 78.33333333333333,
 75: 78.33333333333333,
 77: 78.33333333333333,
 79: 78.33333333333333,
 81: 79.16666666666666,
 83: 79.16666666666666,
 85: 79.16666666666666,
 87: 78.33333333333333,
 89: 80.83333333333333,
 91: 80.0,
 93: 79.16666666666666,
 95: 81.66666666666667,
 97: 81.66666666666667,
 99: 82.5}

**THANK YOU**