# KNN (K-Nearest Neighbors) Algorithm

## Overview
KNN is a simple and widely used classification algorithm. It is intuitive and easy to implement.

### Properties:
- It is a **simple classification algorithm**.
- **k** represents the number of nearest neighbors considered for classification.
- The accuracy and behavior of the model heavily depend on the choice of **k**.

### How to Choose the Value of k?
- There is **no fixed method** or equation to determine the optimal value of **k**.
- You should use the **hit and trial method** to find the best **k**.
- Avoid choosing **smaller values of k**.
- Always try to keep **k odd** because even values of **k** might lead to biased predictions in case of tie situations.

### Pros and Cons

#### Pros:
- **Simple Algorithm**: Easy to understand and implement.
- **Better Results**: Often provides good results in classification tasks.
- **Handles Multiple Classes**: Works well with multi-class problems.

#### Cons:
- **Sensitive to the Choice of k**: Fluctuations in the value of **k** can significantly affect accuracy.
- **High Storage and Computation Requirements**: The algorithm requires additional storage to store the distances and involves high computation, especially with large datasets.
- **Distance Metric Knowledge Required**: You need to understand different distance metrics (e.g., Euclidean, Manhattan) to effectively apply KNN.


## Importing libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression

## Import the dataset

In [2]:
link = 'https://raw.githubusercontent.com/omairaasim/machine_learning/master/project_11_k_nearest_neighbor/iphone_purchase_records.csv'
df = pd.read_csv(link)

## Exploratory Data Analysis (EDA)

In [3]:
df.head()

Unnamed: 0,Gender,Age,Salary,Purchase Iphone
0,Male,19,19000,0
1,Male,35,20000,0
2,Female,26,43000,0
3,Female,27,57000,0
4,Male,19,76000,0


In [4]:
df.shape

(400, 4)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Gender           400 non-null    object
 1   Age              400 non-null    int64 
 2   Salary           400 non-null    int64 
 3   Purchase Iphone  400 non-null    int64 
dtypes: int64(3), object(1)
memory usage: 12.6+ KB


In [6]:
df.describe()

Unnamed: 0,Age,Salary,Purchase Iphone
count,400.0,400.0,400.0
mean,37.655,69742.5,0.3575
std,10.482877,34096.960282,0.479864
min,18.0,15000.0,0.0
25%,29.75,43000.0,0.0
50%,37.0,70000.0,0.0
75%,46.0,88000.0,1.0
max,60.0,150000.0,1.0


In [7]:
# Number of male and female
df.Gender.value_counts()

Gender
Female    204
Male      196
Name: count, dtype: int64

In [8]:
df[df['Purchase Iphone'] == 1]['Gender'].value_counts()

Gender
Female    77
Male      66
Name: count, dtype: int64

In [9]:
# The data is imbalanced -> It is not perfectly jumbled

In [10]:
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

In [11]:
# Label Encoding

In [12]:
from sklearn.preprocessing import LabelEncoder

In [13]:
enc = LabelEncoder()

In [14]:
X.Gender = enc.fit_transform(X.Gender)

In [15]:
# Splitting the data into sets
skf = StratifiedKFold(n_splits = 5)

In [16]:
for train_index, test_index in skf.split(X, y):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

### Feature Scaling

In [17]:
scale = StandardScaler()

In [18]:
# Fit the scaler on X_train and transform both X_train and X_test correctly
X_train = scale.fit_transform(X_train)
X_test = scale.transform(X_test) 

### Model Selection

In [19]:
log = LogisticRegression()
# knn = KNeighborsClassifier(n_jobs=5)
knn = KNeighborsClassifier(n_neighbors=5)

### Training the model

In [20]:
log.fit(X_train, y_train)
knn.fit(X_train, y_train)

In [21]:
### Test the model

In [22]:
y_knn_pred = knn.predict(X_test)

In [23]:
y_log_pred = log.predict(X_test)

In [24]:
newdf = pd.DataFrame({'Actual' : y_test, 'Predicted' : y_knn_pred})

In [25]:
newdf

Unnamed: 0,Actual,Predicted
266,0,0
267,0,0
269,0,0
270,0,1
276,0,1
...,...,...
395,1,1
396,1,1
397,1,1
398,0,0


In [26]:
confusion_matrix(y_test, y_knn_pred)

array([[42,  9],
       [ 2, 27]], dtype=int64)

In [27]:
(42+27)/80

0.8625

### Performance Metrics

In [28]:
from sklearn.metrics import accuracy_score

In [29]:
accuracy_score(y_test, y_knn_pred)

0.8625

In [30]:
accuracy_score(y_test, y_log_pred)

0.7

In [31]:
# Hyper parameter tuning
lis = [i for i in range(2, 101) if i%2==0]

# Try with different list of values

In [32]:
acc = []
for i in lis:
    knn = KNeighborsClassifier(n_jobs = i)
    knn.fit(X_train, y_train)
    y_knn_pred = knn.predict(X_test)
    acc.append(accuracy_score(y_test, y_knn_pred))
print(max(acc))

0.8625


In [33]:
acc = []
dic = {}
for i in lis:
    knn = KNeighborsClassifier(n_neighbors = i)
    knn.fit(X_train, y_train)
    y_knn_pred = knn.predict(X_test)
    acc.append(accuracy_score(y_test, y_knn_pred))
    dic[i] = accuracy_score(y_test, y_knn_pred)
print(max(acc))
print(dic)

0.95
{2: 0.875, 4: 0.875, 6: 0.9125, 8: 0.925, 10: 0.9375, 12: 0.95, 14: 0.95, 16: 0.9375, 18: 0.9125, 20: 0.85, 22: 0.8, 24: 0.8125, 26: 0.7875, 28: 0.775, 30: 0.775, 32: 0.7625, 34: 0.75, 36: 0.75, 38: 0.75, 40: 0.75, 42: 0.75, 44: 0.75, 46: 0.7375, 48: 0.7375, 50: 0.7375, 52: 0.7375, 54: 0.7375, 56: 0.7375, 58: 0.7375, 60: 0.7375, 62: 0.725, 64: 0.725, 66: 0.7125, 68: 0.7125, 70: 0.725, 72: 0.7125, 74: 0.7125, 76: 0.7125, 78: 0.725, 80: 0.725, 82: 0.725, 84: 0.725, 86: 0.725, 88: 0.7125, 90: 0.7, 92: 0.7, 94: 0.7, 96: 0.6875, 98: 0.6875, 100: 0.6875}
