<a href="https://colab.research.google.com/github/Kirbit04/Learning-K-Nearest-Neighbor-Algorithm/blob/main/Learning_KNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Learning to Create a model using K Nearest Neighbour classifier**
Reference:[Article by KdNuggets](https://buff.ly/3MyEnEV)

** INTRODUCTION**
 This is a simple machine learning model that makes predictions of the target data point based on the majority class of the K Nearest data point in the feature space. A practical example is you determining what to wear for a certain weather based on what your neighbour wore. It assumes that similar things exist in proximity. It's considered an intuitive and simple algorithm.

DATASET: The dataset will be created using a dictionary. Where, playing golf or not is the target and observations, temprature, humidity and wind are in the feature space.
[Dataset Visual Representation](https://miro.medium.com/v2/resize:fit:720/format:webp/1*I0ERfwsbpdnnUMVOSN1cng.png)

In [11]:
#importing the libraries
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

#making the dataset
dataset_dict = {
    'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast', 'sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy', 'sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast', 'rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],
    'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
    'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
    'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
    'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
original_df = pd.DataFrame(dataset_dict)

print(original_df)

     Outlook  Temperature  Humidity   Wind Play
0      sunny         85.0      85.0  False   No
1      sunny         80.0      90.0   True   No
2   overcast         83.0      78.0  False  Yes
3      rainy         70.0      96.0  False  Yes
4      rainy         68.0      80.0  False  Yes
5      rainy         65.0      70.0   True   No
6   overcast         64.0      65.0   True  Yes
7      sunny         72.0      95.0  False   No
8      sunny         69.0      70.0  False  Yes
9      rainy         75.0      80.0  False  Yes
10     sunny         75.0      70.0   True  Yes
11  overcast         72.0      90.0   True  Yes
12  overcast         81.0      75.0  False  Yes
13     rainy         71.0      80.0   True   No
14     sunny         81.0      88.0   True   No
15  overcast         74.0      92.0  False  Yes
16     rainy         76.0      85.0  False  Yes
17     sunny         78.0      75.0   True   No
18     sunny         82.0      92.0  False   No
19     rainy         67.0      90.0   Tr

In [12]:
original_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28 entries, 0 to 27
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Outlook      28 non-null     object 
 1   Temperature  28 non-null     float64
 2   Humidity     28 non-null     float64
 3   Wind         28 non-null     bool   
 4   Play         28 non-null     object 
dtypes: bool(1), float64(2), object(2)
memory usage: 1.0+ KB


In [13]:
original_df.describe()

Unnamed: 0,Temperature,Humidity
count,28.0,28.0
mean,75.714286,80.607143
std,6.710175,10.354242
min,64.0,60.0
25%,70.75,70.0
50%,75.5,80.0
75%,81.0,90.0
max,88.0,96.0


In [14]:
from sklearn.preprocessing import StandardScaler

#preprocessing the data, where the get_dummies() is used to change categorical variables into 0/1 equivalents and dtype is set to interger.
df = pd.get_dummies(original_df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
#astype() is used to change datatypes
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Yes').astype(int)
df = df[['sunny','rainy','overcast','Temperature','Humidity','Wind','Play']]

# Split data into a training set and a testing set and standardize features
X, y = df.drop(columns='Play'), df['Play']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)

scaler = StandardScaler()
float_cols = X_train.select_dtypes(include=['float64']).columns
X_train[float_cols] = scaler.fit_transform(X_train[float_cols])
X_test[float_cols] = scaler.transform(X_test[float_cols])

# Print results
print(pd.concat([X_train, y_train], axis=1).round(2), '\n')
print(pd.concat([X_test, y_test], axis=1).round(2), '\n')

    sunny  rainy  overcast  Temperature  Humidity  Wind  Play
0       1      0         0         1.80      0.50     0     0
1       1      0         0         1.02      1.02     1     0
2       0      0         1         1.49     -0.24     0     1
3       0      1         0        -0.56      1.66     0     1
4       0      1         0        -0.88     -0.03     0     1
5       0      1         0        -1.35     -1.08     1     0
6       0      0         1        -1.51     -1.61     1     1
7       1      0         0        -0.25      1.55     0     0
8       1      0         0        -0.72     -1.08     0     1
9       0      1         0         0.23     -0.03     0     1
10      1      0         0         0.23     -1.08     1     1
11      0      0         1        -0.25      1.02     1     1
12      0      0         1         1.17     -0.56     0     1
13      0      1         0        -0.41     -0.03     1     0 

    sunny  rainy  overcast  Temperature  Humidity  Wind  Play
14    

**Training Steps**

1. Choosing a value of K, where K is the number of neighbours to be used for evaluation.



In [15]:
from sklearn.neighbors import KNeighborsClassifier

# Select the Number of Neighbors ('k')
k = 5

2. Selecting a distance metric. For our case we are going to use the euclidian metric.
For Explanation on euclidian metric:
https://miro.medium.com/v2/resize:fit:720/format:webp/1*tqKuQD4yb5NRXj-SGrEdyQ.png


In [17]:
# Choose a Distance Metric
distance_metric = 'euclidean'

# Trying to calculate distance between ID 0 and ID 1
print(np.linalg.norm(X_train.loc[0].values - X_train.loc[1].values))

#store all the training data points and their corresponding labels

# Initialize the k-NN Classifier
knn_clf = KNeighborsClassifier(n_neighbors=k, metric=distance_metric)

# "Train" the kNN (although no real training happens)
knn_clf.fit(X_train, y_train)

1.3789269844186147


In [18]:
#3 Doing the distance calculation using the distance metric chosen above

from scipy.spatial import distance

# Compute the distances from the first row of X_test to all rows in X_train
distances = distance.cdist(X_test.iloc[0:1], X_train, metric='euclidean')

# Create a DataFrame to display the distances
distance_df = pd.DataFrame({
    'Train_ID': X_train.index,
    'Distance': distances[0].round(2),
    'Label': y_train
}).set_index('Train_ID')

print(distance_df.sort_values(by='Distance'))

          Distance  Label
Train_ID                 
1             0.26      0
0             1.22      0
7             1.89      0
11            2.02      1
2             2.05      1
10            2.12      1
9             2.15      1
12            2.21      1
13            2.28      0
3             2.59      1
4             2.82      1
8             2.86      1
5             3.46      0
6             3.88      1


In [19]:
#Identifying the KNearest Neighbours based on the calculated distances, then assign the most common class as the predicted class

# Use the k-NN Classifier to make predictions
y_pred = knn_clf.predict(X_test)
print("Label    :",list(y_test))
print("Prediction:",list(y_pred))

Label    : [0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1]
Prediction: [0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1]


In [20]:
# Evaluating the accuracy of the model
from sklearn.metrics import accuracy_score

# Evaluation Phase
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy.round(4)*100}%')

Accuracy: 71.43%


**Conclusion:**

KNN’s power lies in its ability to make predictions based on the proximity of data points, without requiring complex training processes. The following are factors to observe to ensure a productive and accurate model:

1. When the "K" is smaller, the model is more noise-sensitive and with a higher chance of accuracy.
2. When picking a distance metrics, question your model and dataset before making the choice.
2. Consider your weight function, where the options include: uniform= all neighbours are weighted equally & distance= close neighbours have greater influence.





**Pros:**

1.   No Training Phase - Can incorporate new data without retraining.
1.   Simplicity - easy to understand.
2.   Versatility - Used for both regression and classification.
2.   No Assumptions.



**Cons:**

1.   Computationally Expensive.
1.   Memory Intensive.
2.   Sensitive to Irrelevant Features.
2.   Curse of Dimensionality.

