BY: **RIYA JOSHI**

EMAIL: riya.joshi@somaiya.edu



---


### **Basic idea behind KNN(K-Nearest Neighbour) algorithm**:

*   This algorithm works by classifying the data points based on how the neighbors are classified. 
* Any new case is classified based on a similarity measure of all the available cases. 
* Technically, the algorithm classifies an unknown item by looking at k of its already -classified, nearest neighbor items by finding out majority votes from nearest neighbors that have similar attributes as those used to map the items.
*   It can be used for Regression as well as for Classification but mostly it is used for the Classification problems.
* **Lazy Learning Algorithm** –  It is a lazy learner because it does not have a training phase but rather memorizes the training dataset. All computations are delayed until classification.
* KNN algorithm is a good choice if you have a small dataset and the data is noise free and labeled. 





### **Applications of KNN**:
* Text mining
* Agriculture
* Finance
* Medical
* Facial recognition
* Recommendation systems (Amazon, Hulu, Netflix, etc)




---




In [12]:
# Importing required libraries
import numpy as np
import pandas as pd

from statistics import *

from sklearn.metrics import accuracy_score,confusion_matrix
from sklearn.preprocessing import LabelEncoder

import warnings
warnings.filterwarnings('ignore')

In [13]:
df = pd.read_csv('./smoke.csv') #importing dataset
df.head() # displaying first five records

Unnamed: 0.1,Unnamed: 0,UTC,Temperature[C],Humidity[%],TVOC[ppb],eCO2[ppm],Raw H2,Raw Ethanol,Pressure[hPa],PM1.0,PM2.5,NC0.5,NC1.0,NC2.5,CNT,Fire Alarm
0,0,1654733331,20.0,57.36,0,400,12306,18520,939.735,0.0,0.0,0.0,0.0,0.0,0,0
1,1,1654733332,20.015,56.67,0,400,12345,18651,939.744,0.0,0.0,0.0,0.0,0.0,1,0
2,2,1654733333,20.029,55.96,0,400,12374,18764,939.738,0.0,0.0,0.0,0.0,0.0,2,0
3,3,1654733334,20.044,55.28,0,400,12390,18849,939.736,0.0,0.0,0.0,0.0,0.0,3,0
4,4,1654733335,20.059,54.69,0,400,12403,18921,939.744,0.0,0.0,0.0,0.0,0.0,4,0


In [14]:
# replacing null values with 0
df = df.fillna(value= 0)

In [15]:
new_df=df[['UTC', 'Temperature[C]', 'Humidity[%]', 'TVOC[ppb]', 'eCO2[ppm]', 'Raw H2','Raw Ethanol', 'Pressure[hPa]', 'PM1.0', 'PM2.5', 'NC0.5', 'NC1.0', 'NC2.5', 'CNT', 'Fire Alarm']]
new_df.head()

Unnamed: 0,UTC,Temperature[C],Humidity[%],TVOC[ppb],eCO2[ppm],Raw H2,Raw Ethanol,Pressure[hPa],PM1.0,PM2.5,NC0.5,NC1.0,NC2.5,CNT,Fire Alarm
0,1654733331,20.0,57.36,0,400,12306,18520,939.735,0.0,0.0,0.0,0.0,0.0,0,0
1,1654733332,20.015,56.67,0,400,12345,18651,939.744,0.0,0.0,0.0,0.0,0.0,1,0
2,1654733333,20.029,55.96,0,400,12374,18764,939.738,0.0,0.0,0.0,0.0,0.0,2,0
3,1654733334,20.044,55.28,0,400,12390,18849,939.736,0.0,0.0,0.0,0.0,0.0,3,0
4,1654733335,20.059,54.69,0,400,12403,18921,939.744,0.0,0.0,0.0,0.0,0.0,4,0


In [16]:
# converting type from object to string to be able to apply Label Encoder
new_df['Fire Alarm']=new_df['Fire Alarm'].astype(dtype='string',copy=True)

In [17]:
# converting catagorical column 'Fire Alarm' to numerical column by LabelEncoding
new_df['Fire Alarm'] = LabelEncoder().fit_transform(new_df['Fire Alarm'])

In [18]:
# changing data type of whole dataset to int
new_df = new_df.astype(int)

In [19]:
# splitting dataset into 70:30 ratio

# Defining train size
train_size = int(0.7 * len(new_df))

# Splitting dataset
train_set = new_df[:train_size]
test_set = new_df[train_size:]

In [20]:
# separating train_set into X and Y
X_train=train_set.drop('Fire Alarm', axis=1)
y_train=train_set['Fire Alarm']

# separating test_set into X and Y
X_test=test_set.drop('Fire Alarm', axis=1)
y_test=test_set['Fire Alarm']

# **KNN**

**How to Find the Ideal K?**

1- Using odd numbers, fit a KNN classifier for each number.

2- Create predictions.

3- Further evaluate the performance using the predictions produced in step 2.

4- Compare results across each model and decide on the one with the least error.

In [21]:
def eucledian_dist(p1,p2):
    dist = np.sqrt(np.sum((p1-p2)**2))
    return dist
 
def predict(x_train, y_train , x_input, k):
    output_labels = []
     
    # Loop through the datapoints to be classified
    for i in x_input: 
         
        # Array to store distances
        point_dist = []
         
        # Loop through each training Data
        for j in range(len(x_train)): 
            distance = eucledian_dist(np.array(x_train[j,:]) , i) 
            # Calculating the distance
            point_dist.append(distance) 
        point_dist = np.array(point_dist) 
         
        # Sorting the array while preserving the index
        # Keeping the first K datapoints
        dist = np.argsort(point_dist)[:k] 
         
        # Labels of the K datapoints from above
        labels = y_train[dist]
         
        # Majority voting
        lab = mode(labels) 
        output_labels.append(lab)
 
    return output_labels

In [22]:
# Predicting
y_pred = predict(X_train.values,y_train.values,X_test.values , 7)
 
# Calculating accuracy
accuracy_score(y_test.values, y_pred)

In [None]:
confusion_matrix(y_test,y_pred)

array([[ 132, 1029],
       [ 319, 4004]])

**Comparing our model with Scikit-learn model**

In [None]:
from sklearn.neighbors import KNeighborsClassifier  

model = KNeighborsClassifier(n_neighbors=7, metric='minkowski', p=2 )  
model.fit(X_train,y_train)
preds=model.predict(X_test)

In [None]:
# Calculating accuracy
accuracy_score(y_test, preds)

0.7543763676148797

In [None]:
confusion_matrix(y_test,preds)

array([[ 133, 1028],
       [ 319, 4004]])