#### K-Nearest Neighbors

'''The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other.'''

![k-Nearest-Neigbhour](./img/knn.png)

Notice in the image above that most of the time, similar data points are close to each other. 

The KNN algorithm hinges on this assumption being true enough for the algorithm to be useful. 

KNN captures the idea of similarity (sometimes called distance, proximity, or closeness) with some mathematics we might have learned in our childhood— calculating the distance between points on a graph.

#### When do we use KNN algorithm?

KNN can be used for both classification and regression predictive problems. 

However, it is more widely used in classification problems in the industry

##### How does the KNN algorithm work?

Let’s take a simple case to understand this algorithm. Following is a spread of red circles (RC) and green squares (GS) :

![k-Nearest-Neigbhour](./img/knn-scenario1.png)

You intend to find out the class of the blue star (BS). BS can either be RC or GS and nothing else. 

The “K” is KNN algorithm is the nearest neighbor we wish to take the vote from. Let’s say K = 3. Hence, we will now make a circle with BS as the center just as big as to enclose only three datapoints on the plane. Refer to the following diagram for more details:

![k-Nearest-Neigbhour](./img/knn-scenario2.png)

The three closest points to BS is all RC. Hence, with a good confidence level, we can say that the BS should belong to the class RC. Here, the choice became very obvious as all three votes from the closest neighbor went to RC. The choice of the parameter K is very crucial in this algorithm. Next, we will understand what are the factors to be considered to conclude the best K.

Exercise: Build a simple KNN ML Model. Test around with different values of n.

In [3]:
#import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor

from sklearn.metrics import f1_score

In [4]:
#import data
df = pd.read_csv('./data/winequality-red.csv')
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [5]:
df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


In [6]:
df.quality.unique()

array([5, 6, 7, 4, 8, 3])

In [7]:
#assing X and y

y = df.quality
X = df.drop(['quality'],axis=1)

In [8]:
#split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [9]:
#build the model

# --define
# --fit
# --predict
# --evaluate
knc = KNeighborsClassifier(n_neighbors=3) #define the model

#fit the model
knc.fit(X_train,y_train)

#predict with model
knc_preds = knc.predict(X_test)

In [10]:
#evaluate the model
knc_f1 = f1_score(y_test,knc_preds,average = 'micro')

knc_f1_train = f1_score(y_train,knc.predict(X_train),average = 'micro')

In [11]:
print(f'The f1 score of KNN Classifier for test data is {np.round(knc_f1,2)}')
print(f'The f1 score of KNN Classifier for train data is {np.round(knc_f1_train,2)}')
print('.')
print('.')

The f1 score of KNN Classifier for test data is 0.45
The f1 score of KNN Classifier for train data is 0.74
.
.


##### Using the KNN Regressor

In [12]:
#define the model
knr = KNeighborsRegressor(n_neighbors=3)

#fit the model
knr.fit(X_train,y_train)

#predict
knr_preds = knr.predict(X_test)

In [13]:
knr_preds = np.round(knr_preds,0)

In [14]:
#evaluate the model
knr_f1 = f1_score(y_test,knr_preds,average = 'micro')

knr_f1_train = f1_score(y_train,np.round(knr.predict(X_train),0),average = 'micro')

In [15]:
print(f'The f1 score of KNN Classifier for test data is {np.round(knr_f1,2)}')
print(f'The f1 score of KNN Classifier for train data is {np.round(knr_f1_train,2)}')
print('.')
print('.')

The f1 score of KNN Classifier for test data is 0.49
The f1 score of KNN Classifier for train data is 0.69
.
.
