# KNN Classification using Scikit-learn

LINK: https://www.datacamp.com/community/tutorials/k-nearest-neighbor-classification-scikit-learn

K Nearest Neighbor(KNN) is a very simple, easy to understand, versatile and one of the topmost machine learning algorithms.

KNN used in the variety of applications such as finance, healthcare, political science, handwriting detection, image recognition and video recognition. 

In Credit ratings, financial institutes will predict the credit rating of customers. In loan disbursement, banking institutes will predict whether the loan is safe or risky. 

In political science, classifying potential voters in two classes will vote or wonâ€™t vote.

KNN algorithm used for both classification and regression problems. KNN algorithm based on feature similarity approach.

# K-Nearest Neighbors

KNN is a non-parametric and lazy learning algorithm. 

Non-parametric means there is no assumption for underlying data distribution.In other words, the model structure determined from the dataset. 

This will be very helpful in practice where most of the real world datasets do not follow mathematical theoretical assumptions. Lazy algorithm means it does not need any training data points for model generation. 

All training data used in the testing phase. This makes training faster and testing phase slower and costlier. Costly testing phase means time and memory.

In the worst case, KNN needs more time to scan all data points and scanning all data points will require more memory for storing training data.

# KNN algorithm working

In KNN, K is the number of nearest neighbors. The number of neighbors is the core deciding factor. K is generally an odd number if the number of classes is 2. When K=1, then the algorithm is known as the nearest neighbor algorithm. This is the simplest case. Suppose P1 is the point, for which label needs to predict. First, you find the one closest point to P1 and then the label of the nearest point assigned to P1.

<img src="http://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1531424125/Knn_k1_z96jba.png">

Suppose P1 is the point, for which label needs to predict. First, you find the k closest point to P1 and then classify points by majority vote of its k neighbors. Each object votes for their class and the class with the most votes is taken as the prediction. For finding closest similar points, you find the distance between points using distance measures such as Euclidean distance, Hamming distance, Manhattan distance and Minkowski distance. KNN has the following basic steps:

1. Calculate distance
2. Find closest neighbors
3. Vote for labels

<img src="http://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1531424125/KNN_final1_ibdm8a.png">


 

# How do we decide the number of neighbors in KNN?



The number of neighbors(K) in KNN is a hyperparameter that you need choose at the time of model building. You can think of K as a controlling variable for the prediction model.

Research has shown that no optimal number of neighbors suits all kind of data sets. Each dataset has it's own requirements. In the case of a small number of neighbors, the noise will have a higher influence on the result, and a large number of neighbors make it computationally expensive. Research has also shown that a small amount of neighbors are most flexible fit which will have low bias but high variance and a large number of neighbors will have a smoother decision boundary which means lower variance but higher bias.

Generally, Data scientists choose as an odd number if the number of classes is even. You can also check by generating the model on different values of k and check their performance. You can also try Elbow method here.

<img src="http://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1531424125/KNN_final_a1mrv9.png">




# BUILDING THE CLASSIFIER

In [1]:
#STEP-1

# Defining dataset
# Assigning features and label variables
# First Feature
weather=['Sunny','Sunny','Overcast','Rainy','Rainy','Rainy','Overcast','Sunny','Sunny',
'Rainy','Sunny','Overcast','Overcast','Rainy']
# Second Feature
temp=['Hot','Hot','Hot','Mild','Cool','Cool','Cool','Mild','Cool','Mild','Mild','Mild','Hot','Mild']

# Label or target varible
play=['No','No','Yes','Yes','Yes','No','Yes','No','Yes','Yes','Yes','Yes','Yes','No']

In [5]:
#STEP-2

#Encoding data columns
# Import LabelEncoder
from sklearn import preprocessing
#creating labelEncoder
le = preprocessing.LabelEncoder()
# Converting string labels into numbers.
weather_encoded=le.fit_transform(weather)
print(weather_encoded)

# converting string labels into numbers 
#-->you imported preprocessing module and created Label Encoder object.
#Using this LabelEncoder object, you can fit and transform "ALL" column into the numeric column.

temp_encoded=le.fit_transform(temp)
label=le.fit_transform(play)

[2 2 0 1 1 1 0 2 2 1 2 0 0 1]


In [4]:
#STEP_3

#Combining Features --> WE will combine multiple columns or features into a single set of data using "zip" function

features=list(zip(weather_encoded,temp_encoded))

HERE LABELS ARE 

    1.   a) MILD -->2    b)Hot -->0        c)Cold --> 1  (TEMP)
    2.   a) Sunny -->2   b)Overcast -->0  c)Rainny --> 1  (WEATHER)
    3.   a)Yes --> 1     b)No --> 0                       (OUTCOME)

In [13]:
#STEP_4

#Generating Model

from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=3)

# Train the model using the training sets
model.fit(features,label)

#Predict Output
predicted= model.predict([[0,0]]) # wethear and temp are parameters

if(predicted==0):
    print("NO PLAYING")
else:
    print("Play")


Play


# KNN with Multiple Labels


we will learn about KNN with multiple classes.

In the model the building part, you can use the wine dataset, which is a very famous multi-class classification problem. This data is the result of a chemical analysis of wines grown in the same region in Italy using three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.

The dataset comprises 13 features ('alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline') and a target (type of cultivars).

This data has three types of cultivar classes: 'class_0', 'class_1', and 'class_2'. Here, you can build a model to classify the type of cultivar. The dataset is available in the scikit-learn library, or you can also download it from the UCI Machine Learning Library.

# STEP-1

In [14]:
#Import scikit-learn dataset library
from sklearn import datasets

#Load dataset
wine = datasets.load_wine()

# STEP-2

In [25]:
# print the names of the features
print(wine.feature_names)
print()
# print the label species(class_0, class_1, class_2)
print(wine.target_names)

['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']

['class_0' 'class_1' 'class_2']


# STEP-3

In [28]:
# print the wine data (top 5 records)
print ("the wine data")
print(wine.data[0:5])
print()
# print the wine labels (0:Class_0, 1:Class_1, 2:Class_3)
print ("the wine labels")
print(wine.target)

the wine data
[[1.423e+01 1.710e+00 2.430e+00 1.560e+01 1.270e+02 2.800e+00 3.060e+00
  2.800e-01 2.290e+00 5.640e+00 1.040e+00 3.920e+00 1.065e+03]
 [1.320e+01 1.780e+00 2.140e+00 1.120e+01 1.000e+02 2.650e+00 2.760e+00
  2.600e-01 1.280e+00 4.380e+00 1.050e+00 3.400e+00 1.050e+03]
 [1.316e+01 2.360e+00 2.670e+00 1.860e+01 1.010e+02 2.800e+00 3.240e+00
  3.000e-01 2.810e+00 5.680e+00 1.030e+00 3.170e+00 1.185e+03]
 [1.437e+01 1.950e+00 2.500e+00 1.680e+01 1.130e+02 3.850e+00 3.490e+00
  2.400e-01 2.180e+00 7.800e+00 8.600e-01 3.450e+00 1.480e+03]
 [1.324e+01 2.590e+00 2.870e+00 2.100e+01 1.180e+02 2.800e+00 2.690e+00
  3.900e-01 1.820e+00 4.320e+00 1.040e+00 2.930e+00 7.350e+02]]

the wine labels
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

# STEP-4

In [29]:
# print data(feature)shape
print(wine.data.shape)
# print target(or label)shape
print(wine.target.shape)

(178, 13)
(178,)


# STEP-5

In [30]:
# Splitting Data

# Import train_test_split function
from sklearn.model_selection import train_test_split

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.3) # 70% training and 30% test

# STEP-6

In [35]:
# Generating Model for K=5

#Import knearest neighbors Classifier model
from sklearn.neighbors import KNeighborsClassifier

#Create KNN Classifier of n neighbours 
knn = KNeighborsClassifier(n_neighbors=8)

#Train the model using the training sets
knn.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = knn.predict(X_test)

print("predicted values are as \n",y_pred)

predicted values are as 
 [1 1 0 0 1 0 0 0 1 1 1 1 2 0 0 0 0 1 2 2 1 0 0 1 2 2 0 0 1 0 0 0 2 1 1 1 1
 2 2 0 2 1 0 2 0 1 2 2 1 2 0 0 1 1]


# STEP-7

In [33]:
#Model Evaluation for k=5

#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.7592592592592593
