## Q1. What is the main difference between the Euclidean distance metric and the Manhattan distance metric in KNN? How might this difference affect the performance of a KNN classifier or regressor?

Ans-Distance metrics are a key part of several machine learning algorithms. These distance metrics are used in both supervised and unsupervised learning, generally to calculate the similarity between data points. An effective distance metric improves the performance of our machine learning model, whether that’s for classification tasks or clustering.
Types of Distance Metrics in Machine Learning
1.Euclidean Distance

2.Manhattan Distance

3.Minkowski Distance

4.Hamming Distance

Euclidean Distance
Euclidean Distance represents the shortest distance between two vectors.It is the square root of the sum of squares of differences between corresponding elements.
The Euclidean distance metric corresponds to the L2-norm of a difference between vectors and vector spaces. The cosine similarity is proportional to the dot product of two vectors and inversely proportional to the product of their magnitudes.
Formula for Euclidean Distance
euclidean distance formula | distance metrics
We use this formula when we are dealing with 2 dimensions. We can generalize this for an n-dimensional space as:

euclidean distance formula | distance metrics
Where,

   d=((p1 -q1)² +(p2 -q2)²)1/2

n = number of dimensions
pi, qi = data points
Let’s code Euclidean Distance in Python. This will give you a better understanding of how this distance metric works.

We will first import the required libraries. I will be using the SciPy library that contains pre-written codes for most of the distance functions used in Python:

In [3]:
# importing the library
from scipy.spatial import distance

# defining the points
point_1=(1,2,3)
point_2=(4,5,6)
point_1, point_2

((1, 2, 3), (4, 5, 6))

Manhattan Distance
Manhattan Distance is the sum of absolute differences between points across all the dimensions.
Formula for Manhattan Distance
Since the above representation is 2 dimensional, to calculate Manhattan Distance, we will take the sum of absolute distances in both the x and y directions. So, the Manhattan distance in a 2-dimensional space is given as:

   d = |p1 -q1| + |p2 -q2|
Where,

n = number of dimensions
pi, qi = data points
Now, we will calculate the Manhattan Distance between the two points:

In [6]:
# computing the manhattan distance
manhattan_distance = distance.cityblock(point_1, point_2)
print('Manhattan Distance b/w', point_1, 'and', point_2, 'is: ', manhattan_distance)

Manhattan Distance b/w (1, 2, 3) and (4, 5, 6) is:  9


Note that Manhattan Distance is also known as city block distance. SciPy has a function called cityblock that returns the Manhattan Distance between two points.

The key differences are: KNN regression tries to predict the value of the output variable by using a local average. KNN classification attempts to predict the class to which the output variable belong by computing the local probability.

## Q2. How do you choose the optimal value of k for a KNN classifier or regressor? What techniques can be used to determine the optimal k value?

Ans-The choice of k will largely depend on the input data as data with more outliers or noise will likely perform better with higher values of k. Overall, it is recommended to have an odd number for k to avoid ties in classification, and cross-validation tactics can help you choose the optimal k for your dataset

Kvalue indicates the count of the nearest neighbors. We have to compute distances between test points and trained labels points. Updating distance metrics with every iteration is computationally expensive, and that’s why KNN is a lazy learning algorithm.

As you can verify from the above image, if we proceed with K=3, then we predict that test input belongs to class B, and if we continue with K=7, then we predict that test input belongs to class A.
That’s how you can imagine that the K value has a powerful effect on KNN performance.
Then how to select the optimal K value?
There are no pre-defined statistical methods to find the most favorable value of K.
Initialize a random K value and start computing.
Choosing a small value of K leads to unstable decision boundaries.
The substantial K value is better for classification as it leads to smoothening the decision boundaries.
Derive a plot between error rate and K denoting values in a defined range. Then choose the K value as having a minimum error rate.

## Q3. How does the choice of distance metric affect the performance of a KNN classifier or regressor? In what situations might you choose one distance metric over the other?

## Q4. What are some common hyperparameters in KNN classifiers and regressors, and how do they affect the performance of the model? How might you go about tuning these hyperparameters to improve
# model performance?

Ans-k-Nearest Neighbours:
It is an algorithm which classifies a new data point based on it’s proximity to other data point groups. Higher the proximity of new data point from one group, higher is the likelihood of it getting classified into that group.

Distance between data points is measured by distance metrics like euclidean distance, manhattan distance, minkowski distance, mahalanobis distance, tangential distance, cosine distance and many more.

For data points X and Y with n features:



kNN using Scikit-learn:
kNN hyper-parameters:
In machine learning, before we can use any algorithm, we need to choose the value of hyper-parameters for that model. In case of kNN, important hyper-parameters are:

1.n_neighbors: Number of neighbours in a neighbourhood.

2.weights: If set to uniform, all points in each neighbourhood have equal influence in predicting class i.e. predicted class is the class with highest number of points in the neighbourhood. If set to distance, closer neighbours will have greater influence than neighbours further away i.e. class with more points close to new data point becomes predicted class and to do this we take inverse of distance while calculating weights so that closer points have higher weights.

3.metric: The distance metric to use if we have weights set to distance. Default value is minkowski which is one method to calculate distance between two data points. We can change the default value to use other distance metrics.

4.p: It is power parameter for minkowski metric. If p=1, then distance metric is manhattan_distance. If p=2, then distance metric is euclidean_distance. We can experiment with higher values of p if we want to.

# kNN hyper-parametrs
sklearn.neighbors.KNeighborsClassifier(n_neighbors, weights, metric, p)

kNN classifier:
We will be building a classifier to classify hand written digits into one of the class from 0 to 9. The data we will be using is obtained from MNIST database which is a set of 60,000 28×28 pixel black and white images of handwritten individual digits between 0 and 9.

In [1]:
# To load MNIST image data
from sklearn.datasets import load_digits
# kNN Classifier
from sklearn.neighbors import KNeighborsClassifier
# Confusion matrix to check model performance
from sklearn.metrics import confusion_matrix
# To split data into training and testing set
from sklearn.model_selection import train_test_split
# For plotting digit
import matplotlib.pyplot as plt

Loading MNIST data of digits:

In [2]:
digits = load_digits()

Transforming data to use with kNN classifier:

In [3]:
# Number of images
n_samples = len(digits.images)
# Changing shape from 28x28 pixel values to a sequence of values
X = digits.images.reshape((n_samples, -1))
# Getting the already known targets for each image
y = digits.target

Creating our training and testing sets:

In [4]:
# Splitting data to train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

Creating and training model:

In [5]:
# Creating model
clf = KNeighborsClassifier(n_neighbors=3)
# Training model
clf.fit(X_train, y_train)

Getting predictions for test data:

In [6]:
# Predictions for test data
predicted = clf.predict(X_test)

Comparing actual and predicted target values using confusion matrix:

In [7]:
# Print confusion matrix
confusion_matrix(y_test, predicted)

array([[37,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0, 42,  0,  0,  0,  1,  0,  0,  0,  0],
       [ 0,  0, 44,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  1, 44,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0, 37,  0,  0,  1,  0,  0],
       [ 0,  0,  0,  0,  0, 47,  0,  0,  0,  1],
       [ 0,  0,  0,  0,  0,  0, 52,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0, 48,  0,  0],
       [ 0,  0,  0,  2,  0,  0,  0,  0, 46,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0, 47]])

In the matrix, rows represent actual target values where first row is for 0 label, second for 1 label and so on. Similarly, columns represent predictions where first column is for 0 label, second is for 1 label and so on.

Values along the diagonal of the matrix highlighted in yellow are the values which were predicted correctly.

Consider the value highlighted in blue which is 4th column and 9th row. It is a mistake. Our model misclassified 3 as 8.

Overall, our model did a good job in classifying the digits as misclassifications i.e. values other than diagonal are mostly zero or smaller than 2.

## Q5. How does the size of the training set affect the performance of a KNN classifier or regressor? What techniques can be used to optimize the size of the training set?

ANS- K Nearest Neighbour Classifier

In [8]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

In [9]:
from sklearn.datasets import make_classification

X,y = make_classification(
    n_samples=1000, # 1000 observations
    n_features=3, # 3 total features
     n_redundant=1,
    n_classes=2, # binary target/label
    random_state=999
)

In [10]:
X

array([[-0.33504974,  0.02852654,  1.16193084],
       [-1.37746253, -0.4058213 ,  0.44359618],
       [-1.04520026, -0.72334759, -3.10470423],
       ...,
       [-0.75602574, -0.51816111, -2.20382324],
       [ 0.56066316, -0.07335845, -2.15660348],
       [-1.87521902, -1.11380394, -4.04620773]])

In [11]:
from sklearn.model_selection import train_test_split

In [12]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

In [13]:
from sklearn.neighbors import KNeighborsClassifier

In [14]:
classifier=KNeighborsClassifier(n_neighbors=5,algorithm='auto')
classifier.fit(X_train,y_train)

In [15]:
y_pred=classifier.predict(X_test)

In [16]:
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report

In [17]:
print(confusion_matrix(y_pred,y_test))
print(accuracy_score(y_pred,y_test))
print(classification_report(y_pred,y_test))

[[158  20]
 [ 11 141]]
0.906060606060606
              precision    recall  f1-score   support

           0       0.93      0.89      0.91       178
           1       0.88      0.93      0.90       152

    accuracy                           0.91       330
   macro avg       0.91      0.91      0.91       330
weighted avg       0.91      0.91      0.91       330



In [20]:
knn = KNeighborsClassifier()
from sklearn.model_selection import GridSearchCV
k_range = list(range(1, 10))
param_grid = dict(n_neighbors=k_range)
  

## KNN Regressor

In [23]:
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=1000, n_features=2, noise=10, random_state=42)

In [24]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

In [25]:
from sklearn.neighbors import KNeighborsRegressor

In [26]:
regressor=KNeighborsRegressor(n_neighbors=6,algorithm='auto')
regressor.fit(X_train,y_train)

In [27]:
y_pred=regressor.predict(X_test)


In [28]:
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error

In [29]:
print(r2_score(y_test,y_pred))
print(mean_absolute_error(y_test,y_pred))
print(mean_squared_error(y_test,y_pred))

0.9189275159979495
9.009462452972217
127.45860414317289
