# COURSE:   PGP [AI&ML]

## Learner :  Chaitanya Kumar Battula
## Module  : Machine Learning
## Topic   : KNN (Supervised Algorithm)

# Machine Learning  : KNN (Supervised Algorithm)

## What is k-Nearest Neighbors

The model for kNN is the entire training dataset. When a prediction is required for a unseen data instance, the kNN algorithm will search through the training dataset for the k-most similar instances. The prediction attribute of the most similar instances is summarized and returned as the prediction for the unseen instance.

The similarity measure is dependent on the type of data. For real-valued data, the Euclidean distance can be used. Other other types of data such as categorical or binary data, Hamming distance can be used.

In the case of regression problems, the average of the predicted attribute may be returned. In the case of classification, the most prevalent class may be returned.

#### Reference Link

**1. https://www.analyticsvidhya.com/blog/2018/03/introduction-k-neighbours-algorithm-clustering/**

## How does k-Nearest Neighbors Work

The kNN algorithm is belongs to the family of instance-based, competitive learning and lazy learning algorithms.

Instance-based algorithms are those algorithms that model the problem using data instances (or rows) in order to make predictive decisions. The kNN algorithm is an extreme form of instance-based methods because all training observations are retained as part of the model.

It is a competitive learning algorithm, because it internally uses competition between model elements (data instances) in order to make a predictive decision. The objective similarity measure between data instances causes each data instance to compete to “win” or be most similar to a given unseen data instance and contribute to a prediction.

Lazy learning refers to the fact that the algorithm does not build a model until the time that a prediction is required. It is lazy because it only does work at the last second. This has the benefit of only including data relevant to the unseen data, called a localized model. A disadvantage is that it can be computationally expensive to repeat the same or similar searches over larger training datasets.

Finally, kNN is powerful because it does not assume anything about the data, other than a distance measure can be calculated consistently between any two instances. As such, it is called non-parametric or non-linear as it does not assume a functional form.

## Knn classifier implementation in scikit learn

We are going to examine the Breast Cancer Dataset using python sklearn library to model K-nearest neighbor algorithm. After modeling the knn classifier, we are going to use the trained knn model to predict whether the patient is suffering from the benign tumor or malignant tumor. The greatness of using  Sklearn is that it provides us the functionality to implement machine learning algorithms in a few lines of code.


As we discussed the principle behind KNN classifier (K-Nearest Neighbor) algorithm is to find K predefined number of training samples closest in the distance to new point & predict the label from these. The distance measure is commonly considered to be Euclidean distance.

## Euclidean distance

Euclidean distance is the most commonly used distance measure. Euclidean distance also called as simply distance. The usage of Euclidean distance measure is highly recommended when data is dense or continuous. Euclidean distance is the best proximity measure. The Euclidean distance between two points is the length of the path connecting them.The Pythagorean theorem gives this distance between two points.

### Euclidean distance implementation in python

In [None]:
from math import*
 
def euclidean_distance(x,y):
 
    return sqrt(sum(pow(a-b,2) for a, b in zip(x, y)))


In [None]:
from math import xyz

In [None]:
euclidean_distance([0,3,4],[7,6,3])

## Wisconsin Breast Cancer Data Set

The Wisconsin Breast Cancer Database was collected by Dr. William H. Wolberg (physician), University of Wisconsin Hospitals, USA. This dataset consists of 10 continuous attributes and 1 target class attributes. Class attribute shows the observation result, whether the patient is suffering from the benign tumor or malignant tumor. Benign tumors do not spread to other parts while the malignant tumor is cancerous. The dataset was collected & openly distributed so as to find out some patterns from this data.

Class attribute shows the observation result, whether the patient is suffering from the benign tumor or malignant tumor. Benign tumors do not spread to other parts while the malignant tumor is cancerous. The dataset was collected & openly distributed so as to find out some patterns from this data.

## Breast Cancer Data Set Attribute Information:

1. Sample code number: id number
2. Clump Thickness: 1 – 10
3. Uniformity of Cell Size: 1 – 10
4. Uniformity of Cell Shape: 1 – 10
5. Marginal Adhesion: 1 – 10
6. Single Epithelial Cell Size: 1 – 10
7. Bare Nuclei: 1 – 10
8. Bland Chromatin: 1 – 10
9. Normal Nucleoli: 1 – 10
10. Mitoses: 1 – 10
11. Class: (2 for benign, 4 for malignant)

#### Problem Statement:

To model the knn classifier using the Breast Cancer data for predicting whether a patient is suffering from the benign tumor or malignant tumor.

#### KNN Model for Cancerous tumor detection:

To diagnose Breast Cancer, the doctor uses his experience by analyzing details provided by

   * Patient’s Past Medical History
   * Reports of all the tests performed.


Using the modeled KNN classifier, we will solve the problem in a way similar to the procedure used by doctors. The modeled KNN classifier will compare the new patient’s test reports, observation metrics with the records of patients(training data) that correctly classified as benign or malignant.


In [None]:
import numpy as np
from sklearn.impute import SimpleImputer as Imputer
from sklearn.model_selection import train_test_split
from sklearn import metrics, model_selection, preprocessing
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

## Data Import:

We are using breast cancer data. You can download it from archive.ics.uci.edu website. For importing the data and manipulating it, we are going to use numpy arrays.
Using genfromtxt() method, we are importing our dataset into the 2d numpy array. You can import text files using this function. We are passing 3 parameters:

fname
It handles the filename with extension.


delimiter
 The string used to separate values. In our dataset “,”(comma) is the separator.


dtype
It handles data type of variables.


All the values are numeric in our database. But some values are missing and are replaced by “?”. So, we will have to perform data imputation. Due to this reason, we are using float dtype.

In [None]:
cancer_data = np.genfromtxt(
 fname ='data/breast-cancer-wisconsin.data', delimiter= ',', dtype= float)

In [None]:
import pandas as pd
df=pd.DataFrame(cancer_data)
df

In [None]:
print("Dataset Shape:: ", cancer_data.shape)

## The cancer dataset’s first column consists of patient’s id. To make this prediction process unbiased, we should remove this patient id. We can use numpy delete() method for this operation.

delete(): It returns a new transformed array. Three parameters should to passed.

   * arr: It holds the array name.
   * obj: It indicates which sub-arrays to remove.
   * axis: The axis along which to delete. axis = 1 is used for columns & axis = 0 for rows.

In [None]:
cancer_data = np.delete(arr = cancer_data, obj= 0, axis = 1)

Now, we wish to divide the dataset into feature & label dataset. i.e., feature data is predictor variables they will help us to predict labels(criterion variable). Here, first 9 columns include continuous variables that will help us to predict whether a patient is having the benign tumor or malignant tumor.

In [None]:
X = cancer_data[:,range(0,9)]

In [None]:
Y = cancer_data[:,9]

# Data Imputation:

Imputation is a process of replacing missing values with substituted values. In our dataset, some columns have missing values. We can replace missing values with mean, median, mode or any particular value.
Sklearn provides Imputer() method to perform imputation in 1 line of code. We just need to define missing_values, axis, and strategy. We are using “median” value of the column to substitute with the missing value.

In [None]:
?Imputer

In [None]:
imp = Imputer(missing_values=np.nan, strategy='median')
X = imp.fit_transform(X)

## Train, Test data split:

For dividing data into train data & test data. We are using train_test_split() method by sklearn.
train_test_split(): We are using 4 parameters X, Y, test_size, random_state

   * X, Y:  X is a numpy array consisting of feature dataset & Y contains labels for each record.
   
   * test_size: It represents the size of test data needs to split. If we use 0.4, it indicates 40% of data should be separated and saved as testing data.
   
   * random_state: It’s pseudo-random number generator state used for random sampling. If you want to replicate our results, then use the same value of random_state.
   
Now, X_train & y_train are training datasets. X_test & y_test are testing datasets.
y_train & y_test are 2d numpy arrays with 1 column. To convert it into a 1d array, we are using ravel().

In [None]:
X_train, X_test, y_train, y_test =train_test_split(X, Y, test_size = 0.20, random_state = 100)
y_train = y_train.ravel()
y_test = y_test.ravel()

## KNN Implementation:

Now we are fitting KNN algorithm on training data, predicting labels for dataset and printing the accuracy of the model for different values of K(ranging from 1 to 25).

**KNeighborsClassifier():** This is the classifier function for KNN. It is the main function for implementing the algorithms. Some important parameters are:

   * n_neighbors: It holds the value of K, we need to pass and it must be an integer. If we don’t give the value of n_neighbors then by default, it takes the value as 5.
   * Weights: It holds a string value i.e., name of the weight function. The Weight function used in prediction. It can hold values like ‘uniform’ or ‘distance’ or any user defined function.
      * ‘uniform’ weight used when all points in the neighborhood are weighted equally. Default value for weights taken as ‘uniform’
      * ‘distance’ weight used for giving closer neighbors- higher weight and far neighbors-less weight, i.e., weight points by the inverse of their distance.
      * user defined function we can call the user defined functions. The user defined function can used when we want to produce custom weight values. It accepts distance values and returns an array of weights.
   * algorithm: It specifies algorithm which should be used to compute the nearest neighbors. It can values like ‘auto’, ‘ball_tree’, ‘kd_tree’, brute’. It is an optional parameter.
      * a) ‘ball_tree’ , ‘kd_tree’ are used to implement ball tree algorithm. These are special kind of data structures for space partitioning.
      * b) ‘brute’ is used to implement brute-force search algorithm.
      * c) ‘auto’ is used to give control to the system. By using ‘auto’, it automatically decides the best algorithm according to values of training data.fit()
   * data.fit(): A fit method is used to fit the model. It is passed with two parameters:X and Y. For training data fitting on KNN algorithm, this needs to call.
      * X: It consists of training data with features.
      * Y: It consists of training data with labels.predict(): It predicts class labels for the data provided as its parameters.
If an array of features data is entered as parameters, then an array of labels is given as output.

#### Accuracy Score:

accuracy_score(): This function is used to print accuracy of KNN algorithm. By accuracy, we mean the ratio of the correctly predicted data points to all the predicted data points. Accuracy as a metric helps to understand the effectiveness of our algorithm. It takes 4 parameters.

   * y_true,
   * y_pred,
   * normalize,
   * sample_weight.
Out of these 4, normalize & sample_weight are optional parameters. The parameter y_true  accepts an array of correct labels and y_pred takes an array of predicted labels that are returned by the classifier. It returns accuracy as a float value.

In [None]:
?KNeighborsClassifier

In [None]:
for K in range(25):
    K_value = K+1
    neigh = KNeighborsClassifier(n_neighbors = K_value, weights='uniform', algorithm='auto')
    neigh.fit(X_train, y_train) 
    y_pred = neigh.predict(X_test)
    print("Accuracy is ", accuracy_score(y_test,y_pred)*100,"% for K-Value:",K_value)

It shows that we are getting 95.71% accuracy on K = 3, 5. Choosing a large value of K will lead to greater amount of execution time & underfitting. Selecting the small value of K will lead to overfitting. There is no such guaranteed way to find the best value of K. So, to run it quickly we are considering K =3 for this tutorial.

### Nearest Neighbours: Pros and Cons

#### Pros:
   * Simple to implement
   * Flexible to feature / distance choices
   * Naturally handles multi-class cases
   * Can do well in practice with enough representative data


#### Cons:
   * Large search problem to find nearest neighbours
   * Storage of data
   * Must know we have a meaningful distance function
