##### *This Jupyter Notebook is a part of Practical Machine Learning course at Coursera. The code is taken from Coursera Applied Machine Learning class notebooks. The code is modified and explanation has been added to make it easy to learn for beginners.*

#### Credits: Coursera-Practical Machine Learning 

> The K-Nearest Neighbors algorithm can be used for classification and regression. It is an instance based learning or memory based supervised classification i.e. it learns by memorizing the labeled examples from the training set. The k in k-NN refers to the number of nearest neighbors the classifier will use in order to make prediction. 

> __k-NN Classifier__ How it works:

> 1. If given a new unseen instance of something to classify, k-NN classifier will look into the set of previously learned training examples to find the k exmplaes that have closest features to this new unseen test example.

> 2. The classifier will then look at the class labels of the k-nearest training examples, combine those labels and make a prediction of the label of the new unseen test example, by taking a majority vote. 

###### We will first import the modules required for building the model. 

In [88]:
%matplotlib notebook
%config InlineBackend.figure_format = 'retina'

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split

fruits = pd.read_table('fruit_data_with_colors.txt')

**** 
##### We will look at our data, to gain some insight on what are the features of our data

In [4]:
fruits.head()

Unnamed: 0,fruit_label,fruit_name,fruit_subtype,mass,width,height,color_score
0,1,apple,granny_smith,192,8.4,7.3,0.55
1,1,apple,granny_smith,180,8.0,6.8,0.59
2,1,apple,granny_smith,176,7.4,7.2,0.6
3,2,mandarin,mandarin,86,6.2,4.7,0.8
4,2,mandarin,mandarin,84,6.0,4.6,0.79


**** 

##### We will create a mapping from fruit label value to fruit name to make results easier to interpret. For this we use zip() function to combine the unique fruit label values with unique fruit name to form a zip object. The zip object is a generator-like objects which can be used to produce elements on demand. Reference: https://wiki.python.org/moin/Generators


In [8]:
lookup_fruit_name = zip(fruits.fruit_label.unique(), fruits.fruit_name.unique())  
lookup_fruit_name

<zip at 0x111484748>

##### We convert this inot a dictionary using dict() function. 

In [12]:
lookup_fruit_name = dict(lookup_fruit_name)

# This tells us that we have 4 category of fruit examples in our data.  
lookup_fruit_name

{1: 'apple', 2: 'mandarin', 3: 'orange', 4: 'lemon'}

*****

##### Splitting data to create test and train data

In [15]:
# X holds the features of the data set height, width, mass and color_score. 
# This collection of features is called feature space
X = fruits[['height', 'width', 'mass', 'color_score']]

# y holds the corresponding t=label values for the instances in X
y = fruits['fruit_label']

In [138]:
# We will use train_test_split function of scikit learn to split the training and testing data randomly.
# The default split is 75% / 25% train-test split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

*****
###### Creating a classifier object 

In [139]:
# We create a k-NN classifer object using sklearn library
from sklearn.neighbors import KNeighborsClassifier

# We set the number of parameters by passing a value for n_neighbors parameter here it is 5
knn = KNeighborsClassifier(n_neighbors = 5)
type(knn)

sklearn.neighbors.classification.KNeighborsClassifier

##### Training the model using the training data set that we created

In [140]:
# We train our classifier using the X_train and y_train using the fit method
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

> Here the fit method takes the training data and then changes the state of the classfier i.e. it updates the state of k and n where k is the number of nearest neighbors it will use during training

##### How accurate is our classifier? 

In [141]:
# We use score method to check accuracy of the classifer on the test data that we made earlier.
# ccuracy is defined as the fraction of test set labels that were correctly predicted by the classfier
knn.score(X_test, y_test)

0.53333333333333333

**** 
###### Predicting new unseen instances 

In [142]:
# first example: a small fruit with mass 20g, width 4.3 cm, height 5.5 cm, color_score 0.55
first_fruit_prediction = knn.predict([[20, 4.3, 5.5, 0.55]])
lookup_fruit_name[fruit_prediction[0]]

'mandarin'

In [143]:
# second example: a larger, elongated fruit with mass 192g, width 8.4 cm, height 7.3 cm, color_score 0.55
second_fruit_prediction = knn.predict([[192, 8.4, 7.3, 0.55]])
lookup_fruit_name[fruit_prediction[0]]


'mandarin'

##### Plotting Decision Boundaries 
###### We would like to see the decision boundaries for various values of k
> We choose 5 nearest neighbors, last parameter is the weighing method that is to be used. Here we have used 'uniform' string to treat all neighbours equally when combining the labels. We can also pass 'distance' string for distance wave function and we can also pass our own methods.



In [144]:
from adspy_shared_utilities import plot_fruit_knn

plot_fruit_knn(X_train, y_train, 1, 'uniform')

<IPython.core.display.Javascript object>

In [145]:
plot_fruit_knn(X_train, y_train, 5, 'uniform')

<IPython.core.display.Javascript object>

In [146]:
plot_fruit_knn(X_train, y_train, 10, 'uniform')

<IPython.core.display.Javascript object>

> As we can see for k=1 the prediction is sensitive to noise, outliers, mislabeled data, and other sources of variation in individual data points. 
> For larger values of k, the areas assigned to different classes are smoother and not as fragmented and more robust to noise in the individual points. But there can be an increase in number of mistakes in individual points. This is an example of what's known as the bias variance tradeoff. 

****

#### How choice of k affects the accurancy of classifier 

In [148]:
# Here we are going to see the affect of k on accuracy of the classifer. 
# We will run the classifer on the training and test data and find the accurancy and then plot it.

k_range = range(1,20)
scores = []

for k in k_range:
    knn = KNeighborsClassifier(n_neighbors = k)
    knn.fit(X_train, y_train)
    scores.append(knn.score(X_test, y_test))

plt.figure()
plt.xlabel('k')
plt.ylabel('accuracy')
plt.scatter(k_range, scores)
plt.xticks([0,5,10,15,20]);


<IPython.core.display.Javascript object>

In [151]:
# As we can see that the accuracy is highest for k=6 of around 60%, we can check the decision boundary for it. 
# The value of k should be checked for various splits of training and testing data, in order to find the best k.
# This is important when doing model selection.

knn = KNeighborsClassifier(n_neighbors = 6)
knn.fit(X_train, y_train)
knn.score(X_test, y_test)

0.59999999999999998

In [149]:
plot_fruit_knn(X_train, y_train, 6, 'uniform')


<IPython.core.display.Javascript object>

***** 

###### k-NN classification sensitivity on train/test split proportion

In [154]:
# values for various splits that we want to make and test accuracy of our classifer. 
t = [0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2]

# we choose 5 as the number of nearest neighbours that our classifer will consider
knn = KNeighborsClassifier(n_neighbors = 5)

plt.figure()

# we loop thorugh different values and calculate accuracy for various splits
for s in t:
    scores = []                                   # list to store score results
    for i in range(1,1000):                       # we run this calculation 1000 times for each split
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1-s) # spitting data into test-train sets
        knn.fit(X_train, y_train)                 # fitting our classifier for a particular split
        scores.append(knn.score(X_test, y_test))  # finding accurancy for a particular split
    plt.plot(s, np.mean(scores), 'bo')            # plotting s on x-axis and mean score on y-axis 

# making labels for x and y axis
plt.xlabel('Training set proportion (%)')         
plt.ylabel('accuracy');

<IPython.core.display.Javascript object>

> We can see that accuracy is also sensitive to how we split our data into training set and test set.  