# Perform Classification by using K Nearest Neighbour (KNN)

In [248]:
import numpy as np
from sklearn import datasets
from sklearn import model_selection
from sklearn import metrics
from sklearn import preprocessing
import pandas as pd
import requests as req


import matplotlib.pyplot as plt
%matplotlib inline

## Classifying Iris Dataset With KNN
### Load Data
Here we will load the IRIS dataset from scikit-learn. We will be utilizing `iris.data` and `iris.target` as usual for our features and values.

In [28]:
iris = datasets.load_iris()

As usual `dir(iris)` shows the attributes of the iris datasets.<br> `iris.data.shape` shows the shape of the data.<br>
`iris.target_names` shows the classes that we want to classify.<br>
`iris.feature_names` shows the name of features that we are training.

In [29]:
dir(iris)

['DESCR', 'data', 'feature_names', 'filename', 'target', 'target_names']

In [30]:
iris.data.shape

(150, 4)

In [31]:
iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [32]:
iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [90]:
data = iris.data.astype(np.float32)
target = iris.target.astype(np.float32)

Split data into train and test sets.

In [33]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    data, target, test_size=0.3, random_state=123
)

In [34]:
X_train.shape, y_train.shape

((105, 4), (105,))

In [35]:
X_test.shape, y_test.shape

((45, 4), (45,))

### Model Training
We will use K Nearest Neighbours from scikit learn.

In [36]:
from sklearn.neighbors import KNeighborsClassifier

Initialize the model.<br>
Specify the number of neighbors to 3.

In [37]:
model = KNeighborsClassifier(n_neighbors=3)

Train the model by using train dataset.

In [38]:
# TODO: Enter the code to call fit the training data into the model
model.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')

### Evaluation

In [39]:
predictions = model.predict(X_test)

The method `metrics.confusion_matrix` will visualize the performance of the model through a confusion matrix. 

In [40]:
print(metrics.confusion_matrix(y_test,predictions))

[[18  0  0]
 [ 0  9  1]
 [ 0  1 16]]


In [41]:
metrics.accuracy_score(y_test, predictions)

0.9555555555555556

## Classifying Glass Dataset from UCI Machine Learning Repository

### Load Data

Here, we load the glass data from UCI ML Repository into a Dataframe using **pandas**.<br> `glass` will be storing the dataset, `description` will store the text with the description of the data.

In [303]:
glass = pd.read_csv(
    "https://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data", 
    names=['ID','Refractive Index','Na','Mg','Al','Si','K','Ca','Ba','Fe','Class']
)
description = req.get("https://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.names").text

In [296]:
print(description)

1. Title: Glass Identification Database

2. Sources:
    (a) Creator: B. German
        -- Central Research Establishment
           Home Office Forensic Science Service
           Aldermaston, Reading, Berkshire RG7 4PN
    (b) Donor: Vina Spiehler, Ph.D., DABFT
               Diagnostic Products Corporation
               (213) 776-0180 (ext 3014)
    (c) Date: September, 1987

3. Past Usage:
    -- Rule Induction in Forensic Science
       -- Ian W. Evett and Ernest J. Spiehler
       -- Central Research Establishment
          Home Office Forensic Science Service
          Aldermaston, Reading, Berkshire RG7 4PN
       -- Unknown technical note number (sorry, not listed here)
       -- General Results: nearest neighbor held its own with respect to the
             rule-based system

4. Relevant Information:n
      Vina conducted a comparison test of her rule-based system, BEAGLE, the
      nearest-neighbor algorithm, and discriminant analysis.  BEAGLE is 
      a product available 

The `glass` dataset is a combination of features and categories. From the description, we know that the features that we are interested are in columns **2 - 10**. <br>It is common practice that most of the data have their **expected value/ categories** in the last column, which is also the case in this dataset.<br><br> Using `iloc`, separate the data into `glass_data` which contains features, and `glass_target` which contains expected values/ categories.

In [297]:
glass_data = glass.iloc[:,1:-1]
glass_target = glass.iloc[:,-1]

Perform **feature scaling** on the `glass_data` into **`glass_data_scaled`**.

In [298]:
Scaler = preprocessing.StandardScaler()
glass_data_scaled = Scaler.fit_transform(glass_data)

Split `glass_data_scaled` into **test and train data**.<br>Test size = 0.3

In [299]:
X_train2, X_test2, y_train2, y_test2 = model_selection.train_test_split(
    glass_data_scaled, glass_target, test_size=0.3, random_state=123
)

### Model Training

Initialize KNN Model named `model_2` with `k=3`

In [322]:
model_2 = KNeighborsClassifier(n_neighbors=3)

In [None]:
model_2.fit(X_train2,y_train2)

### Evaluation

Predict the values for the test data and do an **`accuracy test`** and a **`confusion matrix`**.

In [323]:
prediction = model_2.predict(X_test2)
metrics.accuracy_score(y_test2,prediction)

0.7692307692307693

In [325]:
metrics.confusion_matrix(y_test2,prediction)

array([[16,  1,  0,  0,  0,  0],
       [ 3, 19,  0,  0,  0,  0],
       [ 4,  1,  0,  0,  0,  0],
       [ 0,  1,  0,  4,  0,  0],
       [ 0,  0,  0,  0,  2,  1],
       [ 3,  1,  0,  0,  0,  9]], dtype=int64)

## References
C.L. Blake and C.J. Merz (1998). UCI repository of machine learning databases. University
of California. [www http://www.ics.uci.edu/∼mlearn/MLRepository.html]