# Perform Classification by using K Nearest Neighbour (KNN)

In [None]:
import numpy as np
from sklearn import datasets
from sklearn import model_selection
from sklearn import metrics
from sklearn import preprocessing
from imblearn.over_sampling import SMOTE
import pandas as pd
import requests as req


import matplotlib.pyplot as plt
%matplotlib inline

## Classifying Iris Dataset With KNN
### Load Data
Here we will load the IRIS dataset from **scikit-learn**. We will be utilizing `iris.data` and `iris.target` as usual for our features and values.

In [None]:
iris = datasets.load_iris()

As usual `dir(iris)` shows the attributes of the iris datasets.<br> 
- `iris.data.shape` shows the shape of the data.<br>
- `iris.target_names` shows the classes that we want to classify.<br>
- `iris.feature_names` shows the name of features that we are training.

In [None]:
dir(iris)

In [None]:
iris.data.shape

In [None]:
iris.target_names

In [None]:
iris.feature_names

In [None]:
data = iris.data.astype(np.float32)
target = iris.target.astype(np.float32)

Split data into train and test sets.

In [None]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    data, target, test_size=0.3, random_state=123
)

In [None]:
X_train.shape, y_train.shape

In [None]:
X_test.shape, y_test.shape

### Model Training
We will use K Nearest Neighbours from scikit learn.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

Initialize the model.<br>
Specify the number of neighbors to 3.

In [None]:
# TODO: Assign number of neighbors, k=3

model = Classifier(n_neighbors)

Train the model by using train dataset.

In [None]:
# TODO: Enter the code to call fit the training data into the model


### Evaluation

In [None]:
predictions = model.predict(X_test)

The method `metrics.confusion_matrix` will visualize the performance of the model through a confusion matrix. 

In [None]:
print(metrics.confusion_matrix(y_test,predictions))

In [None]:
metrics.accuracy_score(y_test, predictions)

## Classifying Glass Dataset from UCI Machine Learning Repository

### Load Data

Here, we load the glass data from UCI ML Repository into a Dataframe using **pandas**.<br> `glass` will be storing the dataset, `description` will store the text with the description of the data.

In [None]:
glass = pd.read_csv(
    "https://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data", 
    names=['ID','Refractive Index','Na','Mg','Al','Si','K','Ca','Ba','Fe','Class']
)
description = req.get("https://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.names").text

In [None]:
print(description)

The `glass` dataset is a combination of features and categories. From the description, we know that the features that we are interested are in columns **2 - 10**. <br>It is common practice that most of the data have their **expected value/ categories** in the last column, which is also the case in this dataset.<br><br> Using `iloc`, separate the data into :<br> `glass_data` which contains features <br>`glass_target` which contains expected values/ categories.

In [None]:
glass_data = glass.iloc[:,1:-1]
glass_target = glass.iloc[:,-1]

Notice that the amounts of data in each class varies too much. This is a showcase of what's called **imbalanced data**.<br><br>
There are a few ways to tackle this problem. Here, we are choosing to use a method called **oversampling**.<br><br>
**Oversampling** refers to increasing the number of data points in the minority classes.<br><br>
There are a few techniques for oversampling:
1. Random sampling
2. SMOTE: Synthetic Minority Over-sampling Technique
3. ADASYN: Adaptive Synthetic Sampling

For more details about oversampling do refer to https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/.<br><br>
In this case, we are going to utilize `SMOTE` as `SMOTE` can avoid overfitting.To oversample the data, we are going to utilize a external library called `imblearn`.<br><i>Note: To install this library, run this command: `pip install imblearn` in command line/ terminal.

In [None]:
!pip install imblearn

In [None]:
oversample = SMOTE()
glass_data, glass_target = oversample.fit_resample(glass_data,glass_target)

Split `glass_data_scaled` into **test and train data**.<br>Test size = 0.3

In [None]:
# TODO: Split data into 70% training and 30% test
X_train2, X_test2, y_train2, y_test2 = model_selection.train_test_split(
    glass_data_scaled, glass_target, test_size=, random_state=123
)

Perform **feature scaling** on the `X_train2`,`X_test2` into **`X_train2_scaled`** and **`X_test2_scaled`** respectively.<br>
<I>Hint: fit_transform on the training data and transform only on the test data

In [None]:
# TODO: Replace {} with your answer to scale the data

scaler = preprocessing.{}()
X_train2_scaled = scaler.{}(X_train2)
X_test2_scaled = scaler.{}(X_test2)

### Model Training

Initialize KNN Model named `model_2` with `k=3`

In [None]:
# TODO: Initialize KNN model

model_2 = (n_neighbors=)

In [None]:
# TODO: Fit data into the model to train the model

model_2.

### Evaluation

Predict the values for the test data and do an **`accuracy test`** and a **`confusion matrix`**.

In [None]:
prediction = model_2.predict(X_test2_scaled)
metrics.accuracy_score(y_test2,prediction)

In [None]:
metrics.confusion_matrix(y_test2,prediction)

Besides accuracy score and confusion matrix, **precision** and **recall** both provide some insights to any classification model that you're trying to train.<br>
- **`Precision`** : the percentage of your results which are relevant.
$$Precision = \frac{TP}{TP+FP}$$ 
where: <br>
$TP$ = True positive<br>
$FP$ = False positive<br><br>
- **`Recall`** :the percentage of total relevant results correctly classified by your algorithm. 
$$Recall = \frac{TP}{TP+FN}$$ 
where: <br>
$TP$ = True positive<br>
$FN$ = False negative<br>


In [None]:
print(metrics.recall_score(y_train2,model_2.predict(X_train2_scaled)))

In [None]:
print(metrics.precision_score(y_test2,prediction,average=None))

In [None]:
print(metrics.classification_report(y_test2,prediction))

Occasionally we want to see if the model is overfit by the training data. In such cases we may try to measure the accuracy of the predictions by the training data itself.<br><br>
Here we try to compare both the results.<br><br>
If the accuracy is not that distinct from that of the test data, the model is well-fit.

In [None]:
print(metrics.accuracy_score(y_train2,model_2.predict(X_train2_scaled)))

## References
C.L. Blake and C.J. Merz (1998). UCI repository of machine learning databases. University
of California. [www http://www.ics.uci.edu/∼mlearn/MLRepository.html]

Kohli, S. (2019, November 18). Understanding a Classification Report For Your Machine Learning Model. Retrieved August 06, 2020, from https://medium.com/@kohlishivam5522/understanding-a-classification-report-for-your-machine-learning-model-88815e2ce397