Topic 4: k Nearest Neighbours

## Task 4

Using the famous iris data set suggest whether the setosa class is easily separable from the other two classes. Provide evidence for your answer.

### Aim

So in this exercise we aim to train a machine learning model that can take in feature data from the iris dataset and test the accuracy of the model when it tries to predict which of the three flower varieties a set of features are from.

The inputs are recorded measurments of the flowers featurse under the column headings: sepal_length, sepal_width, petal_length and petal_width. The varieties are setosa, versicolor and verginica under the heading class.

###  scikit-learn

We will use the sikit-learn. Scikit-learn is a machine learning library in Python. It is built on NumPy and SciPy and it contains a wide range of tools for machine learning, including: classification, regression, clustering, and dimension reduction tasks. It is a popular choice for machine learning beginners as it has good documentation and a broad library of functions. It is an excellent platform for learning. The package in python is actually called `sklearn`. This shorter name is helpful as it avoids confusion with extra charachters. It is generally included with annaconda so no need to install it seperatley.

>Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities.

https://scikit-learn.org/stable/getting_started.html


### sikit-learn fit method

The scikit-learn `fit()` method is used to train or teach the learning model on a specific dataset. The model begins with default settings that may not be good at predicting the type of flower without any knowledge of flowers. Instead of using its previous knowlodge it learns the patterns in the data and uses those patterns to make predictions on new data. A training step with a fit method is required for most machine learning models before they can be used to make predictions.

## Start coding

In [1]:
#imports required to run the notebook code
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
import math

Lets have a look at the data and check for null values. 

In [2]:
iris = pd.read_csv(r"data\iris.csv")
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [3]:
iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   class         150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [4]:
iris.isnull().values.any()

False

The data is quite short with just 150 entries but no null (or `NaN`) values. This is good, because we don't need to worry about our results being skewed by empty cells. We can see column headings from the data `head()` output. Varieties are under the heading `class`. The other feature headings can also be seen. We will need these too.

To carryout a split train and learn on the data we need separate out the target variable. In this case we are looking to predict class, so we seperate it out as y. The feature data we are using to predict the class is seperated to the variable X. After that we will split the dataset out into training (70%) and testing (30%) sets using the `train_test_split()` function from sikit-learn

In [5]:
# Separate out the target variable. In this case we are looking to predict class, so we seperate it out as y. 
# The feature data we are using to predict the class is seperated to the variable X. 
X = iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
y = iris['class']

In [6]:
#https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
#sklearn.model_selection.train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)

#Slpit the dataset into training (70%) and testing (30%) sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True)

Below using the data `head()` function again we can see the `X_train` and `X_test` sets contain the same data for different selected rows. Also notice the class that we wish to predice is now missing. This is seperated into the `y` datasets. 

In [7]:
X_train.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
56,6.3,3.3,4.7,1.6
33,5.5,4.2,1.4,0.2
15,5.7,4.4,1.5,0.4
120,6.9,3.2,5.7,2.3
109,7.2,3.6,6.1,2.5


In [8]:
X_test.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
31,5.4,3.4,1.5,0.4
128,6.4,2.8,5.6,2.1
92,5.8,2.6,4.0,1.2
132,6.4,2.8,5.6,2.2
8,4.4,2.9,1.4,0.2


K-Nearest Neighbors (KNN) is an algorithm that can classify new data points based on the majority class of its k nearest neighbors in the training set. 

The k-value is an important parameter in KNN. It determines the number of nearest neighbors considered. The best k-value will differ based on the dataset can be optimised by experimentation

https://www.ibm.com/topics/knn What is the k-nearest neighbors algorithm?

>What is a good value for K in KNN?
How to find the optimal value of K in KNN? | by Amey Band ...
The optimal K value usually found is the square root of N, where N is the total number of samples. Use an error plot or accuracy plot to find the most favorable K value. KNN performs well with multi-label classes, but you must be aware of the outliers.23 May 2020

How to find the optimal value of K in KNN? https://towardsdatascience.com/how-to-find-the-optimal-value-of-k-in-knn-35d936e554eb



In [9]:
knn_value = round(math.sqrt((len(iris.index))))

In [10]:
# Create and fit a KNeighborsClassifier model
knn = KNeighborsClassifier(n_neighbors=knn_value)
knn.fit(X_train, y_train)
print(knn.fit(X_train, y_train))

KNeighborsClassifier(n_neighbors=12)


In [11]:
accuracy = (knn.score(X_test, y_test))*100
accuracy

95.55555555555556

### Results

When the code evaluates on the testing set, it is does so with a mixture of all three classes. The code then uses its knowledge of each class to predict a class. As a result, the accuracy of the code is a measure of its ability to correctly classify all three varieties of flowers, not just Iris setosa.

#https://towardsdatascience.com/importance-of-distance-metrics-in-machine-learning-modelling-e51395ffe60d

In this case the model can predict a flowers variety or `class` with 97.77% accuracy.


### Results for setosa

Lets look at the results more closely to see how accuratley setosa can be predicted. For this we can use the sikitlearn `classification_report()` function. It will output a report of our models performance by giving scores for `precision`, `recall`, `F1-score`, and `support` for each class.

    Precision is the proportion of positive predictions that are correct

    Recall is the proportion of actual positives that are correctly identified

    F1-score is a measure of the balance between precision and recall

    Support is the number of actual occurrences of each class in the dataset

#https://www.kaggle.com/code/prashant111/knn-classifier-tutorial

In [12]:
y_pred = knn.predict(X_test)

In [13]:
#https://medium.com/@mehtashubh1029/iris-flower-classification-using-knn-1eef6e7f3f84
#The classification_report function builds a text report showing the main classification metrics. 
#https://scikit-learn.org/stable/modules/model_evaluation.html#classification-report
print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        15
  versicolor       0.93      0.93      0.93        15
   virginica       0.93      0.93      0.93        15

    accuracy                           0.96        45
   macro avg       0.96      0.96      0.96        45
weighted avg       0.96      0.96      0.96        45



### Conclusions

Using sikitlearn on the iris dataset we predicted a flowers variety or `class` with 97.77% accuracy. More importantly for this exercise we predicted `setosa` with 100% accuracy.