# **Flower Species Classifier**
## A Supervised Machine Learning Classification Project
    By Karan Singh Solanki as ShapeAI ML Bootcamp Project
---
### Algorithm used: [K-nearest neighbour (KNN)](https://youtu.be/SQ84-3uwKLk?t=3007)
### Dataset used: iris_dataset (iris plants dataset) of sklearn datasets
### Description...
- It takes data of various flowers, each flower having length of petals and sepals (part of a flower).
- Each flower is labelled with its species, there are total of 3 species (or classes)
- This model learns from the data of flowers and their species

### The task is, whenever a new flower data is given to the model, it should predict the species of the new flower.


# **About the scikit learn iris dataset and analyzing what's inside...**

Importing important required libraries

In [83]:
import numpy as np
import pandas as pd   # np and pd are just python conventions for these two

Loading the dataset

In [7]:
from sklearn.datasets import load_iris
iris_dataset = load_iris()

In [8]:
print(f"Keys of iris_dataset: \n{iris_dataset.keys()}")

Keys of iris_dataset: 
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])


In [17]:
short_desc = iris_dataset['DESCR'][:230]
print("Short Description: \n" + short_desc + "\n...")

Short Description: 
.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    
...


In [97]:
# 3 classes of flowers
target_names = iris_dataset['target_names']
print(f"Target names (3 Classes of flowers):\n{target_names}")

Target names (3 Classes of flowers):
['setosa' 'versicolor' 'virginica']


In [77]:
# Features of dataset (more columns)
print(f"Feature names (description of each features):\n{iris_dataset['feature_names']}")

Feature names (description of each features):
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [76]:
data = iris_dataset['data']
print(f"Type of data: {type(data)}")   # numpy.ndarray
print(f"Shape of data: {data.shape}")  

print(f"{data.shape[0]} instances of flowers each having {data.shape[1]} feature printed above")

Type of data: <class 'numpy.ndarray'>
Shape of data: (150, 4)
150 instances of flowers each having 4 feature printed above


In [48]:
print(f"First 5 rows (data pointes/samples) of data:\n{data[:5]}")

First 5 rows (data pointes/samples) of data:
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]


In [52]:
target = iris_dataset['target']
print(f"Type of target: {type(target)}")
print(f"Shape of target: {target.shape}")  # Only a single column

Type of target: <class 'numpy.ndarray'>
Shape of target: (150,)


In [61]:
print(f"Target (Species code):\n{target}")
print("\n0: setosa, 1: versicolor, 2: virginica")

Target (Species code):
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]

0: setosa, 1: versicolor, 2: virginica


# **Actual Training and Testing begins here...**

### Dividing the whole data into training and testing data...
We have to divide the whole data into two parts, 75% will be used for training of the model and rest of 25% will be used for testing of the model.

This training can be thought as studying for the exam, and testing can be thought as the actual exam. On the basis of this exam (or test) score or accuracy of our model can be determined.

sklearn.model_selection package have a function train_test_split() which splits the data for the this purpose.

Capital X represents the attributes(width of petals and sepals) and small y represents the labels (flower species)

So we'll have X and y for both train and test data, and thus there are four variables...

X_train: attributes for training

X_test: attributes for testing

y_train: labels for training

y_test: labels for testing

In [80]:
from sklearn.model_selection import train_test_split
# I've stored the values of dictionary keys in variables data and target
X_train, X_test, y_train, y_test = train_test_split(
    data, target, random_state = 0
)


### What does random_state = 0 do?
random_state = 0 ensures that all the data will be rearranged randomly and
then only the data will be splitted.
If we don't do this, the training data will contain first 75% rows and rest
25% rows will go to test data. Now the problem is there will be very few 
training for third species and testing will be for third species entirely. 

Let's see how our splitting worked...

In [84]:
print(f"X train shape: {X_train.shape}")  # 4 attributes
print(f"X test shape: {X_test.shape}")    # 4 attributes
print(f"y train shape: {y_train.shape}")  # 1 label
print(f"y test shape: {y_test.shape}")    # 1 label

X train shape: (112, 4)
X test shape: (38, 4)
y train shape: (112,)
y test shape: (38,)


### Building our model...
using actual K-Nearest Neighbour Algorithm from sklearn

In [87]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)

### Training our model created just above...

In [91]:
knn.fit(X_train, y_train)

# Model will be trained by this X_train and y_train training data

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')

### Training done, let's test our model (Make Predictions)...

In [93]:
# Just predicting for one flower which have measurements: [5, 2.9, 1, 0.2]
# We've put this is 2D array because scikit-learn expects 2D array for the data.

X_new = np.array([[5, 2.9, 1, 0.2]])
print(f"Shape of X_new: {X_new.shape}")

Shape of X_new: (1, 4)


In [100]:
prediction = knn.predict(X_new)

print(f"Prediction: {prediction}")
print(f"Predicted target (species) name: {target_names[prediction]}")

print(f"\nSo, according to our model, this new X_new flower is of type {target_names[prediction]}")

Prediction: [0]
Predicted target (species) name: ['setosa']

So, according to our model, this new X_new flower is of type ['setosa']


## Evaluation and Testing our model

from that 25% data of which we know the answers, so that we can evaluate and give score.

In [101]:
# Let's test for X_test data

y_pred = knn.predict(X_test)
print(f"Test set predictions:\n{y_pred}")

Test set predictions:
[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 2]


### Score (Accuracy) of our Model...

y_test: The actual species values of that 25% flowers

y_pred: The species values predicted by our model

So, calculating mean will give us our accuracy.

`np.mean(y_pred == y_test)` meaning it'll only calculate mean if our model have predicted the right value, it'll done for each value in our X_test.

In [108]:
score = np.mean(y_pred == y_test)
print(f"Test set score: {score}")

Test set score: 0.9736842105263158


In [109]:
print(f"Accuracy in percentage: {round(score * 100, 2)} %")

Accuracy in percentage: 97.37 %
