# <span style="color:darkred"> Explore Classification algorithms applied on the Iris Flower data set associated with Ronald Fisher</span>
***

## <span style="color:darkred">Importing Libraries for this notebook</span>
***

Before carrying out any code in this notebook, I have first imported the Libraries that will be needed for the successful running of the notebook. These libraries are collections of code that have already been put together so that the programmer does not need to write the same code again. Instead, the required libraries are imported and using the ``as`` keyword, the libraries are stored as shorter aliases. For example, to ``import pandas``, we would use the ``as`` keyword and the alias ``pd``. This is for simplicity and tidiness when using the libraries throughout the notebook.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns 
import sklearn as sk
import numpy as np

## <span style="color:darkred"> Introduction</span>
***

The aim of this notebook is to explore classification algorithms by applying them on the well known *Iris data set*.
- I will begin by explaining what supervised learning is and explain what classification algortihms are in machine learning.
- Next, I will describe some of the classification algorithms and demonstrate them on the *Iris data set* using the ``scikit-learn`` Python Library. 
- Throughout my notebook, I will be using appropriate plots, mathematical notation and diagrams to explain relevant concepts.

# <span style="color:darkred">What is Supervised Learning?</span>
***

Before getting into the explanation of supervised learning, firstly I will set out to explain the concept of machine learning as supervised learning is part of the machine learning family. 

### <span style="color:darkred">Machine Learning</span>
According to [IBM, (n.d.)](https://www.ibm.com/topics/machine-learning)<sup>1</sup>, Machine Learning (ML) is a branch of Artificial Intelligence which uses data and algorithms to imitate the way humans learn. This, with time will improve the accuracy of Artificial Intellegence, computer science and the use of algorithms for this. 

An article titled [*What is machine learning and how does it work? In-depth guide*](https://www.techtarget.com/searchenterpriseai/definition/machine-learning-ML#:~:text=In%2Ddepth%20guide,-Share%20this%20item&text=Machine%20learning%20(ML)%20is%20a,improve%20their%20performance%20over%20time.) by Tucci. L, (n.d.)<sup>2</sup>, explains concept of Machine Learning (ML). She explains it as being a type of Artificial Intelligence (AI) which focuses on building computer systems that learn from data. Machine Learning algorithms use data that we already have and use it as an input to train the algortihms to find relationships and patterns in the data. These trained algortihms then predict outputs, classify information, cluster data points, reduce dimensionality and help to generate new content or data.  

Under the Machine Learning umbrella, according to [IBM, (n.d)](https://www.ibm.com/topics/machine-learning)<sup>3</sup>, there are three main methods: *Supervised machine learning*, *Unsupervised machine learning* and *Semi-Supervised machine learning*. However, for this Notebook, I will be focusing on the most common type which is *Supervised machine learning*.

### <span style="color:darkred">Supervised Learning</span>
Supervised machine learning (or supervised learning) is a form of machine learning that uses a dataset where we already have data in the form of inputs (X variables) and the corresponding outputs (y variables). The input and output data are known as labeled data [datacamp, (2022)](https://www.datacamp.com/blog/supervised-machine-learning)<sup>4</sup>. 

This labeled input and output data then goes through a stage of *training* during which the prgoram recognises or learns the relationship between the input data and the output data [Shee. Ed, (2022)](https://www.seldon.io/supervised-vs-unsupervised-learning-explained)<sup>5</sup>. Once the program has been trained using the training data, it can then be used to predict the output (y), from given inputs (X) on other data that it has not been trained on. For example in the case of the *Iris data set* once the program has been trained on the data set and someone were to come along with some more iris flowers data, the trained algorithm could be used to predict which of the three types of iris flower (y) (ie: Setosa, Versicolor or Vriginica) it was , given the inputs (X) data (ie: Sepal Length & width and Petal length & width). 

### <span style="color:darkred">What are classification algorithms?</span>
While there are may different variations of definitions of classifiers in machine learning, according to [Datacamp, (2022)](https://www.datacamp.com/blog/classification-machine-learning)<sup>6</sup> classification is where the model tries to predict the correct output , given a some input data. The model is fully trained on a training set of data and is tested on a test set of data before it is used to make predictions on new  data that it has not seen before. 

According to [Shiksha Online, (2023)](https://www.shiksha.com/online-courses/articles/predicting-categorical-data-using-classification-algorithms/)<sup>7</sup> for using classifiers, the outcome or output predicted by the classification algorithm must be a categorical variable. A categorical variable is a variable that has a limited number of possible values or categories, and can be eithier nominal or ordinal [IBM, (2021)](https://www.ibm.com/docs/en/spss-statistics/27.0.0?topic=charts-variable-types)<sup>8</sup>. In the case of the Iris data set, the output variables are categorical and are nominal values (there is no ranking in terms of their values). 

## <span style="color:darkred">The Iris data set</span>
***

The data set contains 150 rows or samples in total, allowing for 50 observations of each of the three Species of Iris Flowers: *Iris setosa, Iris Versicolor and Iris Verginica*. Each of the three sepcies then have four characteristics or atributes: *Sepal length & width and Petal length & width*.

</br>

|Setosa|Virginica|Versicolor|
|:-:|:-:|:-:|
|![Setosa](https://upload.wikimedia.org/wikipedia/commons/thumb/a/a7/Irissetosa1.jpg/640px-Irissetosa1.jpg)|![Second Image](https://upload.wikimedia.org/wikipedia/commons/thumb/f/f8/Iris_virginica_2.jpg/480px-Iris_virginica_2.jpg)|![Second Image](https://upload.wikimedia.org/wikipedia/commons/thumb/2/27/Blue_Flag%2C_Ottawa.jpg/480px-Blue_Flag%2C_Ottawa.jpg)|

Image Sources:</br>
https://en.wikipedia.org/wiki/Iris_setosa <br/>
https://en.wikipedia.org/wiki/Iris_versicolor <br/>
https://en.wikipedia.org/wiki/Iris_virginica </br>
</br>

This data set was first introduced in [Ronald Fisher's](https://en.wikipedia.org/wiki/Ronald_Fisher)<sup>9</sup> 1936 paper [*The use of multiple measurements in taxonomic problems*](https://onlinelibrary.wiley.com/doi/epdf/10.1111/j.1469-1809.1936.tb02137.x) <sup>10</sup> and is commonly known as *Fisher's Iris data set* for this reason.

</br>

<figure>
    <img src="https://upload.wikimedia.org/wikipedia/commons/a/aa/Youngronaldfisher2.JPG"
         height="250"
         alt="Ronald Fisher">
    <figcaption>Sir Ronald Fisher</figcaption>
    </br>
    <figcaption>Image source: https://en.wikipedia.org/wiki/Ronald_Fisher</figcaption>
</figure>

</br>

However, it is also sometimes known as *Anderson's Iris data set* as it was [Edgar Anderson](https://en.wikipedia.org/wiki/Edgar_Anderson)<sup>11</sup> who collected the data for the compilation of the data set [Wikipedia, (2023)](https://en.wikipedia.org/wiki/Iris_flower_data_set)<sup>12</sup>.

</br>

<figure>
    <img src="http://thedailygardener.org/wp-content/uploads/2019/12/Edgar-Shannon-Anderson-1.jpg"
         height="275"
         alt="Edgar Shannon Anderson">
    <figcaption>Edgar Shannon Anderson</figcaption>
    </br>
    <figcaption>Image source: https://thedailygardener.org/otb20190618/ </figcaption>
</figure>



## <span style="color:darkred">Reading in the iris.csv data set</span>

For this notebook, I have sourced the a version of the Iris data set from [Kaggle. UCI Machine Learning](https://www.kaggle.com/datasets/uciml/iris/)<sup>13</sup>.

Using [Pandas](https://pandas.pydata.org/docs/)<sup>14</sup> which is the first of the libraries imported above, I have read in the Iris.csv file. I have given the path to the *data* folder in which it is saved so that Pandas can locate the .csv file.

Once imported using pandas, the file is now known as a **dataframe**, which is two dimentional in structure. It contains the data and its column labels (*SepalLengthCm*, *SepalWidthCm*, *PetalLengthCm*, *PetalWidthCm* and *Species*) and the row label (*Id*). 

Calling ``df`` then shows the first 5 and the last five rows of data in the dataframe. 

The dataframe rows are indexed from 0 through to 149 as shown on the first 'column' below. The index is not part of the data of the .csv file, it is simply in the dataframe to show the indexes of the rows. This this is why there are 150 rows of data but only 149 indexes. 

Similarly, the columns are  indexed starting from 0. The indexes for the columns are not shown on the dataframe, however. So in the dataframe below, the *Id* column is index 0, *SepalLengthCm* column is index 1 and so on up to *Species* which is index 5.

The dataframe is made up of four input variables: (*SepalLengthCm*, *SepalWidthCm*, *PetalLengthCm*, *PetalWidthCm*) and one output variable: (*Species*). 

The *Id* Column doesn't serve a purpose as such for this study and will be removed from the dataframe later on.

In [2]:
# Read in the .csv dataset from the data folder 
df = pd.read_csv('data/Iris.csv')

# Show the dataframe
df

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica


## <span style="color:darkred">References</span>
***

1. IBM, (n.d), What is machine learning </br>
https://www.ibm.com/topics/machine-learning

2. Tucci. Linda (n.d.), What is machine learning and how does it work? In-depth guide </br>
https://www.techtarget.com/searchenterpriseai/definition/machine-learning-ML#:~:text=In%2Ddepth%20guide,-Share%20this%20item&text=Machine%20learning%20(ML)%20is%20a,improve%20their%20performance%20over%20time.

3. IBM, (n.d), What is machine learning </br>
https://www.ibm.com/topics/machine-learning

4. datacamp, (Aug 2022). Supervised Machine Learning </br>
https://www.datacamp.com/blog/supervised-machine-learning

5. Shee. Ed, (Spetember 16, 2022). Supervised vs Unsupervised Learning Explained </br>
https://www.seldon.io/supervised-vs-unsupervised-learning-explained

6. datacamp, (September, 2022) Classification in Machine Learning: An Introduction </br>
https://www.datacamp.com/blog/classification-machine-learning

7. Shiksha Online, (January 27, 2023). Predicting Categorical Data Using Classification Algorithms </br>
https://www.shiksha.com/online-courses/articles/predicting-categorical-data-using-classification-algorithms/

8. IBM, (February 02, 2021). Variable Types. </br>
https://www.ibm.com/docs/en/spss-statistics/27.0.0?topic=charts-variable-types

9. Wikipedia, (2023). Ronald Fisher. </br> https://en.wikipedia.org/wiki/Ronald_Fisher

10. Wiley Online Library, (2023). THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS </br> https://onlinelibrary.wiley.com/doi/10.1111/j.1469-1809.1936.tb02137.x

11. Wikipedia, (2023). Edgar Anderson. </br> https://en.wikipedia.org/wiki/Edgar_Anderson

12. Wikipedia, (2023). *Iris* flower data set. </br> https://en.wikipedia.org/wiki/Iris_flower_data_set

13. kaggle. UCI Machine Learning. Iris Species </br>
https://www.kaggle.com/datasets/uciml/iris/

14. pandas, (November 10, 2023). pandas documentation </br>
https://pandas.pydata.org/docs/

</br>

### <span style="color:darkred">Markdown References</span>
1. Ramalingam. Aravind. medium.com (June 10, 2021) 7 Advanced Markdown Tips! </br>
https://medium.com/analytics-vidhya/7-advanced-markdown-tips-5a031620bf52

2. Markdown Guide, (2023) Image Captions </br>
https://www.markdownguide.org/hacks/



# <span style="color:darkred">End</span>
***