## Exploring Classification Algorithms with the Iris Dataset

### TABLE OF CONTENTS

### Introduction

### Supervised Learning
- What is supervised learning
- labelled data
- relevance/implementation/importance of supervised learning in machine learning

## Understanding the Iris Dataset


### Overview of the Iris Dataset

Often referred to as the Fisher Iris Dataset and despite what the name suggests, it was in fact [Edgar Shannon Anderson](https://en.wikipedia.org/wiki/Edgar_Anderson) (1897 - 1969) who, in the 1930s, collected the raw data which formed the basis of the famous iris data set.  Having secured a fellowship to study at the John Innes Horticultural Institute in Britain, Anderson met the statistician [Sir Ronald Aylmer Fisher](https://en.wikipedia.org/wiki/Ronald_Fisher) who went on to published a paper in 1936 proposing a methodological framework to delineate 'desirable' traits based on the Iris dataset and using the statistics originally gathered by Anderson.

There are between 200 and 300 species within the [iris](https://sites.berry.edu/cborer/inventory/iris/) genus so identifying them to this particular family can be challenging.  Most irises do have some shared characteristics however, first among them is the presence of six 'petals'. The inner three petals are referred to as “standards” while the outer three sepals, often mistaken for petals, are called “falls”. Sepals serve as protection for the flower in bud and as support for the petals when in bloom.

### Attributes of the Iris Data set

The [Iris Flower Data Set](https://en.wikipedia.org/wiki/Iris_flower_data_set) represents four measurements of floral morphology on 150 plants - 50 individuals for each of three genus (*Iris versicolor*, *Iris setosa*, and *Iris virginica*).  The numeric parameters which the dataset contains are sepal width, sepal length, petal width and petal length.  Classification accuracy is the ratio of number of correct predictions to the total number of input samples, it works best if there are equal number of samples belonging to each class which, in this case, there are.

Each row in the Iris dataset describes one flower for which there are four seperate measurements - the length and width of the sepals and the length and width of the petals.  The 5th column is the species of iris: *setosa, versicolor*, or *virginica*. 

The Fisher data set is described as the 'Hello World' for machine learning, useful for practicing basic machine learning algorithms.  It endures because the data is open source, the accuracy and origin are both known, it is 'real' data and with three types of flower, it allows for more than just binary classification.  Additionally, with an even 50 in each classification it is balanced and has no null or missing values.  All measurements are on the same scale (cm) so no normalisation is called for and the file size isn’t unwieldy or excessively complicated.  

### Data Exploration
- load the Iris dataset - scikit-learn
- basic statistics and information about the dataset
- visualise the dataset using plots (scatter plots, histograms)

### Classification Algorithms

#### What is Classification?
Classification is the process of recognising, understanding, and grouping ideas and objects into preset categories. Using pre-categorised training datasets, machine learning programs apply algorithms to classify future datasets into such categories. In machine learning, algorithms use the training data to predict the likelihood that subsequent data will fall into one of the predetermined categories (Dutta, 2022).

In [None]:
- explain goal of classification (assigning labels to data points)

#### Common Classification Algorithms
Below I have chosen six of the most commonly applied ML algorithms, given a brief overview of how they work and listed the advantages and disadvantages of each.

#### Logistic Regression
Logistic regression aims to solve classification problems by predicting categorical outcomes, usually where there are two outcomes (binomial) (W3schools).

Rout, A. (2022) lists the main advantages of logistic regression including that it is easy to implement, interpret, and efficient to train; it doesn't make assumptions about distributions of classes and it can extend to multinomial regression and a natural probabilistic view of class predictions. Logistic regression provides a measure of how appropriate a predictor is and also its direction of association (positive or negative). It demonstrates good accuracy for many simple data sets and performs well where a dataset is linearly separable. It can interpret model coefficients as indicators of feature importance and is less inclined to overfitting.

Clearly perhaps, as the author further expands, the main disadvantage of logistic regression lies in it's binary nature and the assumption of linearity between the dependent variable and the independent variables. It can only be used to predict discrete functions and cannot solve non-linear problems requiring average or no multicollinearity between independent variables. It can be difficult to identify and explain complex relationships using logistic regression - more powerful and compact algorithms, such as Neural Networks, are more efficient in such scenarios.

- pick one algorithm for detailed exploration [chosen algorithm]

#### [chosen algorithm]
- explain the concept of [chosen algorithm]
- discuss how [chosen algorithm] work for classification
- Gini impurity and entropy as criteria for splitting nodes?

#### Implementation of [chosen algorithm]
- import necessary libraries
- split dataset into training and testing sets
- create [chosen algorithm] classifier
- train the classifier on the training data
- evaluate the classifier's performance using accuracy, precision, recall, and F1-score
- visualize the [chosen algorithm]

### Model Evaluation

#### Performance Metrics
- explain the importance of performance metrics in model evaluation
- define accuracy, precision, recall, and F1-score
- calculate and interpret metrics for the [chosen algorithm] model

#### Confusion Matrix
- what is aconfusion matrix
- create and visualize the confusion matrix for the [chosen algorithm] model

### Comparison of Classification Algorithms

#### Implementing Other Algorithms
- choose another classification algorithm for comparison
- implement and train selected algorithm
- evaluate performance using the same metrics as for [chosen algorithm]

#### Model Comparison
- compare the performance of the [chosen algorithm] model and the second algorithm
- discuss the strengths and weaknesses of each algorithm
- consider scenarios - one algorithm preferable over the other?

### Conclusion
