# Classification of Iris Species using Machine Learning algorithms

## About the Iris Dataset

The Iris dataset is the most famous and earliest dataset used in the literature on classification methods and machine learning. It was used in R.A. Fisher's landmark 1936 paper _"The Use of Multiple Measurements in Taxonomic Problems"_. In his article, Fisher demonstrate how to use multiple features simultaneously to best classify objects belonging to different groups. This statistical method has been named **Linear Discriminant Analysis (LDA)**.

The Iris dataset include 150 observations describing thee species of irises:

- Iris Setosa
- Iris Versicolor
- Iris Virginica

The data describe four features of these flowers:

- Sepal Length
- Sepal Width
- Petal Length
- Petal Width

## Purpous of the project

- Conducting Exploratory Data Analysis (EDA) to understand structure of the dataset and the relationships between features
- Assessing the suitability of the dataset for the application of classification algorithms
- Evaluating the quality of classification models using various metrics and identyfying the model that best handles the classification of the iris species
- Tuning the hyperparameters of the selected model to increase its effectiveness

### Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Data Handling

In [2]:
df = pd.read_csv('../data/Iris.csv')

#### Data preview

In [3]:
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


The dataset contains six columns:  
- **Id** - unique identifier for ech observation  
- **SepalLengthCm**, **SepalWidthCm**, **PetalLengthCm**, **PetalWidthCm** - contain four features describing the size of the sepal and petal  
- **Species** - categorical target variable with three classes

### Exploratory Data Analysis (EDA)

In [4]:
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    str    
dtypes: float64(4), int64(1), str(1)
memory usage: 7.2 KB


The data does not contain any missing values

In [7]:
df.groupby('Species').count()

Unnamed: 0_level_0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Iris-setosa,50,50,50,50,50
Iris-versicolor,50,50,50,50,50
Iris-virginica,50,50,50,50,50


The three classes are balanced.