# Exploratory Data Analysis on Iris Dataset
*By Bhavya Bhargava*<br>

The IRIS dataset is a classic dataset in machine learning and statistics, consisting of 150 samples of iris flowers from three species: Iris-setosa, Iris-versicolor, and Iris-virginica.  The dataset is widely used for classification, clustering, pattern recognition and EDA tasks due to its simplicity and well-defined structure.

### About the Dataset:
#### Dataset Source:  
The IRIS dataset was first introduced by British statistician Ronald A. Fisher in his 1936 paper on linear discriminant analysis. It is publicly available and can be accessed through libraries like `scikit-learn` or UCI Machine Learning Repository. For our use-case we'll be going with the `scikit-learn` pathway.

#### Features:  
1. **Sepal Length (cm)**: Length of the sepal in centimeters.  
2. **Sepal Width (cm)**: Width of the sepal in centimeters.  
3. **Petal Length (cm)**: Length of the petal in centimeters.  
4. **Petal Width (cm)**: Width of the petal in centimeters.  
5. **Species**: Target variable indicating the iris flower species (Iris-setosa, Iris-versicolor, Iris-virginica).  


In [1]:
# Importing the required libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Importing the Iris data from Scikit-Learn

from sklearn.datasets import load_iris

# For getting the plots inline
%matplotlib inline

Exploring the structre of Iris data from the Scikit-learn package and converting it to a dataframe for further processing.

In [7]:
# checking the structure of the data from load_iris() function
iris = load_iris()

iris.keys()


dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

As we now have the structre of the dictonary carrying the data it's time to check the type of data for some of the relevant keys like target_names, feature_names, and target.

In [8]:
# Checking the target_names
print(iris.target_names)

['setosa' 'versicolor' 'virginica']


In [14]:
# Checking the feature_names
print(iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [15]:
# Checking the values for the target
print(iris.target)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


Based on the above observations, we can now create a data frame with names of the species mapped to them making our future analysis easier.

In [29]:
# Converting the dataset into a pandas DataFrame
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# Adding the target column (numerical labels)
iris_df['iris_type'] = iris.target

# Mapping the target names to their corresponding class names
iris_df['iris_species'] = iris_df['iris_type'].map({i: name for i, name in enumerate(iris.target_names)})

# Once mapped getting rid of the redundant 'iris_type' column
iris_df = iris_df.drop(columns=['iris_type'])

# Displaying the first few rows of the DataFrame
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),iris_species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


### Data Quality Assessment-

Let's proceed with checking the type of data and presence of any NULL values in our data frame for better analysis strategy.

In [19]:
# Checking the type of data and NULL values in the dataframe
iris_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   iris_name          150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


Observations from the above data:
1. There are 4 numberical columns with float values and there is only a single column with cartegorical data.
2. No columns seem to have any NULL values.

Now, let's check if there are any duplicates in the data and whether they affect the balance of data:

In [20]:
# Checking for any duplicate values in the dataset
iris_df[iris_df.duplicated()]

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),iris_name
142,5.8,2.7,5.1,1.9,virginica


As there is a duplicate record we need to check if it affects the balance of the dataset which might further affect our capability of getting meaningful insights from it.

In [30]:
# Checking the balance of the dataset
iris_df['iris_species'].value_counts()

iris_species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64

As it can be observed, the balance of the dataset is maintained despite having the duplicate value, there is no need to remove it from the dataset.

### Data Visulaization

To visulaize and assess the data, we can start by checking out the spread of data from average values of various features of the iris-species to their maximum and least values.

In [34]:
# Grouping and checking the spread of features for various Iris-Species
iris_df.groupby(['iris_species']).describe().transpose()

Unnamed: 0,iris_species,setosa,versicolor,virginica
sepal length (cm),count,50.0,50.0,50.0
sepal length (cm),mean,5.006,5.936,6.588
sepal length (cm),std,0.35249,0.516171,0.63588
sepal length (cm),min,4.3,4.9,4.9
sepal length (cm),25%,4.8,5.6,6.225
sepal length (cm),50%,5.0,5.9,6.5
sepal length (cm),75%,5.2,6.3,6.9
sepal length (cm),max,5.8,7.0,7.9
sepal width (cm),count,50.0,50.0,50.0
sepal width (cm),mean,3.428,2.77,2.974


A few things to note here include:
1. The dimensions of the features for the Setosa species of the Iris plant are very different from Versicolor and Virginica species which present some similarities.
2. For the Setosa species the sepal width is more than the other species while have all the other dimensions less than the others. This doesn't include the sepal length whose average length is almost the same for all.
3. There is also a lot of variation in the maximum length of dimensions for the species where Virginica leads the pack expect in the sepal width. This shows that other than their average length they have varying potential of growth.

We can get more clarity on the same with the help of some graphs.