# Exploratory Data Analysis on Iris Dataset
*By Bhavya Bhargava*<br>

The IRIS dataset is a classic dataset in machine learning and statistics, consisting of 150 samples of iris flowers from three species: Iris-setosa, Iris-versicolor, and Iris-virginica.  The dataset is widely used for classification, clustering, pattern recognition and EDA tasks due to its simplicity and well-defined structure.

### About the Dataset:
#### Dataset Source:  
The IRIS dataset was first introduced by British statistician Ronald A. Fisher in his 1936 paper on linear discriminant analysis. It is publicly available and can be accessed through libraries like `scikit-learn` or UCI Machine Learning Repository. For our use-case we'll be going with the `scikit-learn` pathway.

#### Features:  
1. **Sepal Length (cm)**: Length of the sepal in centimeters.  
2. **Sepal Width (cm)**: Width of the sepal in centimeters.  
3. **Petal Length (cm)**: Length of the petal in centimeters.  
4. **Petal Width (cm)**: Width of the petal in centimeters.  
5. **Species**: Target variable indicating the iris flower species (Iris-setosa, Iris-versicolor, Iris-virginica).  


In [1]:
# Importing the required libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Importing the Iris data from Scikit-Learn

from sklearn.datasets import load_iris

# For getting the plots inline
%matplotlib inline

Exploring the structre of Iris data from the Scikit-learn package and converting it to a dataframe for further processing.

In [7]:
# checking the structure of the data from load_iris() function
iris = load_iris()

iris.keys()


dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

As we now have the structre of the dictonary carrying the data it's time to check the type of data for some of the relevant keys like target_names, feature_names, and target.

In [8]:
# Checking the target_names
print(iris.target_names)

['setosa' 'versicolor' 'virginica']


In [14]:
# Checking the feature_names
print(iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [15]:
# Checking the values for the target
print(iris.target)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


Based on the above observations, we can now create a data frame with names of the species mapped to them making our future analysis easier.

In [18]:
# Converting the dataset into a pandas DataFrame
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# Adding the target column (numerical labels)
iris_df['iris_type'] = iris.target

# Mapping the target names to their corresponding class names
iris_df['iris_name'] = iris_df['iris_type'].map({i: name for i, name in enumerate(iris.target_names)})

# Once mapped getting rid of the redundant 'iris_type' column
iris_df = iris_df.drop(columns=['iris_type'])

# Displaying the first few rows of the DataFrame
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),iris_name
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
