# Exploratory Data Analysis on Iris Dataset

This notebook performs end-to-end EDA on the Iris dataset, including data loading, cleaning, descriptive statistics, data quality checks, and visualizations.

## Table of Contents
- [Data Loading and Cleaning](#Data-Loading-and-Cleaning)
- [Descriptive Statistics](#Descriptive-Statistics)
- [Data Quality Report](#Data-Quality-Report)
- [Visualizations](#Visualizations)
- [Conclusion](#Conclusion)


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Set style
sns.set_style('whitegrid')

# Load data
data_path = os.path.join('data', 'iris_cleaned.csv')
df = pd.read_csv(data_path)

print('Dataset shape:', df.shape)
df.head()

## Data Loading and Cleaning

The Iris dataset was downloaded from UCI Machine Learning Repository. It contains 150 samples of iris flowers with 4 features and 1 target variable (species).

Cleaning steps:
- Removed duplicates (none found)
- Ensured numerical columns are float
- No missing values in the dataset


In [None]:
# Descriptive statistics
print('Descriptive Statistics:')
df.describe()

In [None]:
# Statistics by species
df.groupby('species').describe()

## Data Quality Report

- No missing values
- No duplicates
- Outliers detected using IQR method (some present in features)
- Distributions are approximately normal for most features


In [None]:
# Histogram of sepal length
plt.figure(figsize=(8,6))
sns.histplot(df['sepal_length'], kde=True)
plt.title('Distribution of Sepal Length')
plt.show()

In [None]:
# Scatter plot
plt.figure(figsize=(8,6))
sns.scatterplot(data=df, x='sepal_length', y='petal_length', hue='species')
plt.title('Sepal Length vs Petal Length')
plt.show()

In [None]:
# Box plot
plt.figure(figsize=(10,6))
df_melted = df.melt(id_vars='species', value_vars=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'])
sns.boxplot(data=df_melted, x='variable', y='value', hue='species')
plt.title('Box Plot of Features by Species')
plt.xticks(rotation=45)
plt.show()

## Conclusion

The Iris dataset is clean and well-suited for classification tasks. The three species are separable based on the features, especially petal length and width.

Next steps could include building a machine learning model to classify the species.
