# 📘 Exercise 2 – Sepal Width Boxplot (Iris Dataset)

**Objective:**  
Compare the distribution of **sepal width** among the three Iris species using a boxplot.  
Follow the **10 steps** as practiced in class.

## **Step 1: Setup Environment**

In [None]:
!pip install pandas seaborn matplotlib

## **Step 2: Import Libraries**

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## **Step 3: Load Dataset**

In [None]:
iris = sns.load_dataset('iris')

## **Step 4: Explore Dataset**

In [None]:
iris.head()

In [None]:
iris.groupby('species')['sepal_width'].describe()

## **Step 5: Data Cleaning / Preprocessing**

In [None]:
iris['sepal_width'].isnull().sum()

## **Step 6: Visualization – Boxplot**

In [None]:
plt.figure(figsize=(6,4))
sns.boxplot(data=iris, x='species', y='sepal_width', palette='Set2')
plt.title('Sepal Width Distribution by Species')
plt.show()

## **Step 7: Customization – Add Swarmplot**

In [None]:
plt.figure(figsize=(6,4))
sns.boxplot(data=iris, x='species', y='sepal_width', palette='Set2')
sns.swarmplot(data=iris, x='species', y='sepal_width', color='black', alpha=0.6)
plt.title('Sepal Width Distribution with Data Points')
plt.show()

## **Step 8: Save Visualization**

In [None]:
plt.figure(figsize=(6,4))
sns.boxplot(data=iris, x='species', y='sepal_width', palette='Set2')
plt.title('Sepal Width by Species')
plt.savefig('sepal_width_boxplot.png', dpi=300, bbox_inches='tight')
plt.show()

## **Step 9: Analysis**
- *Setosa* has the highest median sepal width.  
- *Virginica* has the lowest.  
- *Versicolor* overlaps with both but trends lower than Setosa.  
- Outliers are visible in all groups.

## **Step 10: Next Steps**
Try using a **violin plot** instead of a boxplot to compare the same data.

# 📊 Case Study – Penguins Dataset

**Objective:**  
Determine which morphological features best separate penguin species.  
We will use the `penguins` dataset in seaborn and follow the **10-step process**.

## **Step 1: Setup Environment**

In [None]:
!pip install pandas seaborn matplotlib

## **Step 2: Import Libraries**

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## **Step 3: Load Dataset**

In [None]:
penguins = sns.load_dataset('penguins')
penguins.head()

## **Step 4: Explore Dataset**

In [None]:
penguins.info()

In [None]:
penguins.describe()

In [None]:
penguins['species'].value_counts()

## **Step 5: Data Cleaning / Preprocessing**

In [None]:
penguins.isnull().sum()
penguins = penguins.dropna()

## **Step 6: Visualization – Histograms**

In [None]:
sns.histplot(data=penguins, x='bill_length_mm', hue='species', kde=True, element='step')
plt.title('Distribution of Bill Length by Species')
plt.show()

## **Step 7: Visualization – Boxplots & Scatter**

In [None]:
plt.figure(figsize=(6,4))
sns.boxplot(data=penguins, x='species', y='flipper_length_mm', palette='Set3')
plt.title('Flipper Length by Species')
plt.show()

sns.scatterplot(data=penguins, x='bill_length_mm', y='bill_depth_mm', hue='species', style='sex')
plt.title('Bill Length vs Bill Depth')
plt.show()

## **Step 8: Save Visualization**

In [None]:
plt.figure(figsize=(6,4))
sns.boxplot(data=penguins, x='species', y='body_mass_g', palette='pastel')
plt.title('Body Mass by Penguin Species')
plt.savefig('penguins_boxplot.png', dpi=300, bbox_inches='tight')
plt.show()

## **Step 9: Analysis**
- **Adelie** penguins tend to have shorter bills and lower body mass.  
- **Gentoo** penguins are heavier with longer flippers.  
- **Chinstrap** penguins are intermediate but overlap partly with Adelie in bill depth.  
This suggests **flipper length and body mass** are the clearest features to separate species.

## **Step 10: Next Steps**
- Try creating a **pairplot** of all numerical features colored by species.  
- Add interactivity using **Plotly** for deeper exploration.