Kevin Hennelly : Exploratory Data Analysis Project 9/9/2025

# 1. Imports

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib 

# Axes object (basic plot type returned by Seaborn)
from matplotlib.axes import Axes

# 2. Load Data

In [None]:
# Load the Iris dataset into pandas DataFrame
iris_df: pd.DataFrame = sns.load_dataset('iris')

# List column names
iris_df.columns

# Inspect first few rows of the DataFrame
iris_df.head()


# 3. Initial Data Inspection

In [None]:
# Specify the number of rows to display
iris_df.head(10)

# Inspect the shape of the DataFrame with shape attribute
# The shape is a tuple with count of rows and columns in the DataFrame
iris_df.shape

# Inspect the data types of the columns with dtypes attribute
# The data types are returned as a pandas Series
iris_df.dtypes

# Inspect the data types of the columns with info() method
iris_df.info()

# 4. Initial Descriptive Statistics

In [None]:
# Inspect summary statistics for numerical columns
iris_df.describe()

# 5. Initial Data Distribution for Numerical Columns

In [None]:
# Inspect histogram by one numerical column
iris_df['sepal_length'].hist()

# Inspect histograms for ALL numerical columns
iris_df.hist()

# Show all plots
matplotlib.pyplot.show()

# 6. Initial Data Transformation and Feature Engineering

In [None]:
# Feature Engineering
# Renaming a column
iris_df.rename(columns={'sepal_length': 'Sepal Length'}, inplace=True)

# Adding a new column
iris_df['Sepal Area'] = iris_df['Sepal Length'] * iris_df['sepal_width']


# 7. Initial Visualizations

In [None]:
# Feature Engineering
# Renaming a column
iris_df.rename(columns={'sepal_length': 'Sepal Length'}, inplace=True)

# Adding a new column
iris_df['Sepal Area'] = iris_df['Sepal Length'] * iris_df['sepal_width']

# Create a pairplot of the Iris dataset
# A pairplot is a grid of scatter plots for each pair of numerical columns in the dataset
# The hue parameter is used to color the data points 
# by species (a categorical column)
sns.pairplot(iris_df, hue='species')

# Show all plots
matplotlib.pyplot.show()

# A scatter plot is a plot of two numerical variables.
scatter_plt: Axes = sns.scatterplot(
    data=iris_df, x="Sepal Length", y="Sepal Area", hue="species"
)

# Set axis labels using the Matplotlib Axes methods set_xlabel() and set_ylabel()
scatter_plt.set_xlabel("Sepal Length (mm)")
scatter_plt.set_ylabel("Sepal Area (mm squared)")   

# Set the title using the Matplotlib Axes set_title() method
scatter_plt.set_title("Chart 1. Iris Sepal Length vs. Sepal Area (by Species)")

matplotlib.pyplot.show()

## 7. Initial Visualizations

### Pairplot
The pairplot provides a grid of scatter plots for each pair of numerical columns in the dataset.  
By using the `hue="species"` parameter, we can visually compare how the three iris species differ across sepal and petal dimensions.  
We can already notice some separation between species, especially in petal length and petal width.

### Scatterplot: Sepal Length vs. Sepal Area
This scatterplot shows the relationship between **Sepal Length** and our engineered feature, **Sepal Area**.  
By coloring points according to species, we can see clustering patterns that suggest species differences in sepal measurements.

**Observations:**
- *Setosa* tends to have smaller sepal lengths and areas compared to *Versicolor* and *Virginica*.  
- *Virginica* generally has larger sepal measurements overall.  
- There is some overlap between *Versicolor* and *Virginica*, but the general trend is visible.

## 8. Initial Insights

From the pairplot and scatterplot visualizations, we can draw several early insights:

- The iris species show clear differences in petal dimensions, which appear to be good predictors for classification.  
- *Setosa* is distinctly separated from *Versicolor* and *Virginica* in most plots.  
- *Versicolor* and *Virginica* overlap somewhat, but differences in petal width and length still help distinguish them.  
- Our engineered feature **Sepal Area** also highlights separation among species, though not as strongly as petal measurements.

These initial insights suggest that the dataset has strong predictive features for distinguishing species, especially when petal measurements are included.

## 9. Annotate Your Notebook for Storytelling and Presentation

This notebook demonstrates an initial exploratory data analysis (EDA) project using the classic Iris dataset.  
The goal of this project was to show how pandas, seaborn, and matplotlib can be used to organize, analyze, and visualize data.

### Approach
1. **Data Preparation** – I began by importing the dataset and inspecting its structure.  
2. **Feature Engineering** – I created a new feature, *Sepal Area*, to explore relationships between measurements.  
3. **Visualization** – I used pairplots and scatterplots to show how the iris species differ.  
4. **Insights** – I documented observations at each step to explain patterns in the data.

### Story
The Iris dataset is widely used in data science because it provides a simple but powerful example of classification.  
Through visualizations, we saw how species can be separated by their measurements. *Setosa* is clearly distinct, while *Versicolor* and *Virginica* overlap but can still be distinguished by certain features like petal length and width.  

### Conclusion
This notebook provides a clear narrative: starting with the raw data, building new features, and finishing with visualizations that highlight meaningful patterns.  
By combining code, charts, and explanations, I’ve created a story that both technical and non-technical audiences can follow. This process demonstrates the value of EDA as a foundation for deeper machine learning and decision-making.