# 1.7 Exploring Data Distribution
Imagine you’re an artist sketching a mountain range, capturing its peaks and valleys, or a kid sorting a bag of candies by color to see which ones dominate. That’s what exploring data distribution is like—using tools like histograms, box plots, and skewness to visualize the shape and spread of your data. It’s like drawing a map of your dataset to uncover its hidden story!

## What Is Exploring Data Distribution?
Data distribution shows how values are spread out or clustered. Here’s how we explore it:

- **Histograms**: Bar charts that group data into bins, like counting candies in color piles to see the most common.
- **Box Plots**: Boxes and whiskers that show the middle 50%, extremes, and outliers, like marking the height range of kids.
- **Skewness**: A measure of tilt—whether data leans left (long tail on the right) or right (long tail on the left), like a lopsided mountain.

Let’s use a custom dataset of passenger ages. We’ll start with the sample `titanic_ages.csv` you provided, saved in your `data` folder with these values:

| Age |
|-----|
| 22  |
| 38  |
| 26  |
| 35  |
| 54  |
| 2   |
| 27  |
| 14  |

Load it like this:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Import the sample dataset from the data folder
data = pd.read_csv('data/titanic_ages.csv')

print(data.head())  # Show the first few rows

Let’s create a histogram for `Age`:

In [None]:
# Create a histogram
plt.hist(data['Age'], bins=5, edgecolor='black', color='lightgreen')
plt.title('Distribution of Passenger Ages')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

Now, a box plot for `Age`:

In [None]:
# Create a box plot
data.boxplot(column='Age', patch_artist=True, boxprops=dict(facecolor='lightblue'))
plt.title('Box Plot of Passenger Ages')
plt.ylabel('Age')
plt.show()

To check skewness:

In [None]:
# Calculate skewness
skewness = data['Age'].skew()
print(f"Skewness of ages: {skewness:.2f}")  # Outputs a positive value, e.g., 0.82, indicating right skew

The histogram shows a concentration around 20-30, the box plot highlights the outlier (2), and skewness (around 0.82) confirms a right tilt due to the older age (54). For a larger dataset, let’s generate one:

In [None]:
# Generate a larger custom dataset
import numpy as np
np.random.seed(42)  # For reproducibility
ages = np.concatenate([np.random.normal(30, 10, 800), np.random.normal(60, 5, 200)])  # Mostly 20-40, some older
large_data = pd.DataFrame({'Age': ages})

# Save to CSV
large_data.to_csv('data/titanic_ages_large.csv', index=False)
print("Large dataset saved as data/titanic_ages_large.csv")

Reload and analyze the larger dataset:

In [None]:
# Import the larger dataset
large_data = pd.read_csv('data/titanic_ages_large.csv')

# Histogram for larger dataset
plt.hist(large_data['Age'], bins=10, edgecolor='black', color='lightcoral')
plt.title('Distribution of Larger Passenger Ages')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

# Box plot for larger dataset
large_data.boxplot(column='Age', patch_artist=True, boxprops=dict(facecolor='lightblue'))
plt.title('Box Plot of Larger Passenger Ages')
plt.ylabel('Age')
plt.show()

# Skewness for larger dataset
skewness_large = large_data['Age'].skew()
print(f"Skewness of larger ages: {skewness_large:.2f}")  # e.g., 0.45, still right-skewed

## Why Is This Necessary?

- **In Mathematics**: Visualizing distribution helps understand data behavior, like spotting if grades cluster around a certain score.
- **In Machine Learning (ML)**: It reveals patterns that affect model performance, like whether data needs transformation.

## Relevance in Machine Learning
Distribution shapes ML outcomes. A skewed dataset (e.g., mostly young passengers) might bias a model unless adjusted. Histograms spot these trends, box plots flag outliers to handle, and skewness guides data normalization—key steps for accurate predictions.

## Applications

- **Sales Trends**: Retailers use histograms to see which price ranges sell most.
- **Grade Distributions**: Schools plot grades to identify common performance levels.

## Step-by-Step Example
Let’s explore our age data:

1. **Load the Data**: Import `titanic_ages.csv` or `titanic_ages_large.csv` from `data/`.
2. **Plot a Histogram**: See how ages group—more 20-30-year-olds in the small set, broader in the large set.
3. **Draw a Box Plot**: Spot the median (around 27 small, 32 large) and outliers (e.g., 2 small, extreme ages large).
4. **Measure Skewness**: A positive value (e.g., 0.82 small, 0.45 large) shows older ages are less common.

Run the code above to visualize and confirm the right skew!

## Practical Insights

- **Bin Choice**: Too few bins hide details; too many clutter—10 bins suit the larger dataset.
- **Outlier Flags**: Box plots make outliers jump out—e.g., age 2 in the small set or extremes in the large set.
- **Skewness Action**: Positive skew might need a log transform for ML models.

Let’s apply a log transform to the larger dataset:

In [None]:
# Apply log transform and recalculate skewness
large_data['Log_Age'] = np.log1p(large_data['Age'])  # log1p handles zero values
skewness_log = large_data['Log_Age'].skew()
print(f"Skewness of log-transformed ages: {skewness_log:.2f}")  # e.g., -0.12, closer to zero

# Histogram of log-transformed ages
plt.hist(large_data['Log_Age'], bins=10, edgecolor='black', color='lightcoral')
plt.title('Distribution of Log-Transformed Passenger Ages')
plt.xlabel('Log(Age)')
plt.ylabel('Frequency')
plt.show()

## Common Pitfalls to Avoid

- **Misleading Bins**: Uneven bin sizes can distort histograms—keep them equal.
- **Ignoring Outliers**: Box plot whiskers might hide extremes—investigate them.
- **Overinterpreting Skewness**: A small skew (e.g., 0.45) might not need fixing—check its impact.

Let’s filter outliers from the larger dataset and replot:

In [None]:
# Filter out extreme ages (e.g., below 1 or above 80)
large_data_clean = large_data[(large_data['Age'] >= 1) & (large_data['Age'] <= 80)]
large_data_clean.boxplot(column='Age', patch_artist=True, boxprops=dict(facecolor='lightblue'))
plt.title('Box Plot of Cleaned Passenger Ages')
plt.ylabel('Age')
plt.show()

## What’s Next?
We’ve sketched our data’s shape. Next, we’ll dive into binary categorical data to explore yes/no patterns—think of it as sorting coins by heads or tails. Ready for the next adventure?