# Module 03: Seaborn Statistical Visualization

**Estimated Time**: 90 minutes  
**Difficulty**: Beginner to Intermediate

## Learning Objectives

By the end of this module, you will:
- Understand Seaborn's advantages over Matplotlib
- Create distribution plots (histograms, KDE, violin plots)
- Build categorical plots (box plots, bar plots, count plots)
- Visualize relationships (scatter plots with regression)
- Use pair plots to explore multiple variables
- Create correlation heatmaps
- Work effectively with pandas DataFrames

---

In [None]:
# Import required libraries
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import warnings

# Set style and context
sns.set_theme(style="whitegrid", context="notebook")
warnings.filterwarnings("ignore")

# Set random seed for reproducibility
np.random.seed(42)

print("Libraries imported successfully!")
print(f"Seaborn version: {sns.__version__}")

## Part 1: Why Seaborn?

Seaborn is built on top of Matplotlib and provides:

### Advantages
1. **Beautiful defaults** - Attractive color palettes and styles out of the box
2. **Statistical focus** - Built-in statistical estimation and visualization
3. **DataFrame integration** - Works seamlessly with pandas
4. **Less code** - Complex plots with simple function calls
5. **Consistent API** - Similar function signatures across plot types

### When to Use Seaborn
- Exploratory data analysis
- Statistical visualizations
- Working with DataFrames
- Quick, attractive plots

### When to Use Matplotlib
- Fine-grained control needed
- Custom plot types
- Publication-specific requirements

In [None]:
# Example: Same plot in Matplotlib vs Seaborn
# Create sample data
data = pd.DataFrame({"x": np.random.randn(100), "y": np.random.randn(100)})

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Matplotlib version
axes[0].scatter(data["x"], data["y"], alpha=0.6)
axes[0].set_title("Matplotlib", fontsize=14, fontweight="bold")
axes[0].set_xlabel("X values")
axes[0].set_ylabel("Y values")
axes[0].grid(True, alpha=0.3)

# Seaborn version (with regression line)
sns.regplot(data=data, x="x", y="y", ax=axes[1], scatter_kws={"alpha": 0.6})
axes[1].set_title("Seaborn (with regression)", fontsize=14, fontweight="bold")

fig.suptitle("Matplotlib vs Seaborn", fontsize=16, fontweight="bold")
plt.tight_layout()
plt.show()

print("Notice how Seaborn adds statistical insights (regression) automatically!")

In [None]:
# Load a sample dataset for demonstrations
# We'll create a realistic customer dataset
n_samples = 500

customers = pd.DataFrame(
    {
        "age": np.random.normal(35, 12, n_samples).clip(18, 80).astype(int),
        "income": np.random.normal(50000, 20000, n_samples).clip(20000, 150000),
        "spending_score": np.random.randint(1, 100, n_samples),
        "satisfaction": np.random.choice(["Low", "Medium", "High"], n_samples, p=[0.2, 0.5, 0.3]),
        "region": np.random.choice(["North", "South", "East", "West"], n_samples),
        "gender": np.random.choice(["Male", "Female"], n_samples),
    }
)

# Add some correlation
customers["income"] = customers["income"] + customers["age"] * 500
customers["spending_score"] = (
    (customers["income"] / 1500 + np.random.randint(-20, 20, n_samples)).clip(1, 100).astype(int)
)

print("Customer Dataset Created!")
print(f"Shape: {customers.shape}")
print("\nFirst few rows:")
customers.head()

## Part 2: Distribution Plots

Distribution plots show how data is distributed.

### Key Functions
- `histplot()` - Histogram with optional KDE
- `kdeplot()` - Kernel Density Estimation (smooth distribution)
- `displot()` - Figure-level distribution plot
- `violinplot()` - Combination of box plot and KDE

In [None]:
# Example 1: Histogram with KDE
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Simple histogram
sns.histplot(data=customers, x="age", bins=20, ax=axes[0], color="steelblue")
axes[0].set_title("Age Distribution (Histogram)", fontsize=14, fontweight="bold")
axes[0].set_xlabel("Age", fontsize=12)
axes[0].set_ylabel("Count", fontsize=12)

# Histogram with KDE overlay
sns.histplot(data=customers, x="age", bins=20, kde=True, ax=axes[1], color="coral")
axes[1].set_title("Age Distribution (with KDE)", fontsize=14, fontweight="bold")
axes[1].set_xlabel("Age", fontsize=12)
axes[1].set_ylabel("Count", fontsize=12)

plt.tight_layout()
plt.show()

print("KDE (Kernel Density Estimation) shows a smooth approximation of distribution")

In [None]:
# Example 2: Comparing distributions by category
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Multiple histograms
sns.histplot(data=customers, x="income", hue="gender", bins=20, alpha=0.6, ax=axes[0])
axes[0].set_title("Income Distribution by Gender", fontsize=14, fontweight="bold")
axes[0].set_xlabel("Income ($)", fontsize=12)

# Multiple KDE plots
sns.kdeplot(data=customers, x="income", hue="gender", fill=True, alpha=0.5, ax=axes[1])
axes[1].set_title("Income Distribution by Gender (KDE)", fontsize=14, fontweight="bold")
axes[1].set_xlabel("Income ($)", fontsize=12)

plt.tight_layout()
plt.show()

print("The 'hue' parameter automatically creates grouped visualizations!")

In [None]:
# Example 3: Violin plot - Distribution + Box plot
fig, ax = plt.subplots(figsize=(12, 6))

sns.violinplot(data=customers, x="region", y="spending_score", hue="gender", split=True, ax=ax)
ax.set_title("Spending Score Distribution by Region and Gender", fontsize=16, fontweight="bold")
ax.set_xlabel("Region", fontsize=12)
ax.set_ylabel("Spending Score", fontsize=12)

plt.tight_layout()
plt.show()

print("Violin plots show:")
print("- Width = density (more data at that value)")
print("- Inner box = quartiles")
print("- White dot = median")

## Part 3: Categorical Plots

Categorical plots compare data across categories.

### Key Functions
- `boxplot()` - Box-and-whisker plot
- `barplot()` - Bar plot with error bars
- `countplot()` - Count of observations
- `stripplot()` - Scatter plot for categorical data
- `swarmplot()` - Non-overlapping categorical scatter

In [None]:
# Example 1: Box plot
fig, ax = plt.subplots(figsize=(12, 6))

sns.boxplot(data=customers, x="satisfaction", y="income", order=["Low", "Medium", "High"], ax=ax)
ax.set_title("Income by Satisfaction Level", fontsize=16, fontweight="bold")
ax.set_xlabel("Satisfaction Level", fontsize=12)
ax.set_ylabel("Income ($)", fontsize=12)

plt.tight_layout()
plt.show()

print("Box plot shows:")
print("- Box = 25th to 75th percentile (IQR)")
print("- Line in box = median")
print("- Whiskers = 1.5 * IQR")
print("- Points = outliers")

In [None]:
# Example 2: Bar plot with confidence intervals
fig, ax = plt.subplots(figsize=(12, 6))

sns.barplot(data=customers, x="region", y="spending_score", hue="satisfaction", ax=ax)
ax.set_title("Average Spending Score by Region and Satisfaction", fontsize=16, fontweight="bold")
ax.set_xlabel("Region", fontsize=12)
ax.set_ylabel("Average Spending Score", fontsize=12)

plt.tight_layout()
plt.show()

print("Bar height = mean, error bars = 95% confidence interval")

In [None]:
# Example 3: Count plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Simple count
sns.countplot(data=customers, x="region", ax=axes[0])
axes[0].set_title("Customer Count by Region", fontsize=14, fontweight="bold")
axes[0].set_xlabel("Region", fontsize=12)
axes[0].set_ylabel("Count", fontsize=12)

# Count with hue
sns.countplot(data=customers, x="region", hue="satisfaction", ax=axes[1])
axes[1].set_title("Customer Count by Region and Satisfaction", fontsize=14, fontweight="bold")
axes[1].set_xlabel("Region", fontsize=12)
axes[1].set_ylabel("Count", fontsize=12)

plt.tight_layout()
plt.show()

In [None]:
# Example 4: Strip plot and Swarm plot
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Strip plot (can overlap)
sns.stripplot(data=customers.sample(100), x="region", y="age", hue="gender", alpha=0.6, ax=axes[0])
axes[0].set_title("Age Distribution by Region (Strip Plot)", fontsize=14, fontweight="bold")
axes[0].legend(title="Gender")

# Swarm plot (no overlap)
sns.swarmplot(data=customers.sample(100), x="region", y="age", hue="gender", ax=axes[1])
axes[1].set_title("Age Distribution by Region (Swarm Plot)", fontsize=14, fontweight="bold")
axes[1].legend(title="Gender")

plt.tight_layout()
plt.show()

print("Swarm plots show all individual data points without overlap!")
print("Note: Swarm plots can be slow with large datasets")

## Part 4: Relationship Plots

Relationship plots show how two or more variables relate to each other.

### Key Functions
- `scatterplot()` - Scatter plot with optional hue/size
- `lineplot()` - Line plot with confidence intervals
- `regplot()` - Scatter plot with regression line
- `lmplot()` - Figure-level regression plot
- `residplot()` - Residual plot for regression

In [None]:
# Example 1: Enhanced scatter plot
fig, ax = plt.subplots(figsize=(12, 7))

sns.scatterplot(
    data=customers,
    x="age",
    y="income",
    hue="region",
    size="spending_score",
    sizes=(20, 200),
    alpha=0.6,
    ax=ax,
)
ax.set_title("Income vs Age (sized by Spending Score)", fontsize=16, fontweight="bold")
ax.set_xlabel("Age", fontsize=12)
ax.set_ylabel("Income ($)", fontsize=12)

plt.tight_layout()
plt.show()

print("Seaborn scatter plots can encode multiple dimensions:")
print("- X, Y positions")
print("- Color (hue)")
print("- Size")
print("- Style (not shown)")

In [None]:
# Example 2: Regression plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Simple regression
sns.regplot(data=customers, x="age", y="income", ax=axes[0], scatter_kws={"alpha": 0.5})
axes[0].set_title("Age vs Income (with regression)", fontsize=14, fontweight="bold")
axes[0].set_xlabel("Age", fontsize=12)
axes[0].set_ylabel("Income ($)", fontsize=12)

# Regression by category
for region in customers["region"].unique():
    subset = customers[customers["region"] == region]
    sns.regplot(
        data=subset, x="age", y="income", ax=axes[1], label=region, scatter_kws={"alpha": 0.4}
    )

axes[1].set_title("Age vs Income by Region", fontsize=14, fontweight="bold")
axes[1].set_xlabel("Age", fontsize=12)
axes[1].set_ylabel("Income ($)", fontsize=12)
axes[1].legend(title="Region")

plt.tight_layout()
plt.show()

print("Regression line shows the linear relationship between variables")

In [None]:
# Example 3: lmplot for faceted regression
g = sns.lmplot(
    data=customers,
    x="age",
    y="income",
    col="region",
    hue="gender",
    height=4,
    aspect=1.2,
    scatter_kws={"alpha": 0.5},
)
g.fig.suptitle(
    "Income vs Age: Faceted by Region, Colored by Gender", fontsize=16, fontweight="bold", y=1.02
)
plt.tight_layout()
plt.show()

print("lmplot creates separate plots for each category automatically!")

## Part 5: Pair Plots - Exploring Multiple Variables

Pair plots create a grid showing relationships between all pairs of variables.

### When to Use
- Initial data exploration
- Finding correlations
- Identifying patterns across multiple variables
- Understanding variable distributions

In [None]:
# Example 1: Basic pair plot
# Select numeric columns
numeric_cols = ["age", "income", "spending_score"]

g = sns.pairplot(customers[numeric_cols + ["gender"]], hue="gender", diag_kind="kde", height=2.5)
g.fig.suptitle("Pair Plot: Customer Variables by Gender", fontsize=16, fontweight="bold", y=1.01)
plt.tight_layout()
plt.show()

print("Pair plots show:")
print("- Diagonal: Distribution of each variable")
print("- Off-diagonal: Relationships between variable pairs")

In [None]:
# Example 2: Pair plot with regression
g = sns.pairplot(
    customers[numeric_cols + ["region"]],
    hue="region",
    kind="reg",
    diag_kind="kde",
    height=2.5,
    plot_kws={"scatter_kws": {"alpha": 0.5}},
)
g.fig.suptitle("Pair Plot with Regression Lines", fontsize=16, fontweight="bold", y=1.01)
plt.tight_layout()
plt.show()

print("Adding regression lines helps identify linear relationships!")

## Part 6: Correlation Heatmaps

Heatmaps visualize correlation matrices, showing which variables are related.

### Correlation Values
- **+1**: Perfect positive correlation
- **0**: No correlation
- **-1**: Perfect negative correlation

In [None]:
# Example 1: Correlation heatmap
# Calculate correlation matrix
corr_matrix = customers[numeric_cols].corr()

fig, ax = plt.subplots(figsize=(10, 8))

sns.heatmap(
    corr_matrix,
    annot=True,  # Show correlation values
    fmt=".2f",  # Format to 2 decimal places
    cmap="coolwarm",  # Color scheme
    center=0,  # Center colormap at 0
    square=True,  # Square cells
    linewidths=1,  # Lines between cells
    cbar_kws={"shrink": 0.8},
    ax=ax,
)

ax.set_title("Correlation Matrix: Customer Variables", fontsize=16, fontweight="bold", pad=20)

plt.tight_layout()
plt.show()

print("Strong correlations to note:")
for i in range(len(numeric_cols)):
    for j in range(i + 1, len(numeric_cols)):
        corr_val = corr_matrix.iloc[i, j]
        if abs(corr_val) > 0.5:
            print(f"  {numeric_cols[i]} ↔ {numeric_cols[j]}: {corr_val:.2f}")

In [None]:
# Example 2: Heatmap with clustering
# Create a more complex dataset for demonstration
extended_data = customers[numeric_cols].copy()
extended_data["age_squared"] = extended_data["age"] ** 2
extended_data["income_per_age"] = extended_data["income"] / extended_data["age"]
extended_data["total_score"] = extended_data["spending_score"] + np.random.randint(
    -10, 10, len(extended_data)
)

corr_matrix_extended = extended_data.corr()

# Clustered heatmap
g = sns.clustermap(
    corr_matrix_extended,
    annot=True,
    fmt=".2f",
    cmap="coolwarm",
    center=0,
    figsize=(12, 10),
    linewidths=0.5,
)

g.fig.suptitle("Clustered Correlation Heatmap", fontsize=16, fontweight="bold", y=0.98)
plt.show()

print("Clustered heatmap groups similar variables together!")

In [None]:
# Example 3: Annotated heatmap with custom data
# Create a summary by region and satisfaction
pivot_data = customers.pivot_table(
    values="spending_score", index="region", columns="satisfaction", aggfunc="mean"
)

# Reorder columns
pivot_data = pivot_data[["Low", "Medium", "High"]]

fig, ax = plt.subplots(figsize=(10, 6))

sns.heatmap(
    pivot_data,
    annot=True,
    fmt=".1f",
    cmap="YlGnBu",
    linewidths=1,
    cbar_kws={"label": "Average Spending Score"},
    ax=ax,
)

ax.set_title(
    "Average Spending Score by Region and Satisfaction", fontsize=16, fontweight="bold", pad=20
)
ax.set_xlabel("Satisfaction Level", fontsize=12)
ax.set_ylabel("Region", fontsize=12)

plt.tight_layout()
plt.show()

print("Heatmaps aren't just for correlations - use them for any matrix data!")

## Part 7: Advanced Seaborn Features

### FacetGrid - Create Multiple Subplots by Category

FacetGrid allows you to create a grid of plots based on categorical variables.

In [None]:
# Example 1: FacetGrid with histograms
g = sns.FacetGrid(customers, col="region", hue="gender", height=4, aspect=1.2)
g.map(sns.histplot, "income", bins=15, alpha=0.7)
g.add_legend()
g.fig.suptitle("Income Distribution by Region and Gender", fontsize=16, fontweight="bold", y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# Example 2: FacetGrid with scatter plots
g = sns.FacetGrid(customers, col="region", row="satisfaction", height=3, aspect=1.3)
g.map(sns.scatterplot, "age", "spending_score", alpha=0.6)
g.fig.suptitle(
    "Age vs Spending Score: Faceted by Region and Satisfaction",
    fontsize=16,
    fontweight="bold",
    y=1.01,
)
plt.tight_layout()
plt.show()

print("FacetGrid creates a matrix of plots for easy comparison!")

In [None]:
# Example 3: Joint plot - Combining scatter and distributions
g = sns.jointplot(
    data=customers, x="age", y="income", kind="scatter", hue="gender", height=8, alpha=0.6
)
g.fig.suptitle(
    "Joint Plot: Age vs Income with Marginal Distributions", fontsize=16, fontweight="bold", y=1.01
)
plt.tight_layout()
plt.show()

print("Joint plots show:")
print("- Center: Relationship between variables")
print("- Top: Distribution of X variable")
print("- Right: Distribution of Y variable")

In [None]:
# Example 4: Joint plot variations
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

# We'll create 4 different joint plot styles
kinds = ["scatter", "kde", "hex", "reg"]

for idx, kind in enumerate(kinds):
    # Create joint plot
    g = sns.jointplot(data=customers, x="age", y="income", kind=kind, height=5)
    g.fig.suptitle(f"Joint Plot: {kind.capitalize()}", fontsize=14, fontweight="bold", y=0.98)
    plt.close(g.fig)  # Close to prevent display

# Show one example in detail
g = sns.jointplot(data=customers, x="age", y="income", kind="hex", height=8, cmap="Blues")
g.fig.suptitle("Joint Plot with Hexbin (shows density)", fontsize=16, fontweight="bold", y=1.01)
plt.show()

print("Different joint plot types:")
print("- scatter: Individual points")
print("- kde: Kernel density estimation")
print("- hex: Hexagonal binning (shows density)")
print("- reg: Regression line")

## Part 8: Customizing Seaborn Plots

While Seaborn has beautiful defaults, you can customize to match your needs.

In [None]:
# Example 1: Custom color palettes
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

palettes = ["deep", "muted", "pastel", "dark"]

for ax, palette in zip(axes.ravel(), palettes):
    sns.barplot(
        data=customers, x="region", y="spending_score", hue="satisfaction", palette=palette, ax=ax
    )
    ax.set_title(f"Palette: {palette}", fontsize=14, fontweight="bold")
    ax.set_xlabel("Region", fontsize=11)
    ax.set_ylabel("Spending Score", fontsize=11)

fig.suptitle("Seaborn Color Palettes", fontsize=16, fontweight="bold")
plt.tight_layout()
plt.show()

print("Available palettes: deep, muted, pastel, bright, dark, colorblind")

In [None]:
# Example 2: Seaborn themes/styles
styles = ["darkgrid", "whitegrid", "dark", "white", "ticks"]

fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.ravel()

for ax, style in zip(axes, styles):
    with sns.axes_style(style):
        x = np.linspace(0, 10, 100)
        ax.plot(x, np.sin(x), linewidth=2.5)
        ax.set_title(f"Style: {style}", fontsize=14, fontweight="bold")
        ax.set_xlabel("X values")
        ax.set_ylabel("sin(x)")
        # Apply style to current axes
        for spine in ax.spines.values():
            if style == "ticks":
                spine.set_visible(True)

# Remove extra subplot
fig.delaxes(axes[-1])

fig.suptitle("Seaborn Styles", fontsize=18, fontweight="bold")
plt.tight_layout()
plt.show()

In [None]:
# Example 3: Combining Seaborn with Matplotlib customization
fig, ax = plt.subplots(figsize=(14, 8))

# Create Seaborn plot
sns.violinplot(
    data=customers, x="region", y="income", hue="satisfaction", split=False, palette="Set2", ax=ax
)

# Add Matplotlib customizations
ax.set_title(
    "Income Distribution by Region and Satisfaction\n(with custom styling)",
    fontsize=18,
    fontweight="bold",
    pad=20,
)
ax.set_xlabel("Region", fontsize=14, fontweight="bold")
ax.set_ylabel("Income ($)", fontsize=14, fontweight="bold")

# Add reference line
mean_income = customers["income"].mean()
ax.axhline(
    y=mean_income,
    color="red",
    linestyle="--",
    linewidth=2,
    alpha=0.7,
    label=f"Overall Mean: ${mean_income:,.0f}",
)

# Customize legend
ax.legend(loc="upper left", fontsize=11, framealpha=0.9)

# Add grid
ax.grid(True, alpha=0.3, axis="y", linestyle=":")

plt.tight_layout()
plt.show()

print("You can combine Seaborn's beauty with Matplotlib's flexibility!")

## Part 9: Key Takeaways

### What You've Learned
✓ **Seaborn advantages**: Beautiful defaults, statistical focus, DataFrame integration  
✓ **Distribution plots**: histplot, kdeplot, violinplot  
✓ **Categorical plots**: boxplot, barplot, countplot, swarmplot  
✓ **Relationship plots**: scatterplot, regplot, lmplot  
✓ **Pair plots**: Explore multiple variables simultaneously  
✓ **Heatmaps**: Visualize correlations and matrix data  
✓ **FacetGrid**: Create multi-panel comparative plots  
✓ **Customization**: Palettes, styles, and Matplotlib integration  

### Seaborn Function Categories

**Axes-level** (plot on specific axes):
- `histplot()`, `boxplot()`, `scatterplot()`, etc.
- Use with: `fig, ax = plt.subplots()`

**Figure-level** (create entire figure):
- `displot()`, `catplot()`, `relplot()`, `lmplot()`
- Easier for faceting, harder to customize

### When to Use Each Plot Type

| Goal | Plot Type |
|----- |-----------|
| Show distribution | histplot, kdeplot, violinplot |
| Compare categories | boxplot, barplot, countplot |
| Show relationship | scatterplot, regplot |
| Explore multiple variables | pairplot |
| Show correlations | heatmap |
| Compare across categories | FacetGrid, catplot |

### What's Next
In **Module 04**, you'll learn time series visualization:
- Handling datetime data
- Temporal trends and patterns
- Seasonality and moving averages
- Multiple time series comparison
- Professional financial and weather visualizations

---

## Exercises

Practice your Seaborn skills with these challenges!

### Exercise 1: Distribution Analysis
Using the customers dataset:
1. Create a figure with 2x2 subplots showing distributions of: age, income, spending_score
2. Use different plot types (histogram, KDE, violin plot)
3. Color by gender or region
4. Add appropriate titles and labels

In [None]:
# Your code here

### Exercise 2: Categorical Comparison
Create a comprehensive categorical analysis:
1. Count plot showing distribution across regions
2. Box plot comparing income across satisfaction levels
3. Bar plot showing average spending score by region and gender
4. Use a consistent color palette throughout

In [None]:
# Your code here

### Exercise 3: Relationship Exploration
1. Create a scatter plot of age vs spending_score
2. Color by satisfaction level
3. Add regression lines
4. Create a joint plot showing marginal distributions
5. What patterns do you notice?

In [None]:
# Your code here

### Exercise 4: Correlation Investigation
1. Create a correlation matrix for all numeric variables
2. Visualize with a heatmap
3. Identify the strongest positive and negative correlations
4. Create scatter plots for the most correlated pairs

In [None]:
# Your code here

### Challenge: Create Your Own Dataset and Analysis
1. Create a synthetic dataset with at least 4 numeric and 2 categorical variables
2. Add intentional correlations between some variables
3. Perform a complete Seaborn analysis:
   - Distribution plots
   - Categorical comparisons
   - Pair plot
   - Correlation heatmap
   - FacetGrid analysis
4. Write markdown cells explaining your findings

In [None]:
# Your code here

---

**Congratulations!** You've mastered Seaborn for statistical visualization. You can now create beautiful, informative plots that reveal patterns and relationships in your data with minimal code.

**Next**: Module 04 - Time Series Visualization