# Data Visualization with Matplotlib & Seaborn

## Welcome to Notebook 3!

**Why does data visualization matter?** In data science, visualization is how we turn raw numbers into stories. A well-crafted chart can reveal patterns, trends, and outliers that are invisible in a spreadsheet. Whether you're exploring data for the first time (EDA) or presenting findings to others, visualization is an essential skill.

**What this notebook covers:**
- **Matplotlib** — Python's foundational plotting library, giving you full control over every element of a chart
- **Seaborn** — A higher-level library built on top of Matplotlib that makes statistical plots beautiful and easy
- **Pandas plotting** — Quick built-in plotting for rapid exploration
- How to combine these tools to build polished, publication-ready visualizations

**Prerequisites:**
- Notebook 1: Intro to Python (variables, lists, loops, functions)
- Notebook 2: Data Manipulation with Pandas (DataFrames, filtering, grouping)
- Basic familiarity with `pip install matplotlib seaborn` (both are included in Anaconda)

## Table of Contents

1. [Matplotlib Fundamentals](#1.-Matplotlib-Fundamentals) — Line plots, bar charts, scatter plots, histograms
2. [Customizing Matplotlib](#2.-Customizing-Matplotlib) — Subplots, styles, annotations, saving figures
3. [Plotting with Pandas](#3.-Plotting-with-Pandas) — Quick built-in plots for EDA
4. [Seaborn Basics](#4.-Seaborn-Basics) — Distributions, categorical plots, scatter plots
5. [Advanced Seaborn](#5.-Advanced-Seaborn) — Heatmaps, pair plots, FacetGrids, customization
6. [Putting It All Together](#6.-Putting-It-All-Together) — Build a complete dashboard

In [None]:
# --- Imports ---
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# Display plots inline in the notebook
%matplotlib inline

# Use a clean style for our plots
plt.style.use('seaborn-v0_8-whitegrid')

# --- Sample Dataset ---
# We'll use this batting data throughout the notebook
batting = pd.DataFrame({
    'Player': ['Aaron Judge', 'Shohei Ohtani', 'Mookie Betts', 'Juan Soto', 'Freddie Freeman',
               'Corey Seager', 'Marcus Semien', 'Rafael Devers', 'Vlad Guerrero Jr.', 'Trea Turner'],
    'Team': ['NYY', 'LAD', 'LAD', 'NYY', 'LAD', 'TEX', 'TEX', 'BOS', 'TOR', 'PHI'],
    'HR': [58, 54, 39, 41, 22, 33, 29, 28, 30, 21],
    'AVG': [.321, .310, .307, .292, .281, .294, .281, .297, .301, .271],
    'RBI': [144, 130, 98, 109, 89, 100, 100, 91, 97, 75],
    'OBP': [.425, .390, .380, .410, .338, .350, .330, .360, .355, .318],
    'SLG': [.701, .646, .579, .569, .461, .530, .490, .521, .540, .440],
    'SO': [175, 162, 105, 141, 113, 136, 133, 127, 117, 134],
    'BB': [78, 81, 65, 129, 62, 55, 50, 61, 57, 41],
    'Age': [32, 30, 32, 26, 35, 30, 34, 28, 25, 31]
})

# OPS = On-base Plus Slugging, a key offensive metric
batting['OPS'] = batting['OBP'] + batting['SLG']

print(f"Sample data loaded: {batting.shape[0]} players")
batting.head()

---

## 1. Matplotlib Fundamentals

**Matplotlib** is the most widely used plotting library in Python. Almost every other visualization library (including Seaborn) is built on top of it.

The simplest way to use Matplotlib is through `pyplot`, which we imported as `plt`. The basic pattern is:

```python
plt.plot(x_data, y_data)   # Create the plot
plt.title('My Title')       # Add a title
plt.xlabel('X Axis')        # Label the x-axis
plt.ylabel('Y Axis')        # Label the y-axis
plt.show()                   # Display the plot
```

Let's start with the most common chart types.

### Line Plot

Line plots are perfect for showing trends over time. Let's track fictional monthly home run totals for two players across a season.

In [None]:
# --- Line Plot: Monthly HR Totals ---

# Fictional monthly home run data for two sluggers
months = ['Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep']
judge_hrs = [8, 12, 11, 9, 10, 8]       # Aaron Judge
ohtani_hrs = [7, 10, 9, 12, 8, 8]       # Shohei Ohtani

# Create the plot
plt.figure(figsize=(10, 6))
plt.plot(months, judge_hrs, label='Aaron Judge', marker='o')
plt.plot(months, ohtani_hrs, label='Shohei Ohtani', marker='s')

# Add labels and title
plt.title('Monthly Home Run Totals (2024 Season)', fontsize=14)
plt.xlabel('Month', fontsize=12)
plt.ylabel('Home Runs', fontsize=12)
plt.legend(fontsize=11)

plt.show()

### Bar Chart

Bar charts are ideal for comparing values across categories. Let's visualize home run totals for each player.

In [None]:
# --- Vertical Bar Chart: HR by Player ---

# Sort players by HR for a cleaner look
sorted_batting = batting.sort_values('HR', ascending=False)

plt.figure(figsize=(12, 6))
plt.bar(sorted_batting['Player'], sorted_batting['HR'], color='steelblue', edgecolor='black')

plt.title('Home Runs by Player', fontsize=14)
plt.xlabel('Player', fontsize=12)
plt.ylabel('Home Runs', fontsize=12)
plt.xticks(rotation=45, ha='right')  # Rotate labels so they don't overlap
plt.tight_layout()

plt.show()

In [None]:
# --- Horizontal Bar Chart: Much easier to read player names! ---

# Sort ascending so the highest value is at the top
sorted_batting = batting.sort_values('HR', ascending=True)

plt.figure(figsize=(10, 6))
plt.barh(sorted_batting['Player'], sorted_batting['HR'], color='coral', edgecolor='black')

plt.title('Home Runs by Player (Horizontal)', fontsize=14)
plt.xlabel('Home Runs', fontsize=12)
plt.ylabel('Player', fontsize=12)
plt.tight_layout()

plt.show()

# Tip: Horizontal bar charts (barh) are often better when you have long category labels!

### Scatter Plot

Scatter plots reveal relationships between two numeric variables. Let's explore whether players who hit more home runs also tend to have higher batting averages.

In [None]:
# --- Scatter Plot: HR vs AVG ---
# The 's' parameter controls dot size, 'c' controls color
# We'll size the dots by RBI and color them by OPS

plt.figure(figsize=(10, 7))

scatter = plt.scatter(
    batting['HR'],
    batting['AVG'],
    s=batting['RBI'] * 3,      # Size dots by RBI (scaled up for visibility)
    c=batting['OPS'],           # Color dots by OPS
    cmap='YlOrRd',              # Yellow-Orange-Red colormap
    edgecolors='black',
    alpha=0.8
)

# Add a colorbar to explain what the colors mean
plt.colorbar(scatter, label='OPS')

# Label each point with the player's name
for i, player in enumerate(batting['Player']):
    plt.annotate(player, (batting['HR'].iloc[i], batting['AVG'].iloc[i]),
                 fontsize=8, ha='left', va='bottom')

plt.title('Home Runs vs Batting Average\n(dot size = RBI, color = OPS)', fontsize=14)
plt.xlabel('Home Runs', fontsize=12)
plt.ylabel('Batting Average', fontsize=12)
plt.tight_layout()

plt.show()

### Histogram

Histograms show the **distribution** of a single numeric variable — how frequently values fall into different ranges (bins). This is one of the first things you should plot when exploring a new dataset.

In [None]:
# --- Histogram: Distribution of Batting Averages ---

plt.figure(figsize=(10, 6))

# bins controls how many bars the data is divided into
# edgecolor='black' adds borders so bars are distinct
plt.hist(batting['AVG'], bins=6, color='mediumseagreen', edgecolor='black', alpha=0.7)

# Add a vertical line for the mean
mean_avg = batting['AVG'].mean()
plt.axvline(mean_avg, color='red', linestyle='--', linewidth=2, label=f'Mean: {mean_avg:.3f}')

plt.title('Distribution of Batting Averages', fontsize=14)
plt.xlabel('Batting Average', fontsize=12)
plt.ylabel('Number of Players', fontsize=12)
plt.legend(fontsize=11)

plt.show()

# With only 10 players this histogram is sparse — histograms shine
# with larger datasets (50+ data points). We'll see better examples with Seaborn!

### Quick Reference: Essential Plot Labels

Every good chart needs proper labeling. Here are the key functions you should use on **every** plot:

| Function | What it does | Example |
|---|---|---|
| `plt.title('text')` | Adds a title above the chart | `plt.title('HR Leaders', fontsize=14)` |
| `plt.xlabel('text')` | Labels the x-axis | `plt.xlabel('Home Runs')` |
| `plt.ylabel('text')` | Labels the y-axis | `plt.ylabel('Batting Average')` |
| `plt.legend()` | Shows the legend (for multi-line plots) | `plt.legend(loc='upper right')` |
| `plt.tight_layout()` | Prevents labels from being cut off | `plt.tight_layout()` |
| `plt.figure(figsize=(w,h))` | Sets the figure size in inches | `plt.figure(figsize=(10, 6))` |

**Rule of thumb:** If someone looks at your chart without any surrounding text, they should be able to understand what it shows. Always include a title and axis labels!

---

## 2. Customizing Matplotlib

So far we've used the simple `plt.plot()` interface. But Matplotlib also has a more powerful **object-oriented (OO) interface** using `Figure` and `Axes` objects. This gives you much more control, especially when creating multiple plots in one figure.

```python
# The OO approach:
fig, ax = plt.subplots()    # Create a Figure and one Axes
ax.plot(x, y)                # Plot on that specific Axes
ax.set_title('My Title')     # Set title on that Axes
```

Think of it this way:
- **Figure** = the entire canvas (like a piece of paper)
- **Axes** = an individual plot area on that canvas (you can have many on one figure)

In [None]:
# --- Figure and Axes: The OO Interface ---

# Create a single figure with one axes
fig, ax = plt.subplots(figsize=(10, 6))

# Now use ax.method() instead of plt.method()
sorted_batting = batting.sort_values('HR', ascending=False)
ax.bar(sorted_batting['Player'], sorted_batting['HR'], color='steelblue', edgecolor='black')

# Notice: set_title() instead of title(), set_xlabel() instead of xlabel()
ax.set_title('Home Runs by Player (OO Interface)', fontsize=14)
ax.set_xlabel('Player', fontsize=12)
ax.set_ylabel('Home Runs', fontsize=12)
ax.tick_params(axis='x', rotation=45)

fig.tight_layout()
plt.show()

# The result looks the same, but the OO approach becomes essential
# when you need multiple plots in one figure (subplots)!

### Subplots: Multiple Plots Side by Side

The real power of the OO interface is creating **subplots** — multiple charts in one figure. Use `plt.subplots(nrows, ncols)` to create a grid.

In [None]:
# --- Subplots: HR and AVG Side by Side ---

# Create 1 row, 2 columns of subplots
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Sort for both plots
sorted_batting = batting.sort_values('HR', ascending=False)

# Left plot: Home Runs
axes[0].barh(sorted_batting['Player'], sorted_batting['HR'], color='steelblue', edgecolor='black')
axes[0].set_title('Home Runs', fontsize=13)
axes[0].set_xlabel('HR', fontsize=11)
axes[0].invert_yaxis()  # Highest at top

# Right plot: Batting Average
sorted_by_avg = batting.sort_values('AVG', ascending=False)
axes[1].barh(sorted_by_avg['Player'], sorted_by_avg['AVG'], color='coral', edgecolor='black')
axes[1].set_title('Batting Average', fontsize=13)
axes[1].set_xlabel('AVG', fontsize=11)
axes[1].invert_yaxis()

# Add a shared title for the whole figure
fig.suptitle('Offensive Leaders: Power vs Contact', fontsize=15, fontweight='bold', y=1.02)
fig.tight_layout()

plt.show()

### Colors, Line Styles, and Markers

You can customize every visual element of a line plot. Here are the key parameters.

In [None]:
# --- Customizing Lines: Colors, Styles, Markers ---

months = ['Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep']
judge_hrs = [8, 12, 11, 9, 10, 8]
ohtani_hrs = [7, 10, 9, 12, 8, 8]
betts_hrs = [5, 7, 8, 6, 7, 6]

plt.figure(figsize=(10, 6))

# color: named color, hex code, or RGB tuple
# linestyle: '-' solid, '--' dashed, ':' dotted, '-.' dash-dot
# marker: 'o' circle, 's' square, '^' triangle, 'D' diamond, '*' star
# linewidth: thickness of the line
# markersize: size of the markers

plt.plot(months, judge_hrs, color='#1f77b4', linestyle='-', marker='o',
         linewidth=2.5, markersize=10, label='Judge')

plt.plot(months, ohtani_hrs, color='crimson', linestyle='--', marker='s',
         linewidth=2.5, markersize=10, label='Ohtani')

plt.plot(months, betts_hrs, color='forestgreen', linestyle=':', marker='^',
         linewidth=2.5, markersize=10, label='Betts')

plt.title('Customized Line Styles', fontsize=14)
plt.xlabel('Month', fontsize=12)
plt.ylabel('Home Runs', fontsize=12)
plt.legend(fontsize=11)
plt.ylim(0, 15)  # Set y-axis range

plt.show()

### Style Themes

Matplotlib comes with many built-in style themes that change the overall look of your plots. You can switch styles with `plt.style.use()`.

In [None]:
# --- Style Themes: Compare Different Looks ---

# We'll use plt.style.context() to temporarily apply a style
# (so it doesn't affect the rest of the notebook)

styles_to_try = ['seaborn-v0_8-whitegrid', 'ggplot', 'dark_background']

fig, axes = plt.subplots(1, 3, figsize=(16, 4))

for ax, style_name in zip(axes, styles_to_try):
    with plt.style.context(style_name):
        ax.bar(batting['Player'][:5], batting['HR'][:5], color='steelblue')
        ax.set_title(f'Style: {style_name}', fontsize=10)
        ax.tick_params(axis='x', rotation=45, labelsize=8)
        ax.set_ylabel('HR')

fig.suptitle('Same Data, Different Styles', fontsize=14, fontweight='bold')
fig.tight_layout()
plt.show()

# Reset to our preferred style
plt.style.use('seaborn-v0_8-whitegrid')

# To see ALL available styles:
# print(plt.style.available)

### Annotations: Highlighting Key Data Points

Use `plt.annotate()` to draw attention to specific points in your plot. The arrow and text box make it easy to call out outliers or notable values.

In [None]:
# --- Annotations: Call Out the HR Leader ---

plt.figure(figsize=(10, 7))
plt.scatter(batting['HR'], batting['AVG'], s=100, color='steelblue', edgecolors='black')

# Find the top HR hitter
top_hr_idx = batting['HR'].idxmax()
top_player = batting.loc[top_hr_idx]

# Add an annotation with an arrow pointing to the top hitter
plt.annotate(
    f"{top_player['Player']}\n{top_player['HR']} HR, {top_player['AVG']:.3f} AVG",
    xy=(top_player['HR'], top_player['AVG']),          # Point to annotate
    xytext=(top_player['HR'] - 12, top_player['AVG'] - 0.02),  # Text position
    fontsize=11,
    fontweight='bold',
    arrowprops=dict(
        arrowstyle='->',          # Arrow style
        color='red',
        linewidth=2
    ),
    bbox=dict(boxstyle='round,pad=0.3', facecolor='lightyellow', edgecolor='red')
)

plt.title('HR vs AVG with Annotation', fontsize=14)
plt.xlabel('Home Runs', fontsize=12)
plt.ylabel('Batting Average', fontsize=12)
plt.tight_layout()

plt.show()

### Saving Figures

Use `plt.savefig()` to export your plots as image files. Always call `savefig()` **before** `plt.show()`, since `show()` clears the figure.

In [None]:
# --- Saving Figures ---

import os

# Create a 'figures' directory if it doesn't exist
os.makedirs('figures', exist_ok=True)

# Create a plot
plt.figure(figsize=(10, 6))
sorted_batting = batting.sort_values('HR', ascending=True)
plt.barh(sorted_batting['Player'], sorted_batting['HR'], color='steelblue', edgecolor='black')
plt.title('Home Run Leaders', fontsize=14)
plt.xlabel('Home Runs', fontsize=12)
plt.tight_layout()

# Save BEFORE plt.show()!
# dpi = dots per inch (higher = better quality but larger file)
# bbox_inches='tight' prevents labels from being cut off
plt.savefig('figures/hr_leaders.png', dpi=150, bbox_inches='tight')

plt.show()

print("Figure saved to figures/hr_leaders.png")

# Common formats: .png, .jpg, .svg, .pdf
# plt.savefig('figures/hr_leaders.svg')   # Vector format — great for publications
# plt.savefig('figures/hr_leaders.pdf')   # PDF format

---

## 3. Plotting with Pandas

Did you know that Pandas has **built-in plotting** powered by Matplotlib? Every DataFrame and Series has a `.plot()` method that makes quick charts in a single line. This is perfect for fast exploratory data analysis (EDA) when you don't need a polished publication-ready chart.

In [None]:
# --- Pandas Bar Chart: One Line! ---

# Set Player as the index so it becomes the x-axis labels
batting.set_index('Player')['HR'].sort_values().plot(
    kind='barh', figsize=(10, 6), color='steelblue', edgecolor='black',
    title='Home Runs by Player (Pandas .plot())'
)
plt.xlabel('Home Runs')
plt.tight_layout()
plt.show()

In [None]:
# --- Pandas Scatter Plot ---

batting.plot(
    kind='scatter', x='HR', y='AVG', figsize=(10, 6),
    s=80, color='coral', edgecolors='black',
    title='HR vs AVG (Pandas .plot())'
)
plt.tight_layout()
plt.show()

In [None]:
# --- Pandas Box Plot: Quick distribution comparison ---

batting[['HR', 'RBI', 'BB']].plot(
    kind='box', figsize=(8, 6),
    title='Distribution of HR, RBI, and BB'
)
plt.ylabel('Count')
plt.tight_layout()
plt.show()

# Box plots show the median (line), quartiles (box), and outliers (dots)
# Great for quickly spotting which stats have the most spread

**When to use Pandas plotting:** Pandas `.plot()` is great for quick EDA — when you want to see the data fast and don't need fine-tuned customization. For polished, presentation-ready charts, use Matplotlib or Seaborn directly. The good news is that since Pandas uses Matplotlib under the hood, you can always add `plt.title()`, `plt.xlabel()`, etc. to customize a Pandas plot further.

---

## 4. Seaborn Basics

**Seaborn** is a statistical visualization library built on top of Matplotlib. It provides:

- **Better default aesthetics** — plots look polished out of the box
- **Statistical plots** — built-in support for distributions, regressions, and categorical comparisons
- **DataFrame integration** — pass column names directly instead of extracting arrays
- **Automatic legends and labels** — less boilerplate code

Let's explore the most useful Seaborn plot types.

In [None]:
# --- Seaborn Theme Setup ---

# sns.set_theme() applies Seaborn's default styling to ALL plots
# (including Matplotlib plots made after this call)
sns.set_theme(style='whitegrid', font_scale=1.1)

print("Seaborn theme applied!")
print(f"Available styles: white, dark, whitegrid, darkgrid, ticks")

In [None]:
# --- Distribution Plots: histplot and kdeplot ---

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Histogram with KDE overlay
sns.histplot(batting['HR'], bins=6, kde=True, ax=axes[0], color='steelblue')
axes[0].set_title('Distribution of Home Runs')
axes[0].set_xlabel('Home Runs')

# Right: KDE (Kernel Density Estimate) — a smooth version of a histogram
sns.kdeplot(data=batting, x='HR', fill=True, ax=axes[1], color='coral', alpha=0.5)
sns.kdeplot(data=batting, x='RBI', fill=True, ax=axes[1], color='steelblue', alpha=0.5)
axes[1].set_title('KDE: HR vs RBI Distributions')
axes[1].set_xlabel('Count')
axes[1].legend(['HR', 'RBI'])

plt.tight_layout()
plt.show()

In [None]:
# --- Box Plot and Violin Plot: Comparing Distributions by Team ---

# Our 10-player dataset is small, so let's generate a larger one
# with ~100 fictional players across 6 teams for better visualization
np.random.seed(42)
teams = ['NYY', 'LAD', 'HOU', 'ATL', 'TEX', 'PHI']
n_players = 100

league_data = pd.DataFrame({
    'Team': np.random.choice(teams, n_players),
    'HR': np.random.normal(25, 10, n_players).clip(0, 60).astype(int),
    'AVG': np.random.normal(.265, .030, n_players).clip(.180, .350).round(3),
    'RBI': np.random.normal(70, 25, n_players).clip(10, 150).astype(int),
})

print(f"Generated {len(league_data)} fictional players across {league_data['Team'].nunique()} teams")

# Box plot: shows median, quartiles, and outliers per team
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

sns.boxplot(data=league_data, x='Team', y='HR', ax=axes[0], palette='Set2')
axes[0].set_title('Home Runs by Team (Box Plot)')

# Violin plot: like a box plot + KDE — shows the full distribution shape
sns.violinplot(data=league_data, x='Team', y='HR', ax=axes[1], palette='Set2')
axes[1].set_title('Home Runs by Team (Violin Plot)')

plt.tight_layout()
plt.show()

In [None]:
# --- Bar Plot: Mean HR by Team with Error Bars ---
# sns.barplot automatically calculates the mean and shows 95% confidence intervals

plt.figure(figsize=(10, 6))
sns.barplot(data=league_data, x='Team', y='HR', palette='Set2', ci=95, edgecolor='black')

plt.title('Average Home Runs by Team (with 95% CI)', fontsize=14)
plt.xlabel('Team', fontsize=12)
plt.ylabel('Mean Home Runs', fontsize=12)

plt.show()

# Note: The thin black lines on each bar are confidence intervals —
# they show the uncertainty around the mean. Overlapping CIs suggest
# the teams may not be statistically different.

In [None]:
# --- Count Plot: How Many Players Per Team? ---
# countplot is like a histogram for categorical data — it counts occurrences

plt.figure(figsize=(8, 5))
sns.countplot(data=league_data, x='Team', palette='Set2', edgecolor='black',
              order=league_data['Team'].value_counts().index)  # Order by count

plt.title('Number of Players per Team', fontsize=14)
plt.xlabel('Team', fontsize=12)
plt.ylabel('Count', fontsize=12)

plt.show()

# countplot is great for seeing class balance in your data
# Are the teams evenly distributed, or do some have many more players?

In [None]:
# --- Scatter Plot with Hue and Size ---
# Seaborn makes it easy to encode extra dimensions using color (hue) and dot size

plt.figure(figsize=(12, 7))
sns.scatterplot(
    data=batting,
    x='HR', y='AVG',
    hue='Team',       # Color dots by team
    size='RBI',        # Size dots by RBI
    sizes=(50, 400),   # Min and max dot sizes
    alpha=0.8,
    edgecolor='black'
)

plt.title('HR vs AVG by Team (dot size = RBI)', fontsize=14)
plt.xlabel('Home Runs', fontsize=12)
plt.ylabel('Batting Average', fontsize=12)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')  # Legend outside the plot
plt.tight_layout()

plt.show()

# Compare this to the Matplotlib scatter plot we made earlier —
# Seaborn automatically creates a legend with meaningful labels!

---

## 5. Advanced Seaborn

Now that we know the basics, let's explore Seaborn's more powerful features:

- **Heatmaps** — visualize correlation matrices and other 2D data
- **Pair plots** — see relationships between all pairs of variables at once
- **catplot / FacetGrid** — create small multiples (the same chart split by a category)
- **Customization** — palettes, contexts, and fine-tuning Seaborn's appearance

### Heatmap: Correlation Matrix

A **correlation matrix** shows how strongly each pair of numeric variables is related. Values range from -1 (perfect negative correlation) to +1 (perfect positive correlation). A heatmap makes this matrix easy to read at a glance.

In [None]:
# --- Heatmap: Correlation Matrix of Batting Stats ---

# Select numeric columns for correlation
stats_cols = ['HR', 'RBI', 'AVG', 'OBP', 'SLG', 'OPS']
corr_matrix = batting[stats_cols].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(
    corr_matrix,
    annot=True,           # Show correlation values in each cell
    fmt='.2f',            # Format to 2 decimal places
    cmap='RdYlBu_r',     # Red = high positive, Blue = high negative
    center=0,             # Center the colormap at 0
    square=True,          # Make cells square
    linewidths=1,         # Add grid lines
    vmin=-1, vmax=1       # Fix the color scale
)

plt.title('Correlation Matrix of Batting Statistics', fontsize=14)
plt.tight_layout()

plt.show()

# Key insights from this heatmap:
# - HR and SLG are strongly correlated (power hitters slug more)
# - OBP and SLG combine to form OPS, so both correlate highly with it
# - AVG and OBP are related (getting hits helps you get on base)

### Pair Plot: All Pairwise Relationships at Once

A **pair plot** creates a grid of scatter plots for every pair of numeric variables, with histograms on the diagonal. It's one of the fastest ways to explore relationships in a dataset. Use `hue` to color the points by a categorical variable.

In [None]:
# --- Pair Plot: Explore All Relationships at Once ---

# Select a subset of columns (too many makes the plot unreadable)
pair_cols = ['HR', 'AVG', 'RBI', 'OPS', 'Team']

sns.pairplot(
    batting[pair_cols],
    hue='Team',            # Color points by team
    diag_kind='kde',       # Use KDE on the diagonal instead of histograms
    plot_kws={'alpha': 0.7, 'edgecolor': 'black', 's': 80},
    height=2.5
)

plt.suptitle('Pair Plot: Key Batting Statistics by Team', y=1.02, fontsize=14)

plt.show()

# Pair plots are incredibly useful for initial data exploration!
# Each off-diagonal cell shows a scatter plot between two variables,
# and the diagonal shows the distribution of each variable.

### catplot and FacetGrid: Small Multiples

**Small multiples** (also called facets) show the same type of chart repeated for each category. This makes it easy to compare patterns across groups. Seaborn's `catplot()` and `FacetGrid` make this effortless.

In [None]:
# --- catplot: Small Multiples Made Easy ---
# catplot creates a separate panel for each value of a categorical variable

# Using our league_data with 100 players across 6 teams
g = sns.catplot(
    data=league_data,
    x='HR',
    col='Team',           # Create a separate column for each team
    col_wrap=3,           # Wrap after 3 columns (so we get a 2x3 grid)
    kind='hist',          # Type of plot in each panel
    height=3.5,
    aspect=1.2,
    color='steelblue'
)

g.fig.suptitle('HR Distribution by Team (Small Multiples)', y=1.02, fontsize=14)
g.set_axis_labels('Home Runs', 'Count')

plt.show()

# catplot supports many kinds: 'strip', 'swarm', 'box', 'violin', 'bar', 'count', 'point'
# Try changing kind='hist' to kind='box' or kind='violin'!

### Customizing Seaborn: Palettes, Contexts, and More

Seaborn provides several ways to control the look of your plots globally:

- **`palette`** — change the color scheme (e.g., `'Set2'`, `'husl'`, `'coolwarm'`, `'Blues'`)
- **`context`** — scale elements for different output sizes (`'paper'`, `'notebook'`, `'talk'`, `'poster'`)
- **`font_scale`** — adjust all text sizes at once

In [None]:
# --- Comparing Seaborn Palettes ---

palettes_to_show = ['Set2', 'husl', 'coolwarm', 'viridis', 'pastel']

fig, axes = plt.subplots(1, len(palettes_to_show), figsize=(20, 4))

for ax, pal_name in zip(axes, palettes_to_show):
    sns.barplot(data=league_data, x='Team', y='HR', palette=pal_name, ax=ax, ci=None)
    ax.set_title(f'palette="{pal_name}"', fontsize=10)
    ax.set_xlabel('')
    ax.set_ylabel('')
    ax.tick_params(axis='x', rotation=45, labelsize=8)

fig.suptitle('Same Chart, Different Color Palettes', fontsize=14, fontweight='bold')
fig.tight_layout()
plt.show()

# --- Context: Scaling for Different Outputs ---
# 'paper' = smallest text/lines, 'poster' = largest
# This is useful when preparing plots for presentations vs reports

print("\nSeaborn contexts control text and line scaling:")
print("  'paper'    → Small text, thin lines (for journal figures)")
print("  'notebook' → Default (for Jupyter notebooks)")
print("  'talk'     → Larger text (for slide presentations)")
print("  'poster'   → Largest (for printed posters)")
print("\nUsage: sns.set_theme(context='talk', style='whitegrid', font_scale=1.2)")

# Reset to our default theme
sns.set_theme(style='whitegrid', font_scale=1.1)

---

## 6. Putting It All Together

Let's combine everything we've learned to build a **complete 4-panel dashboard** that tells a story about our batting data. This is the kind of visualization you'd create for a presentation, report, or blog post.

We'll use:
- Matplotlib's OO interface for the layout (`fig, axes = plt.subplots(2, 2)`)
- Seaborn for polished statistical plots
- Proper titles, labels, and annotations throughout

In [None]:
# ========================================
# MINI PROJECT: Batting Statistics Dashboard
# ========================================

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# --- Top Left: HR Bar Chart ---
sorted_batting = batting.sort_values('HR', ascending=True)
axes[0, 0].barh(sorted_batting['Player'], sorted_batting['HR'],
                 color='steelblue', edgecolor='black')
axes[0, 0].set_title('Home Run Leaders', fontsize=13, fontweight='bold')
axes[0, 0].set_xlabel('Home Runs')

# --- Top Right: HR vs AVG Scatter ---
scatter = axes[0, 1].scatter(
    batting['HR'], batting['AVG'],
    s=batting['RBI'] * 3, c=batting['OPS'],
    cmap='YlOrRd', edgecolors='black', alpha=0.8
)
fig.colorbar(scatter, ax=axes[0, 1], label='OPS', shrink=0.8)
axes[0, 1].set_title('HR vs AVG (size=RBI, color=OPS)', fontsize=13, fontweight='bold')
axes[0, 1].set_xlabel('Home Runs')
axes[0, 1].set_ylabel('Batting Average')

# --- Bottom Left: Correlation Heatmap ---
stats_cols = ['HR', 'RBI', 'AVG', 'OBP', 'SLG', 'OPS']
corr = batting[stats_cols].corr()
sns.heatmap(corr, annot=True, fmt='.2f', cmap='RdYlBu_r', center=0,
            square=True, linewidths=0.5, ax=axes[1, 0], cbar_kws={'shrink': 0.8})
axes[1, 0].set_title('Stat Correlations', fontsize=13, fontweight='bold')

# --- Bottom Right: OPS Distribution ---
sns.histplot(batting['OPS'], bins=6, kde=True, color='coral',
             edgecolor='black', ax=axes[1, 1])
mean_ops = batting['OPS'].mean()
axes[1, 1].axvline(mean_ops, color='red', linestyle='--', linewidth=2,
                     label=f'Mean: {mean_ops:.3f}')
axes[1, 1].set_title('OPS Distribution', fontsize=13, fontweight='bold')
axes[1, 1].set_xlabel('OPS')
axes[1, 1].set_ylabel('Count')
axes[1, 1].legend()

# --- Overall Title and Layout ---
fig.suptitle('MLB Batting Statistics Dashboard',
             fontsize=16, fontweight='bold', y=1.02)
fig.tight_layout()

plt.show()

print("This dashboard combines bar charts, scatter plots, heatmaps, and histograms")
print("into a single cohesive figure — exactly the kind of output you'd put in a report!")

### When to Use What: Decision Guide

| Situation | Best Tool | Why |
|---|---|---|
| Quick EDA, one-line plots | **Pandas `.plot()`** | Fastest way to see data, minimal code |
| Full control over every element | **Matplotlib (OO)** | Most flexible, handles any custom layout |
| Beautiful statistical plots | **Seaborn** | Built-in stats, great defaults, less code |
| Multi-panel dashboards | **Matplotlib subplots + Seaborn** | Matplotlib for layout, Seaborn for individual plots |
| Correlation matrices | **Seaborn `heatmap`** | Purpose-built with `annot=True` |
| Explore all variable pairs | **Seaborn `pairplot`** | One line, instant overview |
| Compare distributions by group | **Seaborn `boxplot`/`violinplot`** | Statistical comparison built in |
| Publication-quality figures | **Matplotlib + `savefig()`** | Precise control + high-DPI export |

**Pro tip:** In practice, most data scientists use all three together! Use Pandas for quick exploration, Seaborn for statistical plots, and Matplotlib when you need fine-grained control.

### Key Takeaways

1. **Matplotlib** is the foundation -- learn `plt.subplots()` and the OO interface for full control
2. **Seaborn** makes statistical plots beautiful with minimal code -- use it for distributions, comparisons, and correlations
3. **Pandas `.plot()`** is your go-to for quick, one-line EDA plots
4. **Always label your charts** -- title, axis labels, and legends are non-negotiable
5. **Use `plt.tight_layout()`** to prevent overlapping labels
6. **Save with `plt.savefig()`** before `plt.show()` for publication-quality exports
7. **Correlation heatmaps** and **pair plots** are the fastest way to explore relationships in a dataset
8. **Start simple, then customize** -- get the basic chart right first, then add colors, annotations, and styling

---

**Congratulations!** You now have a solid foundation in data visualization with Python. You can create everything from simple line charts to multi-panel dashboards using Matplotlib, Seaborn, and Pandas.

**Next up:** [Extracting Baseball Data](../how-to-get-baseball-data/extracting_baseball_data.ipynb) — Learn how to pull real baseball data from online sources so you can apply these visualization skills to actual MLB statistics!