In [None]:
# === Environment Setup ===
import os, sys, math, time, random, json, textwrap, warnings
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
import altair as alt
import seaborn as sns
from vega_datasets import data
from IPython.display import display
from plotnine import ggplot, aes, geom_point, geom_smooth, facet_wrap, theme_light

# --- Configuration ---
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams["figure.dpi"] = 130
plt.rcParams['axes.spines.right'] = False
plt.rcParams['axes.spines.top'] = False
alt.renderers.enable('default')

# --- Utility Functions ---
def note(msg, **kwargs):
    """Prints a formatted message with a notebook icon."""
    formatted_msg = textwrap.fill(msg, width=100, subsequent_indent='   ')
    print(f"\n📝 {formatted_msg}", **kwargs)
def sec(title):
    """Prints a formatted section title for code blocks."""
    print(f"\n{100*'='}\n| {title.upper()} |\n{100*'='}")

note("Environment initialized.")

# Part 1: Foundations
## Chapter 1.15: Data Visualization: A Grammar for Argument and Exploration

### Introduction: From Data to Insight and Argument

Data visualization is a critical tool for exploration, analysis, and, most importantly, **argumentation**. A well-designed plot is a form of non-verbal, quantitative reasoning. It can reveal patterns that tables of summary statistics hide, highlight outliers that might otherwise be missed, and communicate complex findings with an immediacy that words alone cannot achieve.

This notebook introduces the core theories and dominant programming paradigms for visualization in Python. We will explore:
- The foundational principles of effective visualization.
- The **Grammar of Graphics**, a powerful theory for describing plots.
- The imperative paradigm of **Matplotlib**, where you build a plot piece by piece.
- The declarative paradigm of high-level libraries like **Seaborn** and **Altair**, where you specify *what* you want to visualize, not *how* to draw it.

### 1. Why Visualize? Anscombe's Quartet

To understand *why* visualization is essential, consider Anscombe's Quartet. Constructed by statistician Francis Anscombe in 1973, it comprises four datasets that have nearly identical simple descriptive statistics yet are vastly different when graphed. It is the classic demonstration of the limitations of summary statistics and the power of graphical exploration.

In [None]:
sec("Anscombe's Quartet: Identical Statistics, Different Stories")
anscombe = sns.load_dataset("anscombe")

note("The summary statistics are nearly identical for all four datasets:")
print(anscombe.groupby('dataset')[['x', 'y']].agg(['mean', 'std']))

note("Visualization, however, immediately reveals the dramatic structural differences:")
g = sns.lmplot(x="x", y="y", col="dataset", hue="dataset", data=anscombe,
               col_wrap=2, ci=None, palette="muted", height=4,
               scatter_kws={"s": 50, "alpha": 1})
g.set_titles("Dataset: {col_name}")
plt.show()

### 2. The Grammar of Graphics: A Theory of Visualization
Leland Wilkinson's *The Grammar of Graphics* provides a formal theory for declarative visualization. The core idea is that any plot is a combination of independent components. This provides a powerful framework for thinking about and constructing graphics.

1.  **Data**: The dataset being plotted.
2.  **Aesthetics (Mappings)**: A mapping from data columns to visual properties. This is the heart of the grammar. Example: `map GDP to the x-axis`, `map Life Expectancy to the y-axis`, `map Continent to color`.
3.  **Geometric Objects (Geoms)**: The shapes used to represent data (e.g., `points`, `lines`, `bars`).
4.  **Statistical Transformations (Stats)**: Optional transformations of the data before plotting (e.g., `binning` for a histogram, `smoothing` for a regression line).
5.  **Scales**: How data values are translated to visual properties (e.g., a linear or log scale for an axis).
6.  **Coordinates**: The coordinate system used (e.g., Cartesian, polar).
7.  **Faceting**: How to create subplots based on a categorical variable (e.g., creating a separate plot for each country).

Libraries like `plotnine` (a Python port of R's famous ggplot2) and `altair` are direct implementations of this grammar.

![Grammar of Graphics Components](../images/png/1.16-grammar-of-graphics.png)

### 3. The Python Visualization Landscape

The Python ecosystem offers several excellent libraries, each with different strengths. Choosing the right one depends on the task.

- **Matplotlib (Imperative):** The foundational library. Use it when you need **full, fine-grained control** over every element of a plot. It is the best choice for creating complex, multi-panel, or heavily annotated static figures for publication. Its syntax is often verbose.

- **Seaborn (High-Level Declarative):** Built on Matplotlib, Seaborn provides a high-level interface for creating beautiful and informative **statistical graphics** (e.g., violin plots, heatmaps, regression plots) with very little code. It's excellent for exploration and for plots with a statistical focus.

- **Altair (Strictly Declarative):** A strict implementation of the Grammar of Graphics. Its key strength is producing **interactive charts** (zooming, panning, filtering) with a clean, JSON-based specification. It's ideal for web-based dashboards and exploratory analysis where interactivity is key.

- **Plotnine (Strictly Declarative):** A Python implementation of R's `ggplot2`, offering a powerful and intuitive API based on the Grammar of Graphics.

#### 3.1 Matplotlib: The Imperative Workhorse
The key to using Matplotlib effectively is understanding its object hierarchy: a `Figure` is the top-level container, which holds one or more `Axes` (the individual plots). The standard workflow is to create a `Figure` and `Axes` with `fig, ax = plt.subplots()` and then call methods directly on the `ax` object to build the plot piece by piece.

In [None]:
sec("Publication-Quality Plot with Matplotlib")

gapminder = data.gapminder()
df_eu = gapminder[gapminder.continent == 'Europe']
countries = ['Germany', 'France', 'United Kingdom', 'Italy', 'Spain', 'Poland']
df_subset = df_eu[df_eu.country.isin(countries)]

# Create a 2x3 grid of subplots. sharey=True makes the y-axis consistent.
fig, axes = plt.subplots(2, 3, figsize=(12, 6), sharey=True)

# axes.flat provides a simple 1D iterator over the 2D grid of axes
for ax, country in zip(axes.flat, countries):
    country_data = df_subset[df_subset.country == country]
    # Plot the data
    ax.plot(country_data['year'], country_data['gdpPercap'], color='royalblue', lw=2)
    # Add a horizontal line for the mean
    mean_gdp = country_data['gdpPercap'].mean()
    ax.axhline(mean_gdp, color='firebrick', linestyle='--', lw=1.5, label=f'Mean: ${mean_gdp:,.0f}')
    
    ax.set_title(country, fontsize=12, weight='bold')
    ax.grid(True, which='major', linestyle='--', linewidth=0.5)
    ax.tick_params(axis='x', rotation=45)
    ax.legend(fontsize=8)

# Use figure-level methods to set shared labels and a title
fig.suptitle('GDP Per Capita for Major European Economies (1952-2007)', fontsize=16, weight='bold')
fig.supxlabel('Year', fontsize=12)
fig.supylabel('GDP Per Capita (USD)', fontsize=12)

# Adjust layout to prevent titles/labels from overlapping
fig.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()

#### 3.2 Seaborn: High-Level Statistical Graphics
Seaborn excels at making complex statistical plots simple. It takes DataFrames directly and handles much of the complexity of mapping data to visual properties automatically.

**Economic Application:** A correlation heatmap is a vital tool for understanding the relationships between variables in a dataset before building a regression model. It visually highlights potential multicollinearity issues.

In [None]:
sec("Correlation Heatmap with Seaborn")
cars = data.cars()
cars_num = cars.select_dtypes(include=np.number)
corr_matrix = cars_num.corr()

plt.figure(figsize=(8, 6))
sns.heatmap(
    corr_matrix, 
    annot=True,     # Annotate cells with the correlation values
    cmap='vlag',    # Use a diverging colormap (blue-white-red)
    fmt='.2f',      # Format annotations to 2 decimal places
    linewidths=.5
)
plt.title('Correlation Matrix of Car Features', fontsize=14, weight='bold')
plt.show()

#### 3.3 Altair: Declarative Interactive Visualization
Altair's syntax directly mirrors the Grammar of Graphics. You define a `Chart` object, specify the `data`, and then `mark` it with a geometric object. The core of the library is the `.encode()` method, where you define the aesthetic mappings.

The real power of Altair is its ability to create interactive charts with just a few extra lines of code.

In [None]:
sec("Interactive Scatter Plot with Altair")
gapminder = data.gapminder()
df_2007 = gapminder[gapminder.year == 2007]

# Create a selection interval for brushing and linking
interval = alt.selection_interval()

# Base chart definition
base = alt.Chart(df_2007).mark_point().encode(
    x=alt.X('gdpPercap', scale=alt.Scale(type="log"), title='GDP Per Capita (log scale)'),
    y=alt.Y('life_expect', title='Life Expectancy'),
    tooltip=['country', 'gdpPercap', 'life_expect'],
    # Color changes based on the interactive selection
    color=alt.condition(interval, 'continent', alt.value('lightgray'))
).properties(
    title='Life Expectancy vs. GDP Per Capita (2007)',
    width=600,
    height=400
).add_params(
    interval
)

# Display the chart (in a notebook, this will be interactive)
display(base)

#### 3.4 Plotnine: The Grammar of Graphics in Python
Plotnine brings the power and elegance of R's `ggplot2` to Python, allowing you to build complex plots layer by layer, following the Grammar of Graphics.

In [None]:
sec("Faceted Scatter Plot with Plotnine")
cars = data.cars()
p = (
    ggplot(cars, aes(x='Horsepower', y='Miles_per_Gallon', color='Origin'))
    + geom_point(alpha=0.6)
    + geom_smooth(method='lm', se=False)
    + facet_wrap('~ Origin')
    + theme_light()
)
display(p)

### 4. Choosing the Right Plot for Your Data

Knowing which plot to use for which kind of data is a crucial skill.

| Data Type / Goal | Recommended Plots |
| :--- | :--- |
| **Distribution of one variable** | Histogram (`sns.histplot`), Density Plot (`sns.kdeplot`), Box Plot (`sns.boxplot`) |
| **Relationship between two continuous variables** | Scatter Plot (`sns.scatterplot`, `sns.regplot`) |
| **Relationship between a continuous and a categorical variable** | Box Plot (`sns.boxplot`), Violin Plot (`sns.violinplot`), Bar Plot (`sns.barplot`) |
| **Relationship between two categorical variables** | Heatmap of counts (`sns.heatmap` on a `pd.crosstab`), Stacked Bar Chart |
| **Time Series Data** | Line Plot (`plt.plot`, `sns.lineplot`) |
| **Matrix / High-Dimensional Data** | Heatmap (`sns.heatmap`), Cluster Map (`sns.clustermap`) |

In [None]:
sec("Violin Plot: Continuous vs. Categorical")

note("A violin plot combines a box plot with a kernel density estimate, showing the full distribution of salaries for each continent.")
plt.figure(figsize=(10, 6))
sns.violinplot(data=gapminder, x='continent', y='life_expect', palette='viridis')
plt.title('Distribution of Life Expectancy by Continent', fontsize=14, weight='bold')
plt.xlabel('Continent')
plt.ylabel('Life Expectancy (Years)')
plt.show()

### 5. Exercises

1.  **Theory and Critique:** Find a data visualization in a recent news article or economic report. Analyze it through the lens of the principles discussed. What is its primary message? Does it use encodings effectively according to the hierarchy of perceptual tasks? Deconstruct it using the Grammar of Graphics (what is the data, what are the aesthetics, what are the geoms?).

2.  **Publication-Quality Matplotlib:** Using the `gapminder` dataset, create a 2x2 `matplotlib` subplot grid. Each subplot should show the relationship between `gdpPercap` and `life_expect` for a different continent ('Asia', 'Europe', 'Africa', 'Americas'). Customize the plots with titles, labels, and colors to make them publication-ready.

3.  **Statistical Plot with Seaborn:** Using the `tips` dataset (`sns.load_dataset('tips')`), create a single figure that shows the relationship between the total bill and the tip amount. Use `sns.regplot` to show a scatter plot with a regression line. Additionally, color the points by whether the customer was a smoker or not. What does this plot tell you about tipping behavior?

4.  **Interactive Exploration with Altair:** Using the `mpg` dataset (`from vega_datasets import data; mpg = data.mpg()`), create an interactive Altair chart. 
    a. The chart should be a scatter plot of `Horsepower` (x) vs. `Miles_per_Gallon` (y).
    b. Add an interactive legend for the `Origin` column, such that clicking on a country in the legend filters the points shown on the plot.
    c. Add tooltips to show the `Name` of the car on hover.

5.  **Choosing the Right Plot:** You have a dataset of monthly unemployment rates for all 50 U.S. states over the last 20 years. You want to create a visualization that shows both the overall trend in unemployment for the entire country and highlights which states have unusually high or low unemployment in any given month. Describe what kind of plot or combination of plots you would create and which library you would use to do it. Justify your choices.