# 00 Descriptive Statistics

Measures of central tendency, spread, shape, and how to summarize data before modeling.

## Table of Contents
- [Measures of central tendency](#measures-of-central-tendency)
- [Measures of spread](#measures-of-spread)
- [Skewness and kurtosis](#skewness-and-kurtosis)
- [Data types and summary tables](#data-types-and-summary-tables)
- [Visualizing distributions](#visualizing-distributions)
- [Checkpoint (Self-Check)](#checkpoint-self-check)
- [Solutions (Reference)](#solutions-reference)

## Why This Notebook Matters
Before fitting any model, you need to understand the shape and scale of your data.
Descriptive statistics are the first line of defense against modeling mistakes:
they reveal outliers, skewness, and data quality issues that can silently corrupt downstream results.

## Prerequisites (Quick Self-Check)
- Comfort with basic Python + pandas (reading CSVs, making plots).
- No statistics background required — this is where we start.

## What You Will Produce
- (no file output; learning/analysis notebook)

## Success Criteria
- You can compute and interpret mean, median, variance, skewness, and kurtosis.
- You can create informative visualizations of univariate distributions.
- You can articulate when the mean is misleading and why.

## Common Pitfalls
- Confusing sample statistics with population parameters.
- Ignoring skewness and assuming everything is symmetric.
- Using `.describe()` without actually reading the output.
- Forgetting to check for missing values before computing statistics.

## Quick Fixes (When You Get Stuck)
- If a column has unexpected NaNs, check `df.isna().sum()`.
- If a histogram looks wrong, check your bin count and whether you need to drop outliers for visualization.
- If you see `ModuleNotFoundError`, re-run the bootstrap cell.

## Matching Guide
- `docs/guides/00a_statistics_primer/00_descriptive_statistics.md`


## How To Use This Notebook
- Work section-by-section; don't skip the markdown.
- Most code cells are incomplete on purpose: replace TODOs and `...`, then run.
- After each section, write 2–4 sentences answering the interpretation prompts (what changed, why it matters).
- Prefer `data/processed/*` if you have built the real datasets; otherwise use the bundled `data/sample/*` fallbacks.
- Use the **Checkpoint (Self-Check)** section to catch mistakes early.
- Use **Solutions (Reference)** only to unblock yourself; then re-implement without looking.
- Use the matching guide (`docs/guides/00a_statistics_primer/00_descriptive_statistics.md`) for the math, assumptions, and deeper context.


<a id="environment-bootstrap"></a>
## Environment Bootstrap
Run this cell first. It makes the repo importable and defines common directories.


In [None]:
from __future__ import annotations

from pathlib import Path
import sys


def find_repo_root(start: Path) -> Path:
    p = start
    for _ in range(8):
        if (p / 'src').exists() and (p / 'docs').exists():
            return p
        p = p.parent
    raise RuntimeError('Could not find repo root. Start Jupyter from the repo root.')


PROJECT_ROOT = find_repo_root(Path.cwd())
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

DATA_DIR = PROJECT_ROOT / 'data'
RAW_DIR = DATA_DIR / 'raw'
PROCESSED_DIR = DATA_DIR / 'processed'
SAMPLE_DIR = DATA_DIR / 'sample'

PROJECT_ROOT


## Load the sample data

We will use `macro_quarterly_sample.csv` throughout this notebook.
This dataset contains quarterly US macroeconomic indicators including GDP growth,
unemployment rate, the federal funds rate, CPI, and more.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv(SAMPLE_DIR / 'macro_quarterly_sample.csv', index_col=0, parse_dates=True)
print('Shape:', df.shape)
print('Columns:', list(df.columns))
df.head()


<a id="measures-of-central-tendency"></a>
## Measures of Central Tendency

### Goal
Compute the mean, median, and mode of economic time series and understand when each is appropriate.

### Why this matters in economics
The mean GDP growth rate tells you the "average" pace of the economy, but it can be
pulled by extreme quarters (deep recessions or post-recession bounces). The median is
more robust to these outliers. Income data is the classic case: the mean household
income is much higher than the median because the distribution is right-skewed
(a few very high earners pull the mean up). Knowing which measure to report—and
why—is a fundamental skill.

**Key definitions:**
- **Mean**: $\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i$ — sensitive to outliers.
- **Median**: the middle value when sorted — robust to outliers.
- **Mode**: the most frequent value — most useful for discrete/categorical data.

### Your Turn

In [None]:
# TODO: Compute the mean, median, and mode of GDP growth (quarter-over-quarter)
# Hint: df['gdp_growth_qoq'].mean(), .median(), .mode()

gdp = df['gdp_growth_qoq'].dropna()

gdp_mean = ...
gdp_median = ...
gdp_mode = ...

print(f'Mean GDP growth (QoQ):   {gdp_mean}')
print(f'Median GDP growth (QoQ): {gdp_median}')
print(f'Mode GDP growth (QoQ):   {gdp_mode}')


In [None]:
# TODO: Now compute mean and median for the unemployment rate (UNRATE).
# Compare: is the mean above or below the median? What does that tell you
# about the shape of the distribution?

unrate = df['UNRATE'].dropna()

unrate_mean = ...
unrate_median = ...

print(f'Mean unemployment:   {unrate_mean}')
print(f'Median unemployment: {unrate_median}')
print(f'Difference (mean - median): {unrate_mean - unrate_median}')


**Interpretation prompt** (write 2–4 sentences below):
- Is the mean GDP growth above or below the median? What might cause that?
- For unemployment, what does the difference between mean and median suggest about its distribution?
- When would you report the median instead of the mean to a policymaker?


<a id="measures-of-spread"></a>
## Measures of Spread

### Goal
Compute variance, standard deviation, range, and IQR for economic indicators.
Visualize spread using box plots.

### Why this matters in economics
Two economies can have the same average GDP growth but very different volatility.
High variance means more uncertainty for businesses, investors, and policymakers.
The standard deviation of GDP growth is a simple measure of macroeconomic stability.
The IQR (interquartile range) is robust to outliers and useful for detecting unusual quarters.

**Key definitions:**
- **Variance** (sample): $s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2$
- **Standard deviation**: $s = \sqrt{s^2}$
- **Range**: $\max(x) - \min(x)$
- **IQR**: $Q_3 - Q_1$ (75th percentile minus 25th percentile)

### Your Turn

In [None]:
# TODO: Compute variance, std, range, and IQR for gdp_growth_qoq

gdp = df['gdp_growth_qoq'].dropna()

gdp_var = ...
gdp_std = ...
gdp_range = ...
gdp_iqr = ...

print(f'Variance:  {gdp_var}')
print(f'Std Dev:   {gdp_std}')
print(f'Range:     {gdp_range}')
print(f'IQR:       {gdp_iqr}')


In [None]:
# TODO: Compute the same spread statistics for multiple columns and compare.
# Which indicator is most volatile? Which is most stable?
# Hint: try UNRATE, FEDFUNDS, gdp_growth_qoq, gdp_growth_yoy

cols_to_compare = ['gdp_growth_qoq', 'gdp_growth_yoy', 'UNRATE', 'FEDFUNDS']

spread_summary = pd.DataFrame({
    'mean': ...,
    'std': ...,
    'iqr': ...,
})
spread_summary


In [None]:
# TODO: Create box plots comparing the distributions of multiple economic indicators.
# Hint: df[cols_to_compare].plot(kind='box', subplots=True, layout=(1, 4), figsize=(14, 4))

fig, axes = plt.subplots(1, len(cols_to_compare), figsize=(14, 4))

...

plt.tight_layout()
plt.show()


**Interpretation prompt** (write 2–4 sentences below):
- Which economic indicator has the highest standard deviation? Is that surprising?
- Why might the range be a misleading measure of spread for GDP growth?
- The box plot whiskers and dots indicate potential outliers. What economic events might produce outlier quarters?


<a id="skewness-and-kurtosis"></a>
## Skewness and Kurtosis

### Goal
Compute skewness and kurtosis for economic time series and interpret what they mean
about the shape of the distribution.

### Why this matters in economics
Financial returns and GDP growth are not normally distributed. They typically exhibit:
- **Negative skewness**: large negative returns (crashes) are more common than
  equivalently large positive returns.
- **Excess kurtosis** ("fat tails"): extreme events happen more often than a normal
  distribution predicts. This is critical for risk management: if you assume normality,
  you underestimate the probability of crashes.

**Key definitions:**
- **Skewness**: measures asymmetry. Positive = right tail longer; negative = left tail longer.
  A symmetric distribution (like the normal) has skewness = 0.
- **Kurtosis**: measures tail heaviness. The normal distribution has kurtosis = 3.
  **Excess kurtosis** = kurtosis - 3. Positive excess kurtosis means heavier tails than normal.
  `scipy.stats.kurtosis()` returns excess kurtosis by default.

### Your Turn

In [None]:
from scipy import stats

# TODO: Compute skewness and (excess) kurtosis for several columns.
# Hint: stats.skew(arr, nan_policy='omit'), stats.kurtosis(arr, nan_policy='omit')

cols_shape = ['gdp_growth_qoq', 'gdp_growth_yoy', 'UNRATE', 'FEDFUNDS']

shape_summary = pd.DataFrame({
    'skewness': ...,
    'excess_kurtosis': ...,
}, index=cols_shape)

shape_summary


In [None]:
# TODO: Create a histogram of gdp_growth_qoq overlaid with a normal distribution
# having the same mean and std. This visual comparison shows whether the data
# is well-approximated by a normal.

gdp = df['gdp_growth_qoq'].dropna()

fig, ax = plt.subplots(figsize=(8, 4))

# Histogram of actual data
...

# Overlay a normal PDF with the same mean/std
x_grid = np.linspace(gdp.min(), gdp.max(), 200)
...

ax.set_title('GDP Growth (QoQ) vs Normal Distribution')
ax.set_xlabel('GDP Growth (%)')
ax.legend()
plt.show()


**Interpretation prompt** (write 2–4 sentences below):
- Which columns have the most skewness? In which direction?
- Which columns have the highest excess kurtosis? What does that mean practically?
- If you assumed GDP growth was normally distributed, would you overestimate or underestimate the chance of a very bad quarter? Why?


<a id="data-types-and-summary-tables"></a>
## Data Types and Summary Tables

### Goal
Understand the difference between continuous, discrete, and categorical data.
Use `df.describe()` and `value_counts()` to summarize different types.

### Why this matters in economics
Economic datasets mix different variable types:
- **Continuous**: GDP growth, inflation rate, unemployment rate — can take any value in a range.
- **Discrete**: number of rate hikes per year, recession indicator (0/1).
- **Categorical**: industry sector, country code, policy regime label.

Using the wrong summary for the wrong type is a common mistake. Means and standard
deviations are meaningful for continuous data; for a binary recession indicator,
`value_counts()` and proportions are more informative.

### Your Turn

In [None]:
# TODO: Use df.describe() to get summary statistics of all numeric columns.
# Read the output carefully: what do the min, max, and quartiles tell you?

...


In [None]:
# TODO: Use value_counts() on the recession column.
# What fraction of quarters are recessions?

recession_counts = ...
print(recession_counts)
print(f'\nFraction in recession: {recession_counts.sum()} quarters total')


In [None]:
# TODO: Classify each variable in the dataset as continuous, discrete, or categorical.
# Fill in the dictionary below.
# Hint: recession and target_recession_next_q are binary (discrete);
#        most others are continuous.

variable_types = {
    'UNRATE':                  ...,  # 'continuous', 'discrete', or 'categorical'
    'FEDFUNDS':                ...,
    'CPIAUCSL':                ...,
    'gdp_growth_qoq':          ...,
    'recession':               ...,
    'target_recession_next_q': ...,
}

for var, vtype in variable_types.items():
    print(f'{var:30s} -> {vtype}')


In [None]:
# TODO: Check for missing values across the dataset.
# Hint: df.isna().sum().sort_values(ascending=False)
# Why might some lag columns have missing values?

...


**Interpretation prompt** (write 2–4 sentences below):
- What did `df.describe()` reveal that you did not expect?
- Why is calling `.mean()` on the `recession` column actually meaningful (even though it is binary)?
- Which columns have missing values and why?


<a id="visualizing-distributions"></a>
## Visualizing Distributions

### Goal
Create histograms, KDE plots, box plots, and violin plots of economic data.
Each visualization type reveals different aspects of the distribution.

### Why this matters in economics
A table of summary statistics can hide important features: bimodality, outliers, or
long tails. Visualization is the fastest way to spot problems before they become
modeling errors. For example, a histogram of unemployment may reveal two modes
(expansion vs recession clusters), which a single mean/std would obscure.

### Your Turn

In [None]:
# TODO: Create histograms for gdp_growth_qoq, UNRATE, and FEDFUNDS.
# Use subplots so they are side by side.
# Hint: fig, axes = plt.subplots(1, 3, figsize=(15, 4))
#        df['col'].hist(ax=axes[i], bins=20)

viz_cols = ['gdp_growth_qoq', 'UNRATE', 'FEDFUNDS']

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

...

plt.tight_layout()
plt.show()


In [None]:
# TODO: Create KDE (kernel density estimation) plots for the same columns.
# KDE smooths the histogram into a continuous curve.
# Hint: df['col'].plot(kind='kde', ax=ax)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

...

plt.tight_layout()
plt.show()


In [None]:
# TODO: Create box plots comparing GDP growth across recession vs non-recession quarters.
# Hint: df.boxplot(column='gdp_growth_qoq', by='recession', ax=ax)

fig, ax = plt.subplots(figsize=(6, 4))

...

ax.set_title('GDP Growth by Recession Status')
ax.set_xlabel('Recession (0 = No, 1 = Yes)')
ax.set_ylabel('GDP Growth (QoQ %)')
plt.show()


In [None]:
# TODO: Create a violin plot showing the distribution of unemployment rate.
# A violin plot combines a box plot with a KDE on each side.
# Hint: ax.violinplot(df['UNRATE'].dropna(), showmedians=True)
#   or use: plt.violinplot(dataset, showmedians=True)

fig, ax = plt.subplots(figsize=(6, 4))

...

ax.set_title('Distribution of Unemployment Rate')
ax.set_ylabel('Unemployment Rate (%)')
plt.show()


In [None]:
# TODO: Create a 2x2 figure showing four different views of gdp_growth_yoy:
# (1) histogram, (2) KDE, (3) box plot, (4) time series line plot.
# This demonstrates how different visualizations complement each other.

col = 'gdp_growth_yoy'
series = df[col].dropna()

fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# Top-left: histogram
...

# Top-right: KDE
...

# Bottom-left: box plot
...

# Bottom-right: time series
...

fig.suptitle(f'Four Views of {col}', fontsize=14)
plt.tight_layout()
plt.show()


**Interpretation prompt** (write 2–4 sentences below):
- Which visualization was most informative for spotting outliers? Why?
- Does the KDE of GDP growth look approximately normal, or is there visible skewness?
- How does GDP growth differ between recession and non-recession quarters in the box plot?
- What does the time series view reveal that the histogram does not?


<a id="checkpoint-self-check"></a>
## Checkpoint (Self-Check)
Run these asserts to verify your work. If any fail, go back and fix the corresponding section.

In [None]:
# ---- Central tendency checks ----
assert isinstance(gdp_mean, float), 'gdp_mean should be a float'
assert isinstance(gdp_median, float), 'gdp_median should be a float'
assert 0 < gdp_mean < 5, f'gdp_mean={gdp_mean} looks out of range for quarterly growth'

# ---- Spread checks ----
assert isinstance(gdp_var, float), 'gdp_var should be a float'
assert gdp_std > 0, 'Standard deviation must be positive'
assert gdp_iqr > 0, 'IQR must be positive'
assert abs(gdp_std**2 - gdp_var) < 1e-10, 'std^2 should equal variance'

# ---- Shape checks ----
assert 'skewness' in shape_summary.columns, 'shape_summary must have a skewness column'
assert 'excess_kurtosis' in shape_summary.columns, 'shape_summary must have an excess_kurtosis column'
assert len(shape_summary) == len(cols_shape), 'shape_summary should have one row per column'

# ---- Data checks ----
assert df.shape[0] > 20, 'Dataset should have more than 20 rows'
assert df.shape[1] >= 3, 'Dataset should have at least 3 columns'
assert df.index.inferred_type in {'datetime64', 'datetime64tz'}, 'Index should be datetime'

print('All checkpoint assertions passed.')


## Extensions (Optional)
- Compute descriptive statistics on the annualized GDP growth (`gdp_growth_qoq_annualized`) and compare to the non-annualized version. How does annualization affect spread and kurtosis?
- Download a longer time series from FRED (e.g., 1960–present) and see how the descriptive statistics change with sample period.
- Compute a rolling mean and rolling standard deviation of GDP growth (e.g., 8-quarter window) and plot them over time. When was the economy most volatile?


## Reflection
- Which summary statistic do you think is most important for an economist to report, and why?
- If you had to summarize the macro dataset in one table for a non-technical audience, which statistics and visualizations would you include?
- What assumptions are you implicitly making when you compute a single mean over the entire sample period?


<a id="solutions-reference"></a>
## Solutions (Reference)

Try the TODOs first. Use these only to unblock yourself or to compare approaches.

<details><summary>Solution: Measures of central tendency</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 00_descriptive_statistics — Measures of central tendency
gdp = df['gdp_growth_qoq'].dropna()

gdp_mean = gdp.mean()
gdp_median = gdp.median()
gdp_mode = gdp.mode().iloc[0]

print(f'Mean GDP growth (QoQ):   {gdp_mean:.4f}')
print(f'Median GDP growth (QoQ): {gdp_median:.4f}')
print(f'Mode GDP growth (QoQ):   {gdp_mode:.4f}')

# Unemployment
unrate = df['UNRATE'].dropna()
unrate_mean = unrate.mean()
unrate_median = unrate.median()
print(f'Mean unemployment:   {unrate_mean:.4f}')
print(f'Median unemployment: {unrate_median:.4f}')
```

</details>

<details><summary>Solution: Measures of spread</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 00_descriptive_statistics — Measures of spread
gdp = df['gdp_growth_qoq'].dropna()

gdp_var = gdp.var()
gdp_std = gdp.std()
gdp_range = gdp.max() - gdp.min()
gdp_iqr = gdp.quantile(0.75) - gdp.quantile(0.25)

# Multi-column comparison
cols_to_compare = ['gdp_growth_qoq', 'gdp_growth_yoy', 'UNRATE', 'FEDFUNDS']
spread_summary = pd.DataFrame({
    'mean': df[cols_to_compare].mean(),
    'std': df[cols_to_compare].std(),
    'iqr': df[cols_to_compare].quantile(0.75) - df[cols_to_compare].quantile(0.25),
})

# Box plots
fig, axes = plt.subplots(1, len(cols_to_compare), figsize=(14, 4))
for i, col in enumerate(cols_to_compare):
    df[col].dropna().plot(kind='box', ax=axes[i])
    axes[i].set_title(col)
plt.tight_layout()
plt.show()
```

</details>

<details><summary>Solution: Skewness and kurtosis</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 00_descriptive_statistics — Skewness and kurtosis
from scipy import stats

cols_shape = ['gdp_growth_qoq', 'gdp_growth_yoy', 'UNRATE', 'FEDFUNDS']

shape_summary = pd.DataFrame({
    'skewness': {c: stats.skew(df[c].dropna()) for c in cols_shape},
    'excess_kurtosis': {c: stats.kurtosis(df[c].dropna()) for c in cols_shape},
}, index=cols_shape)

# Visual: GDP growth vs normal
gdp = df['gdp_growth_qoq'].dropna()
fig, ax = plt.subplots(figsize=(8, 4))
ax.hist(gdp, bins=20, density=True, alpha=0.6, label='GDP Growth (QoQ)')
x_grid = np.linspace(gdp.min(), gdp.max(), 200)
ax.plot(x_grid, stats.norm.pdf(x_grid, gdp.mean(), gdp.std()),
        'r-', lw=2, label='Normal fit')
ax.set_title('GDP Growth (QoQ) vs Normal Distribution')
ax.set_xlabel('GDP Growth (%)')
ax.legend()
plt.show()
```

</details>

<details><summary>Solution: Data types and summary tables</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 00_descriptive_statistics — Data types and summary tables

# describe() for numeric summary
df.describe()

# Recession value counts
recession_counts = df['recession'].value_counts()
print(recession_counts)
print(f'Fraction in recession: {df["recession"].mean():.2%}')

# Variable type classification
variable_types = {
    'UNRATE':                  'continuous',
    'FEDFUNDS':                'continuous',
    'CPIAUCSL':                'continuous',
    'gdp_growth_qoq':          'continuous',
    'recession':               'discrete',
    'target_recession_next_q': 'discrete',
}

# Missing values
df.isna().sum().sort_values(ascending=False).head(10)
```

</details>

<details><summary>Solution: Visualizing distributions</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 00_descriptive_statistics — Visualizing distributions
viz_cols = ['gdp_growth_qoq', 'UNRATE', 'FEDFUNDS']

# Histograms
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for i, col in enumerate(viz_cols):
    df[col].dropna().hist(bins=20, ax=axes[i], edgecolor='black', alpha=0.7)
    axes[i].set_title(col)
plt.tight_layout()
plt.show()

# KDE
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for i, col in enumerate(viz_cols):
    df[col].dropna().plot(kind='kde', ax=axes[i])
    axes[i].set_title(col)
plt.tight_layout()
plt.show()

# Box plot: GDP growth by recession
fig, ax = plt.subplots(figsize=(6, 4))
df.boxplot(column='gdp_growth_qoq', by='recession', ax=ax)
ax.set_title('GDP Growth by Recession Status')
ax.set_xlabel('Recession (0 = No, 1 = Yes)')
ax.set_ylabel('GDP Growth (QoQ %)')
plt.suptitle('')  # remove auto-title
plt.show()

# Violin plot
fig, ax = plt.subplots(figsize=(6, 4))
ax.violinplot(df['UNRATE'].dropna(), showmedians=True)
ax.set_title('Distribution of Unemployment Rate')
ax.set_ylabel('Unemployment Rate (%)')
plt.show()

# 2x2 combined view
col = 'gdp_growth_yoy'
series = df[col].dropna()
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

series.hist(bins=20, ax=axes[0, 0], edgecolor='black', alpha=0.7)
axes[0, 0].set_title('Histogram')

series.plot(kind='kde', ax=axes[0, 1])
axes[0, 1].set_title('KDE')

axes[1, 0].boxplot(series)
axes[1, 0].set_title('Box Plot')

series.plot(ax=axes[1, 1])
axes[1, 1].set_title('Time Series')

fig.suptitle(f'Four Views of {col}', fontsize=14)
plt.tight_layout()
plt.show()
```

</details>
