# 02a Functional Forms and Interactions

Beyond log-log: level-log, log-level, quadratics, interactions, and dummy variables.

## Table of Contents
- [Log-level and level-log models](#log-level-and-level-log-models)
- [Quadratic terms](#quadratic-terms)
- [Interaction terms](#interaction-terms)
- [Dummy variables](#dummy-variables)
- [Checkpoint (Self-Check)](#checkpoint-self-check)
- [Solutions (Reference)](#solutions-reference)

## Why This Notebook Matters
Real economic relationships are rarely linear in levels. You will learn:
- how different functional forms (log-level, level-log, quadratic) change coefficient interpretation,
- how interaction terms let the effect of one variable depend on another,
- how to include categorical variables via dummy encoding.

These are the building blocks for any applied regression specification. Getting the functional form wrong can reverse the sign of an estimate or hide nonlinearities.


## Prerequisites (Quick Self-Check)
- Completed notebooks 00 and 01 in this section (single-factor and multi-factor regression on county data).
- Comfort with log-log interpretation (elasticities) from notebook 00.
- Basic algebra: derivatives, partial effects.

## What You Will Produce
- (no file output; learning/analysis notebook)

## Success Criteria
- You can explain the coefficient interpretation for log-level, level-log, log-log, and quadratic models.
- You can compute marginal effects from interaction models at different values.
- You can run your work end-to-end without undefined variables.

## Common Pitfalls
- Running cells top-to-bottom without reading the instructions.
- Leaving `...` placeholders in code cells.
- Interpreting a level-log coefficient the same way as a log-log coefficient.
- Forgetting to drop one dummy category (the dummy variable trap).
- Interpreting interaction terms without computing marginal effects at specific values.

## Quick Fixes (When You Get Stuck)
- If you see `ModuleNotFoundError`, re-run the bootstrap cell and restart the kernel; make sure `PROJECT_ROOT` is the repo root.
- If a `data/processed/*` file is missing, either run the matching build script (see guide) or use the notebook's `data/sample/*` fallback.
- If results look "too good," suspect leakage; re-check shifts, rolling windows, and time splits.
- If a model errors, check dtypes (`astype(float)`) and missingness (`dropna()` on required columns).

## Matching Guide
- `docs/guides/02_regression/02a_functional_forms_and_interactions.md`

## How To Use This Notebook
- Work section-by-section; don't skip the markdown.
- Most code cells are incomplete on purpose: replace TODOs and `...`, then run.
- After each section, write 2–4 sentences answering the interpretation prompts (what changed, why it matters).
- Prefer `data/processed/*` if you have built the real datasets; otherwise use the bundled `data/sample/*` fallbacks.
- Use the **Checkpoint (Self-Check)** section to catch mistakes early.
- Use **Solutions (Reference)** only to unblock yourself; then re-implement without looking.
- Use the matching guide (`docs/guides/02_regression/02a_functional_forms_and_interactions.md`) for the math, assumptions, and deeper context.

<a id="environment-bootstrap"></a>
## Environment Bootstrap
Run this cell first. It makes the repo importable and defines common directories.

In [None]:
from __future__ import annotations

from pathlib import Path
import sys


def find_repo_root(start: Path) -> Path:
    p = start
    for _ in range(8):
        if (p / 'src').exists() and (p / 'docs').exists():
            return p
        p = p.parent
    raise RuntimeError('Could not find repo root. Start Jupyter from the repo root.')


PROJECT_ROOT = find_repo_root(Path.cwd())
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

DATA_DIR = PROJECT_ROOT / 'data'
RAW_DIR = DATA_DIR / 'raw'
PROCESSED_DIR = DATA_DIR / 'processed'
SAMPLE_DIR = DATA_DIR / 'sample'

PROJECT_ROOT

## Goal
Explore functional forms beyond log-log and learn how specification choices change coefficient interpretation.

You will fit four families of models on the same county data:

| Model | Equation | Coefficient interpretation |
|-------|----------|---------------------------|
| Log-log | $\log(y) = a + b\,\log(x)$ | Elasticity: 1% $\Delta x$ $\to$ $b$% $\Delta y$ |
| Log-level | $\log(y) = a + b\,x$ | Semi-elasticity: 1-unit $\Delta x$ $\to$ $\approx 100b$% $\Delta y$ |
| Level-log | $y = a + b\,\log(x)$ | 1% $\Delta x$ $\to$ $b/100$ unit $\Delta y$ |
| Level-level | $y = a + b\,x$ | 1-unit $\Delta x$ $\to$ $b$-unit $\Delta y$ |

Then you will add quadratics, interactions, and dummy variables.

## Primer: Functional Form Interpretation Cheat Sheet

When you take logs of the dependent variable, the independent variable, both, or neither, the coefficient means something different each time. This primer gives the four cases.

### The four cases

Let $y$ be the outcome and $x$ the regressor.

**Level-level**: $y = \beta_0 + \beta_1 x + \varepsilon$
- $\beta_1$: a 1-unit increase in $x$ is associated with a $\beta_1$-unit change in $y$.

**Log-level (semi-log)**: $\log(y) = \beta_0 + \beta_1 x + \varepsilon$
- $\beta_1$: a 1-unit increase in $x$ is associated with an approximate $100 \times \beta_1$% change in $y$.
- Exact: $y$ is multiplied by $e^{\beta_1}$ for a 1-unit increase in $x$.

**Level-log**: $y = \beta_0 + \beta_1 \log(x) + \varepsilon$
- $\beta_1$: a 1% increase in $x$ is associated with a $\beta_1 / 100$ unit change in $y$.

**Log-log**: $\log(y) = \beta_0 + \beta_1 \log(x) + \varepsilon$
- $\beta_1$: a 1% increase in $x$ is associated with a $\beta_1$% change in $y$ (elasticity).

### Why it matters

Choosing the wrong functional form changes the economic interpretation of your results. A coefficient of 0.5 in a log-log model means something entirely different from 0.5 in a level-level model.

### Quick decision rule

- If both variables span orders of magnitude (e.g., income, population): consider log-log.
- If you want "percent change in $y$ per unit change in $x$": use log-level.
- If you want "unit change in $y$ per percent change in $x$": use level-log.
- Always check residual plots; functional form misspecification shows up as patterns in residuals.

## Load census data

### Goal
Load the county-level dataset and prepare baseline variables.

### Your Turn (1): Load and prepare the data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

year = 2022  # TODO: set to the year you fetched
path = PROCESSED_DIR / f'census_county_{year}.csv'

if path.exists():
    df = pd.read_csv(path)
else:
    df = pd.read_csv(SAMPLE_DIR / 'census_county_sample.csv')

# Build clean modeling DataFrame
income = pd.to_numeric(df['B19013_001E'], errors='coerce')
rent = pd.to_numeric(df['B25064_001E'], errors='coerce')

mask = (income > 0) & (rent > 0)
df_m = pd.DataFrame({
    'income': income[mask],
    'rent': rent[mask],
    'state': df.loc[mask, 'state'],
    'poverty_rate': pd.to_numeric(df.loc[mask, 'poverty_rate'], errors='coerce'),
}).dropna().copy()

# Pre-compute log transforms
df_m['log_income'] = np.log(df_m['income'])
df_m['log_rent'] = np.log(df_m['rent'])

print(f'Observations: {len(df_m)}')
df_m.head()

<a id="log-level-and-level-log-models"></a>
## Log-level and level-log models

### Goal
Fit two models that mix levels and logs, and compare their coefficient interpretations to the log-log model from notebook 00.

**Model A (log-level / semi-elasticity)**:
$$\log(rent_i) = \beta_0 + \beta_1 \cdot income_i + \varepsilon_i$$

Interpretation: a \\$1 increase in median income is associated with an approximate $100 \times \beta_1$% change in rent.

**Model B (level-log)**:
$$rent_i = \beta_0 + \beta_1 \cdot \log(income_i) + \varepsilon_i$$

Interpretation: a 1% increase in income is associated with a $\beta_1 / 100$ dollar change in rent.

### Your Turn (1): Fit the log-level model (semi-elasticity)

In [None]:
from src import econometrics

# Model A: log(rent) = a + b * income  (level predictor, log outcome)
# TODO: Fit using fit_ols_hc3. y_col is 'log_rent', x_cols is ['income'].
res_log_level = econometrics.fit_ols_hc3(df_m, y_col=..., x_cols=[...])
print(res_log_level.summary())

### Your Turn (2): Interpret the log-level coefficient

In [None]:
beta_log_level = float(res_log_level.params['income'])

# The coefficient is in log-points per dollar of income.
# To get the approximate percent change in rent per $1,000 income increase:
# TODO: Compute the approximate percent change in rent for a $1,000 increase in income.
pct_change_per_1000 = ...

print(f'beta (log-level): {beta_log_level:.6f}')
print(f'A $1,000 income increase is associated with ~{pct_change_per_1000:.2f}% higher rent')

### Your Turn (3): Fit the level-log model

In [None]:
# Model B: rent = a + b * log(income)  (log predictor, level outcome)
# TODO: Fit using fit_ols_hc3. y_col is 'rent', x_cols is ['log_income'].
res_level_log = econometrics.fit_ols_hc3(df_m, y_col=..., x_cols=[...])
print(res_level_log.summary())

### Your Turn (4): Interpret the level-log coefficient

In [None]:
beta_level_log = float(res_level_log.params['log_income'])

# In a level-log model, a 1% increase in income is associated with
# beta/100 dollar change in rent.
# TODO: Compute the dollar change in rent for a 10% increase in income.
dollar_change_10pct = ...

print(f'beta (level-log): {beta_level_log:.2f}')
print(f'A 10% income increase is associated with ~${dollar_change_10pct:.2f} higher rent')

### Your Turn (5): Compare all three specifications (log-log from notebook 00)

In [None]:
# Fit log-log for comparison
res_log_log = econometrics.fit_ols_hc3(df_m, y_col='log_rent', x_cols=['log_income'])

# TODO: Build a comparison table of the three models.
# Include: model name, coefficient, SE, R-squared, and interpretation.
comparison = pd.DataFrame({
    'model': ['log-log', 'log-level', 'level-log'],
    'coef': [
        float(res_log_log.params['log_income']),
        ...,  # TODO: log-level coefficient
        ...,  # TODO: level-log coefficient
    ],
    'se': [
        float(res_log_log.bse['log_income']),
        ...,  # TODO: log-level SE
        ...,  # TODO: level-log SE
    ],
    'r_squared': [
        ...,  # TODO: R-squared for each model
        ...,
        ...,
    ],
})
comparison

### Interpretation prompt

Write 2–4 sentences:
- Which model has the highest R-squared? Does that make it "best"?
- Why can't you directly compare R-squared across models with different dependent variables (log_rent vs rent)?
- Which specification would you choose for a policy report about income and rent?

<a id="quadratic-terms"></a>
## Quadratic terms

### Goal
Add $income^2$ to allow a nonlinear (U-shaped or inverted-U) relationship.

Model:
$$\log(rent_i) = \beta_0 + \beta_1 \cdot income_i + \beta_2 \cdot income_i^2 + \varepsilon_i$$

The marginal effect of income is no longer constant:
$$\frac{\partial \log(rent)}{\partial income} = \beta_1 + 2\beta_2 \cdot income$$

**Turning point** (where marginal effect = 0):
$$income^* = \frac{-\beta_1}{2\beta_2}$$

### Your Turn (1): Create the quadratic term and fit the model

In [None]:
# Scale income to $1,000s to avoid tiny coefficients on income^2
df_m['income_k'] = df_m['income'] / 1000

# TODO: Create the squared term.
df_m['income_k_sq'] = ...

# TODO: Fit OLS with HC3: log_rent ~ income_k + income_k_sq
res_quad = econometrics.fit_ols_hc3(
    df_m, y_col='log_rent', x_cols=['income_k', 'income_k_sq']
)
print(res_quad.summary())

### Your Turn (2): Compute the turning point

In [None]:
b1 = float(res_quad.params['income_k'])
b2 = float(res_quad.params['income_k_sq'])

# TODO: Compute the turning point: -b1 / (2 * b2)
# This is in $1,000s because we used income_k.
turning_point_k = ...

print(f'b1 (income_k): {b1:.6f}')
print(f'b2 (income_k_sq): {b2:.8f}')
print(f'Turning point: ${turning_point_k:.1f}k = ${turning_point_k * 1000:,.0f}')

# Interpretation prompt:
# Is the turning point within the data range?
print(f'Income range: ${df_m["income_k"].min():.1f}k to ${df_m["income_k"].max():.1f}k')

### Your Turn (3): Plot the marginal effect across the income range

In [None]:
# TODO: Plot the marginal effect of income_k as a function of income_k.
# The marginal effect is: b1 + 2 * b2 * income_k
income_grid = np.linspace(df_m['income_k'].min(), df_m['income_k'].max(), 200)
marginal_effect = ...

fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(income_grid, marginal_effect)
ax.axhline(0, color='red', linestyle='--', alpha=0.5)
ax.set_xlabel('Median Household Income ($1,000s)')
ax.set_ylabel('Marginal effect on log(rent)')
ax.set_title('Marginal Effect of Income (Quadratic Model)')
plt.tight_layout()
plt.show()

### Interpretation prompt

Write 2–4 sentences:
- Is the quadratic term statistically significant?
- Does the marginal effect cross zero within the data range? What does that mean economically?
- Should you always include a quadratic? What is the cost of doing so?

<a id="interaction-terms"></a>
## Interaction terms

### Goal
Add an interaction between income and poverty_rate. This allows the effect of income on rent to vary with the local poverty rate.

Model:
$$\log(rent_i) = \beta_0 + \beta_1 \cdot income\_k_i + \beta_2 \cdot poverty\_rate_i + \beta_3 \cdot (income\_k_i \times poverty\_rate_i) + \varepsilon_i$$

The marginal effect of income now depends on poverty_rate:
$$\frac{\partial \log(rent)}{\partial income\_k} = \beta_1 + \beta_3 \cdot poverty\_rate$$

Key insight: you cannot interpret $\beta_1$ alone when an interaction is present. $\beta_1$ is the effect of income *when poverty_rate = 0*.

### Your Turn (1): Create the interaction and fit the model

In [None]:
# TODO: Create the interaction term.
df_m['income_x_poverty'] = ...

# TODO: Fit OLS with HC3.
res_interact = econometrics.fit_ols_hc3(
    df_m,
    y_col='log_rent',
    x_cols=['income_k', 'poverty_rate', 'income_x_poverty'],
)
print(res_interact.summary())

### Your Turn (2): Compute marginal effects at different poverty rates

In [None]:
b1_interact = float(res_interact.params['income_k'])
b3_interact = float(res_interact.params['income_x_poverty'])

# Marginal effect of income_k at different poverty rates:
# ME(income_k | poverty_rate=p) = b1 + b3 * p

# TODO: Compute the marginal effect at the 25th, 50th, and 75th percentile of poverty_rate.
pov_p25 = df_m['poverty_rate'].quantile(0.25)
pov_p50 = df_m['poverty_rate'].quantile(0.50)
pov_p75 = df_m['poverty_rate'].quantile(0.75)

me_p25 = ...  # TODO
me_p50 = ...  # TODO
me_p75 = ...  # TODO

print(f'Marginal effect of income ($1k) at poverty_rate={pov_p25:.3f} (p25): {me_p25:.4f}')
print(f'Marginal effect of income ($1k) at poverty_rate={pov_p50:.3f} (p50): {me_p50:.4f}')
print(f'Marginal effect of income ($1k) at poverty_rate={pov_p75:.3f} (p75): {me_p75:.4f}')

### Your Turn (3): Plot the marginal effect of income as a function of poverty_rate

In [None]:
# TODO: Plot the marginal effect of income_k across the poverty_rate range.
pov_grid = np.linspace(df_m['poverty_rate'].min(), df_m['poverty_rate'].max(), 200)
me_grid = ...  # TODO: b1_interact + b3_interact * pov_grid

fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(pov_grid, me_grid)
ax.axhline(0, color='red', linestyle='--', alpha=0.5)
ax.set_xlabel('Poverty Rate')
ax.set_ylabel('Marginal effect of income ($1k) on log(rent)')
ax.set_title('How the Effect of Income Depends on Poverty Rate')
plt.tight_layout()
plt.show()

### Interpretation prompt

Write 2–4 sentences:
- Is the interaction term statistically significant?
- Does higher poverty weaken or strengthen the income-rent relationship? Why might that be?
- Why is it misleading to report only $\beta_1$ from an interaction model?

<a id="dummy-variables"></a>
## Dummy variables

### Goal
Create region dummies from state FIPS codes and include them in a regression. Interpret the dummy coefficients as differences from the base category.

We will map state FIPS codes to Census regions (Northeast, Midwest, South, West) and create indicator columns.

### Your Turn (1): Create region variable from state codes

In [None]:
# Census region mapping (state FIPS -> region)
NORTHEAST = ['09', '23', '25', '33', '34', '36', '42', '44', '50']
MIDWEST = ['17', '18', '19', '20', '26', '27', '29', '31', '38', '39', '46', '55']
SOUTH = ['01', '05', '10', '11', '12', '13', '21', '22', '24', '28',
          '37', '40', '45', '47', '48', '51', '54']
WEST = ['02', '04', '06', '08', '15', '16', '30', '32', '35', '41', '49', '53', '56']

def state_to_region(fips: str) -> str:
    fips = str(fips).zfill(2)
    if fips in NORTHEAST:
        return 'Northeast'
    elif fips in MIDWEST:
        return 'Midwest'
    elif fips in SOUTH:
        return 'South'
    elif fips in WEST:
        return 'West'
    else:
        return 'Other'

# TODO: Apply the mapping to create a 'region' column.
df_m['region'] = ...

print(df_m['region'].value_counts())

### Your Turn (2): Create dummy variables with `pd.get_dummies`

In [None]:
# TODO: Create dummy columns, dropping one category to avoid the dummy variable trap.
# Hint: pd.get_dummies(df_m['region'], prefix='region', drop_first=True)
region_dummies = ...

# Join dummies to df_m
df_m = df_m.join(region_dummies)

# Show the dummy columns
dummy_cols = [c for c in df_m.columns if c.startswith('region_')]
print('Dummy columns:', dummy_cols)
print('Dropped (base) category is the one NOT listed above.')
df_m[dummy_cols].head()

### Your Turn (3): Fit a model with region dummies

In [None]:
# TODO: Fit log_rent ~ log_income + region dummies using HC3.
x_cols_dummy = ['log_income'] + dummy_cols
res_dummy = econometrics.fit_ols_hc3(
    df_m, y_col='log_rent', x_cols=...
)
print(res_dummy.summary())

### Your Turn (4): Interpret the dummy coefficients

In [None]:
# TODO: Extract and interpret the region dummy coefficients.
# Each coefficient is the log-point difference from the base region,
# holding income constant.
# Approximate percent difference: 100 * coef.

for col in dummy_cols:
    coef = float(res_dummy.params[col])
    pval = float(res_dummy.pvalues[col])
    # TODO: Compute the approximate percent difference from the base region.
    pct_diff = ...
    print(f'{col}: coef={coef:.4f}, ~{pct_diff:.1f}% vs base, p={pval:.4f}')

### Interpretation prompt

Write 2–4 sentences:
- Which region has the highest rent premium relative to the base category, controlling for income?
- Are the regional differences statistically significant?
- What does the base category represent in this model?

<a id="checkpoint-self-check"></a>
## Checkpoint (Self-Check)
Run a few asserts and write 2–3 sentences summarizing what you verified.

In [None]:
# TODO: Validate your results across all four model families.
# Example checks (adjust variable names as needed):

# Data checks
assert df_m.shape[0] > 100, 'Too few observations after filtering'
assert not df_m[['income', 'rent', 'log_income', 'log_rent']].isna().any().any(), 'NaNs in core variables'

# Log-level model check
assert beta_log_level > 0, 'Log-level coefficient should be positive (higher income -> higher rent)'

# Level-log model check
assert beta_level_log > 0, 'Level-log coefficient should be positive'

# Quadratic model: turning point should be positive
assert turning_point_k > 0, 'Turning point should be positive income'

# Interaction model fitted
assert 'income_x_poverty' in res_interact.params.index, 'Interaction term missing'

# Dummy model: at least 2 region dummies
assert len(dummy_cols) >= 2, 'Should have at least 2 region dummies (with one dropped)'

print('All checks passed.')

## Extensions (Optional)
- Try one additional variant beyond the main path (different features, different split, different model).
- Write down what improved, what got worse, and your hypothesis for why.

Ideas:
- Add region dummies to the interaction model. Does the interaction coefficient change?
- Try a cubic term ($income^3$). Is it statistically significant? Does the turning point move?
- Interact region dummies with income to allow different slopes by region.

## Reflection
- What did you assume implicitly (about timing, availability, stationarity, or costs)?
- If you had to ship this model, what would you monitor?
- How does functional form choice affect policy conclusions (e.g., "a $1,000 income increase leads to X% higher rent")?

<a id="solutions-reference"></a>
## Solutions (Reference)

Try the TODOs first. Use these only to unblock yourself or to compare approaches.

<details><summary>Solution: Log-level and level-log models</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 02a — Log-level and level-log models
from src import econometrics

# Log-level: log(rent) ~ income
res_log_level = econometrics.fit_ols_hc3(df_m, y_col='log_rent', x_cols=['income'])
print(res_log_level.summary())

beta_log_level = float(res_log_level.params['income'])
# Approx % change per $1,000: 100 * beta * 1000
pct_change_per_1000 = 100 * beta_log_level * 1000
print(f'A $1,000 income increase -> ~{pct_change_per_1000:.2f}% higher rent')

# Level-log: rent ~ log(income)
res_level_log = econometrics.fit_ols_hc3(df_m, y_col='rent', x_cols=['log_income'])
print(res_level_log.summary())

beta_level_log = float(res_level_log.params['log_income'])
# Dollar change for 10% income increase: beta * log(1.10) ≈ beta * 0.10
dollar_change_10pct = beta_level_log * np.log(1.10)
print(f'A 10% income increase -> ~${dollar_change_10pct:.2f} higher rent')
```

</details>

<details><summary>Solution: Compare three specifications</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 02a — Comparison table
comparison = pd.DataFrame({
    'model': ['log-log', 'log-level', 'level-log'],
    'coef': [
        float(res_log_log.params['log_income']),
        float(res_log_level.params['income']),
        float(res_level_log.params['log_income']),
    ],
    'se': [
        float(res_log_log.bse['log_income']),
        float(res_log_level.bse['income']),
        float(res_level_log.bse['log_income']),
    ],
    'r_squared': [
        res_log_log.rsquared,
        res_log_level.rsquared,
        res_level_log.rsquared,
    ],
})
comparison
```

_Note: R-squared is not directly comparable across models with different dependent variables (log_rent vs rent)._

</details>

<details><summary>Solution: Quadratic terms</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 02a — Quadratic terms
df_m['income_k'] = df_m['income'] / 1000
df_m['income_k_sq'] = df_m['income_k'] ** 2

res_quad = econometrics.fit_ols_hc3(
    df_m, y_col='log_rent', x_cols=['income_k', 'income_k_sq']
)
print(res_quad.summary())

b1 = float(res_quad.params['income_k'])
b2 = float(res_quad.params['income_k_sq'])
turning_point_k = -b1 / (2 * b2)
print(f'Turning point: ${turning_point_k * 1000:,.0f}')

# Marginal effect plot
income_grid = np.linspace(df_m['income_k'].min(), df_m['income_k'].max(), 200)
marginal_effect = b1 + 2 * b2 * income_grid
```

</details>

<details><summary>Solution: Interaction terms</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 02a — Interaction terms
df_m['income_x_poverty'] = df_m['income_k'] * df_m['poverty_rate']

res_interact = econometrics.fit_ols_hc3(
    df_m,
    y_col='log_rent',
    x_cols=['income_k', 'poverty_rate', 'income_x_poverty'],
)
print(res_interact.summary())

b1_interact = float(res_interact.params['income_k'])
b3_interact = float(res_interact.params['income_x_poverty'])

pov_p25 = df_m['poverty_rate'].quantile(0.25)
pov_p50 = df_m['poverty_rate'].quantile(0.50)
pov_p75 = df_m['poverty_rate'].quantile(0.75)

me_p25 = b1_interact + b3_interact * pov_p25
me_p50 = b1_interact + b3_interact * pov_p50
me_p75 = b1_interact + b3_interact * pov_p75

# Marginal effect plot
pov_grid = np.linspace(df_m['poverty_rate'].min(), df_m['poverty_rate'].max(), 200)
me_grid = b1_interact + b3_interact * pov_grid
```

</details>

<details><summary>Solution: Dummy variables</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 02a — Dummy variables
df_m['region'] = df_m['state'].apply(state_to_region)
print(df_m['region'].value_counts())

region_dummies = pd.get_dummies(df_m['region'], prefix='region', drop_first=True)
df_m = df_m.join(region_dummies)

dummy_cols = [c for c in df_m.columns if c.startswith('region_')]
x_cols_dummy = ['log_income'] + dummy_cols

res_dummy = econometrics.fit_ols_hc3(
    df_m, y_col='log_rent', x_cols=x_cols_dummy
)
print(res_dummy.summary())

# Interpret coefficients
for col in dummy_cols:
    coef = float(res_dummy.params[col])
    pval = float(res_dummy.pvalues[col])
    pct_diff = 100 * coef
    print(f'{col}: coef={coef:.4f}, ~{pct_diff:.1f}% vs base, p={pval:.4f}')
```

</details>