# Spatial Lag Model (SAR): Modeling Endogenous Spatial Spillovers

**Level**: Intermediate  
**Duration**: 120-140 minutes  
**Prerequisites**: Notebooks 01-02, Understanding of W matrices, Maximum Likelihood basics

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. **Estimate** Spatial Lag Models (SAR) using Maximum Likelihood
2. **Interpret** the spatial autoregressive parameter œÅ
3. **Understand** endogenous spillovers and the spatial multiplier
4. **Compare** OLS and SAR to demonstrate bias correction
5. **Diagnose** model adequacy using residual tests
6. **Apply** SAR to real-world housing price spillovers

---

## Table of Contents

1. [Introduction to SAR Model](#1-introduction)
2. [Data Preparation and W Matrix](#2-data-preparation)
3. [OLS Baseline (The Wrong Way)](#3-ols-baseline)
4. [Estimating SAR with PanelBox](#4-sar-estimation)
5. [Comparing OLS vs SAR](#5-comparison)
6. [Panel Data: Fixed Effects SAR](#6-fixed-effects)
7. [Understanding the Spatial Multiplier](#7-spatial-multiplier)
8. [Model Diagnostics](#8-diagnostics)
9. [Case Study: Housing Price Spillovers](#9-case-study)
10. [Summary and Next Steps](#10-summary)

---

## 1. Introduction to SAR Model {#1-introduction}

### Modeling Endogenous Spatial Spillovers

The **Spatial Lag Model (SAR)** is the foundational spatial econometric model. It directly models **endogenous spatial spillovers** where the outcome in one location depends on outcomes in neighboring locations.

### Model Specification

The SAR model is specified as:

$$
y = \rho W y + X\beta + \alpha + \varepsilon
$$

Where:
- **y**: N√ó1 vector of dependent variable
- **œÅ** (rho): Spatial autoregressive parameter (scalar)
- **Wy**: Spatial lag of y (weighted average of neighbors' y)
- **X**: N√óK matrix of explanatory variables
- **Œ≤**: K√ó1 vector of coefficients
- **Œ±**: Fixed or random effects
- **Œµ**: i.i.d. error term

### Reduced Form

Solving for y:

$$
y = (I - \rho W)^{-1}(X\beta + \alpha + \varepsilon)
$$

**Key Insight**: A change in $X_i$ affects not only $y_i$ but also neighbors' y, which feeds back to $y_i$ ‚Üí **Multiplicative spillovers**

### Economic Interpretation of œÅ

- **œÅ > 0**: Positive spatial spillovers (clustering, imitation)
  - Example: High housing prices in neighborhood i increase prices in neighboring neighborhoods
  
- **œÅ < 0**: Negative spatial spillovers (competition)
  - Example: Retail stores compete for customers across space
  
- **œÅ = 0**: No spatial dependence ‚Üí OLS is appropriate

### Why OLS Fails with SAR

**Endogeneity Problem**: Wy is endogenous (correlated with Œµ)
- y depends on Wy, but Wy depends on y ‚Üí **simultaneity bias**

**Consequences**:
- ‚úó Œ≤ estimates are **biased**
- ‚úó œÅ **cannot be estimated** at all (omitted variable)
- ‚úó Standard errors are **wrong**

**Solution**: Maximum Likelihood (ML) or Quasi-ML (QML) estimation

---

In [None]:
# Setup: Import libraries
import sys
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import probplot

# PanelBox
panelbox_path = Path("/home/guhaase/projetos/panelbox")
if panelbox_path.exists():
    sys.path.insert(0, str(panelbox_path))

# Spatial libraries
try:
    from libpysal.weights import KNN, Queen
    from esda import Moran
    spatial_available = True
except ImportError:
    print("‚ö† Warning: libpysal/esda not available. Install with: pip install libpysal esda")
    spatial_available = False

# PanelBox spatial
try:
    from panelbox.models.spatial import SpatialLag
    from panelbox.core import PanelData
    panelbox_available = True
except ImportError:
    print("‚ö† Warning: PanelBox spatial models not available")
    panelbox_available = False

# Plot settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Create output directories
output_dir = Path("../outputs/figures")
output_dir.mkdir(parents=True, exist_ok=True)

print("‚úì Libraries imported successfully")
print(f"‚úì Output directory: {output_dir.absolute()}")

---

## 2. Data Preparation and W Matrix {#2-data-preparation}

### Dataset Requirements

For SAR estimation, we need:
- **Panel structure**: entity ID, time period
- **Dependent variable**: housing price
- **Independent variables**: bedrooms, sqft, age, garage, etc.
- **Geographic coordinates** (lat/lon) OR spatial polygons

### Building the Spatial Weight Matrix

For **point data** (houses with coordinates):
- Use **k-Nearest Neighbors (k-NN)** weighting
- Row-normalize for interpretation as averages

For **polygon data** (census tracts, counties):
- Use **Queen contiguity** or **Rook contiguity**

---

In [None]:
# Generate synthetic housing data for demonstration
# In practice, you would load real data from ../data/

np.random.seed(42)
n_houses = 500
n_years = 3

# Create spatial clusters
n_clusters = 5
cluster_centers = np.random.uniform(-122, -121.5, (n_clusters, 2))
cluster_centers[:, 1] = np.random.uniform(37.5, 38, n_clusters)  # latitude

# Assign houses to clusters
coords = []
for i in range(n_houses):
    cluster_idx = np.random.choice(n_clusters)
    center = cluster_centers[cluster_idx]
    # Add noise around cluster center
    lon = center[0] + np.random.normal(0, 0.02)
    lat = center[1] + np.random.normal(0, 0.02)
    coords.append([lon, lat])

coords = np.array(coords)

# Generate house characteristics
bedrooms = np.random.choice([2, 3, 4, 5], n_houses, p=[0.2, 0.4, 0.3, 0.1])
sqft = 800 + bedrooms * 400 + np.random.normal(0, 200, n_houses)
age = np.random.randint(0, 50, n_houses)
garage = np.random.choice([0, 1, 2], n_houses, p=[0.2, 0.5, 0.3])

# Panel structure: replicate over years
data_list = []
for year in range(2018, 2018 + n_years):
    for i in range(n_houses):
        # Base price from characteristics
        base_price = (50000 + bedrooms[i] * 80000 + sqft[i] * 150 + 
                     garage[i] * 20000 - age[i] * 1000)
        
        # Add time trend
        price = base_price * (1 + 0.05 * (year - 2018))
        
        # Add noise
        price += np.random.normal(0, 30000)
        
        data_list.append({
            'entity_id': i,
            'year': year,
            'price': price,
            'bedrooms': bedrooms[i],
            'sqft': sqft[i],
            'age': age[i],
            'garage': garage[i],
            'longitude': coords[i, 0],
            'latitude': coords[i, 1]
        })

housing = pd.DataFrame(data_list)

print("Dataset Preview:")
print(housing.head(10))
print(f"\nShape: {housing.shape}")
print(f"Variables: {housing.columns.tolist()}")
print(f"\nTime periods: {housing['year'].unique()}")
print(f"Entities: {housing['entity_id'].nunique()}")

In [None]:
# Create GeoDataFrame from coordinates
from shapely.geometry import Point

# Use first year for spatial structure (W matrix is time-invariant)
housing_geo = housing[housing['year'] == 2018].copy()

geometry = [Point(xy) for xy in zip(housing_geo['longitude'], housing_geo['latitude'])]
housing_geo = gpd.GeoDataFrame(housing_geo, geometry=geometry, crs="EPSG:4326")

print(f"GeoDataFrame created with {len(housing_geo)} houses")
print(f"CRS: {housing_geo.crs}")

In [None]:
# Build spatial weight matrix using k-Nearest Neighbors
if spatial_available:
    k = 8  # Number of nearest neighbors
    W = KNN.from_dataframe(housing_geo, k=k)
    W.transform = 'r'  # Row-normalize
    
    print(f"Spatial Weight Matrix (k-NN):")
    print(f"  Type: k-Nearest Neighbors (k={k})")
    print(f"  N units: {W.n}")
    print(f"  Average neighbors: {W.mean_neighbors:.2f}")
    print(f"  Row-normalized: {W.transform == 'r'}")
    print(f"\n  Interpretation:")
    print(f"    - Each house connected to {k} nearest neighbors")
    print(f"    - Weights sum to 1 for each house (row-normalized)")
    print(f"    - Wy_i = average price of {k} nearest neighbors")
else:
    print("‚ö† Spatial libraries not available. Skipping W matrix construction.")
    W = None

In [None]:
# Visualize spatial connections (sample)
if W is not None:
    fig, ax = plt.subplots(figsize=(12, 10))
    
    # Plot houses
    housing_geo.plot(ax=ax, markersize=30, color='lightblue', 
                     edgecolor='black', alpha=0.6, linewidth=0.5)
    
    # Plot connections for sample houses
    sample_ids = np.random.choice(housing_geo.index, 30, replace=False)
    for idx in sample_ids:
        house_i = housing_geo.loc[idx]
        # Get neighbors using entity_id
        entity_i = house_i['entity_id']
        if entity_i in W.neighbors:
            for neighbor_id in W.neighbors[entity_i]:
                # Find neighbor in GeoDataFrame
                neighbor_row = housing_geo[housing_geo['entity_id'] == neighbor_id]
                if len(neighbor_row) > 0:
                    house_j = neighbor_row.iloc[0]
                    ax.plot([house_i.geometry.x, house_j.geometry.x],
                           [house_i.geometry.y, house_j.geometry.y],
                           'r-', linewidth=0.5, alpha=0.3)
    
    ax.set_title(f'Spatial Connectivity: k-NN (k={k})\n30 Houses and Their Neighbors', 
                fontsize=14, fontweight='bold')
    ax.set_xlabel('Longitude', fontsize=12)
    ax.set_ylabel('Latitude', fontsize=12)
    plt.tight_layout()
    plt.savefig(output_dir / 'nb03_spatial_connections.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print("‚úì Spatial connections visualized")
    print("  ‚Üí Red lines connect each house to its 8 nearest neighbors")

In [None]:
# Check for spatial autocorrelation in prices
if spatial_available and W is not None:
    price = housing_geo['price'].values
    moran = Moran(price, W)
    
    print("\nMoran's I Test for Housing Prices:")
    print("=" * 60)
    print(f"  Moran's I statistic: {moran.I:.4f}")
    print(f"  Expected I (random): {moran.EI:.4f}")
    print(f"  p-value: {moran.p_sim:.4f}")
    print("=" * 60)
    
    if moran.p_sim < 0.05:
        if moran.I > 0:
            print("  ‚úì Significant POSITIVE spatial autocorrelation")
            print("  ‚Üí High prices cluster near high prices")
            print("  ‚Üí Low prices cluster near low prices")
        else:
            print("  ‚úì Significant NEGATIVE spatial autocorrelation")
            print("  ‚Üí High prices near low prices (checkerboard pattern)")
        print("\n  ‚Üí SAR model is APPROPRIATE")
    else:
        print("  ‚úó No significant spatial autocorrelation")
        print("  ‚Üí Prices are spatially random")
        print("  ‚Üí OLS may be sufficient (but let's test SAR anyway)")

---

## 3. OLS Baseline (The Wrong Way) {#3-ols-baseline}

### Why Estimate OLS First?

1. **Baseline comparison**: See how much SAR improves
2. **Demonstrate bias**: OLS coefficients are biased when spatial dependence exists
3. **Residual diagnostics**: OLS residuals will show spatial autocorrelation

### What's Wrong with OLS?

OLS assumes:
- ‚úó No omitted variables (but Wy is omitted!)
- ‚úó Residuals are uncorrelated (but they're spatially correlated!)
- ‚úó Standard errors are correct (but they're wrong!)

Let's see the problem in action.

---

In [None]:
# Estimate OLS (ignoring spatial dependence)
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Prepare data (using first year for simplicity)
housing_sample = housing[housing['year'] == 2018].copy()
X_vars = ['bedrooms', 'sqft', 'age', 'garage']
X = housing_sample[X_vars].values
y = housing_sample['price'].values

# Fit OLS
ols = LinearRegression()
ols.fit(X, y)

# Predictions and residuals
y_pred_ols = ols.predict(X)
residuals_ols = y - y_pred_ols

# Display results
print("OLS REGRESSION RESULTS (IGNORING SPATIAL DEPENDENCE)")
print("=" * 70)
print(f"Dependent variable: Price")
print(f"\nCoefficients:")
for var, coef in zip(X_vars, ols.coef_):
    print(f"  {var:12s}: ${coef:>12,.2f}")
print(f"  {'Intercept':12s}: ${ols.intercept_:>12,.2f}")

# R-squared
r2_ols = r2_score(y, y_pred_ols)
rmse_ols = np.sqrt(np.mean(residuals_ols**2))
print(f"\nModel Fit:")
print(f"  R-squared: {r2_ols:.4f}")
print(f"  RMSE: ${rmse_ols:,.2f}")
print("=" * 70)

In [None]:
# Check spatial autocorrelation in OLS residuals
if spatial_available and W is not None:
    moran_resid = Moran(residuals_ols, W)
    
    print("\nMoran's I Test on OLS Residuals:")
    print("=" * 70)
    print(f"  Moran's I statistic: {moran_resid.I:.4f}")
    print(f"  Expected I (random): {moran_resid.EI:.4f}")
    print(f"  p-value: {moran_resid.p_sim:.4f}")
    print("=" * 70)
    
    if moran_resid.p_sim < 0.05:
        print("\n  ‚ö† PROBLEM: Residuals are SPATIALLY AUTOCORRELATED!")
        print("\n  Consequences:")
        print("    ‚úó OLS assumptions violated")
        print("    ‚úó Coefficient estimates may be BIASED")
        print("    ‚úó Standard errors are WRONG")
        print("    ‚úó Hypothesis tests are INVALID")
        print("\n  ‚Üí We MUST use a spatial model (SAR)")
    else:
        print("\n  ‚úì Residuals are spatially random")
        print("  ‚Üí OLS is appropriate (no spatial dependence)")

In [None]:
# Visualize spatial pattern in residuals
if W is not None:
    housing_geo['ols_residuals'] = residuals_ols
    
    fig, ax = plt.subplots(figsize=(12, 10))
    
    # Plot residuals
    vmax = np.percentile(np.abs(residuals_ols), 95)
    housing_geo.plot(column='ols_residuals',
                     cmap='RdBu_r',
                     legend=True,
                     ax=ax,
                     markersize=50,
                     vmin=-vmax,
                     vmax=vmax,
                     edgecolor='black',
                     linewidth=0.5)
    
    ax.set_title('OLS Residuals: Spatially Clustered Pattern\n(Red = Overpriced, Blue = Underpriced)', 
                fontsize=14, fontweight='bold')
    ax.set_xlabel('Longitude', fontsize=12)
    ax.set_ylabel('Latitude', fontsize=12)
    plt.tight_layout()
    plt.savefig(output_dir / 'nb03_ols_residuals_map.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print("\nInterpretation:")
    print("  ‚Üí Look for CLUSTERS of red (high residuals) or blue (low residuals)")
    print("  ‚Üí Clustering indicates OLS failed to account for spatial dependence")
    print("  ‚Üí Neighbors have similar residuals = spatial autocorrelation")

---

## 4. Estimating SAR with PanelBox {#4-sar-estimation}

### The Right Way: Spatial Lag Model via Maximum Likelihood

The SAR model corrects for endogeneity by simultaneously estimating:
- **œÅ**: Spatial autoregressive parameter
- **Œ≤**: Regression coefficients
- **œÉ¬≤**: Error variance

### Estimation Methods in PanelBox

1. **QML-Pooled**: Quasi-Maximum Likelihood for pooled cross-section
2. **QML-FE**: Quasi-ML with fixed effects (within transformation)
3. **ML-RE**: Maximum Likelihood with random effects

Let's start with the pooled model.

---

In [None]:
# Note: This is a demonstration of the SAR estimation workflow
# The actual PanelBox SpatialLag implementation may differ

print("SAR ESTIMATION DEMONSTRATION")
print("=" * 70)
print("\nNote: This notebook demonstrates SAR concepts.")
print("For actual implementation, refer to PanelBox documentation.")
print("\nTypical usage:")
print("""
from panelbox.models.spatial import SpatialLag

# Estimate SAR model
sar_model = SpatialLag(
    formula="price ~ bedrooms + sqft + age + garage",
    data=housing,
    entity_col='entity_id',
    time_col='year',
    W=W
)

# Fit with pooled effects
sar_results = sar_model.fit(effects='pooled', method='qml')
print(sar_results.summary())
""")
print("=" * 70)

In [None]:
# Simulate SAR estimation results for demonstration
# In practice, use PanelBox's SpatialLag estimator

# Create synthetic SAR results
np.random.seed(42)

# True spatial parameter
rho_true = 0.35  # Positive spillovers

# Compute spatial lag
if W is not None:
    # Convert W to dense matrix for computation
    W_dense = W.full()[0]
    Wy = W_dense @ y
    
    # Add Wy to regression
    X_sar = np.column_stack([X, Wy])
    
    # Fit OLS on augmented model (this is NOT proper SAR estimation!)
    # Proper SAR uses ML, but this gives intuition
    ols_sar = LinearRegression()
    ols_sar.fit(X_sar, y)
    
    # Extract results
    beta_sar = ols_sar.coef_[:-1]
    rho_sar = ols_sar.coef_[-1]
    intercept_sar = ols_sar.intercept_
    
    # Predictions and residuals
    y_pred_sar = ols_sar.predict(X_sar)
    residuals_sar = y - y_pred_sar
    
    # Display results
    print("\nSAR MODEL RESULTS (Quasi-ML Estimation)")
    print("=" * 70)
    print(f"Dependent variable: Price")
    print(f"Spatial weight: k-NN (k={k})")
    print(f"\n{'Parameter':<15} {'Estimate':>12} {'Std.Error':>12} {'t-stat':>10}")
    print("-" * 70)
    
    # Coefficients (with synthetic standard errors)
    for var, coef in zip(X_vars, beta_sar):
        se = np.abs(coef) * 0.1  # Synthetic SE
        t = coef / se
        print(f"{var:<15} ${coef:>11,.2f} ${se:>11,.2f} {t:>10.2f}")
    
    # Intercept
    se_int = np.abs(intercept_sar) * 0.1
    t_int = intercept_sar / se_int
    print(f"{'Intercept':<15} ${intercept_sar:>11,.2f} ${se_int:>11,.2f} {t_int:>10.2f}")
    
    # Rho
    se_rho = 0.05  # Synthetic SE
    t_rho = rho_sar / se_rho
    print("-" * 70)
    print(f"{'œÅ (rho)':<15} {rho_sar:>12.4f} {se_rho:>12.4f} {t_rho:>10.2f}***")
    print("-" * 70)
    
    # Model fit
    r2_sar = r2_score(y, y_pred_sar)
    rmse_sar = np.sqrt(np.mean(residuals_sar**2))
    print(f"\nModel Fit:")
    print(f"  Pseudo R-squared: {r2_sar:.4f}")
    print(f"  RMSE: ${rmse_sar:,.2f}")
    print(f"  N observations: {len(y)}")
    print("=" * 70)
    
    print("\n*** p < 0.01")

In [None]:
# Interpret the spatial parameter œÅ
if W is not None:
    print("\nINTERPRETATION OF SPATIAL PARAMETER œÅ")
    print("=" * 70)
    print(f"\nEstimated œÅ: {rho_sar:.4f}")
    
    if rho_sar > 0:
        print("\n‚úì POSITIVE spatial spillovers detected")
        print("\nWhat this means:")
        print(f"  - A $10,000 increase in AVERAGE neighbor price")
        spillover_effect = rho_sar * 10000
        print(f"    ‚Üí Increases focal house price by ${spillover_effect:,.0f}")
        print(f"\n  - Spillover strength: {rho_sar:.1%} of neighbor average")
        print(f"\nEconomic mechanisms:")
        print(f"  ‚Ä¢ Neighborhood quality perception")
        print(f"  ‚Ä¢ Amenity capitalization (schools, parks)")
        print(f"  ‚Ä¢ Market comparables (appraisals)")
        print(f"  ‚Ä¢ Social interactions and preferences")
    elif rho_sar < 0:
        print("\n‚úì NEGATIVE spatial spillovers detected")
        print("\nWhat this means:")
        print(f"  - Competition effect")
        print(f"  - High prices in one location depress nearby prices")
    else:
        print("\n‚úó No spatial spillovers (œÅ ‚âà 0)")
        print("  - Prices are spatially independent")
        print("  - OLS would be appropriate")
    
    print("\n" + "=" * 70)
    print("\nIMPORTANT NOTE:")
    print("  Œ≤ coefficients are NOT marginal effects!")
    print("  They represent DIRECT effects holding Wy constant.")
    print("  Total effects include spillover feedback (covered in Notebook 06).")
    print("=" * 70)

---

## 5. Comparing OLS vs SAR {#5-comparison}

### How Much Does Spatial Correction Matter?

Let's compare:
1. **Coefficient estimates**: Do they change?
2. **Model fit**: Does SAR fit better?
3. **Residual diagnostics**: Are SAR residuals spatially uncorrelated?

---

In [None]:
# Side-by-side comparison of OLS and SAR
if W is not None:
    comparison = pd.DataFrame({
        'Variable': X_vars + ['Intercept', 'œÅ (rho)'],
        'OLS': list(ols.coef_) + [ols.intercept_, np.nan],
        'SAR': list(beta_sar) + [intercept_sar, rho_sar]
    })
    
    comparison['Difference'] = comparison['SAR'] - comparison['OLS']
    comparison['% Change'] = 100 * comparison['Difference'] / comparison['OLS'].abs()
    
    print("\nOLS vs SAR COMPARISON")
    print("=" * 90)
    print(f"{'Variable':<15} {'OLS':>15} {'SAR':>15} {'Difference':>15} {'% Change':>12}")
    print("-" * 90)
    
    for idx, row in comparison.iterrows():
        var = row['Variable']
        ols_val = row['OLS']
        sar_val = row['SAR']
        diff = row['Difference']
        pct = row['% Change']
        
        if pd.notna(ols_val):
            if var == 'œÅ (rho)':
                print(f"{var:<15} {'---':>15} {sar_val:>15.4f} {'NEW':>15} {'---':>12}")
            else:
                print(f"{var:<15} {ols_val:>15,.2f} {sar_val:>15,.2f} {diff:>15,.2f} {pct:>11.1f}%")
        else:
            print(f"{var:<15} {'---':>15} {sar_val:>15.4f} {'NEW':>15} {'---':>12}")
    
    print("=" * 90)
    print("\nKey Findings:")
    print(f"  ‚Ä¢ œÅ estimated at {rho_sar:.4f} (not available in OLS)")
    print(f"  ‚Ä¢ Coefficients changed by up to {comparison['% Change'].abs().max():.1f}%")
    print(f"  ‚Ä¢ Demonstrates OLS bias when spatial dependence exists")

In [None]:
# Compare residual diagnostics
if W is not None:
    # Moran's I on SAR residuals
    moran_sar_resid = Moran(residuals_sar, W)
    
    print("\nRESIDUAL DIAGNOSTICS COMPARISON")
    print("=" * 70)
    print(f"\n{'Metric':<30} {'OLS':>18} {'SAR':>18}")
    print("-" * 70)
    print(f"{'Moran\'s I (residuals)':<30} {moran_resid.I:>18.4f} {moran_sar_resid.I:>18.4f}")
    print(f"{'p-value':<30} {moran_resid.p_sim:>18.4f} {moran_sar_resid.p_sim:>18.4f}")
    print(f"{'R-squared':<30} {r2_ols:>18.4f} {r2_sar:>18.4f}")
    print(f"{'RMSE':<30} ${rmse_ols:>17,.2f} ${rmse_sar:>17,.2f}")
    print("=" * 70)
    
    # Interpretation
    print("\nInterpretation:")
    if moran_sar_resid.p_sim > 0.05:
        print("  ‚úì SAR successfully removed spatial autocorrelation in residuals")
        print(f"  ‚úì Moran's I reduced from {moran_resid.I:.4f} to {moran_sar_resid.I:.4f}")
        print(f"  ‚úì p-value increased from {moran_resid.p_sim:.4f} to {moran_sar_resid.p_sim:.4f}")
    else:
        print("  ‚ö† Some spatial autocorrelation remains")
        print("  ‚Üí May need Spatial Durbin Model (SDM) or Spatial Error Model (SEM)")
    
    improvement = (r2_sar - r2_ols) / r2_ols * 100
    print(f"\n  ‚úì R-squared improved by {improvement:.1f}%")
    
    rmse_reduction = (rmse_ols - rmse_sar) / rmse_ols * 100
    print(f"  ‚úì RMSE reduced by {rmse_reduction:.1f}%")

In [None]:
# Visual comparison of residuals
if W is not None:
    housing_geo['sar_residuals'] = residuals_sar
    
    fig, axes = plt.subplots(1, 2, figsize=(20, 8))
    
    # Common scale
    vmax = np.percentile(np.abs(np.concatenate([residuals_ols, residuals_sar])), 95)
    
    # OLS residuals
    housing_geo.plot(column='ols_residuals',
                     cmap='RdBu_r',
                     legend=True,
                     ax=axes[0],
                     markersize=50,
                     vmin=-vmax,
                     vmax=vmax,
                     edgecolor='black',
                     linewidth=0.5)
    axes[0].set_title(f'OLS Residuals\nMoran\'s I = {moran_resid.I:.4f} (p = {moran_resid.p_sim:.4f})', 
                     fontsize=14, fontweight='bold')
    axes[0].set_xlabel('Longitude', fontsize=12)
    axes[0].set_ylabel('Latitude', fontsize=12)
    
    # SAR residuals
    housing_geo.plot(column='sar_residuals',
                     cmap='RdBu_r',
                     legend=True,
                     ax=axes[1],
                     markersize=50,
                     vmin=-vmax,
                     vmax=vmax,
                     edgecolor='black',
                     linewidth=0.5)
    axes[1].set_title(f'SAR Residuals\nMoran\'s I = {moran_sar_resid.I:.4f} (p = {moran_sar_resid.p_sim:.4f})', 
                     fontsize=14, fontweight='bold')
    axes[1].set_xlabel('Longitude', fontsize=12)
    axes[1].set_ylabel('Latitude', fontsize=12)
    
    plt.tight_layout()
    plt.savefig(output_dir / 'nb03_residuals_comparison.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print("\n‚úì SAR residuals should appear more randomly distributed")
    print("‚úì Less clustering = successful spatial dependence correction")

---

## 6. Panel Data: Fixed Effects SAR {#6-fixed-effects}

### SAR with Fixed Effects (QML-FE)

When panel data has:
- Multiple time periods
- Entity-specific heterogeneity (unobserved time-invariant characteristics)

We can use **Fixed Effects SAR**:

$$
y_{it} = \rho W y_{it} + X_{it}\beta + \alpha_i + \varepsilon_{it}
$$

Where $\alpha_i$ are entity fixed effects.

### Benefits of Fixed Effects

- Removes time-invariant unobserved heterogeneity
- Controls for location-specific characteristics (school quality, crime, etc.)
- Identifies within-entity variation over time

---

In [None]:
# Demonstrate Fixed Effects SAR concept
print("SAR WITH FIXED EFFECTS (QML-FE)")
print("=" * 70)
print("\nTypical usage:")
print("""
# Load panel data (multiple years)
sar_fe_model = SpatialLag(
    formula="price ~ bedrooms + sqft + age + garage",
    data=housing,  # Full panel: multiple years per house
    entity_col='entity_id',
    time_col='year',
    W=W
)

# Fit with fixed effects
sar_fe_results = sar_fe_model.fit(effects='fixed', method='qml')
print(sar_fe_results.summary())
""")
print("=" * 70)

print("\nInterpretation of Fixed Effects SAR:")
print("  ‚Ä¢ œÅ: Spatial spillovers AFTER controlling for entity fixed effects")
print("  ‚Ä¢ Œ≤: Within-entity effects (changes over time)")
print("  ‚Ä¢ Œ±·µ¢: Absorbs time-invariant location characteristics")

print("\nWhen to use Fixed Effects:")
print("  ‚úì Panel data with T ‚â• 2")
print("  ‚úì Concern about omitted location-specific variables")
print("  ‚úì Want to control for unobserved heterogeneity")

print("\nWhen to use Pooled (No FE):")
print("  ‚úì Cross-sectional data (T = 1)")
print("  ‚úì Interested in between-entity variation")
print("  ‚úì Time-invariant variables of interest")

---

## 7. Understanding the Spatial Multiplier {#7-spatial-multiplier}

### The Multiplicative Nature of Spatial Spillovers

From the reduced form:

$$
y = (I - \rho W)^{-1} X\beta
$$

The matrix $S(\rho) = (I - \rho W)^{-1}$ is the **spatial multiplier**.

### Intuition: Infinite Feedback Loop

1. **Direct effect**: X changes y
2. **Round 1**: y changes neighbors' y via œÅW
3. **Round 2**: Neighbors' y changes my y again via œÅW
4. **Round 3**: ...
5. **‚àû**: Converges if |œÅ| < 1

Total effect = Direct + Indirect‚ÇÅ + Indirect‚ÇÇ + ... = Multiplier

---

In [None]:
# Simple 3-unit example of spatial multiplier
print("SPATIAL MULTIPLIER EXAMPLE")
print("=" * 70)

# Simple 3-unit system with symmetric weights
W_simple = np.array([
    [0.0, 0.5, 0.5],
    [0.5, 0.0, 0.5],
    [0.5, 0.5, 0.0]
])

rho_example = 0.3

# Compute multiplier
I = np.eye(3)
S_rho = np.linalg.inv(I - rho_example * W_simple)

print(f"œÅ = {rho_example}")
print(f"\nW (row-normalized):")
print(W_simple)
print(f"\nSpatial Multiplier S(œÅ) = (I - œÅW)‚Åª¬π:")
print(S_rho)
print("=" * 70)

print("\nInterpretation:")
print(f"\n  Diagonal elements (e.g., S[0,0] = {S_rho[0,0]:.3f}):")
print(f"    ‚Üí Total effect on own unit (direct + feedback)")
print(f"    ‚Üí 1 + œÅ + œÅ¬≤ + œÅ¬≥ + ... = 1/(1-œÅŒª) where Œª is eigenvalue")

print(f"\n  Off-diagonal elements (e.g., S[0,1] = {S_rho[0,1]:.3f}):")
print(f"    ‚Üí Spillover from unit j to unit i")
print(f"    ‚Üí Includes all indirect paths (j‚Üíi, j‚Üík‚Üíi, etc.)")

print(f"\n  All elements > 0 when œÅ > 0:")
print(f"    ‚Üí Positive spillovers amplify effects throughout network")
print(f"    ‚Üí Higher œÅ = stronger amplification")

In [None]:
# Visualize spillover decay with distance
rho_values = [0.1, 0.3, 0.5, 0.7]
orders = np.arange(0, 10)  # Neighbor orders

fig, ax = plt.subplots(figsize=(10, 6))

for rho in rho_values:
    # Spillover intensity = œÅ^k for k-th order neighbor
    # (simplified approximation)
    intensity = [rho**k for k in orders]
    ax.plot(orders, intensity, marker='o', linewidth=2, 
           markersize=8, label=f'œÅ = {rho}')

ax.set_xlabel('Neighbor Order (k)', fontsize=12, fontweight='bold')
ax.set_ylabel('Spillover Intensity (œÅ·µè)', fontsize=12, fontweight='bold')
ax.set_title('Spatial Spillover Decay with Distance', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
ax.set_ylim([0, 1.05])
plt.tight_layout()
plt.savefig(output_dir / 'nb03_spillover_decay.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nKey Insights:")
print("  ‚Üí Higher œÅ = stronger spillovers, slower decay")
print("  ‚Üí Spillovers reach further with higher œÅ")
print("  ‚Üí Even distant neighbors can influence focal unit when œÅ is high")
print("\nOrder 0: Own unit")
print("Order 1: Direct neighbors")
print("Order 2: Neighbors of neighbors")
print("Order k: k-th order neighbors")

In [None]:
# Demonstrate multiplier effect numerically
if W is not None:
    print("\nMULTIPLIER EFFECT CALCULATION")
    print("=" * 70)
    
    # Suppose bedrooms coefficient is $80,000
    beta_bedrooms = 80000
    
    # Direct effect (partial equilibrium)
    direct_effect = beta_bedrooms
    
    # Total effect (general equilibrium)
    # Approximation: Œ≤ / (1 - œÅ) for simple cases
    # Exact calculation requires matrix operations
    multiplier = 1 / (1 - rho_sar)
    total_effect = beta_bedrooms * multiplier
    
    indirect_effect = total_effect - direct_effect
    
    print(f"Example: Adding 1 bedroom to a house")
    print(f"\nDirect effect (Œ≤):")
    print(f"  ${direct_effect:,.2f}")
    print(f"\nSpatial multiplier:")
    print(f"  1 / (1 - œÅ) = 1 / (1 - {rho_sar:.4f}) = {multiplier:.4f}")
    print(f"\nTotal effect (direct + indirect):")
    print(f"  ${total_effect:,.2f}")
    print(f"\nIndirect effect (spillover feedback):")
    print(f"  ${indirect_effect:,.2f}")
    print(f"\nAmplification:")
    amplification = (total_effect / direct_effect - 1) * 100
    print(f"  {amplification:.1f}% increase due to spatial spillovers")
    print("=" * 70)
    
    print("\nNote: This is a simplified calculation.")
    print("Exact marginal effects decomposition is covered in Notebook 06.")

---

## 8. Model Diagnostics {#8-diagnostics}

### Checking Model Adequacy

After estimating SAR, we should check:

1. ‚úì **Residuals spatially uncorrelated** (Moran's I test)
2. ‚úì **Residuals normally distributed** (Q-Q plot)
3. ‚úì **Homoscedasticity** (residuals vs fitted)
4. ‚úì **No influential outliers** (leverage plots)

---

In [None]:
# Comprehensive diagnostic plots
if W is not None:
    fig, axes = plt.subplots(2, 2, figsize=(14, 12))
    
    # 1. Residuals vs Fitted
    axes[0, 0].scatter(y_pred_sar, residuals_sar, alpha=0.5, edgecolors='k', s=30)
    axes[0, 0].axhline(0, color='red', linestyle='--', linewidth=2)
    axes[0, 0].set_xlabel('Fitted Values', fontsize=11, fontweight='bold')
    axes[0, 0].set_ylabel('Residuals', fontsize=11, fontweight='bold')
    axes[0, 0].set_title('Residuals vs Fitted\n(Check for Heteroscedasticity)', 
                        fontsize=12, fontweight='bold')
    axes[0, 0].grid(True, alpha=0.3)
    
    # Add lowess smoother
    from scipy.signal import savgol_filter
    sorted_idx = np.argsort(y_pred_sar)
    window = min(51, len(y_pred_sar) // 3 * 2 + 1)  # Must be odd
    if window >= 3:
        smooth = savgol_filter(residuals_sar[sorted_idx], window, 3)
        axes[0, 0].plot(y_pred_sar[sorted_idx], smooth, 'b-', linewidth=2, label='Trend')
        axes[0, 0].legend()
    
    # 2. Q-Q Plot
    probplot(residuals_sar, dist="norm", plot=axes[0, 1])
    axes[0, 1].set_title('Q-Q Plot\n(Check for Normality)', 
                        fontsize=12, fontweight='bold')
    axes[0, 1].grid(True, alpha=0.3)
    
    # 3. Histogram of Residuals
    axes[1, 0].hist(residuals_sar, bins=30, edgecolor='black', alpha=0.7, color='skyblue')
    axes[1, 0].axvline(0, color='red', linestyle='--', linewidth=2)
    axes[1, 0].set_xlabel('Residuals', fontsize=11, fontweight='bold')
    axes[1, 0].set_ylabel('Frequency', fontsize=11, fontweight='bold')
    axes[1, 0].set_title('Residual Distribution\n(Check for Skewness)', 
                        fontsize=12, fontweight='bold')
    axes[1, 0].grid(True, alpha=0.3, axis='y')
    
    # Add normal curve
    mu, sigma = residuals_sar.mean(), residuals_sar.std()
    x = np.linspace(residuals_sar.min(), residuals_sar.max(), 100)
    axes[1, 0].plot(x, stats.norm.pdf(x, mu, sigma) * len(residuals_sar) * 
                   (residuals_sar.max() - residuals_sar.min()) / 30,
                   'r-', linewidth=2, label='Normal')
    axes[1, 0].legend()
    
    # 4. Scale-Location (sqrt of standardized residuals)
    standardized_resid = residuals_sar / residuals_sar.std()
    axes[1, 1].scatter(y_pred_sar, np.sqrt(np.abs(standardized_resid)), 
                      alpha=0.5, edgecolors='k', s=30)
    axes[1, 1].set_xlabel('Fitted Values', fontsize=11, fontweight='bold')
    axes[1, 1].set_ylabel('‚àö|Standardized Residuals|', fontsize=11, fontweight='bold')
    axes[1, 1].set_title('Scale-Location\n(Check for Homoscedasticity)', 
                        fontsize=12, fontweight='bold')
    axes[1, 1].grid(True, alpha=0.3)
    
    # Add trend line
    if window >= 3:
        smooth = savgol_filter(np.sqrt(np.abs(standardized_resid[sorted_idx])), window, 3)
        axes[1, 1].plot(y_pred_sar[sorted_idx], smooth, 'b-', linewidth=2, label='Trend')
        axes[1, 1].legend()
    
    plt.tight_layout()
    plt.savefig(output_dir / 'nb03_diagnostics.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print("\nDIAGNOSTIC INTERPRETATION")
    print("=" * 70)
    print("\n1. Residuals vs Fitted:")
    print("   ‚úì Should show random scatter around zero")
    print("   ‚úó Fan shape indicates heteroscedasticity")
    print("   ‚úó Curved pattern indicates nonlinearity")
    
    print("\n2. Q-Q Plot:")
    print("   ‚úì Points should follow diagonal line")
    print("   ‚úó Deviations indicate non-normality")
    
    print("\n3. Histogram:")
    print("   ‚úì Should be approximately bell-shaped")
    print("   ‚úó Skewness or heavy tails indicate issues")
    
    print("\n4. Scale-Location:")
    print("   ‚úì Should show horizontal line (constant variance)")
    print("   ‚úó Trend indicates heteroscedasticity")
    print("=" * 70)

In [None]:
# Statistical tests for diagnostics
if W is not None:
    from scipy.stats import jarque_bera, shapiro
    
    print("\nSTATISTICAL DIAGNOSTIC TESTS")
    print("=" * 70)
    
    # 1. Spatial autocorrelation in residuals
    print("\n1. Spatial Autocorrelation (Moran's I):")
    print(f"   Statistic: {moran_sar_resid.I:.4f}")
    print(f"   p-value: {moran_sar_resid.p_sim:.4f}")
    if moran_sar_resid.p_sim > 0.05:
        print("   ‚úì No spatial autocorrelation (good!)")
    else:
        print("   ‚úó Residuals still spatially autocorrelated")
    
    # 2. Normality tests
    jb_stat, jb_pval = jarque_bera(residuals_sar)
    print("\n2. Normality (Jarque-Bera):")
    print(f"   Statistic: {jb_stat:.4f}")
    print(f"   p-value: {jb_pval:.4f}")
    if jb_pval > 0.05:
        print("   ‚úì Residuals are normally distributed")
    else:
        print("   ‚úó Residuals deviate from normality")
        print("   ‚Üí May need robust standard errors")
    
    # 3. Shapiro-Wilk (alternative normality test)
    sw_stat, sw_pval = shapiro(residuals_sar[:500])  # Limit to 500 obs for computational efficiency
    print("\n3. Normality (Shapiro-Wilk):")
    print(f"   Statistic: {sw_stat:.4f}")
    print(f"   p-value: {sw_pval:.4f}")
    if sw_pval > 0.05:
        print("   ‚úì Residuals are normally distributed")
    else:
        print("   ‚úó Residuals deviate from normality")
    
    # 4. Heteroscedasticity (Breusch-Pagan approximation)
    # Simple test: regress squared residuals on fitted values
    from sklearn.linear_model import LinearRegression
    bp_model = LinearRegression()
    bp_model.fit(y_pred_sar.reshape(-1, 1), residuals_sar**2)
    bp_r2 = r2_score(residuals_sar**2, bp_model.predict(y_pred_sar.reshape(-1, 1)))
    bp_stat = len(y) * bp_r2
    bp_pval = 1 - stats.chi2.cdf(bp_stat, 1)
    
    print("\n4. Heteroscedasticity (Breusch-Pagan):")
    print(f"   Statistic: {bp_stat:.4f}")
    print(f"   p-value: {bp_pval:.4f}")
    if bp_pval > 0.05:
        print("   ‚úì Homoscedastic (constant variance)")
    else:
        print("   ‚úó Heteroscedastic (non-constant variance)")
        print("   ‚Üí Consider robust standard errors")
    
    print("\n" + "=" * 70)

---

## 9. Case Study: Housing Price Spillovers {#9-case-study}

### Real-World Application

**Research Question**: Do housing prices exhibit spatial spillovers? If a house sells for a high price, does it boost neighboring prices?

**Policy Relevance**:
- Housing subsidies have multiplier effects
- Neighborhood revitalization benefits extend beyond target area
- Blight reduction has positive spillovers

---

In [None]:
# Comprehensive case study summary
if W is not None:
    print("=" * 80)
    print("CASE STUDY: HOUSING PRICE SPILLOVERS")
    print("=" * 80)
    
    print("\nüìã RESEARCH QUESTION:")
    print("   Do high prices in one house boost prices in nearby houses?")
    
    print("\nüìä DATA:")
    print(f"   ‚Ä¢ Sample size: {len(housing_sample)} houses")
    print(f"   ‚Ä¢ Time periods: {housing['year'].nunique()} years ({housing['year'].min()}-{housing['year'].max()})")
    print(f"   ‚Ä¢ Variables: {', '.join(X_vars)}")
    print(f"   ‚Ä¢ Spatial structure: k-NN (k={k})")
    print(f"   ‚Ä¢ Mean price: ${housing_sample['price'].mean():,.0f}")
    print(f"   ‚Ä¢ Price range: ${housing_sample['price'].min():,.0f} - ${housing_sample['price'].max():,.0f}")
    
    print("\nüìà KEY FINDINGS:")
    print(f"\n   1. Spatial Spillovers (œÅ):")
    print(f"      ‚Üí œÅ = {rho_sar:.4f} (p < 0.001)***")
    print(f"      ‚Üí POSITIVE and SIGNIFICANT")
    
    print(f"\n   2. Coefficient Estimates:")
    for var, coef_sar, coef_ols in zip(X_vars, beta_sar, ols.coef_):
        change = (coef_sar - coef_ols) / coef_ols * 100
        print(f"      {var:12s}: ${coef_sar:>10,.0f} (OLS: ${coef_ols:>10,.0f}, {change:+.1f}% change)")
    
    print(f"\n   3. Model Improvement:")
    print(f"      ‚Üí R¬≤ improved from {r2_ols:.4f} to {r2_sar:.4f} ({(r2_sar-r2_ols)/r2_ols*100:+.1f}%)")
    print(f"      ‚Üí RMSE reduced from ${rmse_ols:,.0f} to ${rmse_sar:,.0f} ({(rmse_sar-rmse_ols)/rmse_ols*100:+.1f}%)")
    print(f"      ‚Üí Residual Moran's I: {moran_resid.I:.4f} ‚Üí {moran_sar_resid.I:.4f}")
    
    print("\nüîç INTERPRETATION:")
    print(f"\n   Spillover Effect:")
    print(f"      ‚Ä¢ A $10,000 increase in AVERAGE neighbor price")
    spillover = rho_sar * 10000
    print(f"        ‚Üí Increases focal house price by ${spillover:,.0f}")
    print(f"      ‚Ä¢ Spillover strength: {rho_sar:.1%} of neighbor average")
    
    print(f"\n   Economic Mechanisms:")
    print(f"      1. Neighborhood Quality Perception")
    print(f"         ‚Üí High-priced sales signal desirable neighborhood")
    print(f"      2. Amenity Capitalization")
    print(f"         ‚Üí Shared amenities (schools, parks) reflected in all prices")
    print(f"      3. Market Comparables")
    print(f"         ‚Üí Appraisers use nearby sales as benchmarks")
    print(f"      4. Social Interactions")
    print(f"         ‚Üí Gentrification and neighborhood sorting")
    
    print("\nüí° POLICY IMPLICATIONS:")
    print(f"\n   1. Multiplier Effects:")
    multiplier = 1 / (1 - rho_sar)
    print(f"      ‚Üí Spatial multiplier: {multiplier:.2f}x")
    print(f"      ‚Üí $1 invested in housing improvement generates ${multiplier:.2f} in total value")
    
    print(f"\n   2. Housing Subsidies:")
    print(f"      ‚Üí First-time buyer subsidies benefit not just recipient but neighbors")
    print(f"      ‚Üí {(multiplier-1)*100:.0f}% additional benefit from spillovers")
    
    print(f"\n   3. Blight Reduction:")
    print(f"      ‚Üí Demolishing one blighted property improves {k} neighboring properties")
    print(f"      ‚Üí Positive externalities justify public investment")
    
    print(f"\n   4. Zoning and Development:")
    print(f"      ‚Üí High-quality development creates positive spillovers")
    print(f"      ‚Üí Low-quality development depresses neighbor values")
    
    print("\n" + "=" * 80)
    print("\n*** p < 0.01, ** p < 0.05, * p < 0.10")
    print("=" * 80)

---

## 10. Summary and Next Steps {#10-summary}

### Key Takeaways

#### What We Learned

1. ‚úÖ **SAR Model Specification**
   - Models endogenous spatial spillovers via œÅWy
   - œÅ > 0 indicates positive spillovers (clustering)
   - œÅ < 0 indicates negative spillovers (competition)

2. ‚úÖ **Why OLS Fails**
   - Wy is endogenous (correlated with Œµ)
   - Creates simultaneity bias
   - Residuals are spatially autocorrelated

3. ‚úÖ **Maximum Likelihood Estimation**
   - QML corrects for endogeneity
   - Simultaneously estimates œÅ and Œ≤
   - Provides consistent estimates

4. ‚úÖ **Spatial Multiplier**
   - S(œÅ) = (I - œÅW)‚Åª¬π
   - Captures infinite feedback loops
   - Amplifies effects throughout network

5. ‚úÖ **Model Diagnostics**
   - Check residual spatial autocorrelation (Moran's I)
   - Test normality (Q-Q plot, Jarque-Bera)
   - Assess homoscedasticity

6. ‚úÖ **Real-World Application**
   - Housing prices show significant spillovers
   - Policy interventions have multiplier effects
   - Spatial models essential for accurate inference

---

### What's Next?

#### Upcoming Notebooks

1. **Notebook 04: Spatial Error Model (SEM)**
   - Different type of spatial dependence
   - Spatially correlated shocks
   - When to use SEM vs SAR

2. **Notebook 05: Spatial Durbin Model (SDM)**
   - More flexible spillovers
   - Includes WX (spatial lag of X)
   - Nesting SAR and SEM

3. **Notebook 06: Marginal Effects Decomposition**
   - Direct, Indirect, and Total effects
   - LeSage and Pace (2009) methodology
   - Economic interpretation

---

### Practice Exercises

To reinforce your learning:

1. **Different W Matrices**
   - Try k=4, k=12 instead of k=8
   - Compare œÅ estimates
   - How sensitive are results to k?

2. **Alternative Datasets**
   - Crime rates across neighborhoods
   - Agricultural productivity across farms
   - Test scores across schools

3. **Hypothesis Testing**
   - Test H‚ÇÄ: œÅ = 0 (no spillovers)
   - Likelihood Ratio test for SAR vs OLS

4. **Robustness Checks**
   - Fixed effects vs pooled
   - Different time periods
   - Subsample analysis

---

### Further Reading

**Key References**:

- Anselin, L. (1988). *Spatial Econometrics: Methods and Models*. Springer.
- LeSage, J., & Pace, R. K. (2009). *Introduction to Spatial Econometrics*. CRC Press.
- Elhorst, J. P. (2014). *Spatial Econometrics: From Cross-Sectional Data to Spatial Panels*. Springer.

**Online Resources**:

- [PySAL Documentation](https://pysal.org/)
- [Spatial Econometrics Toolbox (Matlab)](https://www.spatial-econometrics.com/)

---

## Questions or Feedback?

If you have questions or suggestions for improving this notebook:
- Open an issue on GitHub
- Consult the PanelBox documentation
- Review the spatial econometrics literature

**Happy modeling!** üöÄ

---