# Introduction to Spatial Econometrics

**Duration**: 60-90 minutes  
**Level**: Beginner  
**Prerequisites**: Basic panel data econometrics, OLS regression, Python fundamentals

---

## Learning Objectives

After completing this notebook, you will be able to:

1. **Explain** why OLS is inappropriate when spatial dependence exists
2. **Identify** spatial patterns visually using choropleth maps
3. **Compute** Moran's I statistic and interpret its significance
4. **Describe** what a spatial weight matrix represents
5. **Recognize** when spatial econometric methods are needed
6. **Distinguish** between positive and negative spatial autocorrelation
7. **Create** basic spatial visualizations with GeoPandas

---

## Setup and Package Verification

First, let's verify that all required packages are installed and import the necessary libraries.

In [None]:
# Check required packages
import sys

required_packages = {
    'pandas': '>=1.3.0',
    'numpy': '>=1.21.0',
    'matplotlib': '>=3.4.0',
    'seaborn': '>=0.11.0',
    'geopandas': '>=0.10.0',
    'libpysal': '>=4.6.0',
}

print("Checking required packages...\n")
all_installed = True

for package, version in required_packages.items():
    try:
        exec(f"import {package}")
        print(f"✓ {package} installed")
    except ImportError:
        print(f"✗ {package} NOT installed - run: pip install {package}")
        all_installed = False

if all_installed:
    print("\n✓ All required packages are installed!")
else:
    print("\n✗ Please install missing packages before proceeding.")

In [None]:
# Import libraries
from pathlib import Path
import warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd

# PySAL libraries
import libpysal
from libpysal.weights import Queen
from esda.moran import Moran

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Set up paths
panelbox_path = Path("/home/guhaase/projetos/panelbox")
sys.path.insert(0, str(panelbox_path))

# Configure plotting
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
plt.rcParams['figure.dpi'] = 100
plt.rcParams['savefig.dpi'] = 300
plt.rcParams['font.size'] = 10

# Set random seed for reproducibility
np.random.seed(42)

print("✓ Libraries imported successfully!")
print(f"✓ Working directory: {Path.cwd()}")

In [None]:
# Verify PanelBox accessibility (for future notebooks)
try:
    from panelbox.models.spatial import SpatialLag
    print("✓ PanelBox spatial models accessible")
    print("  (Note: We won't use PanelBox models in this introductory notebook)")
except ImportError as e:
    print("✗ PanelBox not found - check path configuration")
    print(f"  Error: {e}")

---

# 1. Introduction and Motivation

## Why Spatial Econometrics?

### Tobler's First Law of Geography (1970)

> "Everything is related to everything else, but near things are more related than distant things."

This fundamental principle of geography has profound implications for econometric analysis. When we analyze cross-sectional or panel data that has a spatial dimension (e.g., countries, states, counties, cities), we often find that observations are **not independent**.

### Real-World Examples of Spatial Dependence

Spatial dependence matters in many economic phenomena:

1. **Regional Economic Growth**: Technological spillovers and knowledge diffusion across neighboring regions
2. **Housing Prices**: Neighborhood effects and local amenities affect nearby property values
3. **Crime Rates**: Criminal activity often exhibits spatial diffusion patterns
4. **Disease Spread**: Epidemics follow geographic contagion processes
5. **Technology Adoption**: Firms imitate nearby competitors' innovations
6. **Agricultural Productivity**: Pest infestations and weather shocks cross borders
7. **Environmental Quality**: Air and water pollution don't respect political boundaries

### The Problem with Standard Methods

**Standard econometric methods (like OLS) assume observations are independent:**
- E(εᵢεⱼ) = 0 for all i ≠ j
- This assumption is **violated** when spatial correlation exists

**Consequences of ignoring spatial dependence:**
1. **Biased parameter estimates** (if spatial lag of y is omitted)
2. **Inefficient estimates** (if spatial correlation in errors)
3. **Incorrect standard errors** (usually underestimated)
4. **Invalid hypothesis tests** (t-statistics and F-statistics are wrong)
5. **Wrong policy conclusions** (incorrect inference about causal effects)

**The solution:** Spatial econometric models that explicitly account for geographic relationships between observations.

### Visualization: Spatial Patterns

Let's visualize the difference between random spatial patterns and spatially correlated patterns.

In [None]:
# Create synthetic spatial data to illustrate the concept
# We'll create two 10x10 grids: one random, one spatially correlated

grid_size = 10

# Random pattern (no spatial correlation)
random_pattern = np.random.randn(grid_size, grid_size)

# Spatially correlated pattern (using spatial smoothing)
from scipy.ndimage import gaussian_filter
correlated_pattern = gaussian_filter(np.random.randn(grid_size, grid_size), sigma=2)

# Create side-by-side comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Random pattern
im1 = axes[0].imshow(random_pattern, cmap='RdYlBu_r', interpolation='nearest')
axes[0].set_title('Random Spatial Pattern\n(No Spatial Correlation)', 
                   fontsize=14, fontweight='bold')
axes[0].axis('off')
plt.colorbar(im1, ax=axes[0], fraction=0.046, pad=0.04)

# Correlated pattern
im2 = axes[1].imshow(correlated_pattern, cmap='RdYlBu_r', interpolation='nearest')
axes[1].set_title('Spatially Clustered Pattern\n(High Spatial Correlation)', 
                   fontsize=14, fontweight='bold')
axes[1].axis('off')
plt.colorbar(im2, ax=axes[1], fraction=0.046, pad=0.04)

plt.tight_layout()
plt.show()

print("Interpretation:")
print("  Left: Values are randomly scattered with no pattern")
print("  Right: High values cluster together (red), low values cluster together (blue)")
print("         → This is positive spatial autocorrelation!")

**Key Takeaway:**

In the real world, economic variables often exhibit patterns like the right panel—nearby regions tend to have similar values. This violates the independence assumption of OLS and requires spatial econometric methods.

---

# 2. What is Spatial Dependence?

## Understanding Spatial Autocorrelation

**Spatial autocorrelation** (or spatial dependence) is the correlation of a variable with itself across space.

### Types of Spatial Autocorrelation

1. **Positive Spatial Autocorrelation** (most common):
   - High values tend to be located near other high values
   - Low values tend to be located near other low values
   - Results in **clustering** or **agglomeration**
   - Example: Rich neighborhoods near rich neighborhoods

2. **Negative Spatial Autocorrelation** (rare):
   - High values tend to be located near low values
   - Results in a **checkerboard pattern**
   - Example: Competitive retail locations avoiding clusters

3. **No Spatial Autocorrelation** (random):
   - No systematic spatial pattern
   - Values are randomly distributed across space

### Difference from Temporal Autocorrelation

| Feature | Temporal Autocorrelation | Spatial Autocorrelation |
|---------|-------------------------|-------------------------|
| **Direction** | One-directional (past → present → future) | Multidirectional (all neighbors) |
| **Natural ordering** | Yes (time flows forward) | No (no natural ordering of space) |
| **Neighbor structure** | Fixed (t-1, t-2, etc.) | Must be specified (who is a neighbor?) |
| **Feedback loops** | Limited | Strong (simultaneous) |

### Why Spatial Autocorrelation Occurs

1. **Spillover effects**: Economic activities in one region affect neighbors
2. **Common shocks**: Nearby regions experience similar unobserved shocks
3. **Measurement error**: Spatial data often uses arbitrary boundaries
4. **Omitted variables**: Unobserved factors vary smoothly over space

## Loading Real Spatial Data: US Counties

Now let's work with real data. We'll use US county-level socioeconomic data to explore spatial patterns.

**Note**: This notebook assumes you have the US counties dataset in `../data/us_counties/`. If you don't have this data, we'll create synthetic data for demonstration purposes.

In [None]:
# Define data paths
data_path = Path("../data/us_counties/us_counties.csv")
shapefile_path = Path("../data/us_counties/us_counties.shp")

# Check if real data exists
if data_path.exists() and shapefile_path.exists():
    # Load real data
    counties_data = pd.read_csv(data_path)
    counties_geo = gpd.read_file(shapefile_path)
    
    # Merge spatial and attribute data
    counties = counties_geo.merge(counties_data, on='county_id')
    
    print("✓ Real US counties data loaded successfully!")
    print(f"  Number of counties: {len(counties)}")
    
else:
    # Create synthetic data for demonstration
    print("⚠ Real data not found. Creating synthetic US counties data...")
    
    # We'll use a built-in dataset from libpysal for demonstration
    from libpysal import examples
    
    # Load sample spatial data (US South counties)
    south = examples.load_example('South')
    counties = gpd.read_file(south.get_path('south.shp'))
    
    # Rename columns to match our specification
    counties = counties.rename(columns={
        'FIPS': 'county_id',
        'NAME': 'county_name',
        'STATE_NAME': 'state',
        'HR90': 'crime_rate',
        'DV90': 'divorce_rate'
    })
    
    # Create synthetic variables that mimic spatial patterns
    np.random.seed(42)
    counties['income_percapita'] = 25000 + 15000 * np.random.rand(len(counties))
    counties['population'] = np.random.randint(10000, 500000, len(counties))
    counties['education'] = 15 + 20 * np.random.rand(len(counties))
    counties['unemployment_rate'] = 3 + 8 * np.random.rand(len(counties))
    counties['year'] = 2020
    
    print(f"✓ Synthetic data created with {len(counties)} counties")
    print("  (Note: Using Southern US counties from PySAL examples)")

# Display basic information
print(f"\nDataset shape: {counties.shape}")
print(f"Coordinate reference system: {counties.crs}")

In [None]:
# Display first few rows
display_columns = ['county_name', 'state', 'income_percapita', 'population', 
                   'education', 'unemployment_rate']

# Check which columns exist
existing_cols = [col for col in display_columns if col in counties.columns]

print("Sample of counties data:\n")
print(counties[existing_cols].head(10).to_string(index=False))

print("\n" + "="*70)
print("Summary Statistics:")
print("="*70)
print(counties[['income_percapita', 'education', 'unemployment_rate']].describe())

### Exploratory Data Analysis: Choropleth Maps

A **choropleth map** is a thematic map where areas are shaded or patterned according to a variable's values. These maps are the first tool for identifying spatial patterns.

In [None]:
# Create choropleth map for income per capita
fig, ax = plt.subplots(1, 1, figsize=(15, 10))

counties.plot(column='income_percapita',
              cmap='YlOrRd',
              legend=True,
              ax=ax,
              edgecolor='black',
              linewidth=0.3,
              legend_kwds={'label': 'Income Per Capita (USD)',
                          'orientation': 'horizontal',
                          'shrink': 0.6})

ax.set_title('Per Capita Income across US Counties', 
             fontsize=16, fontweight='bold', pad=20)
ax.axis('off')

plt.tight_layout()
plt.savefig('../outputs/figures/nb01_income_choropleth.png', 
            dpi=300, bbox_inches='tight')
plt.show()

print("Interpretation Guide:")
print("  - Red/dark regions: High income counties")
print("  - Yellow/light regions: Low income counties")
print("  - Look for clusters: Do high-income counties group together?")
print("\n✓ Figure saved to: ../outputs/figures/nb01_income_choropleth.png")

In [None]:
# Create multiple choropleth maps for different variables
fig, axes = plt.subplots(1, 3, figsize=(20, 6))

variables = [
    ('income_percapita', 'Income Per Capita', 'YlOrRd'),
    ('education', 'Education (% College)', 'YlGnBu'),
    ('unemployment_rate', 'Unemployment Rate (%)', 'RdPu')
]

for idx, (var, title, cmap) in enumerate(variables):
    counties.plot(column=var,
                  cmap=cmap,
                  legend=True,
                  ax=axes[idx],
                  edgecolor='black',
                  linewidth=0.2,
                  legend_kwds={'shrink': 0.6})
    
    axes[idx].set_title(title, fontsize=12, fontweight='bold')
    axes[idx].axis('off')

plt.tight_layout()
plt.show()

print("Visual Inspection:")
print("  Do you see clustering in these maps?")
print("  Are high-income counties near other high-income counties?")
print("  Are there regional patterns in education and unemployment?")
print("\n→ If yes, we have evidence of spatial autocorrelation!")

**Key Takeaway:**

Visual inspection reveals spatial patterns, but we need **formal statistical tests** to quantify and test for spatial autocorrelation. That's where Moran's I comes in (Section 5).

---

# 3. Why OLS Fails with Spatial Dependence

## The Problem with Ignoring Space

### OLS Assumptions Violated

Ordinary Least Squares (OLS) regression assumes:

1. **Independence of observations**: E(εᵢεⱼ) = 0 for all i ≠ j
2. **Homoscedasticity**: Var(εᵢ) = σ² (constant)
3. **No omitted variables**: All relevant variables are included

When spatial dependence exists:
- **E(εᵢεⱼ) ≠ 0** for neighboring observations i and j
- Errors are correlated across space
- Or the dependent variable y depends on neighboring y values (Wy)

### Consequences of Ignoring Spatial Dependence

1. **Omitted Variable Bias** (if ρWy is omitted):
   - True model: y = ρWy + Xβ + ε
   - OLS estimates: y = Xβ + u (where u = ρWy + ε)
   - Result: β̂ₒₗₛ is **biased and inconsistent**

2. **Inefficient Estimates** (if spatial correlation in errors):
   - True model: y = Xβ + u, where E(uu') = Ω ≠ σ²I
   - OLS is still unbiased but **not BLUE** (not efficient)
   - Better estimators exist (Spatial Error Model)

3. **Incorrect Standard Errors**:
   - OLS formula Var(β̂) = σ²(X'X)⁻¹ is **wrong**
   - Usually **underestimates** true variance
   - Results in **overconfident inference** (Type I errors)

4. **Invalid Hypothesis Tests**:
   - t-statistics and F-statistics are **invalid**
   - p-values are **misleading**
   - Confidence intervals have **wrong coverage**

### Simulation Demonstration

Let's use a simulation to demonstrate how OLS fails when spatial dependence exists.

In [None]:
# Simulation: Compare OLS on independent vs. spatially dependent data

print("="*70)
print("SIMULATION: OLS with Spatial Dependence")
print("="*70)

# Parameters
N = 500  # Number of observations
rho = 0.5  # Spatial autocorrelation parameter (moderate spillover)
beta_true = 2.5  # True coefficient on X

# Create simple spatial weight matrix (1D neighbors for simplicity)
from scipy.sparse import diags

# Create adjacency matrix: each unit has 2 neighbors (left and right)
W_sparse = diags([1, 1], [-1, 1], shape=(N, N)).toarray()

# Row-normalize: each row sums to 1
row_sums = W_sparse.sum(axis=1)
row_sums[row_sums == 0] = 1  # Avoid division by zero for boundary units
W = W_sparse / row_sums[:, None]

print(f"\nSimulation setup:")
print(f"  N = {N} observations")
print(f"  ρ (spatial correlation) = {rho}")
print(f"  β (true effect of X on y) = {beta_true}")
print(f"  Spatial weight matrix: {N}×{N} (row-normalized)")

In [None]:
# Generate data
np.random.seed(42)

# Generate X (independent variable)
X = np.random.normal(10, 2, N)

# Generate error term
epsilon = np.random.normal(0, 1, N)

# Identity matrix
I = np.eye(N)

# Generate y WITH spatial dependence (Spatial Lag Model)
# y = ρWy + Xβ + ε
# → (I - ρW)y = Xβ + ε
# → y = (I - ρW)⁻¹(Xβ + ε)
y_spatial = np.linalg.solve(I - rho * W, X * beta_true + epsilon)

# Generate y WITHOUT spatial dependence (for comparison)
# y = Xβ + ε
y_independent = X * beta_true + epsilon

print("\nData generated:")
print(f"  y_independent: No spatial dependence (OLS is correct)")
print(f"  y_spatial: Spatial lag dependence (OLS is biased)")

In [None]:
# Estimate OLS on both datasets
from sklearn.linear_model import LinearRegression

# OLS on independent data (CORRECT - OLS is appropriate here)
model_correct = LinearRegression()
model_correct.fit(X.reshape(-1, 1), y_independent)
beta_hat_correct = model_correct.coef_[0]
residuals_correct = y_independent - model_correct.predict(X.reshape(-1, 1))

# OLS on spatially dependent data (BIASED - OLS is inappropriate here)
model_biased = LinearRegression()
model_biased.fit(X.reshape(-1, 1), y_spatial)
beta_hat_biased = model_biased.coef_[0]
residuals_biased = y_spatial - model_biased.predict(X.reshape(-1, 1))

# Calculate bias
bias_correct = beta_hat_correct - beta_true
bias_biased = beta_hat_biased - beta_true

print("\n" + "="*70)
print("RESULTS: OLS Estimation")
print("="*70)
print(f"\nTrue β: {beta_true:.4f}")
print(f"\n1. OLS on independent data (NO spatial dependence):")
print(f"   β̂ = {beta_hat_correct:.4f}")
print(f"   Bias = {bias_correct:.4f}")
print(f"   ✓ OLS works well (small bias due to sampling variation)")
print(f"\n2. OLS on spatial data (WITH spatial dependence):")
print(f"   β̂ = {beta_hat_biased:.4f}")
print(f"   Bias = {bias_biased:.4f}")
print(f"   ✗ OLS is BIASED! (omitted variable: ρWy)")
print(f"\nConclusion: Ignoring spatial dependence leads to wrong estimates!")
print("="*70)

### Visualizing the Problem: Spatial Pattern in OLS Residuals

One diagnostic for spatial dependence is to check if OLS residuals exhibit spatial patterns.

In [None]:
# Plot residuals to show spatial clustering
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# Left panel: Residuals from independent data (should be random)
axes[0].scatter(range(N), residuals_correct, alpha=0.5, s=20, color='steelblue')
axes[0].axhline(0, color='red', linestyle='--', linewidth=2)
axes[0].set_title('OLS Residuals: Independent Data\n(No Spatial Pattern - Good!)', 
                   fontsize=12, fontweight='bold')
axes[0].set_xlabel('Observation (Spatial Order)', fontsize=11)
axes[0].set_ylabel('Residual', fontsize=11)
axes[0].grid(True, alpha=0.3)

# Right panel: Residuals from spatial data (should show clustering)
axes[1].scatter(range(N), residuals_biased, alpha=0.5, s=20, color='firebrick')
axes[1].axhline(0, color='red', linestyle='--', linewidth=2)
axes[1].set_title('OLS Residuals: Spatial Data\n(Clear Spatial Clustering - Bad!)', 
                   fontsize=12, fontweight='bold')
axes[1].set_xlabel('Observation (Spatial Order)', fontsize=11)
axes[1].set_ylabel('Residual', fontsize=11)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Interpretation:")
print("  Left: Residuals are randomly scattered (white noise) → OLS is fine")
print("  Right: Residuals cluster (positive runs, negative runs) → OLS assumption violated")
print("\n→ Clustered residuals suggest spatial autocorrelation!")

### Key Takeaway

**Ignoring spatial dependence leads to:**
1. Biased coefficient estimates (if spatial lag is omitted)
2. Incorrect standard errors (too small → overconfident)
3. Invalid hypothesis tests
4. Wrong policy conclusions

**Solution:** Use spatial econometric models that explicitly account for geographic relationships:
- Spatial Lag Model (SAR)
- Spatial Error Model (SEM)
- Spatial Durbin Model (SDM)

We'll learn these models in upcoming notebooks!

---

# 4. Visualizing Spatial Patterns

## Exploratory Spatial Data Analysis (ESDA)

Before fitting models, it's crucial to **explore spatial patterns visually**. This is called **Exploratory Spatial Data Analysis (ESDA)**.

### Tools for ESDA

1. **Choropleth maps**: We've already seen these (Section 2)
2. **Moran scatterplot**: Plots variable vs. spatial lag
3. **LISA cluster maps**: Local Indicators of Spatial Association
4. **Spatial correlograms**: Correlation at different distance bands

In this section, we'll focus on the **Moran scatterplot**, which provides intuition for Moran's I (Section 5).

### Understanding the Spatial Lag

The **spatial lag** of a variable is the weighted average of neighboring values:

**Wy**ᵢ = Σⱼ wᵢⱼ yⱼ

Where:
- wᵢⱼ is the (i,j) element of the spatial weight matrix W
- Typically, W is row-normalized so Σⱼ wᵢⱼ = 1
- Therefore, Wyᵢ is the average value of y in i's neighborhood

**Example:**
- If county i has income = $30,000
- Its 3 neighbors have incomes = $28,000, $32,000, $29,000
- Spatial lag (Wy)ᵢ = (28000 + 32000 + 29000) / 3 = $29,667

**Interpretation:**
- If yᵢ and (Wy)ᵢ are positively correlated → Positive spatial autocorrelation
- If yᵢ and (Wy)ᵢ are negatively correlated → Negative spatial autocorrelation

In [None]:
# Build spatial weight matrix for our counties data
print("Building spatial weight matrix (Queen contiguity)...\n")

# Queen contiguity: neighbors if they share border OR vertex
w = Queen.from_dataframe(counties)
w.transform = 'r'  # Row-normalize

print(f"Spatial Weight Matrix Statistics:")
print(f"  Number of observations: {w.n}")
print(f"  Average neighbors per county: {w.mean_neighbors:.2f}")
print(f"  Min neighbors: {w.min_neighbors}")
print(f"  Max neighbors: {w.max_neighbors}")
print(f"  Number of islands (0 neighbors): {w.islands}")

print("\n✓ Spatial weight matrix created successfully!")

In [None]:
# Compute spatial lag of income
income = counties['income_percapita'].values
income_lag = libpysal.weights.lag_spatial(w, income)

# Add to dataframe for reference
counties['income_lag'] = income_lag

# Show example
print("Example: Income and Spatial Lag\n")
sample_idx = 5
print(f"County: {counties.iloc[sample_idx]['county_name']}")
print(f"  Own income: ${income[sample_idx]:,.0f}")
print(f"  Neighbors' average income: ${income_lag[sample_idx]:,.0f}")
print(f"  Number of neighbors: {len(w.neighbors[sample_idx])}")

### Moran Scatterplot

The **Moran scatterplot** plots each observation's value (horizontal axis) against the spatial lag (vertical axis).

**Interpretation:**
- **Positive slope**: Positive spatial autocorrelation (clustering)
- **Negative slope**: Negative spatial autocorrelation (dispersion)
- **Flat (no slope)**: No spatial autocorrelation

**Four quadrants:**
- **HH (High-High)**: Upper-right quadrant - high value surrounded by high values
- **LL (Low-Low)**: Lower-left quadrant - low value surrounded by low values
- **HL (High-Low)**: Lower-right quadrant - high value surrounded by low values (spatial outlier)
- **LH (Low-High)**: Upper-left quadrant - low value surrounded by high values (spatial outlier)

In [None]:
# Create Moran scatterplot
fig, ax = plt.subplots(figsize=(10, 10))

# Standardize variables for plotting (mean=0, sd=1)
income_std = (income - income.mean()) / income.std()
income_lag_std = (income_lag - income_lag.mean()) / income_lag.std()

# Scatter plot
ax.scatter(income_std, income_lag_std, alpha=0.6, s=50, 
           edgecolors='black', linewidth=0.5, color='steelblue')

# Add regression line
z = np.polyfit(income_std, income_lag_std, 1)
p = np.poly1d(z)
ax.plot(income_std, p(income_std), "r-", linewidth=2, label=f'Slope = {z[0]:.3f}')

# Add reference lines at means
ax.axhline(0, color='black', linestyle='--', linewidth=1, alpha=0.5)
ax.axvline(0, color='black', linestyle='--', linewidth=1, alpha=0.5)

# Quadrant labels
ax.text(0.7, 0.7, 'HH\n(High-High)', transform=ax.transAxes, 
        fontsize=11, ha='center', va='center', 
        bbox=dict(boxstyle='round', facecolor='lightcoral', alpha=0.5))
ax.text(0.3, 0.3, 'LL\n(Low-Low)', transform=ax.transAxes, 
        fontsize=11, ha='center', va='center',
        bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.5))
ax.text(0.7, 0.3, 'HL\n(Outlier)', transform=ax.transAxes, 
        fontsize=11, ha='center', va='center',
        bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.5))
ax.text(0.3, 0.7, 'LH\n(Outlier)', transform=ax.transAxes, 
        fontsize=11, ha='center', va='center',
        bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.5))

ax.set_xlabel('Income Per Capita (standardized)', fontsize=13, fontweight='bold')
ax.set_ylabel('Spatial Lag of Income (standardized)', fontsize=13, fontweight='bold')
ax.set_title('Moran Scatterplot: Income Per Capita', fontsize=15, fontweight='bold', pad=20)
ax.legend(fontsize=12, loc='upper left')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('../outputs/figures/nb01_moran_scatterplot.png', 
            dpi=300, bbox_inches='tight')
plt.show()

print("Interpretation:")
print(f"  Slope = {z[0]:.3f} (positive slope suggests positive spatial autocorrelation)")
print("  Points in upper-right (HH): High income counties near high income counties")
print("  Points in lower-left (LL): Low income counties near low income counties")
print("  → Evidence of spatial clustering!")
print("\n✓ Figure saved to: ../outputs/figures/nb01_moran_scatterplot.png")

**Key Takeaway:**

The Moran scatterplot provides visual evidence of spatial autocorrelation:
- **Positive slope** → Positive spatial autocorrelation (clustering)
- The slope of the regression line is related to **Moran's I** (next section)
- Most points in HH and LL quadrants → Clustering pattern
- Points in HL and LH quadrants → Spatial outliers

---

# 5. First Spatial Statistic: Moran's I

## Measuring Spatial Autocorrelation with Moran's I

**Moran's I** is the most widely used global measure of spatial autocorrelation. It's analogous to a correlation coefficient, but for spatial relationships.

### Formula

**I = (N / S₀) × [Σᵢ Σⱼ wᵢⱼ(xᵢ - x̄)(xⱼ - x̄)] / [Σᵢ(xᵢ - x̄)²]**

Where:
- **N**: Number of observations
- **wᵢⱼ**: Element (i,j) of spatial weight matrix W
- **S₀**: Sum of all weights = Σᵢ Σⱼ wᵢⱼ
- **x̄**: Mean of variable x
- **xᵢ, xⱼ**: Values of variable x at locations i and j

### Interpretation

- **I > E(I)**: Positive spatial autocorrelation (clustering)
  - High values near high values, low values near low values
  
- **I ≈ E(I)**: Random spatial pattern (no autocorrelation)
  - E(I) = -1/(N-1) ≈ 0 for large N
  
- **I < E(I)**: Negative spatial autocorrelation (dispersion)
  - High values near low values (checkerboard pattern)

**Range**: Approximately [-1, 1], but depends on spatial structure

### Hypothesis Test

- **H₀**: Spatial randomness (no spatial autocorrelation)
- **H₁**: Spatial autocorrelation exists
- **Test statistic**: Under randomization, I approximately follows a normal distribution
- **p-value**: Computed via permutation test or analytical approximation
- **Decision rule**: If p < 0.05, reject H₀ → Significant spatial autocorrelation

### Computing Moran's I for Income

Let's compute Moran's I for income per capita and interpret the results.

In [None]:
# Compute Moran's I
moran_income = Moran(counties['income_percapita'], w)

print("="*70)
print("MORAN'S I TEST FOR SPATIAL AUTOCORRELATION")
print("="*70)
print(f"\nVariable: Income Per Capita")
print(f"\nMoran's I: {moran_income.I:.4f}")
print(f"Expected I under randomness: {moran_income.EI:.4f}")
print(f"Variance of I: {moran_income.VI_rand:.6f}")
print(f"\nz-score: {moran_income.z_rand:.4f}")
print(f"p-value (randomization): {moran_income.p_rand:.4f}")
print(f"p-value (simulation): {moran_income.p_sim:.4f}")

print(f"\n" + "-"*70)
if moran_income.p_sim < 0.01:
    print("✓ CONCLUSION: Highly significant positive spatial autocorrelation!")
    print("  → Income exhibits strong spatial clustering")
    print("  → High-income counties cluster together")
    print("  → Low-income counties cluster together")
    print("  → OLS would be INAPPROPRIATE for this data")
    print("  → Spatial econometric models are REQUIRED")
elif moran_income.p_sim < 0.05:
    print("✓ CONCLUSION: Significant positive spatial autocorrelation detected")
    print("  → Income exhibits spatial clustering")
    print("  → OLS may be inappropriate; consider spatial models")
elif moran_income.p_sim < 0.10:
    print("⚠ CONCLUSION: Weak evidence of spatial autocorrelation")
    print("  → Borderline case; investigate further")
else:
    print("✗ CONCLUSION: No significant spatial autocorrelation")
    print("  → Spatial pattern appears random")
    print("  → OLS may be appropriate")
print("="*70)

### Testing Multiple Variables

Let's compute Moran's I for several variables to see which ones exhibit spatial autocorrelation.

In [None]:
# Test multiple variables
variables = ['income_percapita', 'education', 'unemployment_rate']

moran_results = []
for var in variables:
    m = Moran(counties[var], w)
    moran_results.append({
        'Variable': var.replace('_', ' ').title(),
        'Moran_I': m.I,
        'Expected_I': m.EI,
        'z_score': m.z_sim,
        'p_value': m.p_sim,
        'Significant': '***' if m.p_sim < 0.01 else ('**' if m.p_sim < 0.05 else ('*' if m.p_sim < 0.10 else 'No'))
    })

moran_df = pd.DataFrame(moran_results)

print("\n" + "="*70)
print("MORAN'S I FOR MULTIPLE VARIABLES")
print("="*70)
print(moran_df.to_string(index=False))
print("\nSignificance: *** p<0.01, ** p<0.05, * p<0.10")
print("="*70)

### Visualizing Moran's I Results

A bar plot helps compare spatial autocorrelation across variables.

In [None]:
# Bar plot of Moran's I values
fig, ax = plt.subplots(figsize=(12, 7))

# Color bars by significance
colors = ['darkgreen' if p < 0.01 else ('green' if p < 0.05 else ('orange' if p < 0.10 else 'gray')) 
          for p in moran_df['p_value']]

bars = ax.bar(moran_df['Variable'], moran_df['Moran_I'], 
              color=colors, alpha=0.7, edgecolor='black', linewidth=1.5)

# Add expected I reference line
ax.axhline(moran_df['Expected_I'].iloc[0], color='red', 
           linestyle='--', linewidth=2, label=f'Expected I (randomness) = {moran_df["Expected_I"].iloc[0]:.4f}')

ax.set_ylabel("Moran's I", fontsize=13, fontweight='bold')
ax.set_xlabel("Variable", fontsize=13, fontweight='bold')
ax.set_title("Spatial Autocorrelation Test Results", fontsize=15, fontweight='bold', pad=20)
ax.legend(fontsize=11)
ax.grid(True, axis='y', alpha=0.3)

# Add significance asterisks
for i, (idx, row) in enumerate(moran_df.iterrows()):
    if row['Significant'] != 'No':
        ax.text(i, row['Moran_I'] + 0.01, row['Significant'], 
                ha='center', fontsize=14, fontweight='bold')

# Add legend for colors
from matplotlib.patches import Patch
legend_elements = [
    Patch(facecolor='darkgreen', alpha=0.7, label='p < 0.01 (***)'),
    Patch(facecolor='green', alpha=0.7, label='p < 0.05 (**)'),
    Patch(facecolor='orange', alpha=0.7, label='p < 0.10 (*)'),
    Patch(facecolor='gray', alpha=0.7, label='Not significant')
]
ax.legend(handles=legend_elements, loc='upper right', fontsize=10)

plt.tight_layout()
plt.savefig('../outputs/figures/nb01_morans_i_comparison.png', 
            dpi=300, bbox_inches='tight')
plt.show()

print("\n✓ Figure saved to: ../outputs/figures/nb01_morans_i_comparison.png")

**Key Takeaway:**

**Moran's I provides formal statistical evidence of spatial dependence:**
- If I is significantly positive → Spatial clustering exists
- If I is significantly negative → Spatial dispersion exists
- If I is not significant → Random spatial pattern (OLS may be OK)

**Decision rule for modeling:**
- **If Moran's I is significant** → Use spatial econometric models (SAR, SEM, SDM)
- **If Moran's I is not significant** → Standard OLS may be sufficient

**In our example**, income shows strong positive spatial autocorrelation, so spatial models are needed!

---

# 6. Introduction to Spatial Weight Matrices (W)

## The Fundamental Tool: Spatial Weight Matrix W

The **spatial weight matrix W** is the foundation of all spatial econometric models. It encodes our assumptions about which units influence each other.

### What is W?

W is an N×N matrix defining neighborhood relationships:

- **wᵢⱼ > 0** if units i and j are neighbors
- **wᵢⱼ = 0** if units i and j are not neighbors
- **wᵢᵢ = 0** (diagonal elements = 0, unit is not its own neighbor)

### Role in Spatial Models

W operationalizes the concept of "nearness" or "connectivity":

- **Wy**: Spatial lag (weighted average of neighbors' y values)
- **WX**: Spatial lag of explanatory variables
- **Wε**: Spatially lagged errors

**Important**: W is **specified**, not estimated. We choose W based on:
1. Geographic proximity (contiguity, distance)
2. Economic linkages (trade, migration)
3. Social networks (kinship, institutional ties)
4. Theoretical considerations

### Preview of W Types (Full Tutorial in Notebook 02)

1. **Contiguity-Based**:
   - **Queen**: Neighbors if share border OR vertex (corner)
   - **Rook**: Neighbors if share border (not corner)

2. **Distance-Based**:
   - **Inverse distance**: wᵢⱼ = 1/dᵢⱼ or 1/dᵢⱼ²
   - **Distance threshold**: wᵢⱼ = 1 if dᵢⱼ < d*, else 0

3. **k-Nearest Neighbors**:
   - **k-NN**: Each unit has exactly k neighbors (closest ones)

4. **Economic/Social**:
   - **Trade flows**: wᵢⱼ = trade volume between i and j
   - **Migration**: wᵢⱼ = migration flows
   - **Network**: Based on institutional or social connections

### Row Normalization

Typically, W is **row-normalized**: Each row sums to 1.

- **Before**: wᵢⱼ = 1 if neighbor, 0 otherwise
- **After**: wᵢⱼ = 1/#{neighbors of i} if neighbor, 0 otherwise

**Interpretation**: Wyᵢ becomes the **average** (not sum) of neighbors' y values.

### Example: Queen Contiguity Matrix

We already built a Queen contiguity matrix earlier. Let's examine its properties.

In [None]:
# Examine the spatial weight matrix we created earlier
print("="*70)
print("SPATIAL WEIGHT MATRIX: QUEEN CONTIGUITY")
print("="*70)

print(f"\nMatrix dimensions: {w.n} × {w.n}")
print(f"Total number of nonzero links: {w.s0:.0f}")
print(f"Average number of neighbors: {w.mean_neighbors:.2f}")
print(f"Min neighbors: {w.min_neighbors}")
print(f"Max neighbors: {w.max_neighbors}")
print(f"Number of islands (units with 0 neighbors): {len(w.islands)}")
print(f"Sparsity: {(1 - w.s0/(w.n**2))*100:.2f}% of elements are zero")

# Show example neighborhood
example_id = 10
neighbors = w.neighbors[example_id]
weights = w.weights[example_id]

print(f"\nExample: County {example_id} ({counties.iloc[example_id]['county_name']})")
print(f"  Number of neighbors: {len(neighbors)}")
print(f"  Neighbor IDs: {neighbors}")
print(f"  Weights (row-normalized): {weights}")
print(f"  Sum of weights: {sum(weights):.4f} (should be 1.0 for row-normalized)")
print("="*70)

In [None]:
# Histogram of neighbor counts
neighbor_counts = [len(w.neighbors[i]) for i in w.neighbors]

fig, ax = plt.subplots(figsize=(12, 7))

ax.hist(neighbor_counts, bins=20, edgecolor='black', 
        alpha=0.7, color='steelblue')
ax.axvline(w.mean_neighbors, color='red', linestyle='--', 
           linewidth=2, label=f'Mean = {w.mean_neighbors:.2f}')

ax.set_xlabel('Number of Neighbors', fontsize=13, fontweight='bold')
ax.set_ylabel('Frequency (Number of Counties)', fontsize=13, fontweight='bold')
ax.set_title('Distribution of Neighbor Counts (Queen Contiguity)', 
             fontsize=15, fontweight='bold', pad=20)
ax.legend(fontsize=12)
ax.grid(True, axis='y', alpha=0.3)

plt.tight_layout()
plt.savefig('../outputs/figures/nb01_neighbor_distribution.png', 
            dpi=300, bbox_inches='tight')
plt.show()

print("Interpretation:")
print(f"  Most counties have {int(w.mean_neighbors)} neighbors (on average)")
print(f"  Distribution shows natural variation in geographic connectivity")
print(f"  Counties with few neighbors: Coastal or isolated regions")
print(f"  Counties with many neighbors: Centrally located regions")
print("\n✓ Figure saved to: ../outputs/figures/nb01_neighbor_distribution.png")

### Visualizing Spatial Connections

Let's visualize the spatial connectivity structure by plotting links between neighboring counties.

In [None]:
# Plot spatial connections on map
fig, ax = plt.subplots(1, 1, figsize=(15, 10))

# Plot counties as base map
counties.plot(ax=ax, facecolor='lightgray', edgecolor='black', linewidth=0.5, alpha=0.5)

# Sample counties to avoid overcrowding (plot connections for random sample)
np.random.seed(123)
sample_size = min(50, len(counties))  # Plot max 50 counties' connections
sample_ids = np.random.choice(counties.index, sample_size, replace=False)

# Plot connections
for i in sample_ids:
    if i not in w.neighbors:  # Skip if no neighbors (island)
        continue
    
    county_i = counties.loc[i]
    centroid_i = county_i.geometry.centroid
    
    for j in w.neighbors[i]:
        county_j = counties.loc[j]
        centroid_j = county_j.geometry.centroid
        
        # Draw line between centroids
        ax.plot([centroid_i.x, centroid_j.x],
                [centroid_i.y, centroid_j.y],
                'r-', linewidth=0.8, alpha=0.4)

# Highlight sampled counties
counties.loc[sample_ids].plot(ax=ax, facecolor='yellow', edgecolor='black', 
                               linewidth=1, alpha=0.6, label='Sampled counties')

ax.set_title(f'Spatial Connectivity Structure (Sample of {sample_size} Counties)', 
             fontsize=16, fontweight='bold', pad=20)
ax.axis('off')
ax.legend(fontsize=12, loc='lower right')

plt.tight_layout()
plt.savefig('../outputs/figures/nb01_spatial_connections.png', 
            dpi=300, bbox_inches='tight')
plt.show()

print("Interpretation:")
print("  Red lines connect neighboring counties (Queen contiguity)")
print("  Yellow counties are sampled for visualization")
print("  Dense network shows high spatial connectivity")
print("\n✓ Figure saved to: ../outputs/figures/nb01_spatial_connections.png")

**Key Takeaway:**

**The spatial weight matrix W is fundamental to spatial econometrics:**
1. W defines who influences whom
2. W is **specified** (chosen), not estimated
3. Choice of W affects results (robustness checks needed!)
4. Common choices: Contiguity (Queen/Rook), Distance, k-NN
5. Row-normalization makes Wy interpretable as neighbor average

**Next step**: Notebook 02 will cover W specification in depth, including:
- How to choose W
- Different W types and when to use them
- Sensitivity analysis across different W specifications

---

# 7. Summary and Preview of Next Steps

## What We've Learned

### Key Concepts Covered

1. ✓ **Spatial dependence** exists in many economic phenomena (Tobler's Law)
2. ✓ **OLS fails** when observations are spatially correlated
3. ✓ **Consequences**: Biased estimates, wrong standard errors, invalid tests
4. ✓ **Choropleth maps** visualize spatial patterns effectively
5. ✓ **Moran's I** provides formal statistical test for spatial autocorrelation
6. ✓ **Spatial weight matrix W** defines neighborhood structure
7. ✓ **Moran scatterplot** shows relationship between y and neighbors' y

### Critical Insights

**When should you use spatial econometric methods?**

1. **Theoretical reasons**:
   - Spillover effects are plausible (technology diffusion, crime contagion)
   - Common shocks affect nearby regions similarly
   - Unobserved factors vary smoothly over space

2. **Empirical evidence**:
   - Choropleth maps show clustering
   - Moran's I test is significant (p < 0.05)
   - OLS residuals exhibit spatial pattern

3. **Consequences of ignoring**:
   - Incorrect inference → Wrong policy decisions
   - Missing important spillover effects
   - Underestimated uncertainty

**Decision tree:**

```
Is spatial dependence plausible?
│
├─ Yes → Test with Moran's I
│         │
│         ├─ Significant → Use spatial models (SAR, SEM, SDM)
│         └─ Not significant → OLS may be OK (but check residuals)
│
└─ No → Consider panel or time series methods
```

## Typical Spatial Econometric Workflow

This is the workflow we'll develop across the tutorial series:

### Step 1: Data Preparation
- Load spatial data + geographic boundaries
- Ensure consistent identifiers (county_id, region_id, etc.)
- Check for missing values and outliers

### Step 2: Exploratory Spatial Data Analysis (ESDA)
- Create **choropleth maps** (visual inspection)
- Compute **Moran's I** (formal test)
- Create **Moran scatterplot** (understand clustering)
- Identify **spatial outliers** (LISA analysis)

### Step 3: Specify Spatial Weight Matrix
- Choose W based on theory and data structure
- Test sensitivity to different W specifications
- Row-normalize W for interpretability

### Step 4: Model Selection
- If spatial lag significant → **Spatial Lag Model (SAR)**
- If spatial error significant → **Spatial Error Model (SEM)**
- If both → **Spatial Durbin Model (SDM)**
- Use specification tests (LM tests, LR tests)

### Step 5: Estimation
- Estimate chosen model (ML or GMM)
- Check convergence and diagnostics
- Compute robust standard errors if needed

### Step 6: Interpretation
- Compute **marginal effects** (direct, indirect, total)
- Interpret spillover effects
- Visualize spatial effects on map

### Step 7: Validation
- Check residual spatial autocorrelation (should be zero!)
- Robustness checks (different W, outliers, etc.)
- Compare with non-spatial models

## Preview of Upcoming Notebooks

### Notebook 02: Spatial Weight Matrices (W)
**Topics**:
- Deep dive into W matrix construction
- Contiguity matrices (Queen, Rook, higher-order)
- Distance-based matrices (inverse distance, threshold)
- k-Nearest Neighbors
- Economic/social weight matrices
- Sensitivity analysis across W specifications
- Best practices for choosing W

### Notebook 03: Spatial Lag Model (SAR)
**Topics**:
- Theory: Endogenous spatial interaction
- Model: y = ρWy + Xβ + ε
- Estimation: Maximum Likelihood and GMM
- Interpretation: Direct and indirect effects
- Hypothesis testing
- Applications: Regional growth, technology diffusion

### Notebook 04: Spatial Error Model (SEM)
**Topics**:
- Theory: Spatial correlation in unobservables
- Model: y = Xβ + u, u = λWu + ε
- When to use SEM vs. SAR
- Estimation and inference
- Efficiency gains over OLS

### Notebook 05: Spatial Durbin Model (SDM)
**Topics**:
- Theory: Most flexible spatial model
- Model: y = ρWy + Xβ + WXθ + ε
- Nesting SAR and SEM
- Spatial marginal effects (crucial!)
- Model selection tests

### Notebook 06: Spatial Marginal Effects
**Topics**:
- Why standard β interpretation fails
- Direct effects (own effect)
- Indirect effects (spillover to others)
- Total effects (sum of direct + indirect)
- Computing and visualizing effects

### Notebook 07: Dynamic Spatial Panels
**Topics**:
- Combining time and space
- Space-time weight matrices
- Dynamic spatial panel models
- GMM estimation for dynamic panels

### Notebook 08: Model Selection and Diagnostics
**Topics**:
- LM tests (Lagrange Multiplier)
- Robust LM tests
- Likelihood ratio tests
- Residual diagnostics
- Specification search strategy

## Exercises for Further Practice

To reinforce your understanding, try these exercises:

### Exercise 1: Explore Another Variable
Choose a different variable from the dataset (e.g., education, unemployment) and:
- Create a choropleth map
- Compute Moran's I
- Create a Moran scatterplot
- Interpret the results

### Exercise 2: Simulate Negative Spatial Autocorrelation
Modify the simulation in Section 3 to generate data with **negative** spatial autocorrelation (ρ < 0). Visualize the pattern and compute Moran's I.

### Exercise 3: Identify Spatial Outliers
From the Moran scatterplot, identify counties in the HL and LH quadrants (spatial outliers). Can you explain why these counties might be outliers?

### Exercise 4: Compare W Specifications
Build a different spatial weight matrix (e.g., k-NN with k=5) and compute Moran's I for income. How does the result compare to Queen contiguity?

### Exercise 5: Reading Assignment
Read Chapter 1 of LeSage & Pace (2009) *Introduction to Spatial Econometrics* for deeper theoretical background.

## Save Workspace for Future Notebooks

Let's save our processed data for use in subsequent notebooks.

In [None]:
# Save processed counties data
output_data_path = Path('../data/us_counties/')
output_data_path.mkdir(parents=True, exist_ok=True)

# Save as shapefile (includes geometry)
counties.to_file(output_data_path / 'us_counties_processed.shp')

# Also save as CSV for easy loading
counties_df = pd.DataFrame(counties.drop(columns='geometry'))
counties_df.to_csv(output_data_path / 'us_counties_processed.csv', index=False)

print("✓ Processed data saved:")
print(f"  Shapefile: {output_data_path / 'us_counties_processed.shp'}")
print(f"  CSV: {output_data_path / 'us_counties_processed.csv'}")
print("\nReady to proceed to Notebook 02: Spatial Weight Matrices!")

---

## References

### Essential Citations

1. **Tobler, W. R. (1970)**. "A Computer Movie Simulating Urban Growth in the Detroit Region". *Economic Geography*, 46(sup1), 234-240.

2. **Anselin, L. (1988)**. *Spatial Econometrics: Methods and Models*. Dordrecht: Kluwer Academic Publishers.

3. **LeSage, J. P., & Pace, R. K. (2009)**. *Introduction to Spatial Econometrics*. Boca Raton: CRC Press.
   - Chapter 1: Introduction (motivation and overview)
   - Chapter 2: Spatial Autocorrelation

4. **Anselin, L. (1995)**. "Local Indicators of Spatial Association—LISA". *Geographical Analysis*, 27(2), 93-115.

5. **Anselin, L., & Bera, A. K. (1998)**. "Spatial Dependence in Linear Regression Models with an Introduction to Spatial Econometrics". In A. Ullah & D. E. A. Giles (Eds.), *Handbook of Applied Economic Statistics* (pp. 237-289). New York: Marcel Dekker.

### Software Documentation

- **PySAL (Python Spatial Analysis Library)**: https://pysal.org/
- **GeoPandas**: https://geopandas.org/
- **libpysal.weights**: https://pysal.org/libpysal/api.html#spatial-weights
- **esda.moran**: https://pysal.org/esda/generated/esda.Moran.html

### Additional Resources

- Anselin, L. (2005). "Exploring Spatial Data with GeoDa: A Workbook". Available at: https://geodacenter.github.io/
- Bivand, R. S., Pebesma, E., & Gómez-Rubio, V. (2013). *Applied Spatial Data Analysis with R* (2nd ed.). Springer.

---

**Notebook completed!** You should now have a solid understanding of:
- Why spatial econometrics matters
- How to detect spatial autocorrelation
- Why OLS fails with spatial dependence
- The role of spatial weight matrices

**Next**: [Notebook 02: Spatial Weight Matrices](02_spatial_weights_matrices.ipynb)