# Spatial Weight Matrices (W): The Foundation of Spatial Econometrics

**Notebook 02 | Tutorial Series: Spatial Econometrics with PanelBox**

**Author**: PanelBox Development Team  
**Level**: Beginner to Intermediate  
**Duration**: 90-120 minutes  
**Prerequisites**: Notebook 01 (Introduction to Spatial Econometrics)

---

## Objectives

This notebook provides comprehensive understanding of **spatial weight matrices (W)** - the foundational tool of spatial econometrics. You will learn to:

1. Construct various types of W matrices (contiguity, distance, k-NN)
2. Understand their mathematical properties and implications
3. Perform and interpret row normalization
4. Assess sensitivity of results to different W specifications
5. Choose appropriate W for your research question

---

## 1. Introduction: The Role of W in Spatial Models

### Why W is the Foundation of Spatial Econometrics

Recall from Notebook 01 that **spatial weight matrices (W)** encode neighborhood relationships between spatial units. W is central to all spatial econometric models:

**Key Spatial Models**:
- **SAR (Spatial Lag)**: $y = \rho Wy + X\beta + \varepsilon$
- **SEM (Spatial Error)**: $y = X\beta + u$, where $u = \lambda Wu + \varepsilon$
- **SDM (Spatial Durbin)**: $y = \rho Wy + X\beta + WX\theta + \varepsilon$

### Critical Properties of W

1. **W is specified, not estimated**: You choose the structure
2. **W encodes theoretical assumptions** about spatial relationships
3. **Different W specifications → different interpretations**
4. **Diagonal elements $w_{ii} = 0$**: No self-influence

> **Key Insight**: "W is the most important modeling choice in spatial econometrics. Everything else follows from how we define neighbors."

---

In [None]:
# Import required libraries
import sys
from pathlib import Path
import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.sparse import csr_matrix
from scipy.spatial import distance_matrix
import warnings
warnings.filterwarnings('ignore')

# PySAL libraries
import libpysal
from libpysal import weights
from libpysal.weights import Queen, Rook, KNN, DistanceBand
import esda
from esda import Moran

# Set paths
panelbox_path = Path("/home/guhaase/projetos/panelbox")
sys.path.insert(0, str(panelbox_path))
data_path = Path("../data/us_counties/")
output_path = Path("../outputs/figures/")
output_path.mkdir(parents=True, exist_ok=True)

# Set visualization style
sns.set_style("whitegrid")
plt.rcParams['figure.dpi'] = 100
plt.rcParams['savefig.dpi'] = 300
plt.rcParams['savefig.bbox'] = 'tight'

print("✓ Libraries loaded successfully")
print(f"✓ Working directory: {Path.cwd()}")
print(f"✓ Data path: {data_path}")
print(f"✓ Output path: {output_path}")

---

## 2. Contiguity-Based W Matrices

### Neighbors by Shared Borders: Queen and Rook

Contiguity-based weights define neighbors by geographic adjacency.

#### Queen Contiguity
- **Definition**: Units are neighbors if they share a border OR vertex
- **Analogy**: Chess queen movement (any direction including diagonal)
- **Use case**: General geographic analysis

#### Rook Contiguity
- **Definition**: Units are neighbors if they share a border (not just vertex)
- **Analogy**: Chess rook movement (only cardinal directions)
- **Use case**: When diagonal connections are less relevant

**Visual Comparison**:
```
Queen:          Rook:
  X X X           . X .
  X O X           X O X
  X X X           . X .

O = focal unit
X = neighbors
. = not neighbors
```

---

In [None]:
# Load US counties data
print("Loading US counties data...")

# Check if data exists
if not data_path.exists():
    print(f"⚠ Data path does not exist: {data_path}")
    print("Please ensure the data files are in the correct location.")
else:
    print(f"✓ Data directory found: {data_path}")
    
# For this example, we'll create synthetic county data if real data is unavailable
# In production, replace with actual data loading

try:
    counties_geo = gpd.read_file(data_path / "us_counties.shp")
    counties_data = pd.read_csv(data_path / "us_counties.csv")
    counties = counties_geo.merge(counties_data, on='county_id')
    print(f"✓ Loaded {len(counties)} counties")
except FileNotFoundError:
    print("⚠ Real data not found. Creating synthetic example...")
    # Create synthetic grid of counties for demonstration
    from shapely.geometry import Polygon
    
    grid_size = 10
    polygons = []
    county_ids = []
    incomes = []
    
    np.random.seed(42)
    for i in range(grid_size):
        for j in range(grid_size):
            x0, y0 = i, j
            poly = Polygon([(x0, y0), (x0+1, y0), (x0+1, y0+1), (x0, y0+1)])
            polygons.append(poly)
            county_ids.append(f"C{i:02d}{j:02d}")
            incomes.append(np.random.lognormal(10, 0.3))
    
    counties = gpd.GeoDataFrame({
        'county_id': county_ids,
        'income_percapita': incomes,
        'geometry': polygons
    }, crs="EPSG:4326")
    
    print(f"✓ Created synthetic dataset with {len(counties)} units")

print(f"\nDataset summary:")
print(f"  Shape: {counties.shape}")
print(f"  CRS: {counties.crs}")
print(f"\nFirst few rows:")
print(counties.head())

In [None]:
# Build Queen contiguity weight matrix
print("="*60)
print("QUEEN CONTIGUITY WEIGHT MATRIX")
print("="*60)

w_queen = Queen.from_dataframe(counties)

print(f"Number of units: {w_queen.n}")
print(f"Number of non-zero weights: {w_queen.s0:.0f}")
print(f"Average neighbors: {w_queen.mean_neighbors:.2f}")
print(f"Min neighbors: {w_queen.min_neighbors}")
print(f"Max neighbors: {w_queen.max_neighbors}")
print(f"Percent nonzero: {100 * w_queen.pct_nonzero:.2f}%")
print(f"Islands (units with no neighbors): {len(w_queen.islands)}")
print("="*60)

In [None]:
# Build Rook contiguity weight matrix
print("\nROOK CONTIGUITY WEIGHT MATRIX")
print("="*60)

w_rook = Rook.from_dataframe(counties)

print(f"Number of units: {w_rook.n}")
print(f"Number of non-zero weights: {w_rook.s0:.0f}")
print(f"Average neighbors: {w_rook.mean_neighbors:.2f}")
print(f"Min neighbors: {w_rook.min_neighbors}")
print(f"Max neighbors: {w_rook.max_neighbors}")
print(f"Percent nonzero: {100 * w_rook.pct_nonzero:.2f}%")
print("="*60)

# Compare
diff = w_queen.mean_neighbors - w_rook.mean_neighbors
print(f"\nDifference in average neighbors: {diff:.2f}")
print("→ Queen typically has more neighbors (includes diagonal connections)")

In [None]:
# Visualize neighbor distribution: Queen vs Rook
queen_neighbors = [len(w_queen.neighbors[i]) for i in w_queen.neighbors]
rook_neighbors = [len(w_rook.neighbors[i]) for i in w_rook.neighbors]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Queen
axes[0].hist(queen_neighbors, bins=20, alpha=0.7, color='blue', edgecolor='black')
axes[0].set_title('Queen Contiguity', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Number of Neighbors', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].axvline(np.mean(queen_neighbors), color='red', linestyle='--', linewidth=2,
                label=f'Mean: {np.mean(queen_neighbors):.2f}')
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)

# Rook
axes[1].hist(rook_neighbors, bins=20, alpha=0.7, color='green', edgecolor='black')
axes[1].set_title('Rook Contiguity', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Number of Neighbors', fontsize=12)
axes[1].axvline(np.mean(rook_neighbors), color='red', linestyle='--', linewidth=2,
                label=f'Mean: {np.mean(rook_neighbors):.2f}')
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(output_path / 'nb02_queen_vs_rook.png', dpi=300, bbox_inches='tight')
plt.show()

print("→ Queen contiguity typically produces slightly more neighbors per unit")
print("→ Both distributions show spatial structure properties")

In [None]:
# Visualize connections on map
# Select one unit to highlight its neighbors
example_idx = min(len(counties) // 2, 50)  # Choose a central unit

fig, axes = plt.subplots(1, 2, figsize=(18, 8))

# Queen neighbors
ax = axes[0]
counties.plot(ax=ax, facecolor='lightgray', edgecolor='black', linewidth=0.5)
counties.iloc[[example_idx]].plot(ax=ax, facecolor='red', edgecolor='black', linewidth=2)

neighbor_indices_queen = list(w_queen.neighbors[example_idx])
if neighbor_indices_queen:
    counties.iloc[neighbor_indices_queen].plot(ax=ax, facecolor='yellow',
                                                edgecolor='black', linewidth=1)
ax.set_title(f'Queen Neighbors (n={len(neighbor_indices_queen)})',
             fontsize=14, fontweight='bold')
ax.axis('off')

# Rook neighbors
ax = axes[1]
counties.plot(ax=ax, facecolor='lightgray', edgecolor='black', linewidth=0.5)
counties.iloc[[example_idx]].plot(ax=ax, facecolor='red', edgecolor='black', linewidth=2)

neighbor_indices_rook = list(w_rook.neighbors[example_idx])
if neighbor_indices_rook:
    counties.iloc[neighbor_indices_rook].plot(ax=ax, facecolor='yellow',
                                               edgecolor='black', linewidth=1)
ax.set_title(f'Rook Neighbors (n={len(neighbor_indices_rook)})',
             fontsize=14, fontweight='bold')
ax.axis('off')

plt.tight_layout()
plt.savefig(output_path / 'nb02_queen_rook_example.png', dpi=300, bbox_inches='tight')
plt.show()

print("Legend:")
print("  Red = Focal unit")
print("  Yellow = Neighbors")
print("  Gray = Non-neighbors")
print(f"\n→ Queen includes {len(neighbor_indices_queen) - len(neighbor_indices_rook)} additional diagonal neighbors")

---

## 3. Distance-Based W Matrices

### Defining Neighbors by Geographic Distance

Distance-based weights are useful when:
- Contiguity is too restrictive (islands, irregular shapes)
- You want to model distance decay
- Point data (no polygons)

#### Distance Band (Threshold Distance)
- **Definition**: Units are neighbors if within distance $d$
- **Binary weights**: $w_{ij} = 1$ if $d_{ij} < d^*$, else $w_{ij} = 0$
- **Challenge**: Choosing appropriate threshold $d^*$

#### Inverse Distance Weighting
- **Definition**: $w_{ij} = 1/d_{ij}^{\alpha}$ if within threshold, 0 otherwise
- **Rationale**: Closer neighbors have stronger influence
- **Variations**: $1/d_{ij}$, $1/d_{ij}^2$, $\exp(-d_{ij})$

---

In [None]:
# Extract centroids for distance calculations
centroids = counties.geometry.centroid
coords = np.array([[pt.x, pt.y] for pt in centroids])

print(f"Extracted {len(coords)} centroids")
print(f"Coordinate range:")
print(f"  X: [{coords[:, 0].min():.4f}, {coords[:, 0].max():.4f}]")
print(f"  Y: [{coords[:, 1].min():.4f}, {coords[:, 1].max():.4f}]")

In [None]:
# Compute pairwise distances
from scipy.spatial.distance import cdist

print("Computing pairwise distance matrix...")
dist_matrix = cdist(coords, coords, metric='euclidean')

print(f"Distance matrix shape: {dist_matrix.shape}")
print(f"Distance range: [{dist_matrix[dist_matrix > 0].min():.4f}, {dist_matrix.max():.4f}]")

# Determine appropriate threshold
# Strategy: Ensure all units have at least one neighbor
min_max_dist = []
for i in range(len(dist_matrix)):
    # Exclude self (distance = 0)
    non_zero_dists = dist_matrix[i][dist_matrix[i] > 0]
    if len(non_zero_dists) > 0:
        min_max_dist.append(non_zero_dists.min())

# Use 75th percentile to ensure most units have neighbors
threshold = np.percentile(min_max_dist, 75)

print(f"\nDistance threshold determination:")
print(f"  Minimum nearest-neighbor distance: {np.min(min_max_dist):.4f}")
print(f"  25th percentile: {np.percentile(min_max_dist, 25):.4f}")
print(f"  50th percentile (median): {np.percentile(min_max_dist, 50):.4f}")
print(f"  75th percentile: {threshold:.4f} ← Selected threshold")
print(f"  Maximum nearest-neighbor distance: {np.max(min_max_dist):.4f}")
print(f"\n→ Threshold = {threshold:.4f} ensures 75% of units have at least one neighbor")

In [None]:
# Build distance band W
print("\nDISTANCE BAND WEIGHT MATRIX")
print("="*60)

w_dist = DistanceBand.from_dataframe(counties, threshold=threshold)

print(f"Threshold: {threshold:.4f}")
print(f"Number of units: {w_dist.n}")
print(f"Average neighbors: {w_dist.mean_neighbors:.2f}")
print(f"Min neighbors: {w_dist.min_neighbors}")
print(f"Max neighbors: {w_dist.max_neighbors}")
print(f"Islands (units with no neighbors): {len(w_dist.islands)}")
print("="*60)

if len(w_dist.islands) > 0:
    print(f"\n⚠ Warning: {len(w_dist.islands)} islands detected!")
    print("  Consider increasing threshold or using k-NN")
else:
    print("\n✓ No islands - all units have at least one neighbor")

In [None]:
# Create inverse distance weight matrix
def inverse_distance_weights(dist_matrix, threshold, power=1):
    """
    Create inverse distance weight matrix.
    
    w_ij = 1 / d_ij^power if d_ij < threshold and i != j
    w_ij = 0 otherwise
    
    Parameters:
    -----------
    dist_matrix : np.ndarray
        Pairwise distance matrix
    threshold : float
        Maximum distance for neighbors
    power : float
        Exponent for distance decay (default=1)
    
    Returns:
    --------
    W : np.ndarray
        Inverse distance weight matrix
    """
    n = len(dist_matrix)
    W = np.zeros((n, n))
    
    for i in range(n):
        for j in range(n):
            if i != j and dist_matrix[i, j] < threshold and dist_matrix[i, j] > 0:
                W[i, j] = 1 / (dist_matrix[i, j] ** power)
    
    return W

# Create inverse distance W
W_inv_dist = inverse_distance_weights(dist_matrix, threshold, power=1)

print("INVERSE DISTANCE WEIGHT MATRIX")
print("="*60)
print(f"Power: 1 (linear decay)")
print(f"Threshold: {threshold:.4f}")
print(f"Non-zero elements: {(W_inv_dist > 0).sum()}")
print(f"Average weight (non-zero): {W_inv_dist[W_inv_dist > 0].mean():.4f}")
print("="*60)

In [None]:
# Compare binary vs inverse distance weights
print("\nCOMPARISON: Binary Distance Band vs Inverse Distance")
print("="*70)
print("Binary distance band: All neighbors weighted equally")
print("Inverse distance: Closer neighbors weighted more\n")

# Example for unit 0
unit_idx = 0
binary_weights = list(w_dist.weights[unit_idx])[:5]
inv_weights = W_inv_dist[unit_idx][W_inv_dist[unit_idx] > 0][:5]

print(f"Example: Unit {unit_idx}'s first 5 neighbors")
print(f"  Binary weights:          {binary_weights}")
print(f"  Inverse distance weights: {inv_weights}")
print("\n→ Inverse distance gives higher weight to closer neighbors")
print("="*70)

In [None]:
# Visualize distance decay functions
distances = np.linspace(0.01, threshold, 100)
weights_power1 = 1 / distances
weights_power2 = 1 / (distances ** 2)
weights_exp = np.exp(-distances)

plt.figure(figsize=(10, 6))
plt.plot(distances, weights_power1, label='$1/d$ (power=1)', linewidth=2)
plt.plot(distances, weights_power2, label='$1/d^2$ (power=2)', linewidth=2)
plt.plot(distances, weights_exp, label='$\exp(-d)$', linewidth=2, linestyle='--')
plt.xlabel('Distance', fontsize=12)
plt.ylabel('Weight', fontsize=12)
plt.title('Distance Decay Functions', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig(output_path / 'nb02_distance_decay.png', dpi=300, bbox_inches='tight')
plt.show()

print("Key observations:")
print("→ Higher power = stronger decay with distance")
print("→ Closer neighbors dominate the spatial lag")
print("→ Exponential decay provides smooth, bounded weights")

---

## 4. k-Nearest Neighbors (k-NN)

### Fixed Number of Neighbors

k-NN weights define each unit's $k$ closest neighbors.

**Advantages**:
- Ensures all units have exactly $k$ neighbors (no islands)
- Works well with irregular spatial distributions
- Handles varying spatial densities

**Disadvantages**:
- **Asymmetric**: $i$ may be neighbor of $j$, but $j$ may not be neighbor of $i$
- Choice of $k$ is somewhat arbitrary

**Use cases**:
- Point data (cities, stores, households)
- Irregular spatial distributions
- When you want to guarantee connectivity

---

In [None]:
# Build k-NN weight matrices for different k values
k_values = [4, 8, 12]
w_knn_list = []

print("k-NEAREST NEIGHBORS WEIGHT MATRICES")
print("="*70)

for k in k_values:
    w_knn = KNN.from_dataframe(counties, k=k)
    w_knn_list.append(w_knn)
    
    print(f"\nk = {k}:")
    print(f"  Number of units: {w_knn.n}")
    print(f"  Average neighbors: {w_knn.mean_neighbors:.2f}")
    print(f"  Islands: {len(w_knn.islands)}")

print("\n" + "="*70)
print("→ k-NN guarantees exactly k neighbors for each unit (no islands!)")

In [None]:
# Visualize neighbor distribution for different k values
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for idx, (k, w_knn) in enumerate(zip(k_values, w_knn_list)):
    # In k-NN, each unit has exactly k neighbors, but due to asymmetry,
    # a unit may be neighbor to more than k other units
    neighbor_counts = [len(w_knn.neighbors[i]) for i in w_knn.neighbors]
    
    axes[idx].hist(neighbor_counts, bins=15, alpha=0.7, color='purple', edgecolor='black')
    axes[idx].set_title(f'k-NN (k={k})', fontsize=14, fontweight='bold')
    axes[idx].set_xlabel('Number of Neighbors', fontsize=12)
    axes[idx].set_ylabel('Frequency', fontsize=12)
    axes[idx].axvline(k, color='red', linestyle='--', linewidth=2,
                     label=f'k={k}')
    axes[idx].legend(fontsize=11)
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(output_path / 'nb02_knn_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

print("Observations:")
print("→ All units have exactly k outgoing neighbors")
print("→ Distribution shows number of incoming neighbor links")
print("→ Choice of k is subjective - try multiple values in sensitivity analysis")

In [None]:
# Demonstrate k-NN asymmetry
w_knn_8 = w_knn_list[1]  # k=8

# Find an asymmetric relationship
asymmetric_found = False
for example_i in range(min(100, len(counties))):
    neighbors_of_i = list(w_knn_8.neighbors[example_i])
    if neighbors_of_i:
        example_j = neighbors_of_i[0]
        
        i_neighbors_j = example_j in w_knn_8.neighbors[example_i]
        j_neighbors_i = example_i in w_knn_8.neighbors[example_j]
        
        if i_neighbors_j and not j_neighbors_i:
            asymmetric_found = True
            break

print("k-NN ASYMMETRY DEMONSTRATION")
print("="*60)

if asymmetric_found:
    print(f"Unit {example_i} considers Unit {example_j} a neighbor: {i_neighbors_j}")
    print(f"Unit {example_j} considers Unit {example_i} a neighbor: {j_neighbors_i}")
    print("\n→ Asymmetric relationship detected!")
    print("→ This is normal for k-NN but not for contiguity-based W")
    print("→ Unit i can be in j's k nearest, but i's k nearest may not include j")
else:
    print("Note: In this spatial configuration, asymmetry may be minimal")
    print("→ Asymmetry more pronounced with irregular spatial distributions")

print("="*60)

---

## 5. Row Normalization

### Why Normalize W? Interpretation and Stability

**Row normalization** divides each row by its sum, so rows sum to 1:

$$w_{ij}^{\text{norm}} = \frac{w_{ij}}{\sum_k w_{ik}}$$

**Why normalize?**

1. **Interpretation**: $Wy$ becomes weighted average of neighbors' $y$ values
2. **Mathematical stability**: Keeps spatial parameter $\rho$ in bounded range
3. **Comparability**: Makes results comparable across different $W$ specifications

**Trade-offs**:
- Lose absolute distance information in inverse distance weights
- Units with many neighbors get smaller individual weights

**Common transformations**:
- `'b'`: Binary (un-normalized)
- `'r'`: Row-normalized (most common)
- `'v'`: Variance-stabilizing

---

In [None]:
# Compare un-normalized and row-normalized W
w_queen_unnorm = Queen.from_dataframe(counties)
w_queen_unnorm.transform = 'b'  # Binary (un-normalized)

w_queen_norm = Queen.from_dataframe(counties)
w_queen_norm.transform = 'r'  # Row-normalized

# Extract weights for unit 0
unit_idx = 0
row_unnorm = np.array(list(w_queen_unnorm.weights[unit_idx]))
row_norm = np.array(list(w_queen_norm.weights[unit_idx]))

print("ROW NORMALIZATION COMPARISON")
print("="*60)
print(f"Unit {unit_idx} neighbor weights:")
print(f"\n  Un-normalized:")
print(f"    Weights: {row_unnorm[:5]} ... (showing first 5)")
print(f"    Sum: {row_unnorm.sum():.1f}")
print(f"\n  Row-normalized:")
print(f"    Weights: {row_norm[:5]} ... (showing first 5)")
print(f"    Sum: {row_norm.sum():.4f}")
print("\nInterpretation:")
print("  Un-normalized → Spatial lag = SUM of neighbors' values")
print("  Row-normalized → Spatial lag = WEIGHTED AVERAGE of neighbors' values")
print("="*60)

In [None]:
# Compute spatial lag with both normalizations
income = counties['income_percapita'].values

income_lag_unnorm = weights.lag_spatial(w_queen_unnorm, income)
income_lag_norm = weights.lag_spatial(w_queen_norm, income)

print("SPATIAL LAG COMPARISON")
print("="*60)
print(f"Original income:")
print(f"  Mean: {income.mean():.2f}")
print(f"  Std: {income.std():.2f}")
print(f"  Range: [{income.min():.2f}, {income.max():.2f}]")

print(f"\nUn-normalized spatial lag:")
print(f"  Mean: {income_lag_unnorm.mean():.2f}")
print(f"  Std: {income_lag_unnorm.std():.2f}")
print(f"  Range: [{income_lag_unnorm.min():.2f}, {income_lag_unnorm.max():.2f}]")

print(f"\nRow-normalized spatial lag:")
print(f"  Mean: {income_lag_norm.mean():.2f}")
print(f"  Std: {income_lag_norm.std():.2f}")
print(f"  Range: [{income_lag_norm.min():.2f}, {income_lag_norm.max():.2f}]")
print("="*60)
print("\n→ Row-normalized spatial lag has same scale as original variable")
print("→ Un-normalized spatial lag increases with number of neighbors")

In [None]:
# Visualize normalization effect
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Un-normalized
axes[0].scatter(income, income_lag_unnorm, alpha=0.5, edgecolors='k', s=50)
axes[0].set_xlabel('Income per Capita', fontsize=12)
axes[0].set_ylabel('Spatial Lag (Un-normalized)', fontsize=12)
axes[0].set_title('Un-normalized W: Sum of Neighbors', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Row-normalized
axes[1].scatter(income, income_lag_norm, alpha=0.5, edgecolors='k', s=50, color='orange')
axes[1].set_xlabel('Income per Capita', fontsize=12)
axes[1].set_ylabel('Spatial Lag (Row-normalized)', fontsize=12)
axes[1].set_title('Row-normalized W: Average of Neighbors', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)

# Add 45-degree line to normalized plot
lims = [min(income.min(), income_lag_norm.min()),
        max(income.max(), income_lag_norm.max())]
axes[1].plot(lims, lims, 'r--', alpha=0.5, linewidth=2, label='45° line')
axes[1].legend()

plt.tight_layout()
plt.savefig(output_path / 'nb02_normalization_effect.png', dpi=300, bbox_inches='tight')
plt.show()

print("Visual insights:")
print("→ Right plot shows spatial lag on same scale as original variable")
print("→ Points near 45° line indicate strong spatial autocorrelation")
print("→ Row-normalized lag has clear interpretation as neighborhood average")

---

## 6. W Properties: Eigenvalues and Bounds

### Mathematical Properties of W

Understanding the eigenvalues of $W$ is crucial for spatial econometrics:

**Eigenvalues** ($\lambda_1, \lambda_2, ..., \lambda_n$):
- Determine bounds for spatial parameters ($\rho$, $\lambda$)
- For row-normalized $W$: largest eigenvalue $\lambda_{\max} = 1$

**Parameter Bounds**:
$$\frac{1}{\lambda_{\min}} < \rho < \frac{1}{\lambda_{\max}}$$

**Sparsity**:
- Most spatial weight matrices are **sparse** (many zero elements)
- Sparse methods enable efficient computation for large $n$

**Summary Statistics**:
- $s_0 = \sum_i \sum_j w_{ij}$: Sum of all weights
- $s_1 = \frac{1}{2} \sum_i \sum_j (w_{ij} + w_{ji})^2$
- $s_2 = \sum_i (\sum_j w_{ij} + \sum_j w_{ji})^2$

---

In [None]:
# Compute eigenvalues of row-normalized Queen W
w_queen_norm = Queen.from_dataframe(counties)
w_queen_norm.transform = 'r'

print("Computing eigenvalues...")
print("(This may take a moment for large N)\n")

# Extract as dense array (be careful with large N!)
if w_queen_norm.n <= 5000:  # Only for reasonably sized matrices
    W_dense = w_queen_norm.full()[0]  # Returns (array, ids)
    eigenvalues = np.linalg.eigvals(W_dense).real
    lambda_max = eigenvalues.max()
    lambda_min = eigenvalues.min()
    
    print("EIGENVALUE ANALYSIS")
    print("="*60)
    print(f"Matrix dimension: {w_queen_norm.n} × {w_queen_norm.n}")
    print(f"\nEigenvalues:")
    print(f"  Maximum: λ_max = {lambda_max:.6f}")
    print(f"  Minimum: λ_min = {lambda_min:.6f}")
    print(f"  Range: [{lambda_min:.6f}, {lambda_max:.6f}]")
    print(f"\nBounds for spatial autoregressive parameter ρ:")
    print(f"  Lower bound: 1/λ_min = {1/lambda_min:.4f}")
    print(f"  Upper bound: 1/λ_max = {1/lambda_max:.4f}")
    print(f"\n→ Estimated ρ must lie in [{1/lambda_min:.4f}, {1/lambda_max:.4f}]")
    print("→ For row-normalized W, λ_max ≈ 1")
    print("="*60)
else:
    print(f"Matrix too large ({w_queen_norm.n} × {w_queen_norm.n}) for dense eigenvalue computation")
    print("Skipping eigenvalue analysis (requires sparse methods for large N)")
    eigenvalues = None

In [None]:
# Plot eigenvalue distribution
if eigenvalues is not None:
    plt.figure(figsize=(10, 6))
    plt.hist(eigenvalues, bins=50, alpha=0.7, color='teal', edgecolor='black')
    plt.axvline(lambda_max, color='red', linestyle='--', linewidth=2,
                label=f'λ_max = {lambda_max:.3f}')
    plt.axvline(lambda_min, color='blue', linestyle='--', linewidth=2,
                label=f'λ_min = {lambda_min:.3f}')
    plt.axvline(0, color='black', linestyle='-', linewidth=1, alpha=0.5)
    plt.xlabel('Eigenvalue', fontsize=12)
    plt.ylabel('Frequency', fontsize=12)
    plt.title('Eigenvalue Distribution of Row-Normalized W', fontsize=14, fontweight='bold')
    plt.legend(fontsize=11)
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig(output_path / 'nb02_eigenvalues.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print("Key observations:")
    print("→ Largest eigenvalue ≈ 1 for row-normalized W")
    print("→ Eigenvalue distribution reflects spatial structure")
    print("→ Bounds ensure stationarity of spatial process")

In [None]:
# Sparsity analysis
n = w_queen_norm.n
max_possible_links = n * (n - 1)  # Exclude diagonal
actual_links = w_queen_norm.s0
sparsity = 1 - (actual_links / max_possible_links)

print("SPARSITY ANALYSIS")
print("="*60)
print(f"Number of units: {n}")
print(f"Maximum possible links: {max_possible_links:,}")
print(f"Actual non-zero links: {actual_links:.0f}")
print(f"Sparsity: {sparsity * 100:.2f}%")
print(f"\n→ {sparsity * 100:.1f}% of matrix elements are zero")
print("→ Sparse matrix methods enable efficient computation")
print("→ This is why spatial econometrics scales to large datasets")
print("="*60)

---

## 7. Comparing Different W Specifications

### Does Choice of W Matter? Sensitivity Analysis

An important question: **Are our results sensitive to W specification?**

**Best practice**: Test multiple reasonable W specifications and assess robustness.

**What to compare**:
1. Spatial autocorrelation statistics (Moran's I)
2. Model parameter estimates (in regression context)
3. Statistical significance
4. Qualitative conclusions

**Interpretation**:
- Results **robust** across W → Strong evidence of spatial effects
- Results **sensitive** to W → Need theoretical justification for W choice

---

In [None]:
# Create multiple W specifications
print("Creating multiple W specifications for comparison...\n")

w_specs = {
    'Queen': Queen.from_dataframe(counties),
    'Rook': Rook.from_dataframe(counties),
    'k-NN (k=8)': KNN.from_dataframe(counties, k=8),
}

# Add distance band if no islands
if len(w_dist.islands) == 0:
    w_specs['Distance Band'] = w_dist

# Row-normalize all
for w in w_specs.values():
    w.transform = 'r'

print(f"Comparing {len(w_specs)} W specifications:")
for name in w_specs.keys():
    print(f"  ✓ {name}")

In [None]:
# Compute Moran's I for each W specification
print("\nComputing Moran's I for income per capita...\n")

morans_i_results = []
income = counties['income_percapita']

for name, w in w_specs.items():
    mi = Moran(income, w)
    morans_i_results.append({
        'W Specification': name,
        'Moran I': mi.I,
        'E[I]': mi.EI,
        'p-value': mi.p_sim,
        'z-score': mi.z_sim
    })

mi_df = pd.DataFrame(morans_i_results)

print("MORAN'S I SENSITIVITY ANALYSIS")
print("="*80)
print(mi_df.to_string(index=False))
print("="*80)

# Statistical summary
print(f"\nStatistical Summary:")
print(f"  Moran's I range: [{mi_df['Moran I'].min():.4f}, {mi_df['Moran I'].max():.4f}]")
print(f"  Mean Moran's I: {mi_df['Moran I'].mean():.4f}")
print(f"  Std. Dev.: {mi_df['Moran I'].std():.4f}")
print(f"  All significant (p < 0.05): {(mi_df['p-value'] < 0.05).all()}")

print(f"\nObservations:")
print("→ Moran's I values are similar but not identical across W specifications")
print("→ Statistical significance is consistent")
print("→ Spatial autocorrelation detected regardless of W choice")
print("→ Results are qualitatively robust")

In [None]:
# Visualize sensitivity
fig, ax = plt.subplots(figsize=(10, 6))

# Color by significance
colors = ['green' if p < 0.05 else 'gray' for p in mi_df['p-value']]
bars = ax.bar(mi_df['W Specification'], mi_df['Moran I'],
              color=colors, alpha=0.7, edgecolor='black', linewidth=1.5)

# Add expected value line
ax.axhline(mi_df['E[I]'].iloc[0], color='red', linestyle='--', linewidth=2,
           label=f"E[I] = {mi_df['E[I]'].iloc[0]:.4f} (null hypothesis)")

ax.set_ylabel("Moran's I", fontsize=12)
ax.set_xlabel("W Specification", fontsize=12)
ax.set_title("Sensitivity of Moran's I to W Specification", fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, axis='y', alpha=0.3)
plt.xticks(rotation=15, ha='right')

# Add value labels on bars
for i, (bar, val) in enumerate(zip(bars, mi_df['Moran I'])):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
            f'{val:.4f}',
            ha='center', va='bottom', fontsize=10, fontweight='bold')

plt.tight_layout()
plt.savefig(output_path / 'nb02_morans_i_sensitivity.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nGreen bars = Statistically significant (p < 0.05)")
print("Gray bars = Not significant")

---

## 8. Custom Economic W Matrices

### Beyond Geography: Economic Distance

Not all spatial relationships are geographic! You can define **economic weight matrices** based on:

**Examples of Economic W**:
1. **Trade flows**: $w_{ij}$ = volume of trade between regions $i$ and $j$
2. **Migration**: $w_{ij}$ = migration flow from $j$ to $i$
3. **Input-output linkages**: $w_{ij}$ = sector interdependence
4. **Technological similarity**: Based on patent citations, R&D networks
5. **Institutional similarity**: Similar governance, policies, regulations

**When to use economic W**:
- Trade spillovers (gravity models)
- Technology diffusion
- Financial contagion (interbank networks)
- Policy imitation (similar jurisdictions)

**Key insight**: "Neighbors" in spatial econometrics can be defined by ANY meaningful connection, not just geography.

---

In [None]:
# Simulate economic similarity W matrix
# (In practice, use actual data: trade, migration, etc.)

np.random.seed(42)
n_regions = min(50, len(counties))  # Use subset for demonstration

print("CREATING ECONOMIC SIMILARITY WEIGHT MATRIX")
print("="*60)
print("Simulating economic characteristics...\n")

# Simulate economic characteristics
gdp = np.random.lognormal(10, 0.5, n_regions)
industry_mix = np.random.dirichlet(np.ones(5), n_regions)  # 5 industries

print(f"Created {n_regions} regions with:")
print(f"  • GDP (log-normal distribution)")
print(f"  • Industry mix (5 sectors: Manufacturing, Services, Agriculture, Tech, Energy)")
print(f"\nComputing similarity based on industry structure...")

In [None]:
# Compute similarity based on industry structure
from scipy.spatial.distance import cosine

W_econ = np.zeros((n_regions, n_regions))
similarity_threshold = 0.7  # Only consider highly similar regions

for i in range(n_regions):
    for j in range(n_regions):
        if i != j:
            # Cosine similarity of industry vectors
            similarity = 1 - cosine(industry_mix[i], industry_mix[j])
            W_econ[i, j] = similarity if similarity > similarity_threshold else 0

# Row-normalize
row_sums = W_econ.sum(axis=1)
row_sums[row_sums == 0] = 1  # Avoid division by zero
W_econ_norm = W_econ / row_sums[:, None]

print("\nECONOMIC SIMILARITY W MATRIX")
print("="*60)
print(f"Based on: Industry structure similarity (cosine similarity)")
print(f"Similarity threshold: {similarity_threshold}")
print(f"Dimension: {W_econ_norm.shape}")
print(f"Average neighbors: {(W_econ_norm > 0).sum(axis=1).mean():.2f}")
print(f"Min neighbors: {(W_econ_norm > 0).sum(axis=1).min():.0f}")
print(f"Max neighbors: {(W_econ_norm > 0).sum(axis=1).max():.0f}")
print(f"Sparsity: {100 * (W_econ_norm == 0).sum() / W_econ_norm.size:.1f}%")
print("="*60)
print("\nInterpretation:")
print("→ Regions with similar industry structure are 'neighbors'")
print("→ Geography may be irrelevant for technology/policy spillovers")
print("→ Useful for studying inter-industry linkages and structural shocks")

In [None]:
# Visualize economic similarity network
import matplotlib.patches as mpatches

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Left: Heatmap of W_econ
im = axes[0].imshow(W_econ_norm, cmap='YlOrRd', aspect='auto')
axes[0].set_title('Economic Similarity Matrix (Row-Normalized)', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Region j', fontsize=12)
axes[0].set_ylabel('Region i', fontsize=12)
plt.colorbar(im, ax=axes[0], label='Weight')

# Right: Distribution of neighbor counts
neighbor_counts_econ = (W_econ_norm > 0).sum(axis=1)
axes[1].hist(neighbor_counts_econ, bins=15, alpha=0.7, color='orange', edgecolor='black')
axes[1].set_title('Distribution of Economic Neighbors', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Number of Economic Neighbors', fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].axvline(neighbor_counts_econ.mean(), color='red', linestyle='--', linewidth=2,
                label=f'Mean: {neighbor_counts_econ.mean():.2f}')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(output_path / 'nb02_economic_w.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n→ Left: Yellow = strong economic similarity")
print("→ Right: Some regions highly connected, others isolated")
print("→ This reflects heterogeneity in economic structure")

---

## 9. Practical Recommendations

### How to Choose the Right W for Your Research

**Decision Tree for W Selection**:

```
1. Is your research question fundamentally geographic?
   YES → Use contiguity or distance-based W
   NO  → Consider economic W (trade, similarity, networks)

2. If geographic:
   a. Regular grid or administrative boundaries?
      → Queen or Rook contiguity
   
   b. Irregular spacing (cities, houses, points)?
      → k-NN or distance band
   
   c. Island units possible?
      → k-NN (guarantees neighbors)

3. How much do distant neighbors matter?
   a. Only immediate neighbors → Contiguity
   b. Decay with distance → Inverse distance
   c. Fixed influence range → Distance band

4. Is sensitivity analysis feasible?
   → Test 2-3 different W specifications
   → Check if results are qualitatively robust
```

### Best Practices

1. **Justify theoretically**: Why should these units influence each other?
2. **Normalize (usually)**: Row-normalization aids interpretation
3. **Check for islands**: Units with zero neighbors cause problems
4. **Sensitivity analysis**: Try multiple W, report robustness
5. **Document clearly**: Always report W specification in papers

---

In [None]:
# Automated W recommendation function
def recommend_weight_matrix(gdf):
    """
    Suggest appropriate W matrix based on data characteristics.
    
    Parameters:
    -----------
    gdf : GeoDataFrame
        Spatial data
    
    Returns:
    --------
    dict : Recommendations and diagnostics
    """
    n = len(gdf)
    
    # Check for islands with Queen
    w_test = Queen.from_dataframe(gdf)
    islands = w_test.islands
    
    recommendations = []
    warnings = []
    
    # Island check
    if len(islands) == 0:
        recommendations.append("✓ Queen/Rook contiguity (no islands detected)")
    else:
        warnings.append(f"⚠ {len(islands)} islands with contiguity")
        recommendations.append("→ Recommend k-NN to ensure all units have neighbors")
    
    # Dataset size
    if n < 500:
        recommendations.append("✓ Dataset small enough for any W type")
    elif n < 5000:
        recommendations.append("✓ Medium dataset → All W types feasible")
    else:
        recommendations.append("⚠ Large dataset → Prefer sparse W for computational efficiency")
    
    # Geometry type
    geom_type = gdf.geometry.iloc[0].geom_type
    if geom_type == 'Point':
        recommendations.append("→ Point data detected: k-NN or distance band recommended")
    elif geom_type in ['Polygon', 'MultiPolygon']:
        recommendations.append("→ Polygon data: Contiguity-based W appropriate")
    
    return {
        'n_units': n,
        'islands': len(islands),
        'geometry_type': geom_type,
        'warnings': warnings,
        'recommendations': recommendations
    }

# Run recommendation
rec = recommend_weight_matrix(counties)

print("\nW MATRIX RECOMMENDATION SYSTEM")
print("="*60)
print(f"Dataset characteristics:")
print(f"  Number of units: {rec['n_units']}")
print(f"  Geometry type: {rec['geometry_type']}")
print(f"  Islands (contiguity): {rec['islands']}")

if rec['warnings']:
    print(f"\nWarnings:")
    for w in rec['warnings']:
        print(f"  {w}")

print(f"\nRecommendations:")
for r in rec['recommendations']:
    print(f"  {r}")

print("="*60)

---

## 10. Summary and Next Steps

### What We've Learned

**Key Takeaways**:

1. ✓ **W matrix is specified, not estimated** - it embodies your theoretical assumptions
2. ✓ **Multiple valid specifications exist**: contiguity, distance, k-NN, economic
3. ✓ **Row normalization**: Makes rows sum to 1 → interpretable as weighted averages
4. ✓ **Eigenvalues determine bounds** for spatial autoregressive parameters
5. ✓ **Results should be robust** to reasonable W specifications (sensitivity analysis)
6. ✓ **Always justify W choice theoretically** in your research

---

In [None]:
# W matrix comparison table
comparison_data = {
    'W Type': ['Queen', 'Rook', 'k-NN', 'Distance Band', 'Inverse Distance'],
    'Symmetric': ['Yes', 'Yes', 'No', 'Yes', 'Yes'],
    'Guarantees Neighbors': ['No', 'No', 'Yes', 'No', 'No'],
    'Geographic': ['Yes', 'Yes', 'Yes', 'Yes', 'Yes'],
    'Weights Vary': ['No*', 'No*', 'No*', 'No', 'Yes'],
    'Primary Use Case': [
        'General polygons',
        'Cardinal adjacency',
        'Irregular/points',
        'Fixed radius',
        'Distance decay'
    ]
}

comp_df = pd.DataFrame(comparison_data)

print("\nW MATRIX COMPARISON SUMMARY")
print("="*90)
print(comp_df.to_string(index=False))
print("="*90)
print("* Binary weights before normalization")
print("\nNote: All can be row-normalized for interpretation as weighted averages")

### Learning Outcomes Achieved

After completing this notebook, you should be able to:

1. ✓ **Construct** Queen, Rook, k-NN, and distance-based W matrices
2. ✓ **Explain** when to use each W type and justify your choice
3. ✓ **Apply** row normalization and interpret spatial lags
4. ✓ **Compute** eigenvalues and understand parameter bounds
5. ✓ **Assess** sensitivity of results to W specification
6. ✓ **Design** custom economic W matrices for non-geographic relationships

---

### Next Steps: Notebook 03

**Spatial Lag Model (SAR)**

Now that you understand W matrices, you're ready to:
- Estimate Spatial Lag Models (SAR): $y = \rho Wy + X\beta + \varepsilon$
- Interpret the spatial autoregressive parameter $\rho$
- Understand spillover effects and spatial multipliers
- Perform model diagnostics

---

In [None]:
# Final summary
print("\n" + "="*60)
print("NOTEBOOK 02 COMPLETE: SPATIAL WEIGHT MATRICES")
print("="*60)
print("\nYou've mastered the foundation of spatial econometrics!")
print("\nGenerated outputs:")

output_files = list(output_path.glob('nb02_*.png'))
for f in output_files:
    print(f"  ✓ {f.name}")

print(f"\nAll figures saved to: {output_path}")
print("\n" + "="*60)
print("READY FOR NOTEBOOK 03: SPATIAL LAG MODEL (SAR)")
print("="*60)
print("\n→ Now you can build spatial regression models!")
print("→ Proceed to: 03_spatial_lag_model.ipynb")