## 10: Spatial Statistical Hotspot Detection and Cluster Analysis

**Goal:** To apply formal spatial statistics to identify statistically significant patterns in geographic data. We want to move beyond just visualizing data on a map to asking: "Is this cluster of high values I see real, or could it have occurred by random chance?"

This notebook will cover:
1.  **Spatial Weights Matrices:** The critical first step of defining "who is a neighbor" to whom.
2.  **Global Spatial Autocorrelation (Moran's I):** Calculating a single statistic to test if the overall data pattern is clustered, dispersed, or random.
3.  **Local Spatial Autocorrelation (LISA):** Identifying the specific locations of significant **hot spots** (clusters of high values), **cold spots** (clusters of low values), and **spatial outliers**.

We will use the `esda` (Exploratory Spatial Data Analysis) library, part of the `PySAL` ecosystem.

### 1. Setup and Library Imports

You will likely need to install `esda` and `libpysal`:
`pip install esda libpysal`

In [None]:
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import contextily as cx
import esda
from libpysal.weights import Queen

### 2. Load Data

We'll use the LSOA dataset from notebook 06 again. This time, we'll add a synthetic "crime rate" to each LSOA to have a variable to analyze.

In [None]:
# Load LSOA data (using placeholder creation from notebook 06 if not available)
try:
    lsoa_gdf = gpd.read_file('data/LSOA_2021_Boundaries_Full_Clipped.shp')
except Exception:
    print("Could not load local LSOA data. Using a placeholder GeoDataFrame.")
    from shapely.geometry import box
    import numpy as np
    xmin, ymin, xmax, ymax = -3.58, 50.68, -3.42, 50.78
    grid_cells = []
    lsoa_codes = []
    for i, x in enumerate(np.linspace(xmin, xmax, 15)):
        for j, y in enumerate(np.linspace(ymin, ymax, 15)):
            grid_cells.append(box(x, y, x + (xmax-xmin)/15, y + (ymax-ymin)/15))
            lsoa_codes.append(f'E0101{i:02d}{j:02d}')
    lsoa_gdf = gpd.GeoDataFrame({'LSOA21CD': lsoa_codes}, geometry=grid_cells, crs="EPSG:4326")

# Create a synthetic 'crime_rate' with a spatial pattern
np.random.seed(123)
lsoa_gdf['crime_rate'] = np.random.rand(len(lsoa_gdf)) * 50
# Create a hotspot in the center
center_x, center_y = lsoa_gdf.unary_union.centroid.coords[0]
distances = lsoa_gdf.centroid.distance(lsoa_gdf.unary_union.centroid)
lsoa_gdf['crime_rate'] += (1 - distances / distances.max()) * 100

lsoa_gdf.head()

### 3. Create Spatial Weights Matrix

We need to define neighborhood relationships. **Queen Contiguity** is a common choice for polygons: two polygons are considered neighbors if they share at least one vertex (a corner or a side).

In [None]:
w = Queen.from_dataframe(lsoa_gdf)
w.transform = 'r' # Row-standardize the weights

print(f"Created a Queen contiguity weights matrix.")
print(f"Polygon 0 has {w.neighbors[0]} as neighbors.")

### 4. Global Spatial Autocorrelation (Moran's I)

Moran's I tells us about the overall pattern. A positive value indicates clustering (high values near high values), a negative value indicates dispersion, and a value near zero indicates randomness.
We test its statistical significance using a permutation test.

In [None]:
y = lsoa_gdf['crime_rate']
moran = esda.Moran(y, w)

print(f"Moran's I: {moran.I:.4f}")
print(f"P-value: {moran.p_sim:.4f}")

# A low p-value (e.g., < 0.05) suggests that the observed spatial pattern is unlikely to be the result of random chance.

### 5. Local Spatial Autocorrelation (LISA)

Now we drill down to find *where* the clusters are. LISA calculates a Moran's I value for each individual polygon, telling us how it relates to its neighbors.

In [None]:
lisa = esda.Moran_Local(y, w)

# Get the cluster classifications (1=HH, 2=LH, 3=LL, 4=HL)
lsoa_gdf['lisa_q'] = lisa.q

# Filter for only the significant clusters (p < 0.05)
significant = lisa.p_sim < 0.05
lsoa_gdf['lisa_sig'] = 'Not significant'
lsoa_gdf.loc[significant & (lisa.q == 1), 'lisa_sig'] = 'High-High (Hot Spot)'
lsoa_gdf.loc[significant & (lisa.q == 2), 'lisa_sig'] = 'Low-High (Spatial Outlier)'
lsoa_gdf.loc[significant & (lisa.q == 3), 'lisa_sig'] = 'Low-Low (Cold Spot)'
lsoa_gdf.loc[significant & (lisa.q == 4), 'lisa_sig'] = 'High-Low (Spatial Outlier)'

print("LISA results added to GeoDataFrame:")
lsoa_gdf['lisa_sig'].value_counts()

### 6. Visualize the Hotspot Map

In [None]:
# Project for plotting
lsoa_plot = lsoa_gdf.to_crs(epsg=3857)

fig, ax = plt.subplots(figsize=(15, 12))

lisa_colors = {
    'Not significant': 'lightgrey',
    'High-High (Hot Spot)': 'red',
    'Low-Low (Cold Spot)': 'blue',
    'Low-High (Spatial Outlier)': 'skyblue',
    'High-Low (Spatial Outlier)': 'pink'
}

lsoa_plot.plot(
    column='lisa_sig',
    categorical=True,
    k=len(lisa_colors),
    cmap=plt.matplotlib.colors.ListedColormap(list(lisa_colors.values())),
    linewidth=0.5,
    ax=ax,
    edgecolor='white',
    legend=True,
    legend_kwds={'title': 'LISA Cluster Type'}
)

ax.set_title('Local Spatial Autocorrelation (LISA) of Crime Rate')
ax.set_xticks([])
ax.set_yticks([])
cx.add_basemap(ax, source=cx.providers.CartoDB.Positron)
plt.show()

# Discussion:
# - The red areas are 'Hot Spots': locations with high crime rates that are also surrounded by other locations with high crime rates.
# - The blue areas are 'Cold Spots': locations with low crime rates surrounded by other low-crime locations.
# - Spatial outliers (if any) would represent anomalies, like a low-crime area surrounded by high-crime neighbors.
# - Most importantly, these are *statistically significant* patterns, giving us confidence that they are not just random noise.