<center><img src="https://i.imgur.com/zRrFdsf.png" width="700"></center>






# Neighboorhood and Spatial Correlation.



## Getting ready


### Libraries needed

Let's verify:

In [None]:
!pip show pysal pandas geopandas

In [None]:
## needed in Colab
# !pip install pysal

### Data to use

Let me get two maps:

1. The USA map, at states level,  directly from census.gov, which has a good quality.

In [None]:
import geopandas as gpd

url = "https://www2.census.gov/geo/tiger/GENZ2023/shp/cb_2023_us_state_500k.zip"
us_states = gpd.read_file(url)
us_states.info(),us_states.crs.to_epsg(),us_states.crs.is_projected

Notice this map has basic information per state. Also, notice the current crs will plot this:

In [None]:
us_states.plot()

Let's reproject this map:

In [None]:
us_states=us_states.to_crs(5070)
us_states.plot()

Let's use the state name as index, that would help an easier identification of the places when we see most outputs (otherwise we will see just numerical indexes) :

In [None]:
us_states.set_index('NAME', inplace=True)
us_states.head()

Let's subset the us_states for some examples:

In [None]:
someStates=['Utah','Colorado','Arizona','New Mexico', 'Florida','Georgia','Alabama']
sub_us=us_states[us_states.index.isin(someStates)]
sub_us

2. A map of Peru, at the 'distrito' level (similar to municipality in the USA - not exactly the same). The map comes from an unoffical [website](https://www.geogpsperu.com/p/descargas.html). Some columns have been added.

In [None]:
peruDataLink="https://github.com/CienciaDeDatosEspacial/dataSets/raw/refs/heads/main/PERU/PeruMaps.gpkg"
peru_distritos=gpd.read_file(peruDataLink,layer='distritos')

# some basic info
peru_distritos.info(),peru_distritos.crs.to_epsg(),peru_distritos.crs.is_projected

Let's reproject and plot:

In [None]:
peru_distritos=peru_distritos.to_crs(5387)
peru_distritos.plot()

Besides the spatial units (DEPARTAMEN, PROVINCIA, DISTRITO, and Ubigeo - "Ubigeo" is a code ), you have:
 - **Poblacion**: Population (2017)
 - **Superficie**: Area               
 - **IDH2019**: Human Development Index for DISTRITO (2019)                   
 - **Educ_sec_comp2019_pct**: Share of Population that finished High-School (2019)     
 - **NBI2017_pct**: Share of Population with poverty at the household level aggregated by DISTRITO. This index ("Unsatisfied Basic Needs") uses observable living conditions rather than income alone (2017).
 - **Viv_sin_serv_hig2017_pct**: Share of housing units that have no sanitation infrastructure aggregated by  DISTRITO (2017)

Notice we should not use the 'distrito' name as index, because several of them are repeated:

In [None]:
peru_distritos[peru_distritos['DISTRITO'].duplicated()]

Let's use 'Ubigeo', although is not the best solution

In [None]:
# of course
peru_distritos[peru_distritos['Ubigeo'].duplicated()]

In [None]:
#then
peru_distritos.set_index('Ubigeo', inplace=True)

## I. Who is my neighbor?

In spatial analysis, the intuitive concept of a “neighbor” can be operationalized in multiple ways.

So far, we have identified neighbors using geometric operations such as buffering, spatial joins, and overlays. Now, let’s consider the distance matrix:


In [None]:
sub_us.geometry.apply\
(lambda state: sub_us.distance(state)/1000)

In [None]:
sub_us.explore(zoom_start=6)

From this matrix and plot, you’ll notice that neighboring features have a distance of zero—this occurs when two polygons share a boundary (i.e., they are contiguous). This observation helps illustrate the different approaches to defining neighbors:

1. Binary Relationships:

* Contiguity: Two polygons are considered neighbors if they share any portion of their boundary—whether at a point, along a line segment, or more extensively. In the distance matrix, such pairs show a distance of zero, reflecting direct spatial adjacency.
* Ranked Proximity: In this approach, assume there each node has ranked all the other potential neighbors by their mutual distance. Then, you request K neighbors, and you get the K closest ones to you from that ranking.

* Band Proximity: In this approach, you accept as neighbors the one that are within a threshold distance.


2. Continuous Relationships

* Proximity: Two features are considered neighbors if the distance between them falls below a specified threshold (e.g., within 100 km). This approach is especially useful for point data or non-contiguous regions.
* Shared Border Length: The strength of the neighbor relationship is weighted by the length of the shared boundary. Longer shared borders imply stronger spatial interaction—a common assumption in models of spatial diffusion or economic spillovers.

Using matrices—rather than raw geometries—is essential for the mathematical representation and numerical computation required in upcoming spatial analytics techniques.

PySAL (libpysal) is designed to handle spatial relationship matrices and integrates seamlessly with GeoPandas. Rather than relying solely on matrices, modern PySAL uses graph-based representations of spatial relationships. This approach is not only more memory-efficient but also speeds up computation and simplifies visualization—especially for large or sparse spatial datasets.  A key concept in these graphs is the distinction between:

- Focal unit: the spatial feature (e.g., a state, county, or census tract) for which we are identifying neighbors.
- Neighbor(s): the other spatial units that are considered related to the focal unit based on a chosen criterion (e.g., contiguity, distance, or shared border length).

Let’s become familiar with the ```graph``` module in PySAL/libpysal, which provides a modern and flexible framework for constructing and working with spatial neighbor graphs.


In [None]:
from libpysal.graph import Graph

### I. Binary matrices

#### I. 1 Contiguity and Binary matrices

Take a look at the **queen** and **rook** relationship:

<center><img src="https://github.com/CienciaDeDatosEspacial/spatial_autoCorr/raw/main/rookQueen.png" width="700"></center>

From the image above:
- Your **rook** neighbor is whichever  shares a border with you (a borderline of at least two points). It is also known as the _Von Neumann_ neighbor.

- Your **queen** neighbor is whichever  shares a border or a corner with you (at least one point).It is also known as the _Moore_ neighbor.

Let's see how to get each set of neighbors

##### I.1.a Rook

* The key idea: A focal polygon considers another polygon a neighbor if they share a common edge—that is, two or more connected points forming a line segment of positive length.
* The input: A GeoDataFrame (GDF) containing polygon geometries (e.g., states, counties, census tracts). Each row represents a spatial unit that can serve as a focal observation.
* The process: For each polygon (focal unit), the algorithm checks all other polygons to determine whether their boundaries intersect along a line segment (not just at a point).
* The output: A graph built from a binary adjacency matrix where each node represents a polygon (focal unit). An edge exists between two nodes only if they share a boundary segment. The corresponding adjacency matrix is binary.

Given our input for the examples is **sub_us**, let's run...

In [None]:
sub_us_rook=Graph.build_contiguity(sub_us,rook=True)

Now let's check the ouput:

**a. adjacency**

In [None]:
sub_us_rook.adjacency

The previous results shows only the neighbors of the focals, to recreate a wide format:

In [None]:
import pandas as pd
pd.DataFrame(sub_us_rook.adjacency).unstack()

We generally fill those missing values (not a neighbor) with zero.

In [None]:
sub_us_rook_Matrix=pd.DataFrame(sub_us_rook.adjacency).unstack().fillna(0)
sub_us_rook_Matrix

**b. adjacency graph**

As we have a `GRAPH`, we can identify these neighborhood relationships via edges:

In [None]:
sub_us_rook.explore(sub_us,node_kws=dict(color='red'), edge_kws=dict(alpha=0.4,color='blue'),zoom_start = 6)

##### I.1.b Queen

* The key idea: A focal polygon considers another polygon a neighbor if they share a common edge or a vertex (at least one point).
* The input: A GeoDataFrame (GDF) containing polygon geometries (e.g., states, counties, census tracts). Each row represents a spatial unit that can serve as a focal observation.
* The process: For each polygon (focal unit), the algorithm checks all other polygons to determine whether their boundaries intersect at any point or along a line segment.
* The output: A graph built from a binary adjacency matrix where each node represents a polygon (focal unit). An edge exists between two nodes only if they share a point or boundary segment. The corresponding adjacency matrix is binary.

Let's see what we get:

In [None]:
# first
sub_us_queen=Graph.build_contiguity(sub_us,rook=False)

**a. adjacency matrix**

In [None]:
sub_us_queen_Matrix=pd.DataFrame(sub_us_queen.adjacency).unstack().fillna(0)
sub_us_queen_Matrix

**b. adjacency plot**

In [None]:
sub_us_queen.explore(sub_us,node_kws=dict(color='red'), edge_kws=dict(alpha=0.4,color='blue'),zoom_start = 6)




#### I. 2 Ranked  proximity and Binary matrices: KNN

* The key idea: A focal polygon considers another polygon a neighbor if that second polygon is among the K closest polygons to the focal polygon, where “closeness” is measured by the distance between their geometric centroids (or any other user-supplied distance metric).  
* The input: A GeoDataFrame (GDF) containing polygon geometries. Each row represents a spatial unit that can serve as a focal observation.  
* The process: For each polygon (focal unit), the algorithm (1) computes the distance between its centroid and every other centroid, (2) ranks these distances, and (3) flags the K smallest distances as neighbors. Ties can be broken by random selection, ID order, or by including all tied candidates.  
* The output: A graph built from a binary adjacency matrix where each node represents a polygon (focal unit). An edge exists between two nodes whenever one polygon is among the K nearest neighbors of the other. The corresponding adjacency matrix is binary and, by construction, typically asymmetric unless a symmetric constraint is explicitly enforced.

If assume K is 3:

**a. adjacency matrix**

In [None]:
sub_us_knn3 = Graph.build_knn(sub_us.representative_point(), # GDF
                                 k=3) # desired k

sub_us_knn3_Matrix=pd.DataFrame(sub_us_knn3.adjacency).unstack().fillna(0)
sub_us_knn3_Matrix

**b. adjacency plot**

In [None]:
sub_us_knn3.explore(sub_us, edge_kws=dict(alpha=0.4),zoom_start = 6)

#### I. 3 Band-based  proximity and Binary matrices:


* The key idea: A focal polygon considers another polygon a neighbor if the distance between them (usually centroid-to-centroid , or a pair of representative point) lies within a user-defined band.
* The input: A GeoDataFrame (GDF) containing polygon geometries for which a representative point is supplied. Each row represents a spatial unit that can serve as a focal observation, plus (optionally) an attribute column to use as a weight or ID.  
* The process: For each polygon (focal unit) the algorithm computes the distance to every other polygon and flags those whose distance falls inside the specified band.
* The output: A binary spatial graph (or adjacency structure) where each node represents a polygon. An edge exists between two nodes only if their mutual distance sits inside the chosen band. The corresponding adjacency matrix is binary and symmetric by construction.

Let's assume a 750 km distance band:

In [None]:
sub_us_band750k_Bi=Graph.build_distance_band(sub_us.representative_point(), threshold=750000)

sub_us_band750k_Bi_Matrix=pd.DataFrame(sub_us_band750k_Bi.adjacency).unstack().fillna(0)
sub_us_band750k_Bi_Matrix

In [None]:
sub_us_band750k_Bi.explore(sub_us, edge_kws=dict(alpha=0.4), zoom_start=6)

### II. Continuous Matrixes

#### II. 1 Continuos Distance band-based Proximity

* The key idea: A focal polygon considers another polygon a neighbor if the distance between them (usually centroid-to-centroid , or a pair of representative point) lies within a user-defined band.
* The input: A GeoDataFrame (GDF) containing polygon geometries for which a representative point is supplied. Each row represents a spatial unit that can serve as a focal observation, plus (optionally) an attribute column to use as a weight or ID.  
* The process: For each polygon (focal unit) the algorithm computes the distance to every other polygon and flags those whose distance falls inside the specified band.
* **The output**: a graph where each node has edges to all nodes ('neighbors') whose representative points lie within the radius. Edges are represented by a continuous adjacency matrix, the default values represent the inverse distance between the nodes in the edges.

Here it is:

In [None]:
sub_us_band750k_C=Graph.build_distance_band(sub_us.representative_point(), threshold=750000,binary=False)
sub_us_band750k_C_Matrix=pd.DataFrame(sub_us_band750k_C.adjacency).unstack().fillna(0)
sub_us_band750k_C_Matrix


In [None]:
sub_us_band750k_C.explore(
        sub_us, edge_kws=dict(alpha=0.4)
    )

#### II.2 Kernel K-Nearest Neighbors (KNN)

* The key idea: Each polygon keeps only its K closest buddies, but instead of a blunt 0 / 1 the tie is a gentle “friendliness” score that shrinks with distance.  
* The input: A GeoDataFrame of polygons + their representative points.  
* The process:  
  – Find the K nearest points.  
  – Turn distance into a Gaussian bell value (big when close, small when far).  

* The output: A weighted graph where every node has exactly K neighbours carrying continuous “soft” weights; no islands, no huge clumps, just K tidy friendship levels per row.

Let's use K = 3 again:

In [None]:
sub_us_kernel3 = Graph.build_kernel(sub_us.representative_point(), k=3)
sub_us_kernel3_Matrix=pd.DataFrame(sub_us_kernel3.adjacency).unstack().fillna(0)
sub_us_kernel3_Matrix

Notice we have bigger values that the continuous band-based result. Let's use thos values to color the edges (the darker the further):

In [None]:
sub_us_kernel3.explore(
        sub_us, edge_kws=dict(column="weight",
        style_kwds=dict(weight=6)    )
    )

#### II. 3 Permiter-based Proximity


* **The key idea**: A focal polygon treats another polygon as a neighbour only if the two share a common border.  
* **The input**: A GeoDataFrame of polygons; no extra points needed.  
* **The process**:  
  – Compute the exact length of every shared boundary segment.  
  – Set w_ij = shared_perimeter_ij (zero if no border).  
  
* **The output**: A weighted graph encoded in a sparse weights matrix; each non-zero entry is the **metres (or map-units) of shared border**, giving large, jagged polygons more influence over their neighbours than small, compact ones.

In [None]:
sub_us_perimeter = Graph.build_contiguity(sub_us, by_perimeter=True)
sub_us_perimeter_Matrix=pd.DataFrame(sub_us_perimeter.adjacency).unstack().fillna(0)
sub_us_perimeter_Matrix


In [None]:
sub_us_perimeter.explore(
        sub_us, edge_kws=dict(column="weight",
        style_kwds=dict(weight=5)    )
    )

In summary:

| Technique | Pros | Cons | Typical use-case example |
|-----------|------|------|--------------------------|
| **Rook / Queen contiguity** | Simple, law- or admin-based, no distance parameter | Islands get 0 neighbours; ignores “near but not touching” units | State-to-state policy diffusion, local tax spill-overs |
| **K-NN (binary)** | Guarantees every unit has same #neighbours; no islands | Can link far-away units in sparse areas; ignores true borders | Comparative politics with same-size legislatures, small-N samples |
| **Distance band (binary)** | Hard cap on geographic reach; easy to interpret radius | Sparse → islands; dense → huge cliques; radius must be tuned | Housing-price spill-overs within 30 km commuting range |
| **Continuous distance-band** | Smooth weights, still respects max range | Same radius-tuning and island issues as binary band | Environmental exposure declining with distance, gravity models |
| **Kernel K-NN** | Fixed neighbour count, adaptive bandwidth, no islands | Far neighbours possible in deserts/oceans; parameter k to pick | Spatial error regression on counties, GWR, ML feature smoothing |
| **Perimeter-weighted contiguity** | Uses real boundary, good for irregular shapes | Needs clean topology; computation heavier | Ecological edge effects, river-basin pollution, crop pest spread |

## II. From neighborhood to weights

Our matrices tell us which are neighbors, and some give us additional information related to the 'farness' of the identified neighbor. Let me compute the marginal values by row (sum):

In [None]:
allMx=[sub_us_rook_Matrix.sum(axis=1),
       sub_us_queen_Matrix.sum(axis=1),
       sub_us_knn3_Matrix.sum(axis=1),
       sub_us_band750k_Bi_Matrix.sum(axis=1),
       sub_us_band750k_C_Matrix.sum(axis=1),
       sub_us_kernel3_Matrix.sum(axis=1),
       sub_us_perimeter_Matrix.sum(axis=1)]
pd.concat(allMx,axis=1)

We know that no-neighbors have value zero in the matrix. Then, non-zero (binary or non-binary) values carries some weight .
However, they are currently not normalized (they do not add to 1 by row), which may complicate further computations.

For example, having this unnormalized matrix:

In [None]:
sub_us_queen_Matrix

In [None]:
sub_us_queen.transform("r").adjacency.unstack().fillna(0)


In [None]:
sub_us_perimeter_Matrix

In [None]:
sub_us_perimeter.transform("r").adjacency.unstack().fillna(0)

These new row-standardized matrices will serve a greater purpose: the computing of spatial lags.

## III. Spatial Lag: Is my situation independent from my neighbors?

This is a crucial moment is statistics: traditionally, we assume our situation is independent of others'; so sampling may reveal unbiased  population insights. But, what if neigbors are affecting one another.

Of course, by situation we mean a variable, for example, average HS completed in my 'distrito':

In [None]:
peru_distritos.Educ_sec_comp2019_pct.describe()

See the choropleth:

In [None]:
peru_distritos.plot(
    "Educ_sec_comp2019_pct",
    scheme="quantiles",
    cmap="Reds_r",
    legend=True,figsize=(12, 10))

Traditionally we treat each distrito as an isolated coin-flip:  
*“If I randomly sample 10 % of them, the average HS-completed I compute is an unbiased picture of the whole country.”*

But what if the share of adults who finished high school in **my** distrito is **pushed up or down by the share already existing next door**?  
Then the coin-flips are **not independent**—they are **spatially autocorrelated**—and ignoring that:

- inflates t-values (false precision)  
- masks hidden variables that spill over borders  
- makes policy impact look local when it is really regional  

The spatial lag is the **simplest diagnostic**:  

```
lag_HS_i = Σⱼ w_ij HS_j      with Σⱼ w_ij = 1
```



You can compute the adjacency with the weights now:

In [None]:
peru_distritos_queen=Graph.build_contiguity(peru_distritos,rook=False)
peru_distritos_queen=peru_distritos_queen.transform("r")

In [None]:
y = peru_distritos["Educ_sec_comp2019_pct"]
ylag = peru_distritos_queen.lag(y)

It is the **average HS-completed in distrito i’s neighbourhood**.  
Plot lag_HS against own_HS:

- Scatter hugs the 45° line → **strong positive spatial dependence** (high-high, low-low clusters).  
- Cloud is circular → **approximate independence**, classical i.i.d. assumption holds.



Let me add it to the GDF:

In [None]:
peru_distritos=peru_distritos.assign(Educ_sec_comp2019_pct_lagged=ylag)

Plot both to compare:

In [None]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(15, 8))

commonParams=dict(scheme="quantiles",cmap="Reds_r",legend=True)
# --- MAP 1
peru_distritos.plot("Educ_sec_comp2019_pct",ax=axes[0],**commonParams)
axes[0].set_title('Share Population HS Completed (original)', fontsize=14)
axes[0].set_axis_off()

# --- MAP 2
peru_distritos.plot("Educ_sec_comp2019_pct_lagged",ax=axes[1],**commonParams)
axes[1].set_title('Share Population HS Completed (lagged)', fontsize=14)
axes[1].set_axis_off()

plt.tight_layout()
plt.show()


In [None]:
peru_distritos.plot.scatter("Educ_sec_comp2019_pct","Educ_sec_comp2019_pct_lagged")

## Global spatial correlation

If a spatial unit (a row) value in a variable is correlated with values of the neighbors, you know that proximity is interfering with the interpretation.

We need the neighboorhood matrix (the weight matrix) to compute spatial correlation.

In [None]:
import esda

mi = esda.Moran(peru_distritos['Educ_sec_comp2019_pct'], peru_distritos_queen)

In [None]:
mi.I,mi.p_sim

## Local Spatial Correlation

We can compute a Local Index of Spatial Association (LISA -local Moran) for each map object. That will help us find spatial clusters (spots) and spatial outliers:

* A **hotSpot** is a polygon whose value in the variable is high AND is surrounded with polygons with also high values.

* A **coldSpot** is a polygon whose value in the variable is low AND is surrounded with polygons with also low values.

* A **coldOutlier** is a polygon whose value in the variable is low BUT is surrounded with polygons with  high values.

* A **hotOutlier** is a polygon whose value in the variable is high BUT is surrounded with polygons with  low values.


High-High (HH): values above average surrounded by values above average.
Low-Low (LL): values below average surrounded by values below average.
High-Low (HL): values above average surrounded by values below average.
Low-High (LH): values below average surrounded by values above average.

It is also possible that no significant correlation is detected. Let's see those values:

In [None]:
lisa = esda.Moran_Local(peru_distritos['Educ_sec_comp2019_pct'], peru_distritos_queen)

In [None]:
peru_distritos['cluster'] = lisa.get_cluster_labels(crit_value=0.05)

In [None]:
peru_distritos['cluster'].value_counts()

In [None]:
lisa.explore(peru_distritos,crit_value=0.05,
  prefer_canvas=True,
  tiles="CartoDB Positron",
)

In [None]:

oldLabels=['Insignificant', 'High-High','Low-High', 'Low-Low', 'High-Low']
newLabels = [ '0 no_sig', '1 hotSpot', '2 coldOutlier', '3 coldSpot', '4 hotOutlier']

labels = dict(zip(oldLabels, newLabels))


peru_distritos['HS_lisa_quadrant']=peru_distritos['cluster'].map(labels)

peru_distritos['HS_lisa_quadrant'].value_counts()

In [None]:
import matplotlib.pyplot as plt
# custom colors
from matplotlib import colors
myColMap = colors.ListedColormap([ 'snow','pink','blue', 'lightblue','red'])

peru_distritos.plot(column='HS_lisa_quadrant',
                categorical=True,
                cmap=myColMap,
                linewidth=0.1,
                edgecolor='k',
                legend=True,
                legend_kwds={'bbox_to_anchor': (0.3, 0.3)},
                figsize=(12,12))
