<center><img src="https://github.com/DACSS-CSSmeths/guidelines/blob/main/pics/small_logo_ccs_meths.jpg?raw=true" width="700"></center>









# Insight from spatial data

Let me bring a file previously prepared in [Colab](https://colab.research.google.com/drive/1poKKGEsOkTTjwi5ildq2-kMet29HYfpQ?usp=sharing) using FuzzyMerge:

In [None]:
# map
import geopandas as gpd

peru_hdi_map_link="https://github.com/DACSS-CSSmeths/Spatial-Analytics/raw/refs/heads/main/map/perudata.gpkg"

peru_hdi_map=gpd.read_file(peru_hdi_map_link,layer='hdi')

peru_hdi_map.info()

This is a GeoDF with data on Human Development at the municipal (_municipality_) level, including this relevant social info:


* **hdi**: The human development index of the municipality

* **graduated_HS**: Percent of population that finished High School

* **No_basicNeeds**: Percent of households without basic needs.

* **No_sanitaryServ**:Percent of households without sanitary services.

Take a quick look:

In [None]:
peru_hdi_map.head()

## Mining one variable beyond choropleths

Let me use the *Fisher_Jenks* scheme on one variable:

In [None]:
peru_hdi_map.plot(
    column="hdi", 
    scheme="fisherjenks",
    legend=True, figsize=(6,10)
)

From the last [tutorial](https://dacss-cssmeths.github.io/Spatial-Exploring/), even though there are ways to highlight some relevant values, now we will work to get statistically significant patterns, easier to read.

We are using maps to relate values to location: **Are the values of my unit of analysis affected by their location?**

Let's follow these steps to propose an answer:

- Identify the **neighborhood**: One location should be able to **see** how their neighbors "behave".
- Compute a measure of neighborhood effect: how know statiscally whether proximity is affecting or not.
- If proximity is affecting, find the neighborhoods that statistically show some spatial pattern.


## 1. Identify the _neighborhood_

The neighborhood is the set of objects around one object. The problem is the meaning of "around". The figure below shows the two ways we can identify a neighborhood (from [Vilella at al.](https://epjdatascience.springeropen.com/articles/10.1140/epjds/s13688-020-00228-9)):



<center><img src="https://github.com/DACSS-CSSmeths/Spatial-Analytics/blob/main/neighborhood.jpg?raw=true" width="700"></center>



In maps, the **QUEEN** approach considers that two spatial objects (i.e. polygons) are neighbors if their borders share one coordinate (point); the **ROOK** approach considers that two spatial objects  are neighbors if their borders share two coordinates (a line). 

Of course, you can have neighbors that are not contiguous (touching your borders), particularly if they are not polygons (lines or points). In that case you have a different approach: the nearest neighbor or **KNN** (you count as many "k" neighbors you want).

Let me call the functions I need:

In [None]:
from libpysal.weights import Queen, Rook, KNN

Now, I will find all the neighbors:

In [None]:
w_rook = Rook.from_dataframe(peru_hdi_map,use_index=False) 
w_queen = Queen.from_dataframe(peru_hdi_map,use_index=False)
w_knn8 = KNN.from_dataframe(peru_hdi_map, k=8) # you want '8'

The QUEEN and ROOK sent a warning, because some polygons had no neighbors (islands).
Here they are:

In [None]:
#the rows:
w_queen.islands

In [None]:
#the municipalities
peru_hdi_map.iloc[w_queen.islands,:]

In [None]:
# just plotting
peru_hdi_map.iloc[w_queen.islands,:].explore()

Notice the inventory of neighbors:

In [None]:
import pandas as pd # using pandas to ease analysis of data

# I am turning "w_queen.neighbors" into a pandas column (a 'Series')

pd.Series(w_queen.neighbors).head()

Above, you see the indexes (row numbers); that is, the polygon 0 (the first row) has polygon "1" and "3" as neighbors. 
You can also know how many neighbors each on has:

In [None]:
pd.Series(w_queen.cardinalities).head()

In this case, we can see the counting of the above: polygon 0 (the first row) has two neighbors. 

You may not expect the same results from both techniques. The next figure shows you that.

In [None]:
Queen_Rook={'queen':w_queen.cardinalities,'rook':w_rook.cardinalities}

import seaborn as sea
import pandas as pd
from matplotlib import pyplot as plt


sea.histplot(pd.DataFrame(Queen_Rook),multiple="dodge")
plt.xlabel('Amount of Neighbors of a Municipality')

## 2. Compute a measure of neighborhood effect: Global spatial correlation

### The adjacency matrix
We were able compute all the previous information because the algorithms _Queen_, _Rook_ and _KNN_ created an **adjacency matrix**. Take a look:

In [None]:
pd.DataFrame(*w_queen.full()) # 1 means both are neighbors

The presence of **1** means two units are neighbors, for example 1871 and 1870 are neighbors, but none of them is connected to 1869:

In [None]:
peru_hdi_map.loc[[1869,1870,1871],:].plot(color='w', edgecolor='k')

### The spatial weight matrix

The spatial correlation needs the previous adjacency matrix as a **weight matrix**, you get it this way:

In [None]:
w_knn8.transform = 'R'

 Now, the sum of every row is ONE:

In [None]:
# after transformation
pd.DataFrame(*w_knn8.full())

You see the sum by rows (axis=1) here:

In [None]:
pd.DataFrame(*w_knn8.full()).sum(axis=1)

### The Global Moran

Most people are familiar with Pearson correlation coefficient (_Person's R_). The R coefficient tells the relationship between two variables. We will do something similar now.

Take the variable _hdi_ from the GeoDF *peru_hdi_map*. But in this case, we will not use another variable. We will the same variable twice! that is: comparing if there is some correlation between the _hdi_ of a municipality, with the _hdi_ of the neighbors. The results is known as the **Moran's I** statistic. Let's get it:

In [None]:
from esda.moran import Moran

# we use the variable 'hdi' and the weight matrix
moranHDI = Moran(y=peru_hdi_map['hdi'], 
                 w=w_knn8)

# results
moranHDI.I, moranHDI.p_sim

Now we know that there is significant (0.001)  positive correlation (0.63): **when the value of hdi is high in one location, it is usually high in the neighbors (or if it is low, it is usually low in the neighbors)**. 

So far we have the GLOBAL Moran's I. That tells us the global tendency. But it would be even more interesting to know where the areas where spatial correlation (neighboorhood effect!) is actually happening. For that you need the local Moran!

## 3. Find Spatial patterns on neigborhoods

Here we need to compute the  **Local Index of Spatial Association** (LISA -local Moran) for each map object. That will help us identify different **quadrants**, that represent a spatial clusters (spots) or spatial outliers:

* A **hotSpot (HH)** are geometries who are in a neighborhood of high values of a particular variable.

* A **coldSpot (LL)** are geometries who are in a neighborhood of low values  of a particular variable.

* A **coldOutlier (LH)** is a geometry whose value in the variable is low BUT is surrounded with polygons with  high values  of a particular variable.

* A **hotOutlier (HL)** is a geometry whose value in the variable is high BUT is surrounded with polygons with  low values  of a particular variable.

It is also possible that no significant correlation is detected. Let's see compute LISA:

In [None]:
# A LISA for each district using hdi
from esda.moran import Moran_Local

lisa_HDI = Moran_Local(y=peru_hdi_map['hdi'], 
                       w=w_knn8,
                       seed=2022) # use this seed if you want to get the same results.

You have this information in lisa_HDI:

In [None]:
# quadrant, # significance
ResultLISA=pd.DataFrame({'lisa_Qlabel':lisa_HDI.q, 'lisa_Qsig':lisa_HDI.p_sim})
ResultLISA

The first column **lisa_Qlabel** tells you to what quadrant the municipality is in; this is the legend: 

* 1 HH
* 2 LH
* 3 LL
* 4 HL

The LISA also offers you the probability that the local relationship is statistically significant. The second column **lisa_Qsig** tells you that. We need to identify which are not statistically significant, so let me relabel as **0** the non significant quadrant. Let's follow these steps:

1. Identify which QUADRANT LABELs are NOT significant:

In [None]:
# renaming q as 0, if p is less than 0.05
peru_hdi_map['HDI_quadrant']=[q if p <0.05 else 0 for q,p in zip(ResultLISA.lisa_Qlabel,ResultLISA.lisa_Qsig)  ]

Now, we know:

In [None]:
# quadrant: 0:No_sig 1 HH,  2 LH,  3 LL,  4 HL
peru_hdi_map['HDI_quadrant'].value_counts()

We have 1005 districts (polygons) where the local correlation is not statistically significant. We have 422 districts in neighborhoods of the kind LL (all have low values in HDI), and so on.


2. Rename **HDI_quadrant**

   Instead of numbers, we can have labels:

In [None]:
# the dictionary to make changes
newLabels = {0: '0_NoSig',1: '1_HotSpot',2: '2_ColdOutlier',3: '3_ColdSpot',4: '4_HotOutlier'}
newLabels

Here we recode:

In [None]:
peru_hdi_map.replace({'HDI_quadrant':newLabels},inplace=True)


# now
peru_hdi_map['HDI_quadrant'].value_counts()

We have the data ready.

In [None]:
peru_hdi_map.info()

Let's save this data:

In [None]:
peru_hdi_map.to_file("peru_hdi_map.gpkg", layer='spatial', driver="GPKG")

We can use this plot this last column:

In [None]:
from matplotlib import colors

# custom colors
myColMap = colors.ListedColormap([ 'gold', 'pink', 'k', 'cyan','red'])

peru_hdi_map.explore(
    column="HDI_quadrant",  
    tooltip=["municipality","hdi"],  
    tiles="CartoDB positron",  # use "CartoDB positron" tiles
    cmap=myColMap,  # colormap
    style_kwds=dict(stroke=False),  # no borders
    legend_kwds={'caption':'Quadrant type'}
)

One of the most important insights you get, is discovering the outliers (HL or LH). For instance these ones:

In [None]:
peru_hdi_map[peru_hdi_map.HDI_quadrant=='4_HotOutlier'].explore(color='red')

From here, you can propose other queries:

In [None]:
# the mean and median per quadrant:
peru_hdi_map.groupby('HDI_quadrant').agg({'hdi': ['mean','median']})

In [None]:
# which are the min value per quadrant:
whichInfo=['municipality','HDI_quadrant','hdi']
theVar='hdi'
theGroups='HDI_quadrant'
peru_hdi_map.loc[peru_hdi_map.groupby(theGroups)[theVar].transform("min") == peru_hdi_map[theVar]][whichInfo]

These are the positions:

In [None]:
theMins=peru_hdi_map.loc[peru_hdi_map.groupby(theGroups)[theVar].transform("min") == peru_hdi_map[theVar]][whichInfo]
theMins.index

Use those positions to plot:

In [None]:
peru_hdi_map.iloc[theMins.index,:].explore(tiles='CartoDB dark_matter',
    column="HDI_quadrant",
    cmap=myColMap, legend=True, style_kwds={'width':4,'color':'white'}
)

To check which is what in the plot above, we should use zoom. 

# Neighborhoods based on two variables

Remember we have these variables:

In [None]:
peru_hdi_map.columns

Let me see a classical correlation between two variables:

In [None]:
peru_hdi_map[['No_sanitaryServ','graduated_HS']].corr()

In [None]:
sea.regplot(data=peru_hdi_map, x='No_sanitaryServ',y='graduated_HS',line_kws=dict(color="r"))

These two variables have values whose higher values do not represent the same 'good' or 'bad' concept as their values increase. I mean, a high value in _No_sanitaryServ_ is a bad thing, while a high values in the other is a good thing. Then, let me reverse *graduated_HS'*:

In [None]:
peru_hdi_map.graduated_HS.describe()

In [None]:
# then
peru_hdi_map['Non_graduated_HS']=100-peru_hdi_map.graduated_HS

Now you expect a change here:

In [None]:
peru_hdi_map[['No_sanitaryServ','Non_graduated_HS']].corr()

In [None]:

sea.regplot(data=peru_hdi_map, x='No_sanitaryServ',y='Non_graduated_HS',line_kws=dict(color="r"))

We can use TWO variables to get the **Bivariate Moran**:

In [None]:
from esda.moran import Moran_BV
noToilet_HS=Moran_BV(x=peru_hdi_map['No_sanitaryServ'],y=peru_hdi_map['Non_graduated_HS'], w=w_knn8)
noToilet_HS.I,noToilet_HS.p_sim

The global **Bivariate** Moran is telling that there are neighborhoods where a geometry with a high value in *No_sanitaryServ* is surrounded by high values in *Non_graduated_HS* (the low-low also holds as you know).

Let's find those neighborhoods as before:

In [None]:
from esda.moran import Moran_Local_BV

#HH=1, LH=2, LL=3, HL=4

# this is for the local neighborhood:
moran_loc_bv = Moran_Local_BV(x=peru_hdi_map['No_sanitaryServ'],y=peru_hdi_map['Non_graduated_HS'], w=w_knn8,seed=2022)

# results as a dataframe
ResultLISA_BV=pd.DataFrame({'BV_lisa_Qlabel':moran_loc_bv.q, 'BV_lisa_Qsig':moran_loc_bv.p_sim})

# identifying the non significant relationships
peru_hdi_map['HDI_quadrant_BV']=[q if p <0.05 else 0 for q,p in zip(ResultLISA_BV.BV_lisa_Qlabel,ResultLISA_BV.BV_lisa_Qsig)  ]

# relabelling them
peru_hdi_map.replace({'HDI_quadrant_BV':newLabels},inplace=True)

# we have
peru_hdi_map['HDI_quadrant_BV'].value_counts()

For example, we can state that  we have 380 municipalities with low percent of houses that lack sanitary services, surrounded by municipalities with also low percent of people that did not finished high school. 
Notice that several spatial outliers are present, for instance, you know that there are 159 municipalities with high percent of houses that lack sanitary services, surrounded by municipalities with low percent of people that did not finished high school. 

Here we can see them:

In [None]:
peru_hdi_map.explore(
    column="HDI_quadrant_BV",  
    tooltip=["municipality","No_sanitaryServ","graduated_HS","HDI_quadrant_BV"],  
    tiles="CartoDB positron",  
    cmap=myColMap,  # colormap
    style_kwds=dict(stroke=False),  
    legend_kwds={'caption':'BV Quadrant type'}
)

Let me rewrite the map file:

In [None]:
peru_hdi_map.to_file("peru_hdi_map.gpkg", layer='spatial', driver="GPKG")



<div class="alert-success">

<header>
    <h1>Homework 1 (alternative)</h1>
    
  </header>
    

    
</div>

MAPA [JSON](https://gist.github.com/sdwfrost/d1c73f91dd9d175998ed166eb216994a)
DATA EXC