Here we will try some analysis on the data we have collected. We will try to find out the following things:

- Spatial Regression Analysis: 
We can perform spatial regression analysis to understand the relationships between tree distribution, ecological benefits, and environmental factors. This can help identify areas where planting specific tree species can have the most significant impact on reducing pollution or increasing oxygen levels.

Model documentation examples: 
- https://deepforest.readthedocs.io/en/latest/getting_started.html#sample-data
- https://github.com/kodujdlapolski/tree-research/blob/master/model.ipynb
- https://treeco.netlify.app

In [16]:
import geopandas as gpd
#import pysal
import libpysal as lp
from spreg import OLS
import numpy as np

# WE ARE WORKING WITH A SAMPLE OF THE DATA
# Load your tree dataset with coordinates and ecological benefits
trees = gpd.read_file('../data/geojson/geo_data_trees.geojson')

# Load environmental data
environment = gpd.read_file('../data/geojson/circoscrizioni.geojson')

# Create a spatial weights matrix
#w = pysal.lib.weights.DistanceBand.from_dataframe(trees, threshold=100, binary=True)
w = lp.weights.DistanceBand.from_dataframe(trees, threshold=100, binary=True)

# Convert the 'area' column to float (if needed)
environment['area'] = environment['area'].astype(float)

# Create a spatial lag variable for the dependent variable
trees['Total Annual Benefits (eur/yr)'] = trees['Total Annual Benefits (eur/yr)'].str.replace(',', '.').astype(float)
y = trees['Total Annual Benefits (eur/yr)']
#ylag = pysal.lib.weights.lag_spatial(w, y)
#ylag = lp.weights.lag_spatial(w, y)

# Define the independent variable (e.g., 'area')
trees['Carbon Storage (eur)'] = trees['Carbon Storage (eur)'].str.replace(',', '.').astype(float)
X = trees[['Carbon Storage (eur)']]

# arrays x and y not all of same length, cut to match on the basis of the shortest

# Perform spatial regression
#model = pysal.model.spreg.OLS(y.values.reshape(-1, 1), environment[['feature1', 'feature2']].values, w=w, name_y='Total Annual Benefits (eur/yr)', name_x=['feature1', 'feature2'], spat_diag=True)
#model = OLS(y.values.reshape(-1, 1), environment[['area', 'perimetro']].values, w=w, spat_diag=True)
model = OLS(y.values.reshape(-1, 1), X.values, w=w, spat_diag=True)
print(model.summary)


# Visualize the results, e.g., residuals, coefficients on a map
trees['residuals'] = model.u
trees['residuals'].describe()


REGRESSION
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :     dep_var                Number of Observations:         266
Mean dependent var  :      1.8805                Number of Variables   :           2
S.D. dependent var  :      3.5714                Degrees of Freedom    :         264
R-squared           :      0.9147
Adjusted R-squared  :      0.9144
Sum squared residual:     288.312                F-statistic           :   2830.9633
Sigma-square        :       1.092                Prob(F-statistic)     :  3.937e-143
S.E. of regression  :       1.045                Log likelihood        :    -388.150
Sigma-square ML     :       1.084                Akaike info criterion :     780.301
S.E of regression ML:      1.0411                Schwarz criterion     :     787.468

-----------------------------------------------------------------------------

count    2.660000e+02
mean    -1.843137e-15
std      1.043058e+00
min     -7.346083e+00
25%     -5.038463e-01
50%     -3.729156e-01
75%      1.623720e-01
max      4.751881e+00
Name: residuals, dtype: float64

Tutorial: 
- https://sustainability-gis.readthedocs.io/en/latest/lessons/L4/spatial_regression.html

In [21]:
from pysal.model import spreg
from pysal.lib import weights
from scipy import stats
import numpy as np
import pandas as pd
import geopandas as gpd
import seaborn as sns
import osmnx as ox
sns.set(style="whitegrid")

# Read OSM data - get administrative boundaries

# define the place query
query = {'city': 'Bologna'}

# get the boundaries of the place (add additional buffer around the query)
boundaries = ox.geocode_to_gdf(query, buffer_dist=5000)

# Let's check the boundaries on a map
boundaries.explore()

  boundaries = ox.geocode_to_gdf(query, buffer_dist=5000)
  def _fisher_jenks_means(values, classes=5, sort=True):


In [22]:
# Filter data geographically
trees_filtered = gpd.sjoin(trees, boundaries[["geometry"]])
trees_filtered = trees_filtered.reset_index(drop=True)

In [23]:
trees_filtered["Total Annual Benefits (eur/yr)"].head()

0    0.16
1    0.80
2    0.16
3    0.90
4    0.98
Name: Total Annual Benefits (eur/yr), dtype: float64

In [27]:
# Here the tooltip parameter specifies which attributes are shown when hovering on top of the points
# The vmax parameter specifies the maximum value for the colormap (here, all 1000 dollars and above are combined)
trees_filtered.explore(column="Total Annual Benefits (eur/yr)", cmap="Reds", scheme="quantiles", k=4, tooltip=["Species Name", "Total Annual Benefits (eur/yr)"], vmax=1000, tiles="CartoDB Positron")

# Baseline (nonspatial) regression

Before introducing explicitly spatial methods, we will run a simple linear regression model. This will allow us, on the one hand, set the main principles of hedonic modeling and how to interpret the coefficients, which is good because the spatial models will build on this; and, on the other hand, it will provide a baseline model that we can use to evaluate how meaningful the spatial extensions are.

In [34]:
# explanatory_vars = ['crown_height', 'crown_width', 'dbh', 'age', 'Leaf Area (m2)']
explanatory_vars = ['Carbon Storage (eur)']

In [29]:
trees_filtered["log_benefit"] = np.log(trees_filtered["Total Annual Benefits (eur/yr)"] + 0.000001)

In [31]:
# let us build a spatial weights matrix that connects every observation to its 8 nearest neighbors. 
# This will allow us to get extra diagnostics from the baseline model.

w = weights.KNN.from_dataframe(trees_filtered, k=8)
w.transform = 'R'
w

 There are 2 disconnected components.


<libpysal.weights.distance.KNN at 0x1818c02e0>

In [36]:
m1 = spreg.OLS(trees_filtered[['log_benefit']].values, trees_filtered[explanatory_vars].values, 
                  name_y = 'log_benefit', name_x = explanatory_vars)

In [37]:
print(m1.summary)

REGRESSION
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set            :     unknown
Weights matrix      :        None
Dependent Variable  : log_benefit                Number of Observations:         266
Mean dependent var  :     -0.4021                Number of Variables   :           2
S.D. dependent var  :      1.3419                Degrees of Freedom    :         264
R-squared           :      0.4342
Adjusted R-squared  :      0.4320
Sum squared residual:     269.996                F-statistic           :    202.5879
Sigma-square        :       1.023                Prob(F-statistic)     :   1.668e-34
S.E. of regression  :       1.011                Log likelihood        :    -379.421
Sigma-square ML     :       1.015                Akaike info criterion :     762.842
S.E of regression ML:      1.0075                Schwarz criterion     :     770.009

-----------------------------------------------------------------------------