GWR (Geographically Weighted Regression) is a spatially varying coefficients model that explores how relationships between predictors (independent variables) and an outcome (dependent variable) change across space.

In this project we want to investigate how mortality counts from pulmonary embolism changes over space (counties)

In [1]:
import os
import configparser
import time
import warnings
import numpy as np
import pandas as pd
from tqdm import tqdm
import geopandas as gpd
import statsmodels.api as sm
from mgwr.gwr import GWR, MGWR
from mgwr.sel_bw import Sel_BW
from mgwr.utils import shift_colormap
from shapely.geometry import Point
from sklearn.preprocessing import StandardScaler
from statsmodels.stats.outliers_influence import variance_inflation_factor
warnings.filterwarnings('ignore')

In [2]:
BASE_DIR = os.getcwd()
CONFIG = configparser.ConfigParser()
CONFIG.read(os.path.join(BASE_DIR, 'script_config.ini'))

BASE_PATH = os.path.abspath(os.path.join(os.getcwd(), '..', 'data'))

DATA_RAW = os.path.join(BASE_PATH, 'raw')
DATA_RESULTS = os.path.join(BASE_PATH, '..', 'results')

In [44]:
data_path = os.path.join(DATA_RESULTS, 'final', 
                   'pulmonary_air_quality_data.csv')

df = pd.read_csv(data_path)
df = df[df['geometry'].notna()]
df = gpd.GeoDataFrame(df, geometry = 
     gpd.GeoSeries.from_wkt(df['geometry']), crs = 'EPSG:4326')

df = df[['geometry', 'sex', 'race_recode3', 'age_cat', 
         'Daily Mean PM2.5 Concentration', 'Daily AQI Value', 
         'mort_per_100k']]

However, GWR requires numeric predictors.
We have to converg the categorical variables in our dataset into numeric dummy variables (one-hot encoding). But before we do that, we first have to handle the missing values.

We fill the missing values with the most frequent.

In [45]:
cat_cols = ['sex', 'race_recode3', 'age_cat']
for col in cat_cols:
    df[col].fillna(df[col].mode()[0], inplace=True)

Next we encode the categorical columns with dummy variables

In [46]:
df_encoded = pd.get_dummies(df, columns = 
            ["sex", "race_recode3", "age_cat"], 
             dtype = int, drop_first = True)

Now, we define our dependent (y) variable i.e mort_per_100k and independent (x) variables ('Daily Mean PM2.5 Concentration', 'Daily AQI Value').

In [47]:
y = df_encoded["mort_per_100k"].values.reshape((-1, 1))

In [48]:
X = df_encoded[["Daily Mean PM2.5 Concentration", "Daily AQI Value"] + 
    [col for col in df_encoded.columns if col.startswith(("sex_", 
    "race_recode3_", "age_cat_"))]].values

Now we extract individual (x, y) coordinates from the geometry column.

In [49]:
u = df_encoded.geometry.x.values
v = df_encoded.geometry.y.values
coords = np.column_stack((u, v))

In [52]:
df_encoded.shape

(727225, 10)

We now fit the GWR model.

ValueError: could not convert string to float: '10 - 29 years'

Now we extract the local parameter estimates (coefficients), RÂ² and residuals.

In [34]:
local_coefs = results.params 
local_R2 = results.localR2  
residuals = results.resid_response

We then combine these arrays with our  original data and geometry:

In [36]:
predictor_names = ["Intercept"] + list(
    ["Daily Mean PM2.5 Concentration", "Daily AQI Value"]
    + [col for col in df_encoded.columns if col.startswith((
        "sex_", "race_recode3_", "age_cat_"))])

coef_df = pd.DataFrame(local_coefs, columns = predictor_names)

coef_df["local_R2"] = local_R2
coef_df["residuals"] = residuals

gwr_df = gpd.GeoDataFrame(pd.concat([df_encoded.reset_index(
    drop = True), coef_df], axis = 1), geometry = df_encoded.geometry,
                           crs = df_encoded.crs)

In [38]:
folder_out = os.path.join(DATA_RESULTS, 'final')

filename = 'GWR_results.csv'
path_out = os.path.join(folder_out, filename)
gwr_df.to_csv(path_out, index = False)