# Modeling Urban Growth 2010-2020

Having collecter all the predictors, we then built a model to "predict" urban growth from 2010 to 2020 using information available by 2010. We then tested the modeling results against the actual urban growth from 2010 to 2020. In the next section, this "experience" will be "borrowed" to predict urban growth from 2020 to 2030.

The mechanism of the model is to build a model to predict the "True" or "False" of non-developed cells turning into developed cells. However, because the Denver MSA's urban growth has not been significant, the resulting dataset is highly imbalanced with the majority of cells not developed during 2010 to 2020. Therefore, we downsampled non-developed cells during training, using the `BalancedRandomForestClassifier` from the `imblearn` package.

In [8]:
import numpy as np
import pandas as pd
import geopandas as gpd
import altair as alt
import matplotlib.pyplot as plt

from imblearn.ensemble import BalancedRandomForestClassifier
from sklearn.model_selection import train_test_split

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer, StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline, make_pipeline

import matplotlib.pyplot as plt
from sklearn.metrics import RocCurveDisplay

from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import roc_curve, auc, confusion_matrix

from sklearn import metrics

crs = "EPSG:2232"

# Load data where we left off the last time
fishnet = gpd.read_file("../data/fishnets/with_counties.geojson")
fishnet.crs = crs

In [4]:
# Determine variables to use for predicting for 2010-2020
existing_vars = {
    "population_2010": "population",
    "land_cover_type_2009": "land_cover_type",
    "highway_distance": "highway_distance",
    "rail_station_distance": "rail_station_distance",
    "facility_distance": "facility_distance",
    "lag_developed_2009": "lag_developed",
    "lag_population_2010": "lag_population",
    "county": "county",
    "developed_2009": "init_developed",
    "developed_2019": "developed",
}

# Dataset used for prediction
past = (
    fishnet[list(existing_vars.keys()) + ["geometry"]]
    .query("developed_2009 == False") # Use only cells undeveloped in 2010
    .copy()
    .rename(columns=existing_vars)
)

def make_classifier(
    by_county=False,
    use_facilities=False,
    use_highways=False,
    max_depth=20,
    min_samples_split=2,
):
    log_pipeline = Pipeline(
        [("log", FunctionTransformer(np.log1p)), ("scaler", StandardScaler())]
    )

    num_predictors = ["population", "lag_developed", "lag_population"]

    num_predictors_to_log = [
        "rail_station_distance",
    ]

    if use_facilities:
        num_predictors += ["facility_distance"]
    if use_highways:
        num_predictors += ["highway_distance"]

    cat_predictors = [
        "land_cover_type",
    ]

    if by_county:
        cat_predictors += ["county"]

    transformer = ColumnTransformer(
        transformers=[
            ("log", log_pipeline, num_predictors_to_log),
            ("scale", StandardScaler(), num_predictors),
            ("onehot", OneHotEncoder(), cat_predictors),
        ]
    )

    return make_pipeline(
        transformer,
        BalancedRandomForestClassifier(
            n_estimators=200,
            max_depth=max_depth,
            min_samples_split=min_samples_split,
        ),
    )

baseline_model = make_classifier(
    by_county=False, use_facilities=False, use_highways=False
)
use_highways_model = make_classifier(
    by_county=False, use_facilities=False, use_highways=True
)

use_facilities_model = make_classifier(
    by_county=False, use_facilities=True, use_highways=False
)
use_highways_facilities_model = make_classifier(
    by_county=False, use_facilities=True, use_highways=True
)

We tested three different models.

- **Model 1**: Using population density, lag population density, land cover, lag land cover, and distances to rail stations.
- **Model 2**: All the predictors of Model 1, plus distances to highways.
- **Model 3**: All the predictors of Model 1, plus distances to park-and-ride facilities.
- **Model 4**: All the predictors above.

To make sure that the model is generalizable, we tried to test on each county using the model trained on other counties (for all the counties that experienced more than 5 cells of urban growth during 2010 to 2020). The metric shown below is the AUC. AUC measures the goodness of the model to balance two types of errors, keeping both types of errors minimal:

- **Type 1 Error: Under Predicted**: Falsely to predict a cell to be developed when it is not.
- **Type 2 Error: Over Predicted**: Falsely to predict a cell to be not developed when it is.

The higher the AUC, the better the model is at balancing the two types of errors. AUCs of 50% means the model is no better than random guessing, whereas AUCs of 100% means the model is perfect.

In [None]:
def get_auc(observed, predicted):
    fpr, tpr, _ = metrics.roc_curve(observed, predicted)
    return metrics.auc(fpr, tpr)

def auc_from_split(pipe, train, test):
    pipe.fit(train, train["developed"])
    test_observed = test["developed"]
    test_predicted = pipe.predict_proba(test)[:, 1]
    return get_auc(test_observed, test_predicted)

development_by_county = past.groupby("county")["developed"].sum()
counties_to_test = development_by_county[development_by_county > 5].index.tolist()

generalizability_by_county = pd.DataFrame()

for model_name, model in zip(
    ["Model 1", "Model 2", "Model 3", "Model 4"],
    [baseline_model, use_highways_model, use_facilities_model, use_highways_facilities_model],
):
    for county in counties_to_test:
        train = past.query("county != @county")
        test = past.query("county == @county")
        generalizability_by_county.loc[county, model_name] = auc_from_split(
            model, train, test
        )

In [41]:
generalizability_by_county["county"] = generalizability_by_county.index
generalizabilities = pd.melt(
    generalizability_by_county,
    id_vars=["county"],
    var_name="model",
    value_name="auc",
)
from assets.colors import (
    palette_hero,
    palette_primary,
    palette_green,
    palette_highlight
)

alt.Chart(generalizabilities).mark_bar().encode(
    x=alt.X("model:N", title="County"),
    y=alt.Y("auc:Q", title="AUC"),
    color=alt.Color(
        "model:N",
        title="Model",
        scale=alt.Scale(range=[palette_hero, palette_primary, palette_green, palette_highlight]),
    ),
    column=alt.Column("county:N", title="Model"),
).properties(
    width=100,
    height=200,
    title="Generalizability of the models by county",
)

We can see that the models are pretty good and rather generalizable, as the urban growth of each county is generally well-predicted by the models trained on other counties. The models generalizes less well for Denver County, which differs from the other counties in that it is the center of the MSA and was already largely urbanized by 2010. Nevertheless, the AUCs are still above 80% for all the models.

Model selection also differs from county to county. For example, Model 2 is the best for Jefferson County, whereas Model 4 is the best for Adams, Arapahoe, Denver, and Douglas counties.

Let's include "County" itself as a predictors test the model on 70% of all the cells from all the counties, and see how the model performs on the rest of the cells.

In [None]:
comprehensive_model = make_classifier(
    by_county=True, use_facilities=True, use_highways=True
)
train, test = train_test_split(
    past, stratify=past["developed"], random_state=42,
    test_size=0.3
)
comprehensive_model.fit(train, train["developed"])

observed = test["developed"]
predicted = comprehensive_model.predict_proba(test)[:, 1]

In [77]:
fpr, tpr, thresholds = roc_curve(observed, predicted)
roc_auc = auc(fpr, tpr)

left = alt.Chart(pd.DataFrame({"FPR": fpr, "TPR": tpr})).mark_line().encode(
    x=alt.X("FPR", title="False Positive Rate"),
    y=alt.Y("TPR", title="True Positive Rate"),
    color=alt.value(palette_hero),
).properties(
    title=f"ROC Curve (AUC = {roc_auc:.2f})",
    width=200,
    height=200,
)

metric_by_threshold = pd.DataFrame({
    "threshold": thresholds,
    "sensitivity": tpr,
    "specificity": 1 - fpr,
}).melt(
    id_vars="threshold",
    value_vars=["sensitivity", "specificity"],
    var_name="metric",
).query("0 <= threshold <= 1").copy()

right = alt.Chart(metric_by_threshold).mark_line().encode(
    x=alt.X("threshold", title="Threshold", scale=alt.Scale(domain=[0, 1])),
    y=alt.Y("value", title="Value"),
    color=alt.Color(
        "metric", title="Metric",
        scale=alt.Scale(range=[palette_hero, palette_primary]),
    ),
).properties(
    title="Sensitivity and specificity by threshold",
    width=200,
    height=200,
)

left | right

On the left, the ROC curve reflects the trade of between the two types of errors mentioned above. The closer the curve is to the top-left corner, the better a model is able to keep both errors minimal.

The outcome of the model is a "score" for the likelihood of being developed. The higher the score, the more likely a cell is to be developed. A "cut-off" threshold is key, as it determines the trade-off between the two types of errors. The higher the threshold, the more likely a cell is to be predicted as developed. The lower the threshold, the more likely a cell is to be predicted as not developed. 

The plot on the right shows how, as we increased the threshold, the sensitivity (avoidance of under-predicting) and specificity (avoidance of over-predicting) change. We can see that when the threshold is around 0.75, the model is able to keep both types of errors minimal.

Next, let's look at how the model performs spatially.

In [None]:
# Fit the model on the entire dataset
comprehensive_model.fit(past, past["developed"])
past["predicted_proba"] = comprehensive_model.predict_proba(past)[:, 1]

threshold = 0.82
past["outcome"] = "No Development"
past.loc[
    (past["developed"] == True) & (past["predicted_proba"] > threshold),
    "outcome",
] = "Correctly Predicted to Be Developed"
past.loc[
    (past["developed"] == False) & (past["predicted_proba"] > threshold),
    "outcome",
] = "Overly Predicted to Be Developed"
past.loc[
    (past["developed"] == True) & (past["predicted_proba"] <= threshold),
    "outcome",
] = "Missed Development"

In [87]:
def altair_fishnet(fishnet, column, color_dict, legend_title, title):
    chart = alt.Chart(
        fishnet.to_crs(4326)[["geometry", column]]
    ).mark_geoshape().encode(
        color=alt.Color(
            f"{column}:N",
            title=legend_title,
            scale=alt.Scale(
                domain=list(color_dict.keys()), range=list(color_dict.values())
            ),
        )
    ).properties(
        width=400, height=330, title=title
    )
    return chart

color_dict = {
    "No Development": "#eeeeee",
    "Correctly Predicted to Be Developed": palette_hero,
    "Overly Predicted to Be Developed": palette_green,
    "Missed Development": palette_primary,
    "Already Developed in 2010": "#cccccc"
}

# Join with already developed areas in 2010
developed_in_10 = (
    fishnet[list(existing_vars.keys()) + ["geometry"]]
    .query("developed_2009 == True")
    .copy()
    .rename(columns=existing_vars)
)

developed_in_10["outcome"] = "Already Developed in 2010"
developed_in_10["predicted_proba"] = 1

alt.data_transformers.disable_max_rows()

altair_fishnet(
    fishnet=pd.concat([past, developed_in_10]),
    column="outcome",
    color_dict=color_dict,
    legend_title="Outcome",
    title="Model outcome by location",
)

The model did not miss most of the new development. However, as the Denver MSA has seen very modest urban growth, the model tends to over-predict urban growth. 

Nevertheless, the overly-predicted urban growths are still proximate to the actual urban growths, and therefore are still indicators of where urban growths happen. In the next section, we use this model to predict the urban growth from 2020 to 2030.

In [None]:
from joblib import dump

dump(comprehensive_model, "../data/model.joblib")