<a href="https://colab.research.google.com/github/EmilyHong77/gentrification_in_montreal/blob/main/notebooks/ding_measures.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Information**<br>
This notebook applies the Ding gentrification methodology to Montréal CMA census tracts (2001-2021). It constructs mapping and modeling datasets, identifies gentrifiable and gentrified tracts using CMA-level weighted benchmarks, and classifies gentrification intensity based on housing and rent change quantiles. The resulting indicators support visualization, temporal comparison, and machine-learning analysis. <br>

**Ding Measures Source**:<br>
https://www150.statcan.gc.ca/

**CMA Source**:<br>
https://www12.statcan.gc.ca/






**Mount Drive**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

**Libraries**

In [None]:
import numpy as np
import pandas as pd

# Merge Datasets
1. Mapping Dataset
2. XGBoost Dataset

**Mapping Dataset**



**Description**:<br>
This dataset contains all 1004 tracts, including the 61 tracts with missing values.
The full dataset is kept for mapping and visualization. Geographic completeness is important for spatial analysis. Missing values do not prevent tracts from being displayed on maps.

In [None]:
# Read in the data
standardized_1996 = pd.read_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/4_standardized_data/1996_standardized.csv')
standardized_2001 = pd.read_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/4_standardized_data/2001_standardized.csv')
standardized_2006 = pd.read_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/4_standardized_data/2006_standardized.csv')
standardized_2011 = pd.read_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/4_standardized_data/2011_standardized.csv')
standardized_2016 = pd.read_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/4_standardized_data/2016_standardized.csv')
standardized_2021 = pd.read_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/2_clean_data/2021_clean.csv')

In [None]:
# GeoUID_2021 to ctuid
for df in [standardized_1996, standardized_2001, standardized_2006, standardized_2011, standardized_2016, standardized_2021]:
    df.rename(columns={'GeoUID_2021': 'ctuid'}, inplace=True)

In [None]:
# Clean 2001-2016 drop NaNs
standardized_1996 = standardized_1996.dropna()
standardized_2001 = standardized_2001.dropna()
standardized_2006 = standardized_2006.dropna()
standardized_2011 = standardized_2011.dropna()
standardized_2016 = standardized_2016.dropna()

print(len(standardized_1996.columns))
print(standardized_1996.head())
print(len(standardized_2001.columns))
print(standardized_2001.head())
print(len(standardized_2006.columns))
print(standardized_2006.head())
print(len(standardized_2011.columns))
print(standardized_2011.head())
print(len(standardized_2016.columns))
print(standardized_2016.head())
print(len(standardized_2021.columns))
print(standardized_2021.head())

In [None]:
# Merge csvs to a single file
mapping_df = pd.merge(standardized_1996, standardized_2001, on='ctuid', how='outer')
mapping_df = pd.merge(mapping_df, standardized_2006, on='ctuid', how='outer')
mapping_df = pd.merge(mapping_df, standardized_2011, on='ctuid', how='outer')
mapping_df = pd.merge(mapping_df, standardized_2016, on='ctuid', how='outer')
mapping_df = pd.merge(mapping_df, standardized_2021, on='ctuid', how='outer')
mapping_df.head()

In [None]:
for col in mapping_df.columns:
    print(col)

print(mapping_df.shape)

In [None]:
# Export to drive
mapping_df.to_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/5_ding_measurements/merged_data_mapping.csv', index=False)

**XGBoost Training Dataset** <br>

**Description:**
This dataset includes only tracts with complete information for the variables used in modeling.
The 61 NA tracts are removed to ensure:

- model stability

- proper feature alignment

- no missing-value bias

- cleaner training and validation splits

In [None]:
# Columns to check for NaN tracts
cols_2001 = [
    "Non-migrants_2001",
    "Migrants_2001",
    "Internal migrants_2001",
    "External migrants_2001"
]

# Drop NaN tracts
ding_df = mapping_df.dropna(subset=cols_2001)

# Summary
print("Original rows:", len(mapping_df))
print("Cleaned rows:", len(ding_df))
print("Rows dropped:", len(mapping_df) - len(ding_df))

In [None]:
# Export to drive
ding_df.to_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/5_ding_measurements/merged_data_ding.csv', index=False)

# Ding Measurements

1. Gentrifiable Measure
2. Gentrified Measure
3. Gentrification Levels

## Gentrifiable Measure

**Formula**:<br>
median household income < median of CMA


In [None]:
# Median income for Montreal for each year
MEDIAN_INCOME_CMA_1996 = 43856
MEDIAN_INCOME_CMA_2001 = 47267
MEDIAN_INCOME_CMA_2006 = 47979
MEDIAN_INCOME_CMA_2011 = 53024
MEDIAN_INCOME_CMA_2016 = 61790
MEDIAN_INCOME_CMA_2021 = 63600

# Create gentrifiable df from ding_df
ding_gentrifiable = ding_df.copy()

# Median income for each tract
median_household_income_1996 = ding_df["Median household income ($)_1996"]
median_household_income_2001 = ding_df["Median household income ($)_2001"]
median_household_income_2006 = ding_df["Median household income ($)_2006"]
median_household_income_2011 = ding_df["Median household income ($)_2011"]
median_household_income_2016 = ding_df["Median household income ($)_2016"]
median_household_income_2021 = ding_df["Median household income ($)_2021"]

# Calculate gentrifiable measure
ding_gentrifiable['Gentrifiable Ding 1996'] = median_household_income_1996 < MEDIAN_INCOME_CMA_1996
ding_gentrifiable['Gentrifiable Ding 2001'] = median_household_income_2001 < MEDIAN_INCOME_CMA_2001
ding_gentrifiable['Gentrifiable Ding 2006'] = median_household_income_2006 < MEDIAN_INCOME_CMA_2006
ding_gentrifiable['Gentrifiable Ding 2011'] = median_household_income_2011 < MEDIAN_INCOME_CMA_2011
ding_gentrifiable['Gentrifiable Ding 2016'] = median_household_income_2016 < MEDIAN_INCOME_CMA_2016
ding_gentrifiable['Gentrifiable Ding 2021'] = median_household_income_2021 < MEDIAN_INCOME_CMA_2021

In [None]:
# Export to drive
ding_gentrifiable.to_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/5_ding_measurements/gentrifiable_measure.csv', index=False)

## Gentrified Measure

**Formula**<br>
increase in housing value > median for CMA <br>
increase in renting costs > median for CMA <br>
increase university degrees > median for CMA

**Population-weighted median helper (CMA benchmark)**

In [None]:
def weighted_quantile(values, weights, quantiles):
    values = np.array(values)
    weights = np.array(weights)

    # Sort by values
    sorter = np.argsort(values)
    values = values[sorter]
    weights = weights[sorter]

    # Cumulative population share
    cum_weights = np.cumsum(weights) - 0.5 * weights
    cum_weights /= np.sum(weights)

    return np.interp(quantiles, cum_weights, values)

def weighted_median(values, weights):
    return weighted_quantile(values, weights, 0.5)

**Median Housing Value**

In [None]:
# Dwelling population for each tract
owner_households_2001 = ding_df["Owner households_2001"]
owner_households_2006 = ding_df["Owner households_2006"]
owner_households_2011 = ding_df["Owner households_2011"]
owner_households_2016 = ding_df["Owner households_2016"]
owner_households_2021 = ding_df["Owner households_2021"]

# Avg home value for each tract
avg_home_value_1996 = ding_df["Average value dwelling ($)_1996"]
avg_home_value_2001 = ding_df["Average value dwelling ($)_2001"]
avg_home_value_2006 = ding_df["Average value dwelling ($)_2006"]
avg_home_value_2011 = ding_df["Average value dwelling ($)_2011"]
avg_home_value_2016 = ding_df["Average value dwelling ($)_2016"]
avg_home_value_2021 = ding_df["Average value dwelling ($)_2021"]

# Compute tract-level percentage change in average home value between census periods
avg_home_value_change_1996_2001 = (avg_home_value_2001 - avg_home_value_1996) / avg_home_value_1996 * 100
avg_home_value_change_2001_2006 = (avg_home_value_2006 - avg_home_value_2001) / avg_home_value_2001 * 100
avg_home_value_change_2006_2011 = (avg_home_value_2011 - avg_home_value_2006) / avg_home_value_2006 * 100
avg_home_value_change_2011_2016 = (avg_home_value_2016 - avg_home_value_2011) / avg_home_value_2011 * 100
avg_home_value_change_2016_2021 = (avg_home_value_2021 - avg_home_value_2016) / avg_home_value_2016 * 100

# Calculate the weighted median using the weighted_median function
cma_median_dwelling_value_change_2001 = weighted_median(avg_home_value_change_1996_2001, owner_households_2001)
cma_median_dwelling_value_change_2006 = weighted_median(avg_home_value_change_2001_2006, owner_households_2006)
cma_median_dwelling_value_change_2011 = weighted_median(avg_home_value_change_2006_2011, owner_households_2011)
cma_median_dwelling_value_change_2016 = weighted_median(avg_home_value_change_2011_2016, owner_households_2016)
cma_median_dwelling_value_change_2021 = weighted_median(avg_home_value_change_2016_2021, owner_households_2021)

**Median Rent**

In [None]:
# Renter households
renter_households_2001 = ding_df["Renter households_2001"]
renter_households_2006 = ding_df["Renter households_2006"]
renter_households_2011 = ding_df["Renter households_2011"]
renter_households_2016 = ding_df["Renter households_2016"]
renter_households_2021 = ding_df["Renter households_2021"]

# Median monthly rent
median_rent_1996 = ding_df["Average gross rent ($)_1996"]
median_rent_2001 = ding_df["Average gross rent ($)_2001"]
median_rent_2006 = ding_df["Average gross rent ($)_2006"]
median_rent_2011 = ding_df["Average gross rent ($)_2011"]
median_rent_2016 = ding_df["Average gross rent ($)_2016"]
median_rent_2021 = ding_df["Average gross rent ($)_2021"]

# Compute tract-level percentage change in rent between census periods
rent_change_1996_2001 = (median_rent_2001 - median_rent_1996) / median_rent_1996 * 100
rent_change_2001_2006 = (median_rent_2006 - median_rent_2001) / median_rent_2001 * 100
rent_change_2006_2011 = (median_rent_2011 - median_rent_2006) / median_rent_2006 * 100
rent_change_2011_2016 = (median_rent_2016 - median_rent_2011) / median_rent_2011 * 100
rent_change_2016_2021 = (median_rent_2021 - median_rent_2016) / median_rent_2016 * 100

# Calculate the weighted median using the weighted_median function
cma_median_rent_change_2001 = weighted_median(rent_change_1996_2001, renter_households_2001)
cma_median_rent_change_2006 = weighted_median(rent_change_2001_2006, renter_households_2006)
cma_median_rent_change_2011 = weighted_median(rent_change_2006_2011, renter_households_2011)
cma_median_rent_change_2016 = weighted_median(rent_change_2011_2016, renter_households_2016)
cma_median_rent_change_2021 = weighted_median(rent_change_2016_2021, renter_households_2021)

**Percent Higher Education Change**

In [None]:
# Education base population
educ_base_pop_1996 = ding_df["Education base_1996"]
educ_base_pop_2001 = ding_df["Education base_2001"]
educ_base_pop_2006 = ding_df["Education base_2006"]
educ_base_pop_2011 = ding_df["Education base_2011"]
educ_base_pop_2016 = ding_df["Education base_2016"]
educ_base_pop_2021 = ding_df["Education base_2021"]

# Percentage of adults with bachelor's degree or higher
pct_bachelors_plus_1996 = ding_df["Bachelors degree or higher_1996"] / educ_base_pop_1996 * 100
pct_bachelors_plus_2001 = ding_df["Bachelors degree or higher_2001"] / educ_base_pop_2001 * 100
pct_bachelors_plus_2006 = ding_df["Bachelors degree or higher_2006"] / educ_base_pop_2006 * 100
pct_bachelors_plus_2011 = ding_df["Bachelors degree or higher_2011"] / educ_base_pop_2011 * 100
pct_bachelors_plus_2016 = ding_df["Bachelors degree or higher_2016"] / educ_base_pop_2016 * 100
pct_bachelors_plus_2021 = ding_df["Bachelors degree or higher_2021"] / educ_base_pop_2021 * 100

# Compute tract-level change in bachelor's degree attainment between census periods
educ_change_1996_2001 = pct_bachelors_plus_2001 - pct_bachelors_plus_1996
educ_change_2001_2006 = pct_bachelors_plus_2006 - pct_bachelors_plus_2001
educ_change_2006_2011 = pct_bachelors_plus_2011 - pct_bachelors_plus_2006
educ_change_2011_2016 = pct_bachelors_plus_2016 - pct_bachelors_plus_2011
educ_change_2016_2021 = pct_bachelors_plus_2021 - pct_bachelors_plus_2016

# Calculate the weighted median using the weighted_median function
cma_median_educ_change_2001 = weighted_median(educ_change_1996_2001, educ_base_pop_2001)
cma_median_educ_change_2006 = weighted_median(educ_change_2001_2006, educ_base_pop_2006)
cma_median_educ_change_2011 = weighted_median(educ_change_2006_2011, educ_base_pop_2011)
cma_median_educ_change_2016 = weighted_median(educ_change_2011_2016, educ_base_pop_2016)
cma_median_educ_change_2021 = weighted_median(educ_change_2016_2021, educ_base_pop_2021)

In [None]:
ding_gentrified = ding_gentrifiable.copy()

# Calculate Gentrified Measure
ding_gentrified["Gentrified Ding_2001"] = (
    ding_gentrified["Gentrifiable Ding 1996"]
    & (
        (avg_home_value_change_1996_2001 > cma_median_dwelling_value_change_2001)
        | (rent_change_1996_2001 > cma_median_rent_change_2001)
    )
    & (educ_change_1996_2001 > cma_median_educ_change_2001)
)

ding_gentrified["Gentrified Ding_2006"] = (
    ding_gentrified["Gentrifiable Ding 2001"]
    & (
        (avg_home_value_change_2001_2006 > cma_median_dwelling_value_change_2006)
        | (rent_change_2001_2006 > cma_median_rent_change_2006)
    )
    & (educ_change_2001_2006 > cma_median_educ_change_2006)
)

ding_gentrified["Gentrified Ding_2011"] = (
    ding_gentrified["Gentrifiable Ding 2006"]
    & (
        (avg_home_value_change_2006_2011 > cma_median_dwelling_value_change_2011)
        | (rent_change_2006_2011 > cma_median_rent_change_2011)
    )
    & (educ_change_2006_2011 > cma_median_educ_change_2011)
)

ding_gentrified["Gentrified Ding_2016"] = (
    ding_gentrified["Gentrifiable Ding 2011"]
    & (
        (avg_home_value_change_2011_2016 > cma_median_dwelling_value_change_2016)
        | (rent_change_2011_2016 > cma_median_rent_change_2016)
    )
    & (educ_change_2011_2016 > cma_median_educ_change_2016)
)

ding_gentrified["Gentrified Ding_2021"] = (
    ding_gentrified["Gentrifiable Ding 2016"]
    & (
        (avg_home_value_change_2016_2021 > cma_median_dwelling_value_change_2021)
        | (rent_change_2016_2021 > cma_median_rent_change_2021)
    )
    & (educ_change_2016_2021 > cma_median_educ_change_2021)
)

In [None]:
# Export to drive
ding_gentrified.to_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/5_ding_measurements/gentrified_measure.csv', index=False)

## Gentrification Levels

**Formula**<br>
Weak: rent or housing value increased ≤
 25th percentile<br>
Moderate: rent or housing value increased with the 25th - 75th percentile<br>
Intense: rent or housing value increased > 75th percentile


In [None]:
# Housing quartiles
housing_q25_2001, housing_q75_2001 = weighted_quantile(
    avg_home_value_change_1996_2001, owner_households_2001, [0.25, 0.75]
)
housing_q25_2006, housing_q75_2006 = weighted_quantile(
    avg_home_value_change_2001_2006, owner_households_2006, [0.25, 0.75]
)
housing_q25_2011, housing_q75_2011 = weighted_quantile(
    avg_home_value_change_2006_2011, owner_households_2011, [0.25, 0.75]
)
housing_q25_2016, housing_q75_2016 = weighted_quantile(
    avg_home_value_change_2011_2016, owner_households_2016, [0.25, 0.75]
)
housing_q25_2021, housing_q75_2021 = weighted_quantile(
    avg_home_value_change_2016_2021, owner_households_2021, [0.25, 0.75]
)

# Rent quartiles
rent_q25_2001, rent_q75_2001 = weighted_quantile(
    rent_change_1996_2001, renter_households_2001, [0.25, 0.75]
)
rent_q25_2006, rent_q75_2006 = weighted_quantile(
    rent_change_2001_2006, renter_households_2006, [0.25, 0.75]
)
rent_q25_2011, rent_q75_2011 = weighted_quantile(
    rent_change_2006_2011, renter_households_2011, [0.25, 0.75]
)
rent_q25_2016, rent_q75_2016 = weighted_quantile(
    rent_change_2011_2016, renter_households_2016, [0.25, 0.75]
)
rent_q25_2021, rent_q75_2021 = weighted_quantile(
    rent_change_2016_2021, renter_households_2021, [0.25, 0.75]
)

**Weak | Intense | Moderate (Boolean)**

In [None]:
# Weak
ding_gentrified["Weak Gentrified_2001"] = (
    ding_gentrified["Gentrified Ding_2001"]
    & (avg_home_value_change_1996_2001 < housing_q25_2001)
    & (rent_change_1996_2001 < rent_q25_2001)
)

ding_gentrified["Weak Gentrified_2006"] = (
    ding_gentrified["Gentrified Ding_2006"]
    & (avg_home_value_change_2001_2006 < housing_q25_2006)
    & (rent_change_2001_2006 < rent_q25_2006)
)

ding_gentrified["Weak Gentrified_2011"] = (
    ding_gentrified["Gentrified Ding_2011"]
    & (avg_home_value_change_2006_2011 < housing_q25_2011)
    & (rent_change_2006_2011 < rent_q25_2011)
)

ding_gentrified["Weak Gentrified_2016"] = (
    ding_gentrified["Gentrified Ding_2016"]
    & (avg_home_value_change_2011_2016 < housing_q25_2016)
    & (rent_change_2011_2016 < rent_q25_2016)
)

ding_gentrified["Weak Gentrified_2021"] = (
    ding_gentrified["Gentrified Ding_2021"]
    & (avg_home_value_change_2016_2021 < housing_q25_2021)
    & (rent_change_2016_2021 < rent_q25_2021)
)

In [None]:
# Intense
ding_gentrified["Intense Gentrified_2001"] = (
    ding_gentrified["Gentrified Ding_2001"]
    & (
        (avg_home_value_change_1996_2001 > housing_q75_2001)
        | (rent_change_1996_2001 > rent_q75_2001)
    )
)

ding_gentrified["Intense Gentrified_2006"] = (
    ding_gentrified["Gentrified Ding_2006"]
    & (
        (avg_home_value_change_2001_2006 > housing_q75_2006)
        | (rent_change_2001_2006 > rent_q75_2006)
    )
)

ding_gentrified["Intense Gentrified_2011"] = (
    ding_gentrified["Gentrified Ding_2011"]
    & (
        (avg_home_value_change_2006_2011 > housing_q75_2011)
        | (rent_change_2006_2011 > rent_q75_2011)
    )
)

ding_gentrified["Intense Gentrified_2016"] = (
    ding_gentrified["Gentrified Ding_2016"]
    & (
        (avg_home_value_change_2011_2016 > housing_q75_2016)
        | (rent_change_2011_2016 > rent_q75_2016)
    )
)

ding_gentrified["Intense Gentrified_2021"] = (
    ding_gentrified["Gentrified Ding_2021"]
    & (
        (avg_home_value_change_2016_2021 > housing_q75_2021)
        | (rent_change_2016_2021 > rent_q75_2021)
    )
)

In [None]:
# Moderate
for year in [2001, 2006, 2011, 2016, 2021]:
    ding_gentrified[f"Moderate Gentrified_{year}"] = (
        ding_gentrified[f"Gentrified Ding_{year}"]
        & ~ding_gentrified[f"Weak Gentrified_{year}"]
        & ~ding_gentrified[f"Intense Gentrified_{year}"]
    )

**Gentrification intensity to ordinal index (0-3)**

In [None]:
# Assign numeric level indicators
def get_gentrification_level(row, year):
    if row[f"Intense Gentrified_{year}"]:
        return 3
    elif row[f"Moderate Gentrified_{year}"]:
        return 2
    elif row[f"Weak Gentrified_{year}"]:
        return 1
    else:
        return 0

In [None]:
for year in [2001, 2006, 2011, 2016, 2021]:
    ding_gentrified[f"Gentrification Level Ding_{year}"] = (
        ding_gentrified.apply(get_gentrification_level, axis=1, year=year)
    )

**Drop Intermediate Columns**

In [None]:
cols_to_drop = []
for year in [2001, 2006, 2011, 2016, 2021]:
    cols_to_drop.extend([
        f"Weak Gentrified_{year}",
        f"Moderate Gentrified_{year}",
        f"Intense Gentrified_{year}",
    ])

ding_gentrified.drop(columns=cols_to_drop, inplace=True)

ding_gentrification_level = ding_gentrified.copy()

In [None]:
# Export to drive
ding_gentrification_level.to_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/5_ding_measurements/gentrification_level.csv', index=False)

In [None]:
# Print all column names
print("Column names:")
for col in ding_gentrification_level.columns:
    print(col)

# Main Ding Dataset 2001 - 2021

In [None]:
# Drop 1996 variables + Area (sq km)_2021
ding_gentrification_level.drop(columns=['Population_1996',
                                        'Education base_1996',
                                        'Bachelors degree or higher_1996',
                                        'Median household income ($)_1996',
                                        'Average gross rent ($)_1996',
                                        'Average value dwelling ($)_1996',
                                        'Gentrifiable Ding 1996',
                                        'Area (sq km)_2021'], inplace=True)

In [None]:
main_ding_dataset = ding_gentrification_level.copy()

# Export to drive
main_ding_dataset.to_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/5_ding_measurements/main_ding_dataset.csv', index=False)