<a href="https://colab.research.google.com/github/EmilyHong77/gentrification_in_montreal/blob/main/notebooks/ding_measures.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Information**<br>
Ding Measures Source:<br>
CMA Source:<br>
- Statistics Canada: https://www12.statcan.gc.ca/






**Mount Drive**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

**Libraries**

In [None]:
import numpy as np
import pandas as pd

# Merge Datasets
1. Mapping Dataset
2. XGBoost Dataset

**Mapping Dataset**



**Description**:<br>
This dataset contains all 1003 tracts, including the 61 tracts with missing values.
The full dataset is kept for mapping and visualization because geographic completeness is important for spatial analysis. Missing values do not prevent tracts from being displayed on maps.

In [None]:
# read in the data
standardized_2001 = pd.read_csv('/content/drive/MyDrive/Ai4Good_Project/GentrificAItion/montreal_data_processing/data_cleaning/4_standardized_data/2001_standardized.csv')
standardized_2006 = pd.read_csv('/content/drive/MyDrive/Ai4Good_Project/GentrificAItion/montreal_data_processing/data_cleaning/4_standardized_data/2006_standardized.csv')
standardized_2011 = pd.read_csv('/content/drive/MyDrive/Ai4Good_Project/GentrificAItion/montreal_data_processing/data_cleaning/4_standardized_data/2006_standardized.csv')
standardized_2016 = pd.read_csv('/content/drive/MyDrive/Ai4Good_Project/GentrificAItion/montreal_data_processing/data_cleaning/4_standardized_data/2011_standardized.csv')
standardized_2021 = pd.read_csv('/content/drive/MyDrive/Ai4Good_Project/GentrificAItion/montreal_data_processing/data_cleaning/2_clean_data/2021_clean.csv')

In [None]:
# GeoUID_2021 to ctuid
for df in [standardized_2001, standardized_2006, standardized_2011, standardized_2016, standardized_2021]:
    df.rename(columns={'GeoUID_2021': 'ctuid'}, inplace=True)

In [None]:
# clean 2001-2016 drop NaNs
standardized_2001 = standardized_2001.dropna()
standardized_2006 = standardized_2006.dropna()
standardized_2011 = standardized_2011.dropna()
standardized_2016 = standardized_2016.dropna()


print(len(standardized_2001.columns))
print(standardized_2001.head())
print(len(standardized_2006.columns))
print(standardized_2006.head())
print(len(standardized_2011.columns))
print(standardized_2011.head())
print(len(standardized_2016.columns))
print(standardized_2016.head())
print(len(standardized_2021.columns))
print(standardized_2021.head())


In [None]:
# merge csvs to a single file
mapping_df = pd.merge(standardized_2001, standardized_2006, on='ctuid', how='outer')
mapping_df = pd.merge(mapping_df, standardized_2011, on='ctuid', how='outer')
mapping_df = pd.merge(mapping_df, standardized_2016, on='ctuid', how='outer')
mapping_df = pd.merge(mapping_df, standardized_2021, on='ctuid', how='outer')
mapping_df.head()

In [None]:
# export to drive
mapping_df.to_csv('/content/drive/MyDrive/Ai4Good_Project/GentrificAItion/montreal_data_processing/data_cleaning/5_ding_measurements/merged_data_mapping.csv', index=False)

**XGBoost Training Dataset** <br>

**Description:**
This dataset includes only tracts with complete information for the variables used in modeling.
The 61 NA tracts are removed to ensure:

- model stability

- proper feature alignment

- no missing-value bias

- cleaner training and validation splits

In [None]:
# columns to check for NaN tracts
cols_2001 = [
    "Non-migrants_2001",
    "Migrants_2001",
    "Internal migrants_2001",
    "External migrants_2001"
]

# drop NaN tracts
ding_df = mapping_df.dropna(subset=cols_2001)

# summary
print("Original rows:", len(mapping_df))
print("Cleaned rows:", len(ding_df))
print("Rows dropped:", len(mapping_df) - len(ding_df))

In [None]:
# export to drive
ding_df.to_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/5_ding_measurements/merged_data_ding.csv', index=False)

In [None]:
for col in ding_df.columns:
    print(col)

print(ding_df.shape)

# Ding Measurements

1. Gentrifiable Measure
2. Gentrified Measure
3. Gentrification Levels

## Gentrifiable Measure

**Formula**:<br>
median household income < median of CMA


In [None]:
# median income for Montreal for each year
median_income_CMA_2001 = 47267
median_income_CMA_2006 = 47979
median_income_CMA_2011 = 53024
median_income_CMA_2016 = 61790
median_income_CMA_2021 = 63600

# create gentrifiable df from ding_df
ding_gentrifiable = ding_df.copy()

# median income for each tract
median_household_income_2001 = ding_df["Median household income ($)_2001"]
median_household_income_2006 = ding_df["Median household income ($)_2006"]
median_household_income_2011 = ding_df["Median household income ($)_2011"]
median_household_income_2016 = ding_df["Median household income ($)_2016"]
median_household_income_2021 = ding_df["Median household income ($)_2021"]

# calculate gentrifiable measure
ding_gentrifiable['Gentrifiable Ding 2001'] = median_household_income_2001 < median_income_CMA_2001
ding_gentrifiable['Gentrifiable Ding 2006'] = median_household_income_2006 < median_income_CMA_2006
ding_gentrifiable['Gentrifiable Ding 2011'] = median_household_income_2011 < median_income_CMA_2011
ding_gentrifiable['Gentrifiable Ding 2016'] = median_household_income_2016 < median_income_CMA_2016
ding_gentrifiable['Gentrifiable Ding 2021'] = median_household_income_2021 < median_income_CMA_2021

In [None]:
# export to drive
ding_gentrifiable.to_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/5_ding_measurements/gentrifiable_measure.csv', index=False)

## Gentrified Measure

**Formula**<br>
increase university degrees > median for CMA <br>
increase in housing value > median for CMA <br>
increase in renting costs > median for CMA

## Gentrification Levels

**Formula**<br>
Weak: rent or housing value increased â‰¤
 25th percentile<br>
Moderate: rent or housing value increased with the 25th - 75th percentile<br>
Intense: rent or housing value increased > 75th percentile
