<a href="https://colab.research.google.com/github/EmilyHong77/gentrification_in_montreal/blob/main/notebooks/data_cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Information** <br>
Variables Source: <br>
https://censusmapper.ca/ <br>

This notebook performs cleaning of census data across multiple years (2001 to 2021).
Key steps include merging related demographic, education, housing, and transportation variables
(e.g., population by gender and age, household education levels, dwelling condition, and commuting patterns),
as well as renaming columns to ensure consistent variable naming across datasets. <br>
Steps vary for census years, depending on variable availability or formatting across years.
This process prepares the datasets for standardized analysis and Ding measurement application.

**Mount Drive**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

**Library**

In [None]:
import pandas as pd

# **1996**
- keep population variable
- add gentrified variables
- merge automatic variables

In [None]:
census_1996 = pd.read_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/1_raw_census_data/1996_data_variables.csv')

# Print all column names
print("Column names:")
for col in census_1996.columns:
    print(col)

In [None]:
census_1996_gentrified = pd.read_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/1_raw_census_data/1996_gentrified_variables.csv')

# Print all column names
print("Column names:")
for col in census_1996_gentrified.columns:
    print(col)

In [None]:
# Remove white spaces
census_1996.columns = census_1996.columns.str.strip()
census_1996_gentrified.columns = census_1996_gentrified.columns.str.strip()

In [None]:
# Drop useless variables
census_1996.drop(columns=['Type', 'Region Name', 'Area (sq km)', 'Dwellings', 'Households', 'v_CA1996_2: Population, 1996'], inplace=True)
census_1996_gentrified.drop(columns=['Type', 'Region Name', 'Area (sq km)', 'Dwellings', 'Households', 'Population'], inplace=True)


# Check column names
print("Column names:")
for col in census_1996.columns:
    print(col)

# Check column names
print("Column names:")
for col in census_1996_gentrified.columns:
    print(col)

In [None]:
# Merge original with additional variables
census_1996_merged = census_1996.merge(
    census_1996_gentrified,
    on="GeoUID",
    how="left"
)

In [None]:
# Rename variables
census_1996_merged = census_1996_merged.rename(columns=
 {"v_CA1996_1681: Average value of dwelling $": "Average value dwelling ($)",
   "v_CA1996_1701: Average gross rent $": "Average gross rent ($)",
   "v_CA1996_1347: Total population 15 years and over by highest level of schooling": "Education base",
   "v_CA1996_1360: With bachelor's degree or higher": "Bachelors degree or higher"})

In [None]:
# Add year
census_1996_merged = census_1996_merged.add_suffix('_1996')

# Check column names
print("Column names:")
for col in census_1996_merged.columns:
    print(col)

In [None]:
# Export file to drive
census_1996.to_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/2_clean_data/1996_clean.csv', index=False)

# **2001**
- drop employment and unemployment rate variables
- add employed and unemployed variables
- merge population gender
- merge population age buckets
- merge no certificate or diploma
- merge regular maintenance
- merge transportation gender
- merge transportation others
- rename variables!
- add owner and tenant variables


In [None]:
census_2001 = pd.read_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/1_raw_census_data/2001_data_variables.csv')

# Print all column names
print("Column names:")
for col in census_2001.columns:
    print(col)

In [None]:
census_2001_gentrified = pd.read_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/1_raw_census_data/2001_gentrified_variables.csv')

# Print all column names
print("Column names:")
for col in census_2001_gentrified.columns:
    print(col)

In [None]:
# Drop employment rate and unemployment rate
census_2001.drop(columns=['v_CA01_741: Employment rate', 'v_CA01_742: Unemployment rate'], inplace=True)

# Drop owned and rented
census_2001.drop(columns=['v_CA01_99: Owned','v_CA01_100: Rented'], inplace=True)

# Check column names
print("Column names:")
for col in census_2001.columns:
    print(col)

In [None]:
# Add employed and unemployed
emp_unemp = pd.read_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/1_raw_census_data/employed_unemployed_2001.csv')
census_2001 = pd.merge(census_2001, emp_unemp, on='GeoUID')

# Check column names
print("Column names:")
for col in census_2001.columns:
    print(col)

**Owner and Tenant**

In [None]:
# Add owner and tenant
owner_tenant = pd.read_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/1_raw_census_data/2001_housing_variables.csv')
owner_tenant = owner_tenant[['GeoUID', 'v_CA01_1670: Owner households in non-farm, non-reserve private dwellings', 'v_CA01_1666: Tenant households in non-farm, non-reserve private dwellings']]
census_2001 = pd.merge(census_2001, owner_tenant, on='GeoUID')

# Check column names
print("Column names:")
for col in census_2001.columns:
    print(col)

In [None]:
# Remove white spaces
census_2001.columns = census_2001.columns.str.strip()
census_2001_gentrified.columns = census_2001_gentrified.columns.str.strip()

In [None]:
# Drop extra variables
census_2001 = census_2001.drop(columns=['Type_y', 'Region Name_y', 'Area (sq km)_y', 'Population _y', 'Dwellings _y', 'Households _y'])

print(census_2001.columns.tolist())

In [None]:
# Drop extra variables
census_2001_gentrified = census_2001_gentrified.drop(columns=['Type', 'Region Name', 'Area (sq km)', 'Population', 'Dwellings', 'Households'])

print(census_2001_gentrified.columns.tolist())

In [None]:
# Convert all columns except the identifier to numeric
ctuidcol = 'GeoUID'  # identifier column
cols_to_convert = census_2001.columns.difference([ctuidcol])
census_2001[cols_to_convert] = census_2001[cols_to_convert].apply(pd.to_numeric, errors='coerce')

In [None]:
# Merge population gender
census_2001['0-4'] = census_2001['v_CA01_7: 0-4'] + census_2001['v_CA01_26: 0-4']
census_2001 = census_2001.drop(columns=['v_CA01_7: 0-4', 'v_CA01_26: 0-4'])

census_2001['5-9'] = census_2001['v_CA01_8: 5-9'] + census_2001['v_CA01_27: 5-9']
census_2001 = census_2001.drop(columns=['v_CA01_8: 5-9', 'v_CA01_27: 5-9'])

census_2001['10-14'] = census_2001['v_CA01_9: 10-14'] + census_2001['v_CA01_28: 10-14']
census_2001 = census_2001.drop(columns=['v_CA01_9: 10-14', 'v_CA01_28: 10-14'])

census_2001['15-19'] = census_2001['v_CA01_10: 15-19'] + census_2001['v_CA01_29: 15-19']
census_2001 = census_2001.drop(columns=['v_CA01_10: 15-19', 'v_CA01_29: 15-19'])

census_2001['20-24'] = census_2001['v_CA01_11: 20-24'] + census_2001['v_CA01_30: 20-24']
census_2001 = census_2001.drop(columns=['v_CA01_11: 20-24', 'v_CA01_30: 20-24'])

census_2001['25-29'] = census_2001['v_CA01_12: 25-29'] + census_2001['v_CA01_31: 25-29']
census_2001 = census_2001.drop(columns=['v_CA01_12: 25-29', 'v_CA01_31: 25-29'])

census_2001['30-34'] = census_2001['v_CA01_13: 30-34'] + census_2001['v_CA01_32: 30-34']
census_2001 = census_2001.drop(columns=['v_CA01_13: 30-34', 'v_CA01_32: 30-34'])

census_2001['35-39'] = census_2001['v_CA01_14: 35-39'] + census_2001['v_CA01_33: 35-39']
census_2001 = census_2001.drop(columns=['v_CA01_14: 35-39', 'v_CA01_33: 35-39'])

census_2001['40-44'] = census_2001['v_CA01_15: 40-44'] + census_2001['v_CA01_34: 40-44']
census_2001 = census_2001.drop(columns=['v_CA01_15: 40-44', 'v_CA01_34: 40-44'])

census_2001['45-49'] = census_2001['v_CA01_16: 45-49'] + census_2001['v_CA01_35: 45-49']
census_2001 = census_2001.drop(columns=['v_CA01_16: 45-49', 'v_CA01_35: 45-49'])

census_2001['50-54'] = census_2001['v_CA01_17: 50-54'] + census_2001['v_CA01_36: 50-54']
census_2001 = census_2001.drop(columns=['v_CA01_17: 50-54', 'v_CA01_36: 50-54'])

census_2001['55-59'] = census_2001['v_CA01_18: 55-59'] + census_2001['v_CA01_37: 55-59']
census_2001 = census_2001.drop(columns=['v_CA01_18: 55-59', 'v_CA01_37: 55-59'])

census_2001['60-64'] = census_2001['v_CA01_19: 60-64'] + census_2001['v_CA01_38: 60-64']
census_2001 = census_2001.drop(columns=['v_CA01_19: 60-64', 'v_CA01_38: 60-64'])

census_2001['65-69'] = census_2001['v_CA01_20: 65-69'] + census_2001['v_CA01_39: 65-69']
census_2001 = census_2001.drop(columns=['v_CA01_20: 65-69', 'v_CA01_39: 65-69'])

census_2001['70-74'] = census_2001['v_CA01_21: 70-74'] + census_2001['v_CA01_40: 70-74']
census_2001 = census_2001.drop(columns=['v_CA01_21: 70-74', 'v_CA01_40: 70-74'])

census_2001['75-79'] = census_2001['v_CA01_22: 75-79'] + census_2001['v_CA01_41: 75-79']
census_2001 = census_2001.drop(columns=['v_CA01_22: 75-79', 'v_CA01_41: 75-79'])

census_2001['80-84'] = census_2001['v_CA01_23: 80-84'] + census_2001['v_CA01_42: 80-84']
census_2001 = census_2001.drop(columns=['v_CA01_23: 80-84', 'v_CA01_42: 80-84'])

census_2001['85+'] = census_2001['v_CA01_24: 85+'] + census_2001['v_CA01_43: 85+']
census_2001 = census_2001.drop(columns=['v_CA01_24: 85+', 'v_CA01_43: 85+'])

# Check column names
print("Column names:")
for col in census_2001.columns:
    print(col)

In [None]:
# Merge population age buckets
census_2001['Population 0 to 9'] = census_2001['0-4'] + census_2001['5-9']
census_2001 = census_2001.drop(columns=['0-4', '5-9'])

census_2001['Population 10 to 19'] = census_2001['10-14'] + census_2001['15-19']
census_2001 = census_2001.drop(columns=['10-14', '15-19'])

census_2001['Population 20 to 29'] = census_2001['20-24'] + census_2001['25-29']
census_2001 = census_2001.drop(columns=['20-24', '25-29'])

census_2001['Population 30 to 39'] = census_2001['30-34'] + census_2001['35-39']
census_2001 = census_2001.drop(columns=['30-34', '35-39'])

census_2001['Population 40 to 49'] = census_2001['40-44'] + census_2001['45-49']
census_2001 = census_2001.drop(columns=['40-44', '45-49'])

census_2001['Population 50 to 59'] = census_2001['50-54'] + census_2001['55-59']
census_2001 = census_2001.drop(columns=['50-54', '55-59'])

census_2001['Population 60 plus'] = census_2001['60-64'] + census_2001['65-69'] + census_2001['70-74'] + census_2001['75-79'] + census_2001['80-84'] + census_2001['85+']
census_2001 = census_2001.drop(columns=['60-64', '65-69', '70-74', '75-79', '80-84', '85+'])

# Check column names
print("Column names:")
for col in census_2001.columns:
    print(col)

In [None]:
# Merge no certificate or diploma
census_2001['No certificate or diploma'] = census_2001['v_CA01_1387: Without high school graduation certificate']+ census_2001['v_CA01_1391: Without certificate or diploma'] + census_2001['v_CA01_1394: Without degree']
census_2001 = census_2001.drop(columns=['v_CA01_1387: Without high school graduation certificate', 'v_CA01_1391: Without certificate or diploma', 'v_CA01_1394: Without degree'])

# Merge regular maintenance
census_2001['Regular maintenance'] = census_2001['v_CA01_102: Regular maintenance only']+ census_2001['v_CA01_103: Minor repairs']
census_2001 = census_2001.drop(columns=['v_CA01_102: Regular maintenance only', 'v_CA01_103: Minor repairs'])

# Merge transportation gender and other methods
census_2001['Public transit'] = census_2001['v_CA01_1257: Public transit'] + census_2001['v_CA01_1266: Public transit']
census_2001 = census_2001.drop(columns=['v_CA01_1257: Public transit', 'v_CA01_1266: Public transit'])

census_2001['Walked'] = census_2001['v_CA01_1258: Walked'] + census_2001['v_CA01_1267: Walked']
census_2001 = census_2001.drop(columns=['v_CA01_1258: Walked', 'v_CA01_1267: Walked'])

census_2001['Bicycle'] = census_2001['v_CA01_1259: Bicycle'] + census_2001['v_CA01_1268: Bicycle']
census_2001 = census_2001.drop(columns=['v_CA01_1259: Bicycle', 'v_CA01_1268: Bicycle'])

census_2001['Transportation other methods'] = census_2001['v_CA01_1260: Motorcycle'] + census_2001['v_CA01_1261: Taxicab'] + census_2001['v_CA01_1262: Other method'] + census_2001['v_CA01_1269: Motorcycle'] + census_2001['v_CA01_1270: Taxicab'] + census_2001['v_CA01_1271: Other method']
census_2001 = census_2001.drop(columns=['v_CA01_1260: Motorcycle', 'v_CA01_1261: Taxicab', 'v_CA01_1262: Other method', 'v_CA01_1269: Motorcycle', 'v_CA01_1270: Taxicab', 'v_CA01_1271: Other method'])

census_2001['Transportation vehicle driver'] = census_2001['v_CA01_1255: Car, truck, van, as driver'] + census_2001['v_CA01_1264: Car, truck, van, as driver']
census_2001 = census_2001.drop(columns=['v_CA01_1255: Car, truck, van, as driver', 'v_CA01_1264: Car, truck, van, as driver'])

census_2001['Transportation vehicle non-driver'] = census_2001['v_CA01_1256: Car, truck, van, as passenger'] + census_2001['v_CA01_1265: Car, truck, van, as passenger']
census_2001 = census_2001.drop(columns=['v_CA01_1256: Car, truck, van, as passenger', 'v_CA01_1265: Car, truck, van, as passenger'])

# Check column names
print("Column names:")
for col in census_2001.columns:
    print(col)

In [None]:
# Drop useless variables
census_2001 = census_2001.drop(columns=['Type_x', 'Region Name_x'])

# Check column names
print("Column names:")
for col in census_2001.columns:
    print(col)

In [None]:
# Rename: drop apostrophe
census_2001["v_CA01_1397: With bachelors degree or higher"] = census_2001["v_CA01_1397: With bachelor's degree or higher"]
census_2001 = census_2001.drop(columns=["v_CA01_1397: With bachelor's degree or higher"])

print(census_2001.columns.tolist())

In [None]:
print(census_2001['v_CA01_1397: With bachelors degree or higher'])

In [None]:
# Rename variables
census_2001 = census_2001.rename(columns=
  {'Area (sq km)_x': 'Area (sq km)', 'Population _x': 'Population', 'Dwellings _x': 'Dwellings',
   'Households _x': 'Households', 'v_CA01_6: Male, total': 'Total male population', 'v_CA01_25: Female, total': 'Total female population',
   'v_CA01_3: Population percentage change, 1996-2001': 'Population change (%)', 'v_CA01_45: Never legally married (single)': 'Never married', 'v_CA01_48: Divorced': 'Divorced',
   'v_CA01_49: Widowed': 'Widowed', 'v_CA01_383: Movers': 'Movers', 'v_CA01_382: Non-movers': 'Non-movers',
   'v_CA01_400: Canadian citizenship': 'Canadian citizens', 'v_CA01_385: Migrants': 'Migrants', 'v_CA01_384: Non-migrants': 'Non-migrants',
   'v_CA01_386: Internal migrants': 'Internal migrants', 'v_CA01_389: External migrants': 'External migrants', 'v_CA01_703: Total visible minority population': 'Visible minority',
   'v_CA01_716: All others': 'Non-visible minority', 'v_CA01_401: Citizenship other than Canadian': 'Non-Canadian citizens', 'v_CA01_403: Non-immigrant population': 'Non-immigrants',
   'v_CA01_406: Total immigrants by selected places of birth': 'Immigrants', 'v_CA01_458: Non-permanent residents': 'Non-permanent residents', 'v_CA01_55: Married couples': 'Married couples',
   'v_CA01_61: Common-law couples': 'Common-law couples',
   'v_CA01_1634: Median household income $': 'Median household income ($)', 'v_CA01_1388: With high school graduation certificate': 'High school or Secondary degree', 'v_CA01_1392: With certificate or diploma': 'College or CEGEP degree',
   'v_CA01_1389: Trades certificate or diploma': 'Trades certificate, diploma or apprenticeship', 'v_CA01_1397: With bachelors degree or higher': 'Bachelors degree or higher', 'v_CA01_118: Apartment, building that has fewer than five storeys': 'Apartment with fewer than five stories',
   'v_CA01_117: Apartment, building that has five or more storeys': 'Apartment with five or more storeys', 'v_CA01_115: Row house': 'Row house', 'v_CA01_113: Single-detached house': 'Single-detached house',
   'v_CA01_114: Semi-detached house': 'Semi-detached house', 'v_CA01_120: Movable dwelling': 'Movable dwelling', 'v_CA01_119: Other single-attached house': 'Other single-attached house',
   'v_CA01_104: Major repairs': 'Major repairs', 'v_CA01_101: Band housing': 'Band housing', 'v_CA01_135: English': 'English only',
   'v_CA01_136: French': 'French only', 'v_CA01_137: Non-official languages': 'Allophone', 'v_CA01_1667: Average gross rent $': 'Average gross rent ($)',
   'v_CA01_1674: Average value of dwelling $': 'Average value dwelling ($)', 'v_CA01_209: English and French': 'English and French', 'v_CA01_210: English and non-official language': 'English and non-official language(s)',
   'v_CA01_211: French and non-official language': 'French and non-official language(s)', 'v_CA01_212: English, French and non-official language': 'English, French, and non-official language(s)', 'v_CA01_737: Employed': 'Employed',
   'v_CA01_738: Unemployed': 'Unemployed',
   'v_CA01_1670: Owner households in non-farm, non-reserve private dwellings': 'Owner housholds',
   'v_CA01_1666: Tenant households in non-farm, non-reserve private dwellings': 'Renter households' })

# Check column names and length
print("Column names:")
for col in census_2001.columns:
    print(col)

print(len(census_2001.columns))

In [None]:
# Rename variables
census_2001_gentrified = census_2001_gentrified.rename(columns=
  { 'v_CA01_1384: Total population 20 years and over by highest level of schooling': 'Education base'})

In [None]:
# Merge original with additional variables
census_2001_merged = census_2001.merge(
    census_2001_gentrified,
    on="GeoUID",
    how="left"
)

In [None]:
# Add year
census_2001_merged = census_2001_merged.add_suffix('_2001')

# Check column names
print("Column names:")
for col in census_2001_merged.columns:
    print(col)

print(len(census_2001.columns))


In [None]:
# Check values for each column
census_2001_merged.iloc[:60, 0:20]

In [None]:
# Export file to drive
census_2001_merged.to_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/2_clean_data/2001_clean.csv', index=False)

# **2006**

- merge population gender + merge population age buckets
- merge regular maintenance
- merge transportation others
- calculate population percentage change and add
- rename variables!
- add owner and tenant variables

In [None]:
census_2006 = pd.read_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/1_raw_census_data/2006_data_variables.csv')

# Print all column names
print("Column names:")
for col in census_2006.columns:
    print(col)

**Owner and Tenant**<br>
**Bachelors degree or higher**<br>
**Education base**

In [None]:
# Remove white spaces
census_2006.columns = census_2006.columns.str.strip()

In [None]:
# Add owner and tenant
owner_tenant = pd.read_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/1_raw_census_data/2006_housing_variables.csv')
owner_tenant = owner_tenant[['GeoUID', 'v_CA06_2053: Owner-occupied private non-farm, non-reserve dwellings', 'v_CA06_2049: Tenant-occupied private non-farm, non-reserve dwellings']]
census_2006 = pd.merge(census_2006, owner_tenant, on='GeoUID')

# Add missing gentrified variables
education_bachelors = pd.read_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/1_raw_census_data/2006_gentrified_variables.csv')
education_bachelors = education_bachelors[['GeoUID', 'v_CA06_1234: Total population 15 to 24 years by highest certificate, diploma or degree - 20% sample data',
                                          'v_CA06_1248: Total population 25 to 64 years by highest certificate, diploma or degree - 20% sample data', 'v_CA06_1242: University certificate or degree',
                                          'v_CA06_1262: Total population 65 years and over by highest certificate, diploma or degree - 20% sample data', 'v_CA06_1270: University certificate or degree']]
census_2006 = pd.merge(census_2006, education_bachelors, on='GeoUID')

# Check column names
print("Column names:")
for col in census_2006.columns:
    print(col)

In [None]:
# Convert all columns except the identifier to numeric
ctuidcol = 'GeoUID'  # identifier column
cols_to_convert = census_2006.columns.difference([ctuidcol])
census_2006[cols_to_convert] = census_2006[cols_to_convert].apply(pd.to_numeric, errors='coerce')

In [None]:
# Merge population gender and age buckets
census_2006['Population 0 to 9'] = census_2006['v_CA06_4: 0 to 4 years'] + census_2006['v_CA06_23: 0 to 4 years'] + census_2006['v_CA06_5: 5 to 9 years'] + census_2006['v_CA06_24: 5 to 9 years']
census_2006 = census_2006.drop(columns=['v_CA06_4: 0 to 4 years', 'v_CA06_23: 0 to 4 years', 'v_CA06_5: 5 to 9 years', 'v_CA06_24: 5 to 9 years'])

census_2006['Population 10 to 19'] = census_2006['v_CA06_6: 10 to 14 years'] + census_2006['v_CA06_25: 10 to 14 years'] + census_2006['v_CA06_7: 15 to 19 years'] + census_2006['v_CA06_26: 15 to 19 years']
census_2006 = census_2006.drop(columns=['v_CA06_6: 10 to 14 years', 'v_CA06_25: 10 to 14 years', 'v_CA06_7: 15 to 19 years', 'v_CA06_26: 15 to 19 years'])

census_2006['Population 20 to 29'] = census_2006['v_CA06_8: 20 to 24 years'] + census_2006['v_CA06_27: 20 to 24 years'] + census_2006['v_CA06_9: 25 to 29 years'] + census_2006['v_CA06_28: 25 to 29 years']
census_2006 = census_2006.drop(columns=['v_CA06_8: 20 to 24 years', 'v_CA06_27: 20 to 24 years', 'v_CA06_9: 25 to 29 years', 'v_CA06_28: 25 to 29 years'])

census_2006['Population 30 to 39'] = census_2006['v_CA06_10: 30 to 34 years'] + census_2006['v_CA06_29: 30 to 34 years'] + census_2006['v_CA06_11: 35 to 39 years'] + census_2006['v_CA06_30: 35 to 39 years']
census_2006 = census_2006.drop(columns=['v_CA06_10: 30 to 34 years', 'v_CA06_29: 30 to 34 years', 'v_CA06_11: 35 to 39 years', 'v_CA06_30: 35 to 39 years'])

census_2006['Population 40 to 49'] = census_2006['v_CA06_12: 40 to 44 years'] + census_2006['v_CA06_31: 40 to 44 years'] + census_2006['v_CA06_13: 45 to 49 years'] + census_2006['v_CA06_32: 45 to 49 years']
census_2006 = census_2006.drop(columns=['v_CA06_12: 40 to 44 years', 'v_CA06_31: 40 to 44 years', 'v_CA06_13: 45 to 49 years', 'v_CA06_32: 45 to 49 years'])

census_2006['Population 50 to 59'] = census_2006['v_CA06_14: 50 to 54 years'] + census_2006['v_CA06_33: 50 to 54 years'] + census_2006['v_CA06_15: 55 to 59 years'] + census_2006['v_CA06_34: 55 to 59 years']
census_2006 = census_2006.drop(columns=['v_CA06_14: 50 to 54 years', 'v_CA06_33: 50 to 54 years', 'v_CA06_15: 55 to 59 years', 'v_CA06_34: 55 to 59 years'])

census_2006['Population 60 plus'] = census_2006['v_CA06_16: 60 to 64 years'] + census_2006['v_CA06_35: 60 to 64 years'] + census_2006['v_CA06_17: 65 to 69 years'] + census_2006['v_CA06_36: 65 to 69 years'] + census_2006['v_CA06_18: 70 to 74 years'] + census_2006['v_CA06_37: 70 to 74 years'] + census_2006['v_CA06_19: 75 to 79 years'] + census_2006['v_CA06_38: 75 to 79 years'] + census_2006['v_CA06_20: 80 to 84 years'] + census_2006['v_CA06_39: 80 to 84 years'] + census_2006['v_CA06_21: 85 years and over'] + census_2006['v_CA06_40: 85 years and over']
census_2006 = census_2006.drop(columns=['v_CA06_16: 60 to 64 years', 'v_CA06_35: 60 to 64 years', 'v_CA06_17: 65 to 69 years', 'v_CA06_36: 65 to 69 years', 'v_CA06_18: 70 to 74 years', 'v_CA06_37: 70 to 74 years', 'v_CA06_19: 75 to 79 years', 'v_CA06_38: 75 to 79 years', 'v_CA06_20: 80 to 84 years', 'v_CA06_39: 80 to 84 years', 'v_CA06_21: 85 years and over', 'v_CA06_40: 85 years and over'])

# Check column names
print("Column names:")
for col in census_2006.columns:
    print(col)

In [None]:
# Merge regular maintenance
census_2006['Regular maintenance'] = census_2006['v_CA06_106: Regular maintenance only']+ census_2006['v_CA06_107: Minor repairs']
census_2006 = census_2006.drop(columns=['v_CA06_106: Regular maintenance only', 'v_CA06_107: Minor repairs'])

# Merge transportion other methods
census_2006['Transportation other methods'] = census_2006['v_CA06_1106: Motorcycle'] + census_2006['v_CA06_1107: Taxicab'] + census_2006['v_CA06_1108: Other method']
census_2006 = census_2006.drop(columns=['v_CA06_1106: Motorcycle', 'v_CA06_1107: Taxicab', 'v_CA06_1108: Other method'])

# Check all column names
print("Column names:")
for col in census_2006.columns:
    print(col)

In [None]:
# Merge bachelors degree or higher
census_2006['Bachelors degree or higher'] = census_2006['v_CA06_1242: University certificate or degree'] + census_2006['v_CA06_1256: University certificate or degree'] + census_2006['v_CA06_1270: University certificate or degree']
census_2006 = census_2006.drop(columns=['v_CA06_1242: University certificate or degree', 'v_CA06_1256: University certificate or degree', 'v_CA06_1270: University certificate or degree'])

# Merge education base
census_2006['Education_base'] = census_2006['v_CA06_1234: Total population 15 to 24 years by highest certificate, diploma or degree - 20% sample data'] + census_2006['v_CA06_1248: Total population 25 to 64 years by highest certificate, diploma or degree - 20% sample data'] + census_2006['v_CA06_1262: Total population 65 years and over by highest certificate, diploma or degree - 20% sample data']
census_2006 = census_2006.drop(columns=['v_CA06_1234: Total population 15 to 24 years by highest certificate, diploma or degree - 20% sample data', 'v_CA06_1248: Total population 25 to 64 years by highest certificate, diploma or degree - 20% sample data', 'v_CA06_1262: Total population 65 years and over by highest certificate, diploma or degree - 20% sample data'])

In [None]:
# Calculate population percentage change and add
c_2001 = pd.read_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/2_clean_data/2001_clean.csv')
census_2006['Population change (%)'] = ((census_2006['Population'] - c_2001['Population_2001']) / c_2001['Population_2001'] * 100).round(1)

print(census_2006['Population change (%)'])

# Check column names
print("Column names:")
for col in census_2006.columns:
    print(col)

In [None]:
# Drop useless variables
census_2006 = census_2006.drop(columns=['Type', 'Region Name', 'v_CA06_102: Owned', 'v_CA06_103: Rented'])

In [None]:
# Rename variables
census_2006 = census_2006.rename(columns=
  {'v_CA06_3: Male, total': 'Total male population', 'v_CA06_22: Female, total': 'Total female population',
   'v_CA06_42: Never legally married (single)': 'Never married', 'v_CA06_45: Divorced': 'Divorced',
   'v_CA06_46: Widowed': 'Widowed', 'v_CA06_453: Movers': 'Movers', 'v_CA06_452: Non-movers': 'Non-movers',
   'v_CA06_470: Canadian citizens': 'Canadian citizens',
   'v_CA06_455: Migrants': 'Migrants',
   'v_CA06_454: Non-migrants': 'Non-migrants',
   'v_CA06_456: Internal migrants': 'Internal migrants',
   'v_CA06_459: External migrants': 'External migrants',
   'v_CA06_1303: Total visible minority population': 'Visible minority',
   'v_CA06_1316: Not a visible minority': 'Non-visible minority',
   'v_CA06_473: Not Canadian citizens': 'Non-Canadian citizens',
   'v_CA06_475: Non-immigrants': 'Non-immigrants',
   'v_CA06_478: Immigrants': 'Immigrants',
   'v_CA06_511: Non-permanent residents': 'Non-permanent residents',
   'v_CA06_57: Married couples': 'Married couples',
   'v_CA06_63: Common-law couples': 'Common-law couples',
  #  'v_CA06_102: Owned': 'Owned housing',
  #  'v_CA06_103: Rented': 'Rented housing',
   'v_CA06_1249: No certificate, diploma or degree': 'No certificate or diploma',
   'v_CA06_2000: Median household income $': 'Median household income ($)',
   'v_CA06_1251: High school certificate or equivalent': 'High school or Secondary degree',
   'v_CA06_1253: College, CEGEP or other non-university certificate or diploma': 'College or CEGEP degree',
   'v_CA06_1252: Apprenticeship or trades certificate or diploma': 'Trades certificate, diploma or apprenticeship',
  #  'v_CA06_1256: University certificate or degree': 'Bachelors degree or higher',
   'v_CA06_125: Apartment, building that has fewer than five storeys': 'Apartment with fewer than five stories',
   'v_CA06_124: Apartment, building that has five or more storeys': 'Apartment with five or more storeys',
   'v_CA06_122: Row house': 'Row house',
   'v_CA06_120: Single-detached house': 'Single-detached house',
   'v_CA06_121: Semi-detached house': 'Semi-detached house',
   'v_CA06_127: Movable dwelling': 'Movable dwelling',
   'v_CA06_126: Other single-attached house': 'Other single-attached house',
   'v_CA06_108: Major repairs': 'Major repairs',
   'v_CA06_104: Band housing': 'Band housing',
   'v_CA06_142: English': 'English only',
   'v_CA06_143: French': 'French only',
   'v_CA06_144: Non-official languages': 'Allophone',
   'v_CA06_2050: Average gross rent $': 'Average gross rent ($)',
   'v_CA06_2054: Average value of dwelling $': 'Average value dwelling ($)',
   'v_CA06_1103: Public transit' : 'Public transit',
   'v_CA06_1104: Walked' : 'Walked',
   'v_CA06_1105: Bicycle' : 'Bicycle',
   'v_CA06_1101: Car, truck, van, as driver': 'Transportation vehicle driver',
   'v_CA06_1102: Car, truck, van, as passenger' : 'Transportation vehicle non-driver',
   'v_CA06_239: English and French': 'English and French',
   'v_CA06_240: English and non-official language': 'English and non-official language(s)',
   'v_CA06_241: French and non-official language': 'French and non-official language(s)',
   'v_CA06_242: English, French and non-official language': 'English, French, and non-official language(s)',
   'v_CA06_577: Employed': 'Employed',
   'v_CA06_578: Unemployed': 'Unemployed',
   'v_CA06_2053: Owner-occupied private non-farm, non-reserve dwellings': 'Owner households',
   'v_CA06_2049: Tenant-occupied private non-farm, non-reserve dwellings': 'Tenant households'})

# Check column names and length
print("Column names:")
for col in census_2006.columns:
    print(col)

print(len(census_2006.columns))

In [None]:
# Add year
census_2006 = census_2006.add_suffix('_2006')

# Check column names
print("Column names:")
for col in census_2006.columns:
    print(col)

print(len(census_2006.columns))

In [None]:
# Check values
census_2006.iloc[:60, 0:20]

In [None]:
# Export file to drive
census_2006.to_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/2_clean_data/2006_clean.csv', index=False)

# **2011**
- merge population age buckets
- rename variables!
- drop "NHS Non Return Rate" column
- add owner and tenant variables

In [None]:
census_2011 = pd.read_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/1_raw_census_data/2011_data_variables.csv')

# Print all column names
print("Column names:")
for col in census_2011.columns:
    print(col)

**Owner and Tenant**<br>
**Bachelors degree or higher**<br>
**Education base**

In [None]:
# Add owner and tenant
owner_tenant = pd.read_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/1_raw_census_data/2011_housing_variables.csv')
owner_tenant = owner_tenant[['GeoUID', 'v_CA11N_2281: Number of owner households in non-farm, non-reserve private dwellings', 'v_CA11N_2288: Number of tenant households in non-farm, non-reserve private dwellings']]
census_2011 = pd.merge(census_2011, owner_tenant, on='GeoUID')

# Add missing gentrified variables
education_bachelors = pd.read_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/1_raw_census_data/2011_gentrified_variables.csv')
education_bachelors = education_bachelors[['GeoUID', 'v_CA11N_1771: Total population aged 15 years and over by highest certificate, diploma or degree',
                                           'v_CA11N_1792: University certificate, diploma or degree at bachelor level or above', 'v_CA11N_1801: Total population aged 25 to 64 years by highest certificate, diploma or degree']]
census_2011 = pd.merge(census_2011, education_bachelors, on='GeoUID')

# Check column names
print("Column names:")
for col in census_2011.columns:
    print(col)

In [None]:
# Remove white spaces
census_2011.columns = census_2011.columns.str.strip()

In [None]:
# Convert all columns except the identifier to numeric
ctuidcol = 'GeoUID'  # identifier column
cols_to_convert = census_2011.columns.difference([ctuidcol])
census_2011[cols_to_convert] = census_2011[cols_to_convert].apply(pd.to_numeric, errors='coerce')

In [None]:
# Merge population age buckets
census_2011['Population 0 to 9'] = census_2011['v_CA11F_8: 0 to 4 years'] + census_2011['v_CA11F_11: 5 to 9 years']
census_2011 = census_2011.drop(columns=['v_CA11F_8: 0 to 4 years', 'v_CA11F_11: 5 to 9 years'])

census_2011['Population 10 to 19'] = census_2011['v_CA11F_14: 10 to 14 years'] + census_2011['v_CA11F_17: 15 to 19 years']
census_2011 = census_2011.drop(columns=['v_CA11F_14: 10 to 14 years', 'v_CA11F_17: 15 to 19 years'])

census_2011['Population 20 to 29'] = census_2011['v_CA11F_35: 20 to 24 years'] + census_2011['v_CA11F_38: 25 to 29 years']
census_2011 = census_2011.drop(columns=['v_CA11F_35: 20 to 24 years', 'v_CA11F_38: 25 to 29 years'])

census_2011['Population 30 to 39'] = census_2011['v_CA11F_41: 30 to 34 years'] + census_2011['v_CA11F_44: 35 to 39 years']
census_2011 = census_2011.drop(columns=['v_CA11F_41: 30 to 34 years', 'v_CA11F_44: 35 to 39 years'])

census_2011['Population 40 to 49'] = census_2011['v_CA11F_47: 40 to 44 years'] + census_2011['v_CA11F_50: 45 to 49 years']
census_2011 = census_2011.drop(columns=['v_CA11F_47: 40 to 44 years', 'v_CA11F_50: 45 to 49 years'])

census_2011['Population 50 to 59'] = census_2011['v_CA11F_53: 50 to 54 years'] + census_2011['v_CA11F_56: 55 to 59 years']
census_2011 = census_2011.drop(columns=['v_CA11F_53: 50 to 54 years', 'v_CA11F_56: 55 to 59 years'])

census_2011['Population 60 plus'] = census_2011['v_CA11F_59: 60 to 64 years'] + census_2011['v_CA11F_62: 65 to 69 years'] + census_2011['v_CA11F_65: 70 to 74 years'] + census_2011['v_CA11F_68: 75 to 79 years'] + census_2011['v_CA11F_71: 80 to 84 years'] + census_2011['v_CA11F_74: 85 years and over']
census_2011 = census_2011.drop(columns=['v_CA11F_59: 60 to 64 years', 'v_CA11F_62: 65 to 69 years', 'v_CA11F_65: 70 to 74 years', 'v_CA11F_68: 75 to 79 years', 'v_CA11F_71: 80 to 84 years', 'v_CA11F_74: 85 years and over'])

# Check column names
print("Column names:")
for col in census_2011.columns:
    print(col)

In [None]:
# Merge bachelors degree or higher
census_2011['Bachelors degree or higher'] = census_2011['v_CA11N_1792: University certificate, diploma or degree at bachelor level or above'] + census_2011['v_CA11N_1822: University certificate, diploma or degree at bachelor level or above']
census_2011 = census_2011.drop(columns=['v_CA11N_1792: University certificate, diploma or degree at bachelor level or above', 'v_CA11N_1822: University certificate, diploma or degree at bachelor level or above'])

# Merge education base
census_2011['Education_base'] = census_2011['v_CA11N_1771: Total population aged 15 years and over by highest certificate, diploma or degree'] + census_2011['v_CA11N_1801: Total population aged 25 to 64 years by highest certificate, diploma or degree']
census_2011 = census_2011.drop(columns=['v_CA11N_1771: Total population aged 15 years and over by highest certificate, diploma or degree', 'v_CA11N_1801: Total population aged 25 to 64 years by highest certificate, diploma or degree'])

In [None]:
 # Calculate population percentage change and add
c_2006 = pd.read_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/2_clean_data/2006_clean.csv')
census_2011['Population change (%)'] = ((census_2011['Population'] - c_2006['Population_2006']) / c_2006['Population_2006'] * 100).round(1)

print(census_2011['Population change (%)'])

In [None]:
# Drop useless variables
census_2011 = census_2011.drop(columns=['NHS Non Return Rate', 'Type', 'Region Name', 'v_CA11N_2253: Owner', 'v_CA11N_2254: Renter'])

# Check column names
print("Column names:")
for col in census_2011.columns:
    print(col)

In [None]:
# Rename variables
census_2011 = census_2011.rename(columns=
  {'v_CA11F_6: Total population by age groups': 'Total male population', 'v_CA11F_7: Total population by age groups': 'Total female population',
   'v_CA11N_1720: Non-movers': 'Non-movers', 'v_CA11N_1723: Movers': 'Movers', 'v_CA11N_1726: Non-migrants': 'Non-migrants', 'v_CA11N_1729: Migrants': 'Migrants',
   'v_CA11N_1732: Internal migrants': 'Internal migrants', 'v_CA11N_1741: External migrants': 'External migrants', 'v_CA11N_499: Not a visible minority': 'Non-visible minority',
   'v_CA11N_460: Total visible minority population': 'Visible minority', 'v_CA11N_13: Not Canadian citizens': 'Non-Canadian citizens', 'v_CA11N_4: Canadian citizens': 'Canadian citizens',
   'v_CA11N_19: Non-immigrants': 'Non-immigrants', 'v_CA11N_22: Immigrants': 'Immigrants', 'v_CA11N_46: Non-permanent residents': 'Non-permanent residents',
   'v_CA11F_98: Single (never legally married)': 'Never married', 'v_CA11F_104: Divorced': 'Divorced', 'v_CA11F_107: Widowed': 'Widowed', 'v_CA11F_123: Common-law couples': 'Common-law couples', 'v_CA11F_117: Married couples': 'Married couples',
   'v_CA11N_2562: Median household total income $': 'Median household income ($)', 'v_CA11N_1993: Employed': 'Employed', 'v_CA11N_1996: Unemployed': 'Unemployed',
   'v_CA11N_1804: No certificate, diploma or degree': 'No certificate or diploma', 'v_CA11N_1807: High school diploma or equivalent': 'High school or Secondary degree',
   'v_CA11N_1816: College, CEGEP or other non-university certificate or diploma': 'College or CEGEP degree', 'v_CA11N_1813: Apprenticeship or trades certificate or diploma': 'Trades certificate, diploma or apprenticeship',
  #  'v_CA11N_1822: University certificate, diploma or degree at bachelor level or above': 'Bachelors degree or higher',
   'v_CA11F_207: Apartment, building that has fewer than five storeys': 'Apartment with fewer than five stories', 'v_CA11F_201: Apartment, building that has five or more storeys': 'Apartment with five or more storeys',
   'v_CA11F_205: Row house': 'Row house', 'v_CA11F_200: Single-detached house': 'Single-detached house', 'v_CA11F_204: Semi-detached house': 'Semi-detached house', 'v_CA11F_202: Movable dwelling': 'Movable dwelling', 'v_CA11F_208: Other single-attached house': 'Other single-attached house',
   'v_CA11N_2292: Average monthly shelter costs for rented dwellings ($)': 'Average gross rent ($)', 'v_CA11N_2287: Average value of dwellings ($)': 'Average value dwelling ($)',
   'v_CA11N_2232: Major repairs needed': 'Major repairs', 'v_CA11N_2231: Only regular maintenance or minor repairs needed': 'Regular maintenance',
  #  'v_CA11N_2253: Owner': 'Owned housing', 'v_CA11N_2254: Renter': 'Rented housing',
   'v_CA11N_2255: Band housing': 'Band housing',
   'v_CA11N_2200: Public transit': 'Public transit', 'v_CA11N_2203: Walked': 'Walked', 'v_CA11N_2206: Bicycle': 'Bicycle',
   'v_CA11N_2209: Other methods': 'Transportation other methods', 'v_CA11N_2194: Car, truck or van - as a driver': 'Transportation vehicle driver', 'v_CA11N_2197: Car, truck or van - as a passenger': 'Transportation vehicle non-driver',
   'v_CA11F_224: English': 'English only', 'v_CA11F_227: French': 'French only', 'v_CA11F_539: English and French': 'English and French',
   'v_CA11F_542: English and non-official language': 'English and non-official language(s)', 'v_CA11F_545: French and non-official language': 'French and non-official language(s)',
   'v_CA11F_548: English, French and non-official language': 'English, French, and non-official language(s)', 'v_CA11F_230: Non-official languages': 'Allophone',
   'v_CA11N_2281: Number of owner households in non-farm, non-reserve private dwellings': 'Owner households', 'v_CA11N_2288: Number of tenant households in non-farm, non-reserve private dwellings': 'Renter households'
   })

# Check column names and length
print("Column names:")
for col in census_2011.columns:
    print(col)

print(len(census_2011.columns))

In [None]:
# Add year
census_2011 = census_2011.add_suffix('_2011')

# Check column names
print("Column names:")
for col in census_2011.columns:
    print(col)

In [None]:
# Check values
census_2011.iloc[:60, 0:20]

In [None]:
# Export file to drive
census_2011.to_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/2_clean_data/2011_clean.csv' , index=False)

# **2016**
- merge population age buckets
- rename variables!
- add owner and tenant variables


In [220]:
census_2016 = pd.read_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/1_raw_census_data/2016_data_variables.csv')

# Print all column names
print("Column names:")
for col in census_2016.columns:
    print(col)

Column names:
GeoUID
Type
Region Name
Area (sq km)
Population 
Dwellings 
Households 
v_CA16_7: 0 to 4 years
v_CA16_25: 5 to 9 years
v_CA16_43: 10 to 14 years
v_CA16_2: Age Stats
v_CA16_3: Age Stats
v_CA16_64: 15 to 19 years
v_CA16_82: 20 to 24 years
v_CA16_100: 25 to 29 years
v_CA16_118: 30 to 34 years
v_CA16_136: 35 to 39 years
v_CA16_154: 40 to 44 years
v_CA16_172: 45 to 49 years
v_CA16_190: 50 to 54 years
v_CA16_208: 55 to 59 years
v_CA16_226: 60 to 64 years
v_CA16_247: 65 to 69 years
v_CA16_265: 70 to 74 years
v_CA16_283: 75 to 79 years
v_CA16_301: 80 to 84 years
v_CA16_319: 85 years and over
v_CA16_403: Population percentage change, 2011 to 2016
v_CA16_415: Apartment in a building that has fewer than five storeys
v_CA16_410: Apartment in a building that has five or more storeys
v_CA16_413: Row house
v_CA16_409: Single-detached house
v_CA16_412: Semi-detached house
v_CA16_417: Movable dwelling
v_CA16_416: Other single-attached house
v_CA16_6695: Non-movers
v_CA16_6698: Movers
v_CA

**Owner and Tenant**<br>
**Bachelors degree or higher**<br>
**Education base**

In [221]:
# Add owner and tenant
owner_tenant = pd.read_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/1_raw_census_data/2016_housing_variables.csv')
owner_tenant = owner_tenant[['GeoUID', 'v_CA16_4890: Total - Owner households in non-farm, non-reserve private dwellings - 25% sample data', 'v_CA16_4897: Total - Tenant households in non-farm, non-reserve private dwellings - 25% sample data']]
census_2016 = pd.merge(census_2016, owner_tenant, on='GeoUID')

# Add missing gentrified variables
education_bachelors = pd.read_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/1_raw_census_data/2016_gentrified_variables.csv')
education_bachelors = education_bachelors[['GeoUID', 'v_CA16_5051: Total - Highest certificate, diploma or degree for the population aged 15 years and over in private households - 25% sample data',
                                           'v_CA16_5096: Total - Highest certificate, diploma or degree for the population aged 25 to 64 years in private households - 25% sample data', 'v_CA16_5078: University certificate, diploma or degree at bachelor level or above']]
census_2016 = pd.merge(census_2016, education_bachelors, on='GeoUID')

# Check column names
print("Column names:")
for col in census_2016.columns:
    print(col)

Column names:
GeoUID
Type
Region Name
Area (sq km)
Population 
Dwellings 
Households 
v_CA16_7: 0 to 4 years
v_CA16_25: 5 to 9 years
v_CA16_43: 10 to 14 years
v_CA16_2: Age Stats
v_CA16_3: Age Stats
v_CA16_64: 15 to 19 years
v_CA16_82: 20 to 24 years
v_CA16_100: 25 to 29 years
v_CA16_118: 30 to 34 years
v_CA16_136: 35 to 39 years
v_CA16_154: 40 to 44 years
v_CA16_172: 45 to 49 years
v_CA16_190: 50 to 54 years
v_CA16_208: 55 to 59 years
v_CA16_226: 60 to 64 years
v_CA16_247: 65 to 69 years
v_CA16_265: 70 to 74 years
v_CA16_283: 75 to 79 years
v_CA16_301: 80 to 84 years
v_CA16_319: 85 years and over
v_CA16_403: Population percentage change, 2011 to 2016
v_CA16_415: Apartment in a building that has fewer than five storeys
v_CA16_410: Apartment in a building that has five or more storeys
v_CA16_413: Row house
v_CA16_409: Single-detached house
v_CA16_412: Semi-detached house
v_CA16_417: Movable dwelling
v_CA16_416: Other single-attached house
v_CA16_6695: Non-movers
v_CA16_6698: Movers
v_CA

In [222]:
# Remove white spaces
census_2016.columns = census_2016.columns.str.strip()

In [223]:
# Convert all columns except the identifier to numeric
ctuidcol = 'GeoUID'  # identifier column
cols_to_convert = census_2016.columns.difference([ctuidcol])
census_2016[cols_to_convert] = census_2016[cols_to_convert].apply(pd.to_numeric, errors='coerce')

In [224]:
# Merge population age buckets
census_2016['Population 0 to 9'] = census_2016['v_CA16_7: 0 to 4 years'] + census_2016['v_CA16_25: 5 to 9 years']
census_2016 = census_2016.drop(columns=['v_CA16_7: 0 to 4 years', 'v_CA16_25: 5 to 9 years'])

census_2016['Population 10 to 19'] = census_2016['v_CA16_43: 10 to 14 years'] + census_2016['v_CA16_64: 15 to 19 years']
census_2016 = census_2016.drop(columns=['v_CA16_43: 10 to 14 years', 'v_CA16_64: 15 to 19 years'])

census_2016['Population 20 to 29'] = census_2016['v_CA16_82: 20 to 24 years'] + census_2016['v_CA16_100: 25 to 29 years']
census_2016 = census_2016.drop(columns=['v_CA16_82: 20 to 24 years', 'v_CA16_100: 25 to 29 years'])

census_2016['Population 30 to 39'] = census_2016['v_CA16_118: 30 to 34 years'] + census_2016['v_CA16_136: 35 to 39 years']
census_2016 = census_2016.drop(columns=['v_CA16_118: 30 to 34 years', 'v_CA16_136: 35 to 39 years'])

census_2016['Population 40 to 49'] = census_2016['v_CA16_154: 40 to 44 years'] + census_2016['v_CA16_172: 45 to 49 years']
census_2016 = census_2016.drop(columns=['v_CA16_154: 40 to 44 years', 'v_CA16_172: 45 to 49 years'])

census_2016['Population 50 to 59'] = census_2016['v_CA16_190: 50 to 54 years'] + census_2016['v_CA16_208: 55 to 59 years']
census_2016 = census_2016.drop(columns=['v_CA16_190: 50 to 54 years', 'v_CA16_208: 55 to 59 years'])

census_2016['Population 60 plus'] = census_2016['v_CA16_226: 60 to 64 years'] + census_2016['v_CA16_247: 65 to 69 years'] + census_2016['v_CA16_265: 70 to 74 years'] + census_2016['v_CA16_283: 75 to 79 years'] + census_2016['v_CA16_301: 80 to 84 years'] + census_2016['v_CA16_319: 85 years and over']
census_2016 = census_2016.drop(columns=['v_CA16_226: 60 to 64 years', 'v_CA16_247: 65 to 69 years', 'v_CA16_265: 70 to 74 years', 'v_CA16_283: 75 to 79 years', 'v_CA16_301: 80 to 84 years', 'v_CA16_319: 85 years and over'])

# Check column names
print("Column names:")
for col in census_2016.columns:
    print(col)

Column names:
GeoUID
Type
Region Name
Area (sq km)
Population
Dwellings
Households
v_CA16_2: Age Stats
v_CA16_3: Age Stats
v_CA16_403: Population percentage change, 2011 to 2016
v_CA16_415: Apartment in a building that has fewer than five storeys
v_CA16_410: Apartment in a building that has five or more storeys
v_CA16_413: Row house
v_CA16_409: Single-detached house
v_CA16_412: Semi-detached house
v_CA16_417: Movable dwelling
v_CA16_416: Other single-attached house
v_CA16_6695: Non-movers
v_CA16_6698: Movers
v_CA16_3957: Total visible minority population
v_CA16_3996: Not a visible minority
v_CA16_3402: Not Canadian citizens
v_CA16_3408: Non-immigrants
v_CA16_3411: Immigrants
v_CA16_3435: Non-permanent residents
v_CA16_3393: Canadian citizens
v_CA16_6701: Non-migrants
v_CA16_6704: Migrants
v_CA16_6707: Internal migrants
v_CA16_6716: External migrants
v_CA16_466: Never married
v_CA16_472: Divorced
v_CA16_475: Widowed
v_CA16_487: Common-law couples
v_CA16_486: Married couples
v_CA16_2397:

In [225]:
# Merge bachelors degree or higher
census_2016['Bachelors degree or higher'] = census_2016['v_CA16_5078: University certificate, diploma or degree at bachelor level or above'] + census_2016['v_CA16_5123: University certificate, diploma or degree at bachelor level or above']
census_2016 = census_2016.drop(columns=['v_CA16_5078: University certificate, diploma or degree at bachelor level or above', 'v_CA16_5123: University certificate, diploma or degree at bachelor level or above'])

# Merge education base
census_2016['Education_base'] = census_2016['v_CA16_5051: Total - Highest certificate, diploma or degree for the population aged 15 years and over in private households - 25% sample data'] + census_2016['v_CA16_5096: Total - Highest certificate, diploma or degree for the population aged 25 to 64 years in private households - 25% sample data']
census_2016 = census_2016.drop(columns=['v_CA16_5051: Total - Highest certificate, diploma or degree for the population aged 15 years and over in private households - 25% sample data', 'v_CA16_5096: Total - Highest certificate, diploma or degree for the population aged 25 to 64 years in private households - 25% sample data'])

In [226]:
# Drop useless variables
census_2016 = census_2016.drop(columns=['Type','Region Name', 'v_CA16_4837: Owner', 'v_CA16_4838: Renter',])

# Check column names
print("Column names:")
for col in census_2016.columns:
    print(col)

Column names:
GeoUID
Area (sq km)
Population
Dwellings
Households
v_CA16_2: Age Stats
v_CA16_3: Age Stats
v_CA16_403: Population percentage change, 2011 to 2016
v_CA16_415: Apartment in a building that has fewer than five storeys
v_CA16_410: Apartment in a building that has five or more storeys
v_CA16_413: Row house
v_CA16_409: Single-detached house
v_CA16_412: Semi-detached house
v_CA16_417: Movable dwelling
v_CA16_416: Other single-attached house
v_CA16_6695: Non-movers
v_CA16_6698: Movers
v_CA16_3957: Total visible minority population
v_CA16_3996: Not a visible minority
v_CA16_3402: Not Canadian citizens
v_CA16_3408: Non-immigrants
v_CA16_3411: Immigrants
v_CA16_3435: Non-permanent residents
v_CA16_3393: Canadian citizens
v_CA16_6701: Non-migrants
v_CA16_6704: Migrants
v_CA16_6707: Internal migrants
v_CA16_6716: External migrants
v_CA16_466: Never married
v_CA16_472: Divorced
v_CA16_475: Widowed
v_CA16_487: Common-law couples
v_CA16_486: Married couples
v_CA16_2397: Median total inc

In [227]:
# Rename variables
census_2016 = census_2016.rename(columns=
  {'v_CA16_2: Age Stats': 'Total male population', 'v_CA16_3: Age Stats': 'Total female population', 'v_CA16_403: Population percentage change, 2011 to 2016': 'Population change (%)',
   'v_CA16_6695: Non-movers': 'Non-movers', 'v_CA16_6698: Movers': 'Movers', 'v_CA16_6701: Non-migrants': 'Non-migrants', 'v_CA16_6704: Migrants': 'Migrants',
   'v_CA16_6707: Internal migrants': 'Internal migrants', 'v_CA16_6716: External migrants': 'External migrants', 'v_CA16_3996: Not a visible minority': 'Non-visible minority',
   'v_CA16_3957: Total visible minority population': 'Visible minority', 'v_CA16_3402: Not Canadian citizens': 'Non-Canadian citizens', 'v_CA16_3393: Canadian citizens': 'Canadian citizens',
   'v_CA16_3408: Non-immigrants': 'Non-immigrants', 'v_CA16_3411: Immigrants': 'Immigrants', 'v_CA16_3435: Non-permanent residents': 'Non-permanent residents',
   'v_CA16_466: Never married': 'Never married', 'v_CA16_472: Divorced': 'Divorced', 'v_CA16_475: Widowed': 'Widowed', 'v_CA16_487: Common-law couples': 'Common-law couples', 'v_CA16_486: Married couples': 'Married couples',
   'v_CA16_2397: Median total income of households in 2015 ($)': 'Median household income ($)', 'v_CA16_5603: Employed': 'Employed', 'v_CA16_5606: Unemployed': 'Unemployed',
   'v_CA16_5099: No certificate, diploma or degree': 'No certificate or diploma', 'v_CA16_5102: Secondary (high) school diploma or equivalency certificate': 'High school or Secondary degree',
   'v_CA16_5117: College, CEGEP or other non-university certificate or diploma': 'College or CEGEP degree', 'v_CA16_5108: Apprenticeship or trades certificate or diploma': 'Trades certificate, diploma or apprenticeship',
  #  'v_CA16_5123: University certificate, diploma or degree at bachelor level or above': 'Bachelors degree or higher',
   'v_CA16_415: Apartment in a building that has fewer than five storeys': 'Apartment with fewer than five stories', 'v_CA16_410: Apartment in a building that has five or more storeys': 'Apartment with five or more storeys',
   'v_CA16_413: Row house': 'Row house', 'v_CA16_409: Single-detached house': 'Single-detached house', 'v_CA16_412: Semi-detached house': 'Semi-detached house', 'v_CA16_417: Movable dwelling': 'Movable dwelling', 'v_CA16_416: Other single-attached house': 'Other single-attached house',
   'v_CA16_4901: Average monthly shelter costs for rented dwellings ($)': 'Average gross rent ($)', 'v_CA16_4896: Average value of dwellings ($)': 'Average value dwelling ($)',
   'v_CA16_4872: Major repairs needed': 'Major repairs', 'v_CA16_4871: Only regular maintenance or minor repairs needed': 'Regular maintenance',
  #  'v_CA16_4837: Owner': 'Owned housing', 'v_CA16_4838: Renter': 'Rented housing',
   'v_CA16_4839: Band housing': 'Band housing',
   'v_CA16_5801: Public transit': 'Public transit', 'v_CA16_5804: Walked': 'Walked', 'v_CA16_5807: Bicycle': 'Bicycle',
   'v_CA16_5810: Other method': 'Transportation other methods', 'v_CA16_5795: Car, truck, van - as a driver': 'Transportation vehicle driver', 'v_CA16_5798: Car, truck, van - as a passenger': 'Transportation vehicle non-driver',
   'v_CA16_557: English': 'English only', 'v_CA16_560: French': 'French only', 'v_CA16_1343: English and French': 'English and French',
   'v_CA16_1346: English and non-official language': 'English and non-official language(s)', 'v_CA16_1349: French and non-official language': 'French and non-official language(s)',
   'v_CA16_1352: English, French and non-official language': 'English, French, and non-official language(s)', 'v_CA16_563: Non-official languages': 'Allophone',
   'v_CA16_4890: Total - Owner households in non-farm, non-reserve private dwellings - 25% sample data': 'Owner households', 'v_CA16_4897: Total - Tenant households in non-farm, non-reserve private dwellings - 25% sample data': 'Renter households',
   })

# Check column names and length
print("Column names:")
for col in census_2016.columns:
    print(col)

print(len(census_2016.columns))

Column names:
GeoUID
Area (sq km)
Population
Dwellings
Households
Total male population
Total female population
Population change (%)
Apartment with fewer than five stories
Apartment with five or more storeys
Row house
Single-detached house
Semi-detached house
Movable dwelling
Other single-attached house
Non-movers
Movers
Visible minority
Non-visible minority
Non-Canadian citizens
Non-immigrants
Immigrants
Non-permanent residents
Canadian citizens
Non-migrants
Migrants
Internal migrants
External migrants
Never married
Divorced
Widowed
Common-law couples
Married couples
Median household income ($)
Employed
Unemployed
No certificate or diploma
High school or Secondary degree
College or CEGEP degree
Trades certificate, diploma or apprenticeship
Average gross rent ($)
Average value dwelling ($)
Major repairs
Regular maintenance
Band housing
Public transit
Walked
Bicycle
Transportation other methods
Transportation vehicle driver
Transportation vehicle non-driver
English only
French only
All

In [228]:
# Add year
census_2016 = census_2016.add_suffix('_2016')

# Check column names
print("Column names:")
for col in census_2016.columns:
    print(col)

Column names:
GeoUID_2016
Area (sq km)_2016
Population_2016
Dwellings_2016
Households_2016
Total male population_2016
Total female population_2016
Population change (%)_2016
Apartment with fewer than five stories_2016
Apartment with five or more storeys_2016
Row house_2016
Single-detached house_2016
Semi-detached house_2016
Movable dwelling_2016
Other single-attached house_2016
Non-movers_2016
Movers_2016
Visible minority_2016
Non-visible minority_2016
Non-Canadian citizens_2016
Non-immigrants_2016
Immigrants_2016
Non-permanent residents_2016
Canadian citizens_2016
Non-migrants_2016
Migrants_2016
Internal migrants_2016
External migrants_2016
Never married_2016
Divorced_2016
Widowed_2016
Common-law couples_2016
Married couples_2016
Median household income ($)_2016
Employed_2016
Unemployed_2016
No certificate or diploma_2016
High school or Secondary degree_2016
College or CEGEP degree_2016
Trades certificate, diploma or apprenticeship_2016
Average gross rent ($)_2016
Average value dwelli

In [229]:
# Check values
census_2016.iloc[:60, 0:20]

Unnamed: 0,GeoUID_2016,Area (sq km)_2016,Population_2016,Dwellings_2016,Households_2016,Total male population_2016,Total female population_2016,Population change (%)_2016,Apartment with fewer than five stories_2016,Apartment with five or more storeys_2016,Row house_2016,Single-detached house_2016,Semi-detached house_2016,Movable dwelling_2016,Other single-attached house_2016,Non-movers_2016,Movers_2016,Visible minority_2016,Non-visible minority_2016,Non-Canadian citizens_2016
0,4620001.0,0.46188,2638,1452,1328,1340.0,1295.0,1.2,795.0,5.0,15.0,25.0,25.0,0.0,15.0,2300.0,365.0,415.0,2285.0,85.0
1,4620002.0,0.38927,3516,1902,1762,1670.0,1845.0,15.7,1015.0,155.0,85.0,80.0,20.0,0.0,20.0,2615.0,690.0,560.0,2780.0,100.0
2,4620003.0,0.7401,6373,3103,2937,3090.0,3280.0,2.2,1685.0,15.0,95.0,225.0,100.0,0.0,15.0,5480.0,715.0,1380.0,4870.0,430.0
3,4620004.0,0.44828,3176,1704,1603,1545.0,1635.0,-2.6,900.0,0.0,125.0,115.0,20.0,0.0,90.0,2700.0,405.0,485.0,2655.0,160.0
4,4620005.0,0.56464,3060,1749,1593,1515.0,1540.0,-3.3,930.0,0.0,145.0,45.0,30.0,0.0,30.0,2485.0,520.0,415.0,2620.0,150.0
5,4620006.0,0.64815,4467,2149,1993,2215.0,2255.0,0.2,1445.0,0.0,100.0,65.0,30.0,0.0,0.0,3540.0,675.0,1110.0,3170.0,345.0
6,4620007.0,1.09059,5682,2946,2809,2760.0,2925.0,-0.7,1470.0,105.0,15.0,680.0,55.0,0.0,30.0,4840.0,780.0,1210.0,4445.0,425.0
7,4620008.0,0.6229,3016,1586,1465,1520.0,1495.0,2.5,705.0,30.0,10.0,150.0,190.0,0.0,10.0,2700.0,415.0,465.0,2690.0,120.0
8,4620009.0,4.26975,3470,1770,1640,1800.0,1675.0,-0.5,920.0,5.0,30.0,205.0,20.0,0.0,10.0,2795.0,665.0,735.0,2760.0,255.0
9,4620010.0,0.92115,1479,754,701,750.0,725.0,3.6,390.0,5.0,30.0,30.0,0.0,0.0,5.0,1300.0,165.0,400.0,1075.0,130.0


In [230]:
# Export file to drive
census_2016.to_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/2_clean_data/2016_clean.csv' , index=False)

# **2021**
- merge population age buckets
- merge "Multiple non-official languages" column and "Non-official languages" column
- rename variables!
- add owner and tenant variables


In [244]:
census_2021 = pd.read_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/1_raw_census_data/2021_data_variables.csv')

# Print all column names
print("Column names:")
for col in census_2021.columns:
    print(col)

Column names:
GeoUID
Type
Region Name
Area (sq km)
Population 
Dwellings 
Households 
v_CA21_14: 0 to 4 years
v_CA21_32: 5 to 9 years
v_CA21_50: 10 to 14 years
v_CA21_9: Total - Age
v_CA21_10: Total - Age
v_CA21_3: Population percentage change, 2016 to 2021
v_CA21_71: 15 to 19 years
v_CA21_89: 20 to 24 years
v_CA21_107: 25 to 29 years
v_CA21_125: 30 to 34 years
v_CA21_143: 35 to 39 years
v_CA21_161: 40 to 44 years
v_CA21_179: 45 to 49 years
v_CA21_197: 50 to 54 years
v_CA21_215: 55 to 59 years
v_CA21_233: 60 to 64 years
v_CA21_254: 65 to 69 years
v_CA21_272: 70 to 74 years
v_CA21_290: 75 to 79 years
v_CA21_308: 80 to 84 years
v_CA21_326: 85 years and over
v_CA21_5748: Non-movers
v_CA21_5751: Movers
v_CA21_5754: Non-migrants
v_CA21_5757: Migrants
v_CA21_5760: Internal migrants
v_CA21_5769: External migrants
v_CA21_4875: Total visible minority population
v_CA21_4914: Not a visible minority
v_CA21_4401: Not Canadian citizens
v_CA21_4407: Non-immigrants
v_CA21_4410: Immigrants
v_CA21_4434:

**Owner and Tenant**<br>
**Bachelors degree or higher**<br>
**Education base**


In [245]:
# Add owner and tenant
owner_tenant = pd.read_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/1_raw_census_data/2021_housing_variables.csv')
owner_tenant = owner_tenant[['GeoUID', 'v_CA21_4305: Total - Owner households in non-farm, non-reserve private dwellings', 'v_CA21_4313: Total - Tenant households in non-farm, non-reserve private dwellings']]
census_2021 = pd.merge(census_2021, owner_tenant, on='GeoUID')

# Add missing gentrified variables
education_bachelors = pd.read_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/1_raw_census_data/2021_gentrified_variables.csv')
education_bachelors = education_bachelors[['GeoUID',
                                           "v_CA21_5847: Bachelor's degree or higher",
                                           'v_CA21_5817: Total - Highest certificate, diploma or degree for the population aged 15 years and over in private households',
                                           'v_CA21_5865: Total - Highest certificate, diploma or degree for the population aged 25 to 64 years in private households']]
census_2021 = pd.merge(census_2021, education_bachelors, on='GeoUID')

# Check column names
print("Column names:")
for col in census_2021.columns:
    print(col)

Column names:
GeoUID
Type
Region Name
Area (sq km)
Population 
Dwellings 
Households 
v_CA21_14: 0 to 4 years
v_CA21_32: 5 to 9 years
v_CA21_50: 10 to 14 years
v_CA21_9: Total - Age
v_CA21_10: Total - Age
v_CA21_3: Population percentage change, 2016 to 2021
v_CA21_71: 15 to 19 years
v_CA21_89: 20 to 24 years
v_CA21_107: 25 to 29 years
v_CA21_125: 30 to 34 years
v_CA21_143: 35 to 39 years
v_CA21_161: 40 to 44 years
v_CA21_179: 45 to 49 years
v_CA21_197: 50 to 54 years
v_CA21_215: 55 to 59 years
v_CA21_233: 60 to 64 years
v_CA21_254: 65 to 69 years
v_CA21_272: 70 to 74 years
v_CA21_290: 75 to 79 years
v_CA21_308: 80 to 84 years
v_CA21_326: 85 years and over
v_CA21_5748: Non-movers
v_CA21_5751: Movers
v_CA21_5754: Non-migrants
v_CA21_5757: Migrants
v_CA21_5760: Internal migrants
v_CA21_5769: External migrants
v_CA21_4875: Total visible minority population
v_CA21_4914: Not a visible minority
v_CA21_4401: Not Canadian citizens
v_CA21_4407: Non-immigrants
v_CA21_4410: Immigrants
v_CA21_4434:

In [246]:
# Remove white spaces
census_2021.columns = census_2021.columns.str.strip()

In [247]:
# Convert all columns except the identifier to numeric
ctuidcol = 'GeoUID'  # identifier column
cols_to_convert = census_2021.columns.difference([ctuidcol])
census_2021[cols_to_convert] = census_2021[cols_to_convert].apply(pd.to_numeric, errors='coerce')

In [248]:
# Merge population age buckets
census_2021['Population 0 to 9'] = census_2021['v_CA21_14: 0 to 4 years'] + census_2021['v_CA21_32: 5 to 9 years']
census_2021 = census_2021.drop(columns=['v_CA21_14: 0 to 4 years', 'v_CA21_32: 5 to 9 years'])

census_2021['Population 10 to 19'] = census_2021['v_CA21_50: 10 to 14 years'] + census_2021['v_CA21_71: 15 to 19 years']
census_2021 = census_2021.drop(columns=['v_CA21_50: 10 to 14 years', 'v_CA21_71: 15 to 19 years'])

census_2021['Population 20 to 29'] = census_2021['v_CA21_89: 20 to 24 years'] + census_2021['v_CA21_107: 25 to 29 years']
census_2021 = census_2021.drop(columns=['v_CA21_89: 20 to 24 years', 'v_CA21_107: 25 to 29 years'])

census_2021['Population 30 to 39'] = census_2021['v_CA21_125: 30 to 34 years'] + census_2021['v_CA21_143: 35 to 39 years']
census_2021 = census_2021.drop(columns=['v_CA21_125: 30 to 34 years', 'v_CA21_143: 35 to 39 years'])

census_2021['Population 40 to 49'] = census_2021['v_CA21_161: 40 to 44 years'] + census_2021['v_CA21_179: 45 to 49 years']
census_2021 = census_2021.drop(columns=['v_CA21_161: 40 to 44 years', 'v_CA21_179: 45 to 49 years'])

census_2021['Population 50 to 59'] = census_2021['v_CA21_197: 50 to 54 years'] + census_2021['v_CA21_215: 55 to 59 years']
census_2021 = census_2021.drop(columns=['v_CA21_197: 50 to 54 years', 'v_CA21_215: 55 to 59 years'])

census_2021['Population 60 plus'] = census_2021['v_CA21_233: 60 to 64 years'] + census_2021['v_CA21_254: 65 to 69 years'] + census_2021['v_CA21_272: 70 to 74 years'] + census_2021['v_CA21_290: 75 to 79 years'] + census_2021['v_CA21_308: 80 to 84 years'] + census_2021['v_CA21_326: 85 years and over']
census_2021 = census_2021.drop(columns=['v_CA21_233: 60 to 64 years', 'v_CA21_254: 65 to 69 years', 'v_CA21_272: 70 to 74 years', 'v_CA21_290: 75 to 79 years', 'v_CA21_308: 80 to 84 years', 'v_CA21_326: 85 years and over'])

# Check column names and length
print("Column names:")
for col in census_2021.columns:
    print(col)

Column names:
GeoUID
Type
Region Name
Area (sq km)
Population
Dwellings
Households
v_CA21_9: Total - Age
v_CA21_10: Total - Age
v_CA21_3: Population percentage change, 2016 to 2021
v_CA21_5748: Non-movers
v_CA21_5751: Movers
v_CA21_5754: Non-migrants
v_CA21_5757: Migrants
v_CA21_5760: Internal migrants
v_CA21_5769: External migrants
v_CA21_4875: Total visible minority population
v_CA21_4914: Not a visible minority
v_CA21_4401: Not Canadian citizens
v_CA21_4407: Non-immigrants
v_CA21_4410: Immigrants
v_CA21_4434: Non-permanent residents
v_CA21_4392: Canadian citizens
v_CA21_480: Never married
v_CA21_486: Divorced
v_CA21_489: Widowed
v_CA21_504: Common-law couples
v_CA21_501: Married couples
v_CA21_906: Median total income of household in 2020 ($)
v_CA21_6498: Employed
v_CA21_6501: Unemployed
v_CA21_5868: No certificate, diploma or degree
v_CA21_5889: College, CEGEP or other non-university certificate or diploma
v_CA21_5880: Apprenticeship or trades certificate or diploma
v_CA21_5895: Ba

In [249]:
# Merge non-official languages and multiple non-official languages
census_2021['Allophone'] = census_2021['v_CA21_2215: Non-official languages'] + census_2021['v_CA21_3190: Multiple non-official languages']
census_2021 = census_2021.drop(columns=['v_CA21_2215: Non-official languages', 'v_CA21_3190: Multiple non-official languages'])

In [250]:
# Merge bachelors degree or higher
census_2021['Bachelors degree or higher'] = census_2021["v_CA21_5847: Bachelor's degree or higher"] + census_2021["v_CA21_5895: Bachelor's degree or higher"]
census_2021 = census_2021.drop(columns=["v_CA21_5847: Bachelor's degree or higher", "v_CA21_5895: Bachelor's degree or higher"])

# Merge education base
census_2021['Education_base'] = census_2021['v_CA21_5817: Total - Highest certificate, diploma or degree for the population aged 15 years and over in private households'] + census_2021['v_CA21_5865: Total - Highest certificate, diploma or degree for the population aged 25 to 64 years in private households']
census_2021 = census_2021.drop(columns=['v_CA21_5817: Total - Highest certificate, diploma or degree for the population aged 15 years and over in private households', 'v_CA21_5865: Total - Highest certificate, diploma or degree for the population aged 25 to 64 years in private households'])

In [251]:
# Drop useless variables
census_2021 = census_2021.drop(columns=['Type','Region Name', 'v_CA21_4238: Owner', 'v_CA21_4239: Renter'])

# Check column names
print("Column names:")
for col in census_2021.columns:
    print(col)

Column names:
GeoUID
Area (sq km)
Population
Dwellings
Households
v_CA21_9: Total - Age
v_CA21_10: Total - Age
v_CA21_3: Population percentage change, 2016 to 2021
v_CA21_5748: Non-movers
v_CA21_5751: Movers
v_CA21_5754: Non-migrants
v_CA21_5757: Migrants
v_CA21_5760: Internal migrants
v_CA21_5769: External migrants
v_CA21_4875: Total visible minority population
v_CA21_4914: Not a visible minority
v_CA21_4401: Not Canadian citizens
v_CA21_4407: Non-immigrants
v_CA21_4410: Immigrants
v_CA21_4434: Non-permanent residents
v_CA21_4392: Canadian citizens
v_CA21_480: Never married
v_CA21_486: Divorced
v_CA21_489: Widowed
v_CA21_504: Common-law couples
v_CA21_501: Married couples
v_CA21_906: Median total income of household in 2020 ($)
v_CA21_6498: Employed
v_CA21_6501: Unemployed
v_CA21_5868: No certificate, diploma or degree
v_CA21_5889: College, CEGEP or other non-university certificate or diploma
v_CA21_5880: Apprenticeship or trades certificate or diploma
v_CA21_5814: With high school di

In [252]:
# Rename variables
census_2021 = census_2021.rename(columns=
  {'v_CA21_9: Total - Age': 'Total male population', 'v_CA21_10: Total - Age': 'Total female population', 'v_CA21_3: Population percentage change, 2016 to 2021': 'Population change (%)',
   'v_CA21_5748: Non-movers': 'Non-movers', 'v_CA21_5751: Movers': 'Movers', 'v_CA21_5754: Non-migrants': 'Non-migrants', 'v_CA21_5757: Migrants': 'Migrants',
   'v_CA21_5760: Internal migrants': 'Internal migrants', 'v_CA21_5769: External migrants': 'External migrants',
   'v_CA21_4914: Not a visible minority': 'Non-visible minority', 'v_CA21_4875: Total visible minority population': 'Visible minority',
   'v_CA21_4401: Not Canadian citizens': 'Non-Canadian citizens', 'v_CA21_4392: Canadian citizens': 'Canadian citizens',
   'v_CA21_4407: Non-immigrants': 'Non-immigrants', 'v_CA21_4410: Immigrants': 'Immigrants', 'v_CA21_4434: Non-permanent residents': 'Non-permanent residents',
   'v_CA21_480: Never married': 'Never married', 'v_CA21_486: Divorced': 'Divorced', 'v_CA21_489: Widowed': 'Widowed', 'v_CA21_504: Common-law couples': 'Common-law couples', 'v_CA21_501: Married couples': 'Married couples',
   'v_CA21_906: Median total income of household in 2020 ($)': 'Median household income ($)', 'v_CA21_6498: Employed': 'Employed', 'v_CA21_6501: Unemployed': 'Unemployed',
   'v_CA21_5868: No certificate, diploma or degree': 'No certificate or diploma', 'v_CA21_5814: With high school diploma or equivalency certificate': 'High school or Secondary degree',
   'v_CA21_5889: College, CEGEP or other non-university certificate or diploma': 'College or CEGEP degree', 'v_CA21_5880: Apprenticeship or trades certificate or diploma': 'Trades certificate, diploma or apprenticeship',
  #  "v_CA21_5895: Bachelor's degree or higher": "Bachelors degree or higher",
   'v_CA21_439: Apartment in a building that has fewer than five storeys': 'Apartment with fewer than five stories', 'v_CA21_440: Apartment in a building that has five or more storeys': 'Apartment with five or more storeys',
   'v_CA21_437: Row house': 'Row house', 'v_CA21_435: Single-detached house': 'Single-detached house', 'v_CA21_436: Semi-detached house': 'Semi-detached house', 'v_CA21_442: Movable dwelling': 'Movable dwelling', 'v_CA21_441: Other single-attached house': 'Other single-attached house',
   'v_CA21_4318: Average monthly shelter costs for rented dwellings ($) (59)': 'Average gross rent ($)', 'v_CA21_4312: Average value of dwellings ($) (60)': 'Average value dwelling ($)',
   'v_CA21_4274: Major repairs needed': 'Major repairs', 'v_CA21_4273: Only regular maintenance and minor repairs needed': 'Regular maintenance',
  #  'v_CA21_4238: Owner': 'Owned housing', 'v_CA21_4239: Renter': 'Rented housing',
   'v_CA21_4240: Dwelling provided by the local government, First Nation or Indian band': 'Band housing',
   'v_CA21_7644: Public transit': 'Public transit', 'v_CA21_7647: Walked': 'Walked', 'v_CA21_7650: Bicycle': 'Bicycle',
   'v_CA21_7653: Other method': 'Transportation other methods', 'v_CA21_7638: Car, truck or van - as a driver': 'Transportation vehicle driver', 'v_CA21_7641: Car, truck or van - as a passenger': 'Transportation vehicle non-driver',
   'v_CA21_2209: English': 'English only', 'v_CA21_2212: French': 'French only', 'v_CA21_3178: English and French': 'English and French',
   'v_CA21_3181: English and non-official language(s)': 'English and non-official language(s)', 'v_CA21_3184: French and non-official language(s)': 'French and non-official language(s)',
   'v_CA21_3187: English, French and non-official language(s)': 'English, French, and non-official language(s)',
   'v_CA21_4305: Total - Owner households in non-farm, non-reserve private dwellings': 'Owner households', 'v_CA21_4313: Total - Tenant households in non-farm, non-reserve private dwellings': 'Renter households'
   })

# Check column names and length
print("Column names:")
for col in census_2021.columns:
    print(col)

print(len(census_2021.columns))

Column names:
GeoUID
Area (sq km)
Population
Dwellings
Households
Total male population
Total female population
Population change (%)
Non-movers
Movers
Non-migrants
Migrants
Internal migrants
External migrants
Visible minority
Non-visible minority
Non-Canadian citizens
Non-immigrants
Immigrants
Non-permanent residents
Canadian citizens
Never married
Divorced
Widowed
Common-law couples
Married couples
Median household income ($)
Employed
Unemployed
No certificate or diploma
College or CEGEP degree
Trades certificate, diploma or apprenticeship
High school or Secondary degree
Apartment with fewer than five stories
Apartment with five or more storeys
Row house
Single-detached house
Semi-detached house
Movable dwelling
Other single-attached house
Average gross rent ($)
Average value dwelling ($)
Major repairs
Regular maintenance
Band housing
Public transit
Walked
Bicycle
Transportation vehicle driver
Transportation vehicle non-driver
Transportation other methods
English only
French only
Eng

In [253]:
# Add year
census_2021 = census_2021.add_suffix('_2021')

# Check column names
print("Column names:")
for col in census_2021.columns:
    print(col)

Column names:
GeoUID_2021
Area (sq km)_2021
Population_2021
Dwellings_2021
Households_2021
Total male population_2021
Total female population_2021
Population change (%)_2021
Non-movers_2021
Movers_2021
Non-migrants_2021
Migrants_2021
Internal migrants_2021
External migrants_2021
Visible minority_2021
Non-visible minority_2021
Non-Canadian citizens_2021
Non-immigrants_2021
Immigrants_2021
Non-permanent residents_2021
Canadian citizens_2021
Never married_2021
Divorced_2021
Widowed_2021
Common-law couples_2021
Married couples_2021
Median household income ($)_2021
Employed_2021
Unemployed_2021
No certificate or diploma_2021
College or CEGEP degree_2021
Trades certificate, diploma or apprenticeship_2021
High school or Secondary degree_2021
Apartment with fewer than five stories_2021
Apartment with five or more storeys_2021
Row house_2021
Single-detached house_2021
Semi-detached house_2021
Movable dwelling_2021
Other single-attached house_2021
Average gross rent ($)_2021
Average value dwelli

In [254]:
# Check values
census_2021.iloc[:60, 0:20]

Unnamed: 0,GeoUID_2021,Area (sq km)_2021,Population_2021,Dwellings_2021,Households_2021,Total male population_2021,Total female population_2021,Population change (%)_2021,Non-movers_2021,Movers_2021,Non-migrants_2021,Migrants_2021,Internal migrants_2021,External migrants_2021,Visible minority_2021,Non-visible minority_2021,Non-Canadian citizens_2021,Non-immigrants_2021,Immigrants_2021,Non-permanent residents_2021
0,4620001.0,0.4643,2684,1463,1402,1350.0,1335.0,1.7,2205.0,340.0,275.0,65.0,55.0,10.0,595.0,1975.0,170.0,1995.0,520.0,55.0
1,4620002.0,0.3858,3780,2022,1944,1825.0,1955.0,7.5,3205.0,535.0,445.0,90.0,55.0,35.0,865.0,2925.0,305.0,3045.0,645.0,100.0
2,4620003.0,0.7401,6600,3174,3016,3210.0,3390.0,3.6,5770.0,735.0,595.0,140.0,115.0,30.0,1840.0,4800.0,570.0,4920.0,1530.0,190.0
3,4620004.0,0.447,3355,1744,1646,1690.0,1660.0,5.6,2670.0,470.0,390.0,80.0,50.0,30.0,710.0,2455.0,310.0,2445.0,640.0,75.0
4,4620005.0,0.5652,3175,1774,1636,1585.0,1590.0,3.8,2710.0,455.0,365.0,90.0,75.0,15.0,430.0,2750.0,230.0,2720.0,370.0,90.0
5,4620006.0,0.646,4611,2162,2045,2320.0,2295.0,3.2,3930.0,615.0,525.0,90.0,75.0,15.0,1645.0,2950.0,655.0,3200.0,1100.0,295.0
6,4620007.0,1.0901,5655,2997,2867,2770.0,2880.0,-0.5,5125.0,460.0,360.0,100.0,65.0,30.0,1400.0,4240.0,440.0,4315.0,1165.0,160.0
7,4620008.0,0.6213,3104,1605,1532,1585.0,1520.0,2.9,2675.0,355.0,300.0,60.0,60.0,0.0,535.0,2535.0,235.0,2590.0,340.0,140.0
8,4620009.0,4.2912,3554,1807,1702,1835.0,1725.0,2.4,2840.0,585.0,390.0,195.0,180.0,15.0,1080.0,2385.0,435.0,2570.0,670.0,225.0
9,4620010.0,0.9271,1494,781,742,775.0,720.0,1.0,1295.0,220.0,180.0,40.0,30.0,0.0,500.0,1040.0,165.0,1125.0,350.0,60.0


In [255]:
# Export file to drive
census_2021.to_csv('/content/drive/MyDrive/GentrificAItion/montreal_data_processing/data_cleaning/2_clean_data/2021_clean.csv' , index=False)