# Toronto 2016 Census Choropleth - Reported Places of Birth of Immigrants

It is the purpose of this notebook to outline the process I took to transform the data into a form to be used for an interactive bokeh plot.

In [1]:
# To import our necessary libraries
import numpy as np
import pandas as pd
import geopandas as gpd
import json

The data that we will be using will be provided by the 2016 Canadian census. It can be downloaded [here](https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/download-telecharger/comp/page_dl-tc.cfm?Lang=E) under the title "Census metropolitan areas (CMAs), tracted census agglomerations (CAs) and census tracts (CTs)".

I renamed the file to `CensusTract_2016.csv` for ease of use.

In [2]:
# Importing our data set, the Census Profile, 2016 Census
df_2016 = pd.read_csv("CensusData/CensusTract_2016.csv")

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
#To check
df_2016.head()

Unnamed: 0,CENSUS_YEAR,GEO_CODE (POR),GEO_LEVEL,GEO_NAME,GNR,GNR_LF,DATA_QUALITY_FLAG,ALT_GEO_CODE,DIM: Profile of Census Tracts (2247),Member ID: Profile of Census Tracts (2247),Notes: Profile of Census Tracts (2247),Dim: Sex (3): Member ID: [1]: Total - Sex,Dim: Sex (3): Member ID: [2]: Male,Dim: Sex (3): Member ID: [3]: Female
0,2016,1.0,1,St. John's,3.5,5.3,0,1,"Population, 2016",1,1.0,205955.0,...,...
1,2016,1.0,1,St. John's,3.5,5.3,0,1,"Population, 2011",2,2.0,196954.0,...,...
2,2016,1.0,1,St. John's,3.5,5.3,0,1,"Population percentage change, 2011 to 2016",3,,4.6,...,...
3,2016,1.0,1,St. John's,3.5,5.3,0,1,Total private dwellings,4,3.0,92353.0,...,...
4,2016,1.0,1,St. John's,3.5,5.3,0,1,Private dwellings occupied by usual residents,5,4.0,85015.0,...,...


This data set contains all of the 2016 census data from the census profile at a census tract level. A census tract is a geographic location within a metropolitan area of specific population. A detailed explanation can be read [here](https://www150.statcan.gc.ca/n1/pub/92-195-x/2011001/geo/ct-sr/def-eng.htm). We are looking specifically for those that make-up the neighbourhoods in Toronto. Fortunately Mat Krepicz at the City of Toronto was able to help and supplied this information. The excel file, `CTtoNHOOD-2006to2016.xlsx` is included in the repository.

I created a new .csv for to use for our purposes from `CTtoNHOOD-2006to2016.xlsx`.

In [4]:
# A dataframe of neighbourhoods and the census tracts that they consist of.
df_neighbourhoodsCTs = pd.read_csv("CensusData/2016_Neighbourhood_CensusTracts.csv")

In [5]:
# To check
df_neighbourhoodsCTs.head()

Unnamed: 0,Census Tract,Neighbourhood,Neighbourhood #
0,5350330.0,Guildwood,140
1,5350331.01,Guildwood,140
2,5350331.03,Scarborough Village,139
3,5350331.04,Scarborough Village,139
4,5350332.0,Scarborough Village,139


In [6]:
# We can rename the columns for ease of use and consistency
df_neighbourhoodsCTs.columns = ["CensusTract", "Neighbourhood", "Neighbourhood#"]
df_neighbourhoodsCTs.head()

Unnamed: 0,CensusTract,Neighbourhood,Neighbourhood#
0,5350330.0,Guildwood,140
1,5350331.01,Guildwood,140
2,5350331.03,Scarborough Village,139
3,5350331.04,Scarborough Village,139
4,5350332.0,Scarborough Village,139


In [7]:
# Great! So now we need a list of the census tracts
wanted_CTS = [] 
for i in df_neighbourhoodsCTs["CensusTract"]:
    wanted_CTS.append(i)

# To check
print(wanted_CTS[0:10])
print(len(wanted_CTS))

[5350330.0, 5350331.01, 5350331.03, 5350331.04, 5350332.0, 5350353.03, 5350353.04, 5350355.02, 5350355.03, 5350355.04]
599


With these census tracts we can reduce our dataframe to just those for the Toronto neighbourhoods. Through previous exploration, we found the range of codes in `Member ID: Profile of Census Tracts (2247)` for the immigration data we want. 

In [None]:
# With our list of desired census tracts, we can now make our smaller dataframe
NeighbourhoodCT_df = df_2016.where((df_2016["GEO_CODE (POR)"].isin(wanted_CTS)) & \
                                   (df_2016["Member ID: Profile of Census Tracts (2247)"].isin(np.arange(1157, 1217)))).copy()

In [None]:
# To give it a check
NeighbourhoodCT_df["DIM: Profile of Census Tracts (2247)"].unique()

In [None]:
# To keep only the columns we would like.
NeighbourhoodCT_df = NeighbourhoodCT_df[["GEO_CODE (POR)", "DIM: Profile of Census Tracts (2247)", "Member ID: Profile of Census Tracts (2247)", "Dim: Sex (3): Member ID: [1]: Total - Sex"]]

# To check
print(NeighbourhoodCT_df.info())
print(NeighbourhoodCT_df.shape)
NeighbourhoodCT_df["GEO_CODE (POR)"].isna().sum()

In [None]:
# To remove all of the N/As.
NeighbourhoodCT_df.dropna(subset=["GEO_CODE (POR)"], inplace=True)

# And to check
print(NeighbourhoodCT_df.isna().sum())
print(NeighbourhoodCT_df.info())

Hmmm? `Dim: Sex (3): Member ID: [1]: Total - Sex` should be a `float` or an `int` type. We need to investigate. But first we will clean up the titles, names, and descriptions to make working with the data more intuitive.

In [None]:
# To make life easier, we will rename the columns
NeighbourhoodCT_df.columns = ["CensusTract", "ReportedOrigin", "ReportedOriginCode", "Population"]

In [None]:
# What categories are we working with?
NeighbourhoodCT_df["ReportedOrigin"].unique()

In [None]:
# The total seems a bit wordy, we can make that easier
NeighbourhoodCT_df["ReportedOrigin"].replace("Total - Selected places of birth for the immigrant population in private households - 25% sample data",\
                                            "Total_Immigrant_Population", inplace = True)

# To check
NeighbourhoodCT_df["ReportedOrigin"].unique()

The white spaces in the categories may cause some trouble later. We can replace whitespace and other punctiation.

In [None]:
# To remove spaces and punctuation in category titles, for graphing reasons
fixlist = list(NeighbourhoodCT_df["ReportedOrigin"].unique())

for i in fixlist:
    j = i.replace(" ", "_")
    j = j.replace(",", "")
    NeighbourhoodCT_df["ReportedOrigin"].replace(i, j, inplace=True)
    
# To check
NeighbourhoodCT_df["ReportedOrigin"].unique()

Excellent, now to check what was wrong with our `Population` column.

In [None]:
# Now we need to know what is wrong with `Population`
NeighbourhoodCT_df[NeighbourhoodCT_df["Population"].str.isnumeric() == False]

The "x" refers to supressed data.The population was either too small to report, the response rate not high enough, or there were technical issues. More information can be found [here](https://www12.statcan.gc.ca/census-recensement/2011/dp-pd/prof/help-aide/N3.cfm). We should see how many tracts are missing.

In [None]:
# To check effected census tracts
print(NeighbourhoodCT_df[NeighbourhoodCT_df["Population"].str.isnumeric() == False]["CensusTract"].unique())

It looks like there are only two census tracts in our dataframe that have had "Area and data suppression". It is important to see what neighbourhoods these effect before removing them, so that we are aware.

In [None]:
# To check the supressed CTs
df_neighbourhoodsCTs[df_neighbourhoodsCTs["CensusTract"].isin([5350006.00, 5350205.00])]

In [None]:
# To check how many CTs are in each neighbourhood
df_neighbourhoodsCTs[df_neighbourhoodsCTs["Neighbourhood#"].isin([85, 18])]

After removing the census tracts subjected to data supression, the listed neighbourhoods will still have CTs with accessible data. We can now feel more comfortable removing these census tracts.

In [None]:
# Make a list of what we want to remove
rmvCT = list(NeighbourhoodCT_df[NeighbourhoodCT_df["Population"].str.isnumeric() == False]["CensusTract"].unique())

# Removing the CTs that don't contain information
NeighbourhoodCT_df = NeighbourhoodCT_df.where(~NeighbourhoodCT_df["CensusTract"].isin(rmvCT)).dropna()

#Changing our data from object to int
NeighbourhoodCT_df["Population"] = NeighbourhoodCT_df["Population"].astype(int)

# To check
NeighbourhoodCT_df.info()

Now that we have our census tract data all in a row, we should now add the neighbourhoods that each contribute to.

In [None]:
# To combine our data
Merged_df_2016 = NeighbourhoodCT_df.merge(df_neighbourhoodsCTs, how = "left", on = "CensusTract").copy()

# To take a look
Merged_df_2016

In [None]:
# Removing the columns we don't need.
Merged_df_2016.drop(["CensusTract", "Neighbourhood", "ReportedOriginCode"], axis=1, inplace=True)

# To save
Merged_df_2016.to_csv("Merged_df_2016.csv")

We save `Merged_df_2016.csv` for easier access later. This is important for when we would like to deploy a bokeh application.

In [None]:
# To check
Merged_df_2016.head()

The key to creating our choropleth is the `shapefile`. Fortunately the City of Toronto provides a shapefile for the neighbourhoods in their OpenData portal, which can be found [here](https://open.toronto.ca/dataset/neighbourhoods/).

In [None]:
# Now we can create a choropleth in bokeh!
shapefile = "CensusData/Neighbourhoods/Neighbourhoods.shp"

# Read shapefile using Geopandas
# From previous exploration we determined that the listed fields contain the data we want.
gdf_neighbourhoods = gpd.read_file(shapefile)[["FIELD_5", "FIELD_7", "geometry"]]

# Rename Columns
gdf_neighbourhoods.columns = ["Neighbourhood#", "Neighbourhood", "geometry"]
gdf_neighbourhoods.head()

Now we have everything we need to create our Bokeh choropleth! An important step in the following is to convert our dataframe into a `.json` string. This is a requirement in Bokeh for this sort of graph.

You will notice that the `Select` option in the graph here won't change the graph. A Bokeh server will need to be used.

In [None]:
from bokeh.io import show, output_notebook
from bokeh.plotting import figure, curdoc
from bokeh.models import GeoJSONDataSource, LinearColorMapper, ColorBar, HoverTool
from bokeh.palettes import brewer
from bokeh.models.widgets import Select
from bokeh.layouts import widgetbox, column

###

## To make sure all of our data is here

# read in our dataframe
Merged_df_2016 = pd.read_csv("Merged_df_2016.csv")

# Read shapefile using Geopandas
shapefile = "./CensusData/Neighbourhoods/Neighbourhoods.shp"

gdf_neighbourhoods = gpd.read_file(shapefile)[["FIELD_5", "FIELD_7", "geometry"]]

# Rename Columns
gdf_neighbourhoods.columns = ["Neighbourhood#", "Neighbourhood", "geometry"]

###

def update_data(selectedOrigin):
    """
    This function updates our data source to be only the data from our selected origin.
    """
    # Gather the data about the selected origin we want
    df = Merged_df_2016[Merged_df_2016["ReportedOrigin"] == selectedOrigin].copy()
    
    # Combine the population data from all the CTs into neighbourhoods
    df["TotalPopulation"] = df.groupby("Neighbourhood#")["Population"].transform("sum")
    
    # Grab the first instance in our dataframe for each neighbourhood (removing duplicate information)
    df = df.groupby("Neighbourhood#").first().reset_index()
    
    # Combining our shapefile data with our dataframe
    df = gdf_neighbourhoods.merge(df, how = "left", on = "Neighbourhood#")
    
    # Restructuring our dataframe
    df = df[["Neighbourhood", "geometry", "ReportedOrigin", "TotalPopulation"]]
    
    return df

def data_to_json(df):
    """
    We use this to convert our data to `.json` so that we can graph in bokeh.
    This step is seperate so that we can acces data from the dataframe before conversion.
    """
    # Convert dataframe to a json string
    tojson = df.to_json()

    return tojson

def origindata(attr, old, new):
    """
    This function updates our plot with our new origin selection
    """
    # To get the index of our value selected in the menu
    choice = menu.index(select.value)

    # To change the user facing category to the dataframe recognized category
    selectedOrigin = menulist[choice]

    # Changing our title to represent the data being shown
    p.title.text = f'Population of Immigrants who reported {select.value} as their origin by Neighbourhood'

    # Update our dataframe to our selected data
    df_to_use = update_data(selectedOrigin)

    # Change the tick labels on our colorbar
    color_map.high = df_to_use.iloc[:,3].max()
    color_map.low = df_to_use.iloc[:,3].min()

    # Convert our data to be used.
    new_data = data_to_json(df_to_use)
    geosource.geojson = new_data
    return

# Our intial starting point
selectedOrigin = "Total_Immigrant_Population"

#Getting our data into a starting position
df_to_use = update_data(selectedOrigin)
geosource = GeoJSONDataSource(geojson = data_to_json(df_to_use))

# Choosing our colorblind friendly colours
palette = brewer['YlGn'][7]

# Making the darker areas represent the higher population
palette = palette[::-1]

# Map our colours from the palette to our data. High and low set the ticks.
color_map = LinearColorMapper(palette = palette, high = df_to_use.iloc[:,3].max(), low = df_to_use.iloc[:,3].min())

# Add hover tool with defined information
hover = HoverTool(tooltips = [('Neighbourhood','@Neighbourhood'),('Population', '@TotalPopulation')])

# Create color bar.
color_bar = ColorBar(color_mapper=color_map, label_standoff= 8, width = 20, height = 500,
                     border_line_color=None, location = (0,0), orientation = 'vertical')

# ou plot!
p = figure(title = f'Population of Immigrants from Total Immigrant Population by Neighbourhood', plot_height = 600 , plot_width = 950, tools = [hover])

# For aesthetic reasons, getting rid of excess lines and borders
p.axis.visible = False
p.grid.visible = False
p.outline_line_color = None

#Add patch renderer to figure.
p.patches('xs','ys', source = geosource, fill_color = {'field' : 'TotalPopulation', 'transform' : color_map},
              line_color = 'black', line_width = 0.25, fill_alpha = 1)

# Put the colorbar on the left.
p.add_layout(color_bar, "left")


### Create our widget, and what is in it.

# Take a list of categories
menulist = list(Merged_df_2016["ReportedOrigin"].unique())

# Sort the list alphabetically, but put the total population as choice one.
menulist[1:] = sorted(menulist[1:])

# Repeat process above, but make the categories more user/reader friendly
menu = list(map(lambda x : x.replace("_", " "), menulist))
menu[1:] = sorted(menu[1:])
menu[31] = "Korea, South"
menu[51] = 'South Africa, Republic of'

# This is our Select widget.
select = Select(title="Reported Origin:", options=menu)
select.on_change('value', origindata)

###

#Display figure inline in Jupyter Notebook.
output_notebook()

layout = column(p, widgetbox(select))
curdoc().add_root(layout)

#Display figure.
show(layout)


In [None]:
# Take a list of categories
menulist = list(Merged_df_2016["ReportedOrigin"].unique())

# Sort the list alphabetically, but put the total population as choice one.
menulist[1:] = sorted(menulist[1:])

# Repeat process above, but make the categories more user/reader friendly
menu = list(map(lambda x : x.replace("_", " "), menulist))
menu[1:] = sorted(menu[1:])