# 2021 Census of Population, Census Profile

The census profile presents information from the 2021 Census of Population for various levels of geography, including provinces and territories, census metropolitan areas, communities and census tracts.

## Reading the CSV file

This is a data subset reading from the 2.5gb size census 2021 csv file.

In [1]:
# import libraries
import pandas as pd
import sqlite3

In [2]:
# show all columns
pd.set_option('display.max_columns', None)

## Exploring a sample

The **Census Profile, 2021 Census of Population** is a large dataset. To identify the proper settings to import the full dataset. First, explore a sample. Previously, some basic `read_csv` configuration has been identify to make it work.
Also it helps to capture the `dtypes` and built a proper variables dictionary.

In [3]:
# reading n-rows zip compressed csv
n = 2631*3
df = pd.read_csv("c2021_dataset.csv.gz", nrows=n, sep=',',
                 encoding='utf-8', encoding_errors='replace'
                )

In [4]:
df.head()

Unnamed: 0,CENSUS_YEAR,DGUID,ALT_GEO_CODE,GEO_LEVEL,GEO_NAME,TNR_SF,TNR_LF,DATA_QUALITY_FLAG,CHARACTERISTIC_ID,CHARACTERISTIC_NAME,CHARACTERISTIC_NOTE,C1_COUNT_TOTAL,SYMBOL,C2_COUNT_MEN+,SYMBOL.1,C3_COUNT_WOMEN+,SYMBOL.2,C10_RATE_TOTAL,SYMBOL.3,C11_RATE_MEN+,SYMBOL.4,C12_RATE_WOMEN+,SYMBOL.5
0,2021,2021S0503932,932.0,Census metropolitan area,Abbotsford - Mission,2.7,3.7,0,1,"Population, 2021",1.0,195726.0,,,...,,...,,...,,...,,...
1,2021,2021S0503932,932.0,Census metropolitan area,Abbotsford - Mission,2.7,3.7,0,2,"Population, 2016",1.0,180518.0,,,...,,...,,...,,...,,...
2,2021,2021S0503932,932.0,Census metropolitan area,Abbotsford - Mission,2.7,3.7,0,3,"Population percentage change, 2016 to 2021",,8.4,,,...,,...,8.4,,,...,,...
3,2021,2021S0503932,932.0,Census metropolitan area,Abbotsford - Mission,2.7,3.7,0,4,Total private dwellings,2.0,70648.0,,,...,,...,,...,,...,,...
4,2021,2021S0503932,932.0,Census metropolitan area,Abbotsford - Mission,2.7,3.7,0,5,Private dwellings occupied by usual residents,3.0,67613.0,,,...,,...,,...,,...,,...


In [5]:
df.shape

(7893, 23)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7893 entries, 0 to 7892
Data columns (total 23 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   CENSUS_YEAR          7893 non-null   int64  
 1   DGUID                7893 non-null   object 
 2   ALT_GEO_CODE         7893 non-null   float64
 3   GEO_LEVEL            7893 non-null   object 
 4   GEO_NAME             7893 non-null   object 
 5   TNR_SF               7893 non-null   float64
 6   TNR_LF               7893 non-null   float64
 7   DATA_QUALITY_FLAG    7893 non-null   int64  
 8   CHARACTERISTIC_ID    7893 non-null   int64  
 9   CHARACTERISTIC_NAME  7893 non-null   object 
 10  CHARACTERISTIC_NOTE  837 non-null    float64
 11  C1_COUNT_TOTAL       7884 non-null   float64
 12  SYMBOL               9 non-null      object 
 13  C2_COUNT_MEN+        7199 non-null   float64
 14  SYMBOL.1             694 non-null    object 
 15  C3_COUNT_WOMEN+      7196 non-null   f

In [7]:
# dictionary with data types from the csv
column_types=df.dtypes.to_dict()
column_types

{'CENSUS_YEAR': dtype('int64'),
 'DGUID': dtype('O'),
 'ALT_GEO_CODE': dtype('float64'),
 'GEO_LEVEL': dtype('O'),
 'GEO_NAME': dtype('O'),
 'TNR_SF': dtype('float64'),
 'TNR_LF': dtype('float64'),
 'DATA_QUALITY_FLAG': dtype('int64'),
 'CHARACTERISTIC_ID': dtype('int64'),
 'CHARACTERISTIC_NAME': dtype('O'),
 'CHARACTERISTIC_NOTE': dtype('float64'),
 'C1_COUNT_TOTAL': dtype('float64'),
 'SYMBOL': dtype('O'),
 'C2_COUNT_MEN+': dtype('float64'),
 'SYMBOL.1': dtype('O'),
 'C3_COUNT_WOMEN+': dtype('float64'),
 'SYMBOL.2': dtype('O'),
 'C10_RATE_TOTAL': dtype('float64'),
 'SYMBOL.3': dtype('O'),
 'C11_RATE_MEN+': dtype('float64'),
 'SYMBOL.4': dtype('O'),
 'C12_RATE_WOMEN+': dtype('float64'),
 'SYMBOL.5': dtype('O')}

## Defining the variables dictionary

In order to built a proper variable dictionary, information has been pulled from the *metadata file* included in the download zip file. Geographic information definitions has been gathered from the [Full Table Download (CSV) User Guide](https://www.statcan.gc.ca/en/developers/csv/user-guide). Meaning that a full variable dictionary doesn't exist as such, although the information is available but scattered.

Variable            | Description | Dtype
--------------------|-------------|----------
CENSUS_YEAR         | year of census | int64  
DGUID               | Dissemination Geography Unique Identifier - DGUID. <br>It is an alphanumeric code, composed of four components.<br>It varies from 10 to 20 characters in length. <br>The first 9 characters are fixed in composition and length.<br>Vintage (4) + Type (1) + Schema (4) + Geographic Unique Identifier (2-11) :<br>VVVV T SSSS  GGGGGGGGGGG <br> Further information at the [Dictionary, Census of Population, 2021: Dissemination Geography Unique Identifier (DGUID)](https://www12.statcan.gc.ca/census-recensement/2021/ref/dict/az/definition-eng.cfm?ID=geo055) | object
ALT_GEO_CODE        | is an alternate geographic code that is usually equal to the ending digits of the DGUID.<br> This code is often the geographic code found in previous census cycle products | object  
GEO_LEVEL           | this is the level of the geography i.e. Census Subdivision | object 
GEO_NAME            | the name of the geographic area | object 
TNR_SF              | this value would be the total non-response rate to the short-form questionnaire | float64
TNR_LF              | this value would be the total non-response rate to the long-form questionnaire (*) | float64
DATA_QUALITY_FLAG   | its a 5 character number that describes data quality | object
CHARACTERISTIC_ID   | identifier of each one of the 2631 characteristics.<br> i.e.: *Total - Age groups of the population - 100% data*, *Total private dwelling*, etc. | int64  
CHARACTERISTIC_NAME | is a descriptibe name for each characteristic associated with an identifier number. <br>Characteristics are organized by topic and subtopic. <br>More info at [Characteristics by topic and subtopic](https://www12.statcan.gc.ca/census-recensement/2021/dp-pd/prof/about-apropos/metadata-metadonnees-eng.cfm) | object
CHARACTERISTIC_NOTE | is a reference to the footnotes mentioned in the previous table | int64
C1_COUNT_TOTAL      | is the count of total population | float64
SYMBOL              | field used to associate any necessary quality symbols `C1_COUNT_TOTAL` | object
C2_COUNT_MEN+       | includes men and boys, as well as some non-binary persons | float64
SYMBOL              | field used to associate any necessary quality symbols `C2_COUNT_MEN+` | object
C3_COUNT_WOMEN+     | includes women and girls, as well as some non-binary persons | float64
SYMBOL              | field used to associate any necessary quality symbols `C3_COUNT_WOMEN+` | object
C10_RATE_TOTAL      | corresponding rate for `C1_COUNT_TOTAL` | float64
SYMBOL              | field used to associate any necessary quality symbols `C10_RATE_TOTAL` | object
C11_RATE_MEN+       | corresponding rate for `C2_COUNT_MEN+` | float64
SYMBOL              | field used to associate any necessary quality symbols `C11_RATE_MEN+` | object
C12_RATE_WOMEN+     | corresponding rate for `C3_COUNT_WOMEN+` | float64
SYMBOL              | field used to associate any necessary quality symbols `C12_RATE_WOMEN+` | object

## Improving quality of import

### Re-defining the data types

In [8]:
# Column types according to the dictionary
column_types = {
    'CENSUS_YEAR': 'int64',
    'DGUID': 'O',
    'ALT_GEO_CODE': 'O', # this dtype was 'float64'
    'GEO_LEVEL': 'O',
    'GEO_NAME': 'O',
    'TNR_SF': 'float64',
    'TNR_LF': 'float64',
    'DATA_QUALITY_FLAG': 'O', # this dtype was 'float64'
    'CHARACTERISTIC_ID': 'int64',
    'CHARACTERISTIC_NAME': 'O',
    'CHARACTERISTIC_NOTE': 'O', # this dtype was 'float64'
    'C1_COUNT_TOTAL': 'float64',
    'SYMBOL': 'O',
    'C2_COUNT_MEN+': 'float64',
    'SYMBOL.1': 'O',
    'C3_COUNT_WOMEN+': 'float64',
    'SYMBOL.2': 'O',
    'C10_RATE_TOTAL': 'float64',
    'SYMBOL.3': 'O',
    'C11_RATE_MEN+': 'float64',
    'SYMBOL.4': 'O',
    'C12_RATE_WOMEN+': 'float64',
    'SYMBOL.5': 'O'
}

### Finding char errors

In [9]:
# get all unique Characteristic Name
unique_values=df['CHARACTERISTIC_NAME'].unique().tolist()

In [10]:
# print all in one go (not affected by pandas row truncation)
print("\n".join(map(str, unique_values)))

Population, 2021
Population, 2016
Population percentage change, 2016 to 2021
Total private dwellings
Private dwellings occupied by usual residents
Population density per square kilometre
Land area in square kilometres
Total - Age groups of the population - 100% data
  0 to 14 years
    0 to 4 years
    5 to 9 years
    10 to 14 years
  15 to 64 years
    15 to 19 years
    20 to 24 years
    25 to 29 years
    30 to 34 years
    35 to 39 years
    40 to 44 years
    45 to 49 years
    50 to 54 years
    55 to 59 years
    60 to 64 years
  65 years and over
    65 to 69 years
    70 to 74 years
    75 to 79 years
    80 to 84 years
    85 years and over
      85 to 89 years
      90 to 94 years
      95 to 99 years
      100 years and over
Total - Distribution (%) of the population by broad age groups - 100% data
Average age of the population
Median age of the population
Total - Occupied private dwellings by structural type of dwelling - 100% data
  Single-detached house
  Semi-detached

### Testing new read_csv config

In [11]:
# reading n-rows zip compressed csv
n = 2631*3 # two DGUID units, each with its 2631 characteristics
cols=list(range(0,4))+list(range(7,17)) # filter geo name, total non-response rate and rate variables

df = pd.read_csv("c2021_dataset.csv.gz", nrows=n, sep=',',
                 encoding='utf-8', encoding_errors='replace',
                 dtype= column_types,
                 usecols=cols
                )

In [12]:
# filling NaN with empty space
df=df.fillna('')

In [13]:
df.tail()

Unnamed: 0,CENSUS_YEAR,DGUID,ALT_GEO_CODE,GEO_LEVEL,DATA_QUALITY_FLAG,CHARACTERISTIC_ID,CHARACTERISTIC_NAME,CHARACTERISTIC_NOTE,C1_COUNT_TOTAL,SYMBOL,C2_COUNT_MEN+,SYMBOL.1,C3_COUNT_WOMEN+,SYMBOL.2
7888,2021,2021S05079320002.00,9320002.0,Census tract,0,2627,Total - Eligibility and instruction in the min...,204.0,165.0,,85.0,,80.0,
7889,2021,2021S05079320002.00,9320002.0,Census tract,0,2628,Children eligible for instruction in the min...,,10.0,,5.0,,0.0,
7890,2021,2021S05079320002.00,9320002.0,Census tract,0,2629,Eligible children�who have been instructed...,,5.0,,0.0,,0.0,
7891,2021,2021S05079320002.00,9320002.0,Census tract,0,2630,Eligible children�who have not been instru...,,5.0,,5.0,,5.0,
7892,2021,2021S05079320002.00,9320002.0,Census tract,0,2631,Children not eligible for instruction in the...,,155.0,,80.0,,75.0,


### Handling `CENSUS_YEAR` constant variable

The `CENSUS_YEAR` is a constant variable, so there is no reason to keep it as a repeated value in the dataframe. Because it consumes resources unnecesarily, and doesn't add up to further analysis. For that reason, this value will be store as an attribute rather than a variable.

In [14]:
# storing Census Year as an attribute
df.attrs["CENSUS_YEAR"] = int(df["CENSUS_YEAR"].unique()[0])

In [15]:
df.attrs

{'CENSUS_YEAR': 2021}

In [16]:
# dropping Census Year column
df=df.drop(columns=["CENSUS_YEAR"])

In [17]:
df.tail()

Unnamed: 0,DGUID,ALT_GEO_CODE,GEO_LEVEL,DATA_QUALITY_FLAG,CHARACTERISTIC_ID,CHARACTERISTIC_NAME,CHARACTERISTIC_NOTE,C1_COUNT_TOTAL,SYMBOL,C2_COUNT_MEN+,SYMBOL.1,C3_COUNT_WOMEN+,SYMBOL.2
7888,2021S05079320002.00,9320002.0,Census tract,0,2627,Total - Eligibility and instruction in the min...,204.0,165.0,,85.0,,80.0,
7889,2021S05079320002.00,9320002.0,Census tract,0,2628,Children eligible for instruction in the min...,,10.0,,5.0,,0.0,
7890,2021S05079320002.00,9320002.0,Census tract,0,2629,Eligible children�who have been instructed...,,5.0,,0.0,,0.0,
7891,2021S05079320002.00,9320002.0,Census tract,0,2630,Eligible children�who have not been instru...,,5.0,,5.0,,5.0,
7892,2021S05079320002.00,9320002.0,Census tract,0,2631,Children not eligible for instruction in the...,,155.0,,80.0,,75.0,


## Filtering Census by city, geographical level and variables of interest

### By city and geographical level

The city of Toronto is the area of interest for this project. The complementary csv file *Geo_starting_now* has a list of all `DGUID`, listed as `Geo Code` in the dataset, along with their corresponding `Geo Names` and the starting line number of each geographical unit. The `Line Number` variable can be useful for improving the `read_csv` configuration by defining the start and end lines for import and enabling early-stage filtering.

Geo Code | Geo Name | Line Number
---------|----------|------------
2021S0503535 | Toronto | 10816043
2021S05075350001.00 | 5350001 | 10818674

The dataset includes information at multiple geographical levels. This project will focus exclusiely on the `Census tract` level. The `Geo Code` structure varies depending on the geographical level. According to the [Dictionary, Census of Population, 2021: Census Tract (CT))](https://www12.statcan.gc.ca/census-recensement/2021/ref/dict/az/Definition-eng.cfm?ID=geo013), the coding structure for census tract follows a specific naming convention:

>Each CT is assigned a seven‑character numeric "name" (including leading zeros, decimal point and trailing zeros). To uniquely identify each CT in its corresponding CMA or tracted CA, the three‑digit CMA or CA code must precede the CT name. For example:

CMA/CA code and CT name	| CMA/CA name
------------------------|------------
562 0005.00	| Sarnia CA (Ont.)
933 0005.00	| Vancouver CMA (B.C.)

In order to filter Toronto's census tracts, it is necesary to retain only those with the Toronto's CMA code: **535**, found within the `DGUID` after the Vintage + Type + Schema coding sequence, in this case: **2021S0507**.

In [18]:
# filtering by DGUID
df[df["DGUID"].str.startswith("2021S0507535")] 

Unnamed: 0,DGUID,ALT_GEO_CODE,GEO_LEVEL,DATA_QUALITY_FLAG,CHARACTERISTIC_ID,CHARACTERISTIC_NAME,CHARACTERISTIC_NOTE,C1_COUNT_TOTAL,SYMBOL,C2_COUNT_MEN+,SYMBOL.1,C3_COUNT_WOMEN+,SYMBOL.2


The filter returns an empty set because the sample dataset has rows only starting with **2021S0503932** and **2021S0507932**.

### By variables of interest

The variables of interest belong to the topic *Ethnocultural and religious diversity*, within this topic the subtopic of interest is *Ethnic or cultural origin*. Additionally, from the topic *Immigration, place of birth, and citizenship*, the subtopic *Selected places of birth for the immigrant population* will also be included among the selected variables.

The goal of the EDA is to identify which *characterstics* will be selected for further analysis. For now, all characteristics within the listed subtopics will be analyzed.

A quick search in the *General Information* txt file shows that the characteristics of interests and their id numbers are:

Characteristic | ID range
---------------|----------
Place of birth for the immigrant population in private households | 1544-1603
Place of birth for the recent immigrant population in private households | 1604-1664
Ethnic or cultural origin for the population in private households | 1698-1948

In [19]:
# filtering the characterists by subtopic
#  group 1: Place of birth for the immigrant population in private households
df1=df[df["CHARACTERISTIC_ID"].between(1544,1603)]

In [20]:
# filtering the characterists by subtopic
#  group 2: Place of birth for the recent immigrant population in private households
df2=df[df["CHARACTERISTIC_ID"].between(1604,1664)]

In [21]:
# filtering the characterists by subtopic
#  group 3: Ethnic or cultural origin for the population in private households
df3=df[df["CHARACTERISTIC_ID"].between(1698,1948)]

## Reading all the dataset

In [None]:
%%time
# reading zip compressed csv while
cols=list(range(0,4))+list(range(7,17)) # filter geo name, total non-response rate and rate variables

df = pd.read_csv("c2021_dataset.csv.gz", sep=',',
                 encoding='utf-8', 
                 encoding_errors='replace', # it uses 'U+FFFD', the official REPLACEMENT CHARACTER 
                 dtype= column_types,# dtype has been stated to avoid no mixed types because of 'low_memory' config
                 usecols=cols,
                 #header=0, skiprows= list(range(1,(10816043-1))) # early-stage filtering by line number seems inefficient
                 low_memory=False # internally process the file in chunks
                )

In [None]:
# storing Census Year as an attribute
df.attrs["CENSUS_YEAR"] = int(df["CENSUS_YEAR"].unique()[0])

In [None]:
# dropping Census Year column
df=df.drop(columns=["CENSUS_YEAR"])

In [None]:
# filtering by DGUID
df=df[df["DGUID"].str.startswith("2021S0507535")] 

In [None]:
# storing Geo Level as an attribute
df.attrs["GEO_LEVEL"] = str(df["GEO_LEVEL"].unique()[0])

In [None]:
# dropping Census Level column
df=df.drop(columns=["GEO_LEVEL"])

In [None]:
df.attrs

In [None]:
# filling NaN with empty space for objects dtypes columns
obj_cols = df.select_dtypes(include='object').columns
df[obj_cols] = df[obj_cols].fillna('')

In [None]:
# filling NaN with 0.0 for float64 dtypes columns
float_cols = df.select_dtypes(include='float64').columns
df[float_cols] = df[float_cols].fillna(0.0)

In [None]:
df.dtypes

In [None]:
df

In [None]:
# export the dataset
df.to_csv('df.csv', index=False)

As importing the data became a long process, EDA will be performed in another notebook.

## Resouces

- [Census Profile, 2021 Census of Population](https://www12.statcan.gc.ca/census-recensement/2021/dp-pd/prof/index.cfm?Lang=E)
- [About the Census Profile, 2021 Census of Population](https://www12.statcan.gc.ca/census-recensement/2021/dp-pd/prof/about-apropos/about-apropos.cfm?Lang=E#aa1)
- [Guide to the Census of Population, 2021](https://www12.statcan.gc.ca/census-recensement/2021/ref/98-304/index-eng.cfm),  provides an overview of the Census of Population content determination, collection, processing, data quality assessment and data dissemination. It may be useful to both new and experienced users who wish to familiarize themselves with and find specific information about the 2021 Census
- [Filling the gaps: Information on gender in the 2021 Census](https://www12.statcan.gc.ca/census-recensement/2021/ref/98-20-0001/982000012021001-eng.cfm), defines gender, sex at birth, and relevant concepts as the Census 2021 disseminates census information on gender
- [Census Profile metadata](https://www12.statcan.gc.ca/census-recensement/2021/dp-pd/prof/about-apropos/metadata-metadonnees-eng.cfm), list characteristics by topics and subtopic, and list all footnotes
- [Full Table Download (CSV) User Guide](https://www.statcan.gc.ca/en/developers/csv/user-guide), provides users with a guide to the full table downloadable output files available from the Statistics Canada website
- [Dictionary, Census of Population 2021, PDF version](https://www12.statcan.gc.ca/census-recensement/2021/ref/dict/98-301-x2021001-eng.pdf), is a reference document which contains detailed definitions of Census of Population concepts, variables and geographic terms, as well as historical information. The PDF version organizes the concepts by topics, which is not the case for the [web version](https://www12.statcan.gc.ca/census-recensement/2021/ref/dict/index-eng.cfm).

## Reference

Statistics Canada. 2023. Census Profile. 2021 Census of Population. Statistics Canada Catalogue number 98-316-X2021001. Ottawa. Released November 15, 2023.
*https://www12.statcan.gc.ca/census-recensement/2021/dp-pd/prof/index.cfm?Lang=E (accessed August 4, 2025).*