# CANSIM API - Get MetaData and Download CANSIM Tables
A Jupyter Notebook that explores some functions, namely getAllCubeList, getCodeSets, and getCubeMetaData, from Statistics Canada's Web Data Service (WDS) and demonstrates how to download full CSV data files. For more information about the WDS, visit: [https://www.statcan.gc.ca/eng/developers/wds/user-guide](https://www.statcan.gc.ca/eng/developers/wds/user-guide)

In [2]:
# Import packages
import urllib3
import urllib
import json
import zipfile
import pandas as pd
import numpy as np
import os
from IPython.display import display

# Instantiate poolmanager
http = urllib3.PoolManager()

## WDS - getAllCubeList
This function returns a list of jsons, 1 per metadata table. Notice that dimension contains another json data structure.

In [2]:
# getAllCubeList
r = http.request('GET', "https://www150.statcan.gc.ca/t1/wds/rest/getAllCubesList")

# Returns a list of JSON metadata
response = json.loads(r.data.decode('utf-8'))

# Print length
print(len(response))

# Check to see the contents
print(response[0])



5849
{'productId': 10100001, 'cansimId': '183-0021', 'cubeTitleEn': 'Federal public sector employment reconciliation of Treasury Board of Canada Secretariat, Public Service Commission of Canada and Statistics Canada statistical universes, as at December 31', 'cubeTitleFr': 'Emploi du secteur public fédéral rapprochement des univers statistiques du Secrétariat du Conseil du Trésor du Canada, de la Commission de la fonction publique du Canada et de Statistique Canada, au 31 décembre', 'cubeStartDate': '1999-01-01', 'cubeEndDate': '2011-01-01', 'releaseTime': '2012-08-01T12:30', 'archived': '1', 'subjectCode': ['10'], 'surveyCode': ['1713'], 'frequencyCode': 12, 'corrections': [], 'dimensions': [{'dimensionNameEn': 'Geography', 'dimensionNameFr': 'Géographie', 'dimensionPositionId': 1, 'hasUOM': False}, {'dimensionNameEn': 'Federal public sector employment', 'dimensionNameFr': "Groupes d'emplois du secteur public fédéral", 'dimensionPositionId': 2, 'hasUOM': True}]}


### Converting to Pandas Data Frame
It is difficult for humans to read the metadata as a json, especially when the output doesn't print each attribute to its own line. Let's transform the json to Pandas dataframe which puts the data into a legible tabular format.

First, extract the dimensions and corrections into its own data frame and keep the productId column. Then, convert the rest of the json to a dataframe. Since the dimensions-only and data-frame-without-dimensions data frames both have a productId column, we can merge the two together into one data frame. Similarly, the same can be done with corrections. Do this for all the CANSIM tables and then concatenate them together to get one large data frame. We can then drop the original dimension column as we've already extracted it.

In [3]:
# Empty list to store dataframes of each metadata table
mds = []

# Transform each json to dataframe
for i in response:

    # pd.json_normalize flattens the json to a 1-row data frame,
    # otherwise a value error occurs: "arrays must all be same length"
    # Flatten just dimensions and keep productId
    dims = pd.DataFrame.from_records(pd.json_normalize(i, 'dimensions', ['productId']))

    # Flatten just corrections and keep productId
    cor = pd.DataFrame.from_records(pd.json_normalize(i, 'corrections', ['productId']))

    # Flatten the rest of the json
    mdNoDims = pd.DataFrame.from_records(pd.json_normalize(i))

    # Merge dims and mdNoDims
    mdDF = pd.merge(mdNoDims, dims, how="left", on="productId")

    # merge cor and mdDF
    mdDF = pd.merge(mdDF, cor, how="left", on="productId")

    # Add each dataframe into the mds list
    mds.append(mdDF)

# Concatenate all the dataframes into one
md = pd.concat(mds, ignore_index=True)

md.head()

print(len(md.index))

19907


Notice how some columns hold lists, this means that there might be multiple values in those cells. Let's unlist them such that each value in the list has its own row. We can do that by using `pd.explode()`.

In [4]:
# Drop dimensions and corrections column
md = md[['productId', 'cansimId', 'cubeTitleEn', 'cubeTitleFr', 'cubeStartDate',
         'cubeEndDate', 'releaseTime', 'archived', 'subjectCode', 'surveyCode',
         'frequencyCode', 'dimensionNameEn', 'dimensionNameFr',
         'dimensionPositionId', 'hasUOM', "correctionDate", "correctionNoteEn",
         "correctionNoteFr"]]

# Unlist subjectCode, surveyCode
md = md.explode("subjectCode", ignore_index=True)
md = md.explode("surveyCode", ignore_index=True)
md.head()

Unnamed: 0,productId,cansimId,cubeTitleEn,cubeTitleFr,cubeStartDate,cubeEndDate,releaseTime,archived,subjectCode,surveyCode,frequencyCode,dimensionNameEn,dimensionNameFr,dimensionPositionId,hasUOM,correctionDate,correctionNoteEn,correctionNoteFr
0,10100001,183-0021,Federal public sector employment reconciliatio...,Emploi du secteur public fédéral rapprochement...,1999-01-01,2011-01-01,2012-08-01T12:30,1,10,1713,12,Geography,Géographie,1,False,,,
1,10100001,183-0021,Federal public sector employment reconciliatio...,Emploi du secteur public fédéral rapprochement...,1999-01-01,2011-01-01,2012-08-01T12:30,1,10,1713,12,Federal public sector employment,Groupes d'emplois du secteur public fédéral,2,True,,,
2,10100002,191-0002,Central government debt,Dette du gouvernement central,2009-04-01,2021-03-01,2021-05-31T12:30,2,10,7514,6,Geography,Géographie,1,False,,,
3,10100002,191-0002,Central government debt,Dette du gouvernement central,2009-04-01,2021-03-01,2021-05-31T12:30,2,10,7514,6,Central government debt,Dette du gouvernement central,2,True,,,
4,10100003,176-0079,Government of Canada debt securities: Gross ne...,Titres d’emprunt du gouvernement du Canada : é...,1975-01-01,2021-05-01,2021-06-21T12:30,2,10,7502,6,Geography,Géographie,1,False,,,


I then check how many NaNs are in each column to see if there are any columns that can be removed. My condition for removing columns is when the entire column is NaNs.

In [5]:
# Count how many NaNs are in each column
print(md.isna().sum())

productId                  0
cansimId                4459
cubeTitleEn                0
cubeTitleFr                0
cubeStartDate              0
cubeEndDate                0
releaseTime                0
archived                   0
subjectCode                0
surveyCode                96
frequencyCode              0
dimensionNameEn            0
dimensionNameFr            0
dimensionPositionId        0
hasUOM                     0
correctionDate         21483
correctionNoteEn       21483
correctionNoteFr       21483
dtype: int64


Then, I convert columns to their proper types (i.e. date columns are dates). However, notice that some columns which look to be integer types are object types which can be due to having NaNs in the column. To handle this, convert the column to float type.

In [6]:
# Check types
print(md.dtypes)

# Convert column types
md[['productId', 'cansimId', 'cubeTitleEn', 'cubeTitleFr', 'releaseTime', 'dimensionNameEn', 'dimensionNameFr']] = md[['productId', 'cansimId', 'cubeTitleEn', 'cubeTitleFr', 'releaseTime', 'dimensionNameEn', 'dimensionNameFr']].astype(str)

# Has to be float to handle NaNs
md[['archived', 'subjectCode', 'surveyCode', 'frequencyCode']] = md[['archived', 'subjectCode', 'surveyCode', 'frequencyCode']].astype(float)

# Remove Time stamps on Release Time
md['releaseTime'] = md['releaseTime'].map(lambda x: str(x)[:-6])

# Convert string to date
md['cubeStartDate'] = pd.to_datetime(md['cubeStartDate'], format='%Y-%m-%d')
md['cubeEndDate'] = pd.to_datetime(md['cubeEndDate'], format='%Y-%m-%d')
md['releaseTime'] = pd.to_datetime(md['releaseTime'], format='%Y-%m-%d')

# Check types again
print(md.dtypes)

productId              object
cansimId               object
cubeTitleEn            object
cubeTitleFr            object
cubeStartDate          object
cubeEndDate            object
releaseTime            object
archived               object
subjectCode            object
surveyCode             object
frequencyCode           int64
dimensionNameEn        object
dimensionNameFr        object
dimensionPositionId     int64
hasUOM                   bool
correctionDate         object
correctionNoteEn       object
correctionNoteFr       object
dtype: object
productId                      object
cansimId                       object
cubeTitleEn                    object
cubeTitleFr                    object
cubeStartDate          datetime64[ns]
cubeEndDate            datetime64[ns]
releaseTime            datetime64[ns]
archived                      float64
subjectCode                   float64
surveyCode                    float64
frequencyCode                 float64
dimensionNameEn             

Once the column types are converted, drop duplicate rows. Note that pandas cannot drop duplicates on unhashable objects such as lists, dicts (the json format), and comparisons between different types of objects. Hence, it was necessary we convert the columns to their appropriate types and account for NaNs.

The output is a clean, and easy-to-read data frame. Much better than the json format.

In [7]:
# Drop all duplicates
# Would not drop duplicates without changing type
# Threw an error: dict is unhashable
md.drop_duplicates(inplace=True, ignore_index=True)

# Print
md.head()

Unnamed: 0,productId,cansimId,cubeTitleEn,cubeTitleFr,cubeStartDate,cubeEndDate,releaseTime,archived,subjectCode,surveyCode,frequencyCode,dimensionNameEn,dimensionNameFr,dimensionPositionId,hasUOM,correctionDate,correctionNoteEn,correctionNoteFr
0,10100001,183-0021,Federal public sector employment reconciliatio...,Emploi du secteur public fédéral rapprochement...,1999-01-01,2011-01-01,2012-08-01,1.0,10.0,1713.0,12.0,Geography,Géographie,1,False,,,
1,10100001,183-0021,Federal public sector employment reconciliatio...,Emploi du secteur public fédéral rapprochement...,1999-01-01,2011-01-01,2012-08-01,1.0,10.0,1713.0,12.0,Federal public sector employment,Groupes d'emplois du secteur public fédéral,2,True,,,
2,10100002,191-0002,Central government debt,Dette du gouvernement central,2009-04-01,2021-03-01,2021-05-31,2.0,10.0,7514.0,6.0,Geography,Géographie,1,False,,,
3,10100002,191-0002,Central government debt,Dette du gouvernement central,2009-04-01,2021-03-01,2021-05-31,2.0,10.0,7514.0,6.0,Central government debt,Dette du gouvernement central,2,True,,,
4,10100003,176-0079,Government of Canada debt securities: Gross ne...,Titres d’emprunt du gouvernement du Canada : é...,1975-01-01,2021-05-01,2021-06-21,2.0,10.0,7502.0,6.0,Geography,Géographie,1,False,,,


## Collect Code Sets
The current dataframe has subject, survey, and frequency as codes. This doesn't tell the user much without a legend, and with so many codes, the user would have to spend a lot of time looking up codes. Thus, let's collect the code sets and merge them with the data frame.

As before, let's retrieve the code sets and load it into a Pandas data frame. Notice that there is only one row of values, but each code set is stored in its own json format.

In [8]:
r = http.request('GET', "https://www150.statcan.gc.ca/t1/wds/rest/getCodeSets")

# Returns a list of JSON metadata
response = json.loads(r.data.decode('utf-8'))

# Code Sets
cs = pd.DataFrame.from_records(pd.json_normalize(response))

cs



Unnamed: 0,status,object.scalar,object.frequency,object.symbol,object.status,object.uom,object.survey,object.subject,object.classificationType,object.securityLevel,object.terminated,object.wdsResponseStatus
0,SUCCESS,"[{'scalarFactorCode': 0, 'scalarFactorDescEn':...","[{'frequencyCode': 1, 'frequencyDescEn': 'Dail...","[{'symbolCode': '0', 'symbolRepresentationEn':...","[{'statusCode': '0', 'statusRepresentationEn':...","[{'memberUomCode': 0, 'memberUomEn': None, 'me...","[{'surveyCode': '1105', 'surveyEn': 'Business ...","[{'subjectCode': '10', 'subjectEn': 'Governmen...","[{'classificationTypeCode': 1, 'classification...","[{'securityLevelCode': 0, 'securityLevelRepres...","[{'codeId': 0, 'codeTextEn': 'active', 'codeTe...","[{'codeId': 0, 'codeTextEn': 'Success', 'codeT..."


We can load each code set in the same way as we've retrieved them. The most useful ones are the subject, survey, and frequency code sets.

In [9]:
# Get individual code sets
scalar = pd.DataFrame.from_records(pd.json_normalize(cs['object.scalar'][0]))
freq = pd.DataFrame.from_records(pd.json_normalize(cs['object.frequency'][0]))
symb = pd.DataFrame.from_records(pd.json_normalize(cs['object.symbol'][0]))
stat = pd.DataFrame.from_records(pd.json_normalize(cs['object.status'][0]))
uom = pd.DataFrame.from_records(pd.json_normalize(cs['object.uom'][0]))
surv = pd.DataFrame.from_records(pd.json_normalize(cs['object.survey'][0]))
subj = pd.DataFrame.from_records(pd.json_normalize(cs['object.subject'][0]))
classType = pd.DataFrame.from_records(pd.json_normalize(cs['object.classificationType'][0]))
secLvl = pd.DataFrame.from_records(pd.json_normalize(cs['object.securityLevel'][0]))
term = pd.DataFrame.from_records(pd.json_normalize(cs['object.terminated'][0]))

# Check
display(subj.head())
display(surv.head())
display(freq.head())
display(uom.head())
display(term.head())

Unnamed: 0,subjectCode,subjectEn,subjectFr
0,10,Government,Gouvernement
1,11,"Income, pensions, spending and wealth","Revenu, pensions, dépenses et richesse"
2,12,International trade,Commerce international
3,13,Health,Santé
4,14,Labour,Travail


Unnamed: 0,surveyCode,surveyEn,surveyFr
0,1105,Business Register,Registre des entreprises
1,1141,Average Fair Market Value/Purchase Price for N...,Juste valeur marchande/prix d'achat pour les h...
2,1209,Survey of Environmental Goods and Services,Enquête sur les biens et services environnemen...
3,1301,Gross Domestic Product by Industry - National ...,Produit intérieur brut par industrie - Nationa...
4,1302,Gross Domestic Product by Industry - Annual,Produit intérieur brut par industrie - annuelle


Unnamed: 0,frequencyCode,frequencyDescEn,frequencyDescFr
0,1,Daily,Quotidienne
1,11,Semi-annual,Semestrielle
2,12,Annual,Annuelle
3,13,Every 2 years,Aux 2 ans
4,14,Every 3 years,Aux 3 ans


Unnamed: 0,memberUomCode,memberUomEn,memberUomFr
0,0,,
1,1,1981=100,1981=100
2,10,199712=100,199712=100
3,100,Dollars per 8 litres,Dollars par 8 litres
4,101,Dollars per 9 litres,Dollars par 9 litres


Unnamed: 0,codeId,codeTextEn,codeTextFr,displayCodeEn,displayCodeFr
0,0,active,actif,,
1,1,terminated,terminé,t,t


Then, I merge these extracted code sets with the metadata, and select and reorder the columns such that codes are adjacent to their definitions. Note, for the archived code, I manually searched a productId with 1 and 2 on the website and checked if there was an archived tag or not.

In [10]:
# 1 means archived, 2 means not archived
print(md['archived'].unique())

[1. 2.]


In [11]:
# Convert types to match with md
subj['subjectCode'] = subj['subjectCode'].astype(float)
surv['surveyCode'] = surv['surveyCode'].astype(float)
freq['frequencyCode'] = freq['frequencyCode'].astype(float)

# Merge code sets with metadata
md = pd.merge(md, subj, how="left", on="subjectCode")
md = pd.merge(md, surv, how="left", on="surveyCode")
md = pd.merge(md, freq, how="left", on="frequencyCode")

md.head()

Unnamed: 0,productId,cansimId,cubeTitleEn,cubeTitleFr,cubeStartDate,cubeEndDate,releaseTime,archived,subjectCode,surveyCode,...,hasUOM,correctionDate,correctionNoteEn,correctionNoteFr,subjectEn,subjectFr,surveyEn,surveyFr,frequencyDescEn,frequencyDescFr
0,10100001,183-0021,Federal public sector employment reconciliatio...,Emploi du secteur public fédéral rapprochement...,1999-01-01,2011-01-01,2012-08-01,1.0,10.0,1713.0,...,False,,,,Government,Gouvernement,Public Sector Employment,Emploi dans le secteur public,Annual,Annuelle
1,10100001,183-0021,Federal public sector employment reconciliatio...,Emploi du secteur public fédéral rapprochement...,1999-01-01,2011-01-01,2012-08-01,1.0,10.0,1713.0,...,True,,,,Government,Gouvernement,Public Sector Employment,Emploi dans le secteur public,Annual,Annuelle
2,10100002,191-0002,Central government debt,Dette du gouvernement central,2009-04-01,2021-03-01,2021-05-31,2.0,10.0,7514.0,...,False,,,,Government,Gouvernement,Department of Finance,Ministère des Finances,Monthly,Mensuelle
3,10100002,191-0002,Central government debt,Dette du gouvernement central,2009-04-01,2021-03-01,2021-05-31,2.0,10.0,7514.0,...,True,,,,Government,Gouvernement,Department of Finance,Ministère des Finances,Monthly,Mensuelle
4,10100003,176-0079,Government of Canada debt securities: Gross ne...,Titres d’emprunt du gouvernement du Canada : é...,1975-01-01,2021-05-01,2021-06-21,2.0,10.0,7502.0,...,False,,,,Government,Gouvernement,Bank of Canada,Banque du Canada,Monthly,Mensuelle


In [12]:
# Select and reorder the columns
md = md[["productId", "cansimId", "cubeTitleEn", "cubeTitleFr", "cubeStartDate",
         "cubeEndDate", "releaseTime", "archived", "subjectCode", "subjectEn",
         "subjectFr", "surveyCode", "surveyEn", "surveyFr", "frequencyCode",
         "frequencyDescEn",	"frequencyDescFr", "dimensionNameEn",
         "dimensionNameFr", "correctionDate", "correctionNoteEn",
         "correctionNoteFr", "hasUOM"]]

md.head()

Unnamed: 0,productId,cansimId,cubeTitleEn,cubeTitleFr,cubeStartDate,cubeEndDate,releaseTime,archived,subjectCode,subjectEn,...,surveyFr,frequencyCode,frequencyDescEn,frequencyDescFr,dimensionNameEn,dimensionNameFr,correctionDate,correctionNoteEn,correctionNoteFr,hasUOM
0,10100001,183-0021,Federal public sector employment reconciliatio...,Emploi du secteur public fédéral rapprochement...,1999-01-01,2011-01-01,2012-08-01,1.0,10.0,Government,...,Emploi dans le secteur public,12.0,Annual,Annuelle,Geography,Géographie,,,,False
1,10100001,183-0021,Federal public sector employment reconciliatio...,Emploi du secteur public fédéral rapprochement...,1999-01-01,2011-01-01,2012-08-01,1.0,10.0,Government,...,Emploi dans le secteur public,12.0,Annual,Annuelle,Federal public sector employment,Groupes d'emplois du secteur public fédéral,,,,True
2,10100002,191-0002,Central government debt,Dette du gouvernement central,2009-04-01,2021-03-01,2021-05-31,2.0,10.0,Government,...,Ministère des Finances,6.0,Monthly,Mensuelle,Geography,Géographie,,,,False
3,10100002,191-0002,Central government debt,Dette du gouvernement central,2009-04-01,2021-03-01,2021-05-31,2.0,10.0,Government,...,Ministère des Finances,6.0,Monthly,Mensuelle,Central government debt,Dette du gouvernement central,,,,True
4,10100003,176-0079,Government of Canada debt securities: Gross ne...,Titres d’emprunt du gouvernement du Canada : é...,1975-01-01,2021-05-01,2021-06-21,2.0,10.0,Government,...,Banque du Canada,6.0,Monthly,Mensuelle,Geography,Géographie,,,,False


Finally, the user then has the option to filter the data frame and save it.

In [13]:
# Optional: Filter dataframe

# Print unique productIds
print(md['productId'].unique())

# Save this file
md.to_csv("all_metadata.csv", index=False)

['10100001' '10100002' '10100003' ... '46100054' '46100055' '46100056']


## Download Complete CANSIM Tables
To download complete CANSIM Tables, the user requires a list of productIds. In this example, I use the `productId` column from the compiled metadata data frame I saved from the previous section. I use the `unique()` function to remove any duplicates, which returns a list of unique productIds, and then just use the first 5.

In [14]:
"""Get Complete CANSIM Tables"""

# From the API, notice that the url to download csv files is static
# with only the product name that changes

# Load productIds
md = pd.read_csv("all_metadata.csv", usecols=['productId'])

# List of unique pids (Take first 5)
pids = md['productId'].unique()[0:5]

print(pids)

[10100001 10100002 10100003 10100004 10100005]


It would be inefficient to have to download already downloaded tables, so let's add a check to only download ones that do not exist in a specified file path. First, scan for all directories (a.k.a folders) in the specified file path, and ensure to remove hidden directories; the output is a list of directories where the directory names are string type. The productIds are integers though, so to ensure the common elements between the two lists, one of the list's types will need to be converted. In this case, I convert the list of directory names from string to integer if the list is non-empty. To remove the common elements, we can XOR two sets together, so we convert the two lists into sets first, then XOR them, and then convert back to list. The XOR operator is a logical operator that only returns true when both elements are different; hence, removing common elements between the two sets.

In [15]:
# Check which tables have already been downloaded

# If the cansim_tables forlder does not exist, create it
if not os.path.exists("./cansim_tables/"):
    os.mkdir("./cansim_tables/")

# List of downloaded cansim tables
dled_pids = [f.name for f in os.scandir("./cansim_tables/") if f.is_dir()]

# Remove hidden directories
dled_pids = [d for d in dled_pids if not d.startswith('.')]

# Convert to integers if list is not empty
if (len(dled_pids) > 0):
    dled_pids = [int(i) for i in dled_pids]

print(dled_pids)

# Get pids that haven't been downloaded as a list
new_pids = list(set(dled_pids) ^ set(pids))

print(new_pids)

[10100005, 10100038, 10100026, 10100075, 10100001]
[10100002, 10100003, 10100004, 10100038, 10100026, 10100075]


Finally, for each new CANSIM table, download them. Referring to the `getFullTableDownloadCSV` function from the WDS, notice that the object in the returned json is a static link where only the productId would change (i.e. "https://www150.statcan.gc.ca/n1/tbl/csv/14100287-eng.zip"), so instead of using the WDS, let's directly modify this link per new CANSIM table. By using `urllib.request.urlretrieve(url)` and specifying the file path as the second parameter, the complete CANSIM table can be installed as a zip folder. However, if I want to read data quickly, it is easier to do so when the zip file is extracted, so I do unzip the download and save that instead of a zip.

In [16]:
# Download a small subset of zip files and unzip them
for i in new_pids:

    # Update url per each productId
    url = "https://www150.statcan.gc.ca/n1/tbl/csv/{}-eng.zip".format(i)

    # Create a folder per productId
    extract_dir = r'./cansim_tables/{}'.format(i)

    # Retrieve the zip file, but don't save it
    # Zip file can be saved if extract_dir is specified in parameter
    zip_path, _ = urllib.request.urlretrieve(url)

    # Unzip each file
    with zipfile.ZipFile(zip_path, "r") as f:
        f.extractall(extract_dir)

## Get More Detailed MetaData
Previously, we used `getAllCubesList` to construct the metadata, but it's missing information on dimensions, members, and footnotes. However, there's a function that retrieves individual productId's metadata which has these attributes: `getCubeMetadata`. The caveat is that you must have a list of productIds, but this can be retrieved using `getAllCubesList` or `getAllCubesListLite` (which gets even less attributes). In this example, I use the productId column from the compiled metadata data frame I saved from the first section. I use the unique() function to remove any duplicates, which returns a list of unique productIds, and then just use the first 3.

Note: Because there are many columns, the Pandas data frame will hide some columns, but I would like to check that all columns were added, so I changed the display setting to show all columns.

In [17]:
# Change the display option to show all columns
pd.set_option('display.max_columns', None)

# Load all_metadata.csv for its productId column
md = pd.read_csv("all_metadata.csv", usecols=['productId'])

# List of pids
pids = md['productId'].unique()

# Change the size of the list to test
pids = pids[0:3]

For each productId, retrieve and load the metadata into a data frame as seen in the first section - Converting to Pandas Data Frame. Notice that dimensions and footnotes are still in json format, so extract in the same way too. The interesting point to note is that members is in a json format within the extracted dimensions data frame which can have more rows than one. Therefore, another for loop, for each row in the dimensions data frame, will be used to extract members information as data frames. These member data frames can be concatenated into one data frame. To be able to merge dimensions, members, and footnotes to the original retrieved metadata data frame, I add a `productId` column to these data frames as the common key. Then, I select which columns to keep (removing some unnecessary columns like status, response status, etc.), expand all columns that were storing lists, and drop duplicate rows. The output is a clean data frame for the productId which is then stored in a list.

In [18]:
# Empty list to store each dataframe
mds = []

# For each productId
for i in pids:

    # Update post body
    data = [{"productId": int(i)}]

    # Encoding the data in JSON format.
    encoded_data = json.dumps(data).encode("utf-8")

    # Retrieve the metadata
    r = http.request(
        "POST",
        "https://www150.statcan.gc.ca/t1/wds/rest/getCubeMetadata",
        body=encoded_data, # Embedding JSON data into request body.
        headers={"Content-Type": "application/json"}
    )

    # Load it as json
    r = json.loads(r.data.decode("utf-8"))

    # Flatten the json into a dataframe
    pid = pd.DataFrame.from_records(pd.json_normalize(r))

    # Drop first 2 columns
    pid = pid[["object.productId", "object.cansimId", "object.cubeTitleEn",
               "object.cubeTitleFr", "object.cubeStartDate", "object.cubeEndDate",
               "object.frequencyCode", "object.nbSeriesCube", "object.archiveStatusCode",
               "object.archiveStatusEn", "object.archiveStatusFr", "object.subjectCode",
               "object.surveyCode", "object.dimension", "object.footnote",
               "object.correctionFootnote", "object.geoAttribute", "object.correction"]]

    # Change column names
    pid.columns = pid.columns.str.replace("object.", "")

    # Extract dims
    dims = pd.DataFrame.from_records(pd.json_normalize(pid['dimension'][0]))

    # Add a column for productId
    dims['productId'] = pid['productId'][0]

    # Empty list to store members
    mems = []

    # For each member in dims
    for j in dims['member']:

        # Add the flattened dataframe into mems
        mems.append(pd.DataFrame.from_records(pd.json_normalize(j)))

    # Concatenate the members
    members = pd.concat(mems, ignore_index=True)

    # Add a column for productId
    members['productId'] = pid['productId'][0]

    # Extract Footnotes
    fn = pd.DataFrame.from_records(pd.json_normalize(pid['footnote'][0]))

    # Add a column for productId
    fn['productId'] = pid['productId'][0]

    # Merge pid, dims, members, fn
    pid = pd.merge(pid, dims, on='productId')
    pid = pd.merge(pid, members, on='productId')
    pid = pd.merge(pid, fn, on='productId')

    # Select columns
    pid = pid[["productId", "cansimId", "cubeTitleEn", "cubeTitleFr", "cubeStartDate", "cubeEndDate",
               "frequencyCode", "nbSeriesCube", "archiveStatusCode", "archiveStatusEn", "archiveStatusFr",
               "subjectCode", "surveyCode", "dimensionPositionId", "dimensionNameEn", "dimensionNameFr",
               "hasUom", "memberId", "parentMemberId", "memberNameEn", "memberNameFr", "memberUomCode",
               "footnoteId", "footnotesEn", "footnotesFr", "link.dimensionPositionId", "link.memberId",
               "correctionFootnote", "geoAttribute", "correction", "classificationCode",
               "classificationTypeCode", "geoLevel", "vintage"]]

    # Explode all columns with lists
    pid = pid.explode("subjectCode", ignore_index=True)
    pid = pid.explode("surveyCode", ignore_index=True)
    pid = pid.explode("correctionFootnote", ignore_index=True)
    pid = pid.explode("geoAttribute", ignore_index=True)
    pid = pid.explode("correction", ignore_index=True)

    # Convert types to float
    pid['subjectCode'] = pid['subjectCode'].astype(float)
    pid['surveyCode'] = pid['surveyCode'].astype(float)
    pid['frequencyCode'] = pid['frequencyCode'].astype(float)
    pid['memberUomCode'] = pid['memberUomCode'].astype(float)
    pid['classificationTypeCode'] = pid['classificationTypeCode'].astype(float)

    # Merge with code sets
    pid = pd.merge(pid, subj, on="subjectCode")
    pid = pd.merge(pid, surv, on="surveyCode")
    pid = pd.merge(pid, freq, on="frequencyCode")
    pid = pd.merge(pid, uom, on="memberUomCode")
    pid = pd.merge(pid, classType, how='left', on="classificationTypeCode")

    # Drop duplicates
    pid.drop_duplicates(inplace=True, ignore_index=True)

    mds.append(pid)

  pid.columns = pid.columns.str.replace("object.", "")


The list of metadata data frames is concatenated to construct a single data frame. Once again, I drop duplicate rows just in case there is overlap between CANSIM tables' metadata. I check for the number of missing values (NaNs) in each column, and remove the columns that consist of entire NaNs. Note that the `correctionFootnote` and `correction` columns could still be in json format when non-empty. If this is true, you will need to handle it similar to how members in the dimensions dataframe was handled. While doing so, I reorder the columns so that the code columns precede their definitions.

In [19]:
# Concatenate all dataframes
md = pd.concat(mds, ignore_index=True)

# Drop duplicates
md.drop_duplicates(inplace=True, ignore_index=True)

# Replace all None with NaN
md.replace("None", np.nan, inplace=True)

print(len(md.index))
print(md.isna().sum())

5840
productId                      0
cansimId                       0
cubeTitleEn                    0
cubeTitleFr                    0
cubeStartDate                  0
cubeEndDate                    0
frequencyCode                  0
nbSeriesCube                   0
archiveStatusCode              0
archiveStatusEn                0
archiveStatusFr                0
subjectCode                    0
surveyCode                     0
dimensionPositionId            0
dimensionNameEn                0
dimensionNameFr                0
hasUom                         0
memberId                       0
parentMemberId              3000
memberNameEn                   0
memberNameFr                   0
memberUomCode                  0
footnoteId                     0
footnotesEn                    0
footnotesFr                    0
link.dimensionPositionId       0
link.memberId                  0
correctionFootnote          5840
geoAttribute                5840
correction                  5840
class

In [20]:
# Save to csv
md.to_csv("all_detailed_metadata_CHECK.csv", index=False)

In [21]:
# Print columns
md.columns

Index(['productId', 'cansimId', 'cubeTitleEn', 'cubeTitleFr', 'cubeStartDate',
       'cubeEndDate', 'frequencyCode', 'nbSeriesCube', 'archiveStatusCode',
       'archiveStatusEn', 'archiveStatusFr', 'subjectCode', 'surveyCode',
       'dimensionPositionId', 'dimensionNameEn', 'dimensionNameFr', 'hasUom',
       'memberId', 'parentMemberId', 'memberNameEn', 'memberNameFr',
       'memberUomCode', 'footnoteId', 'footnotesEn', 'footnotesFr',
       'link.dimensionPositionId', 'link.memberId', 'correctionFootnote',
       'geoAttribute', 'correction', 'classificationCode',
       'classificationTypeCode', 'geoLevel', 'vintage', 'subjectEn',
       'subjectFr', 'surveyEn', 'surveyFr', 'frequencyDescEn',
       'frequencyDescFr', 'memberUomEn', 'memberUomFr', 'classificationTypeEn',
       'classificationTypeFr'],
      dtype='object')

In [22]:
# Select and reorder columns
md = md[['productId', 'cansimId', 'cubeTitleEn', 'cubeTitleFr', 'cubeStartDate',
         'cubeEndDate', 'frequencyCode', 'frequencyDescEn', 'frequencyDescFr',
         'nbSeriesCube', 'archiveStatusCode', 'archiveStatusEn', 'archiveStatusFr',
         'subjectCode', 'subjectEn', 'subjectFr', 'surveyCode', 'surveyEn', 'surveyFr',
         'dimensionPositionId', 'dimensionNameEn', 'dimensionNameFr', 'hasUom',
         'memberId', 'parentMemberId', 'memberNameEn', 'memberNameFr', 'memberUomCode',
         'memberUomEn', 'memberUomFr', 'link.dimensionPositionId', 'link.memberId',
         'footnoteId', 'footnotesEn', 'footnotesFr']]

md.head()

Unnamed: 0,productId,cansimId,cubeTitleEn,cubeTitleFr,cubeStartDate,cubeEndDate,frequencyCode,frequencyDescEn,frequencyDescFr,nbSeriesCube,archiveStatusCode,archiveStatusEn,archiveStatusFr,subjectCode,subjectEn,subjectFr,surveyCode,surveyEn,surveyFr,dimensionPositionId,dimensionNameEn,dimensionNameFr,hasUom,memberId,parentMemberId,memberNameEn,memberNameFr,memberUomCode,memberUomEn,memberUomFr,link.dimensionPositionId,link.memberId,footnoteId,footnotesEn,footnotesFr
0,10100001,183-0021,Federal public sector employment reconciliatio...,Emploi du secteur public fédéral rapprochement...,1999-01-01,2011-01-01,12.0,Annual,Annuelle,14,1,ARCHIVED - a cube publicly available but no l...,ARCHIVÉ - un cube qui est disponible au public...,1002.0,Government/Employment and remuneration,Gouvernement/Emploi et rémunération,1713.0,Public Sector Employment,Emploi dans le secteur public,1,Geography,Géographie,False,1,,"Federal public sector employees, as per Statis...","Employé(e)s du secteur public fédéral, selon l...",249.0,Persons,Personnes,0,0,1,This reconciliation statement provides data as...,Cet état des concordances fourni des données a...
1,10100001,183-0021,Federal public sector employment reconciliatio...,Emploi du secteur public fédéral rapprochement...,1999-01-01,2011-01-01,12.0,Annual,Annuelle,14,1,ARCHIVED - a cube publicly available but no l...,ARCHIVÉ - un cube qui est disponible au public...,1002.0,Government/Employment and remuneration,Gouvernement/Emploi et rémunération,1713.0,Public Sector Employment,Emploi dans le secteur public,1,Geography,Géographie,False,1,,"Federal public sector employees, as per Statis...","Employé(e)s du secteur public fédéral, selon l...",249.0,Persons,Personnes,0,0,6,"In this reconciliation table, the Treasury Boa...","Dans ce tableau de rapprochement, l'univers du..."
2,10100001,183-0021,Federal public sector employment reconciliatio...,Emploi du secteur public fédéral rapprochement...,1999-01-01,2011-01-01,12.0,Annual,Annuelle,14,1,ARCHIVED - a cube publicly available but no l...,ARCHIVÉ - un cube qui est disponible au public...,1002.0,Government/Employment and remuneration,Gouvernement/Emploi et rémunération,1713.0,Public Sector Employment,Emploi dans le secteur public,1,Geography,Géographie,False,1,,"Federal public sector employees, as per Statis...","Employé(e)s du secteur public fédéral, selon l...",249.0,Persons,Personnes,2,5,2,Included are employees of entities such as Can...,Comprend les employés d'entités telles que l'A...
3,10100001,183-0021,Federal public sector employment reconciliatio...,Emploi du secteur public fédéral rapprochement...,1999-01-01,2011-01-01,12.0,Annual,Annuelle,14,1,ARCHIVED - a cube publicly available but no l...,ARCHIVÉ - un cube qui est disponible au public...,1002.0,Government/Employment and remuneration,Gouvernement/Emploi et rémunération,1713.0,Public Sector Employment,Emploi dans le secteur public,1,Geography,Géographie,False,1,,"Federal public sector employees, as per Statis...","Employé(e)s du secteur public fédéral, selon l...",249.0,Persons,Personnes,2,8,3,Included are employees of entities such as Can...,Comprend les employés d'entités telles que l'A...
4,10100001,183-0021,Federal public sector employment reconciliatio...,Emploi du secteur public fédéral rapprochement...,1999-01-01,2011-01-01,12.0,Annual,Annuelle,14,1,ARCHIVED - a cube publicly available but no l...,ARCHIVÉ - un cube qui est disponible au public...,1002.0,Government/Employment and remuneration,Gouvernement/Emploi et rémunération,1713.0,Public Sector Employment,Emploi dans le secteur public,1,Geography,Géographie,False,1,,"Federal public sector employees, as per Statis...","Employé(e)s du secteur public fédéral, selon l...",249.0,Persons,Personnes,2,13,4,Treasury Board of Canada Secretariat federal g...,Du point de vue du Secrétariat du Conseil du T...


Finally, the user can filter the clean and complete metadata data frame and save it.

In [23]:
# Save to csv
md.to_csv("all_detailed_metadata.csv", index=False)

## Yet Another Way to Concatenate All Metadata

Note: Requires user to have a list of productIds (pid_list)

1. Download the full data csv files
2. Read and load all their metadata files as dataframes into a list
    - for i in pid_list: pd.read_csv("path/to/cansim_tables/{}/{}_MetaData.csv".format(i))
3. Clean the metadata data frame
4. Concatenate all the dataframes into one with `pd.concat(list_df, ignore_index=True)`