# StatsCan Data Access via Python

Leveraging the Statistics Canada Web Data Services available at https://www.statcan.gc.ca/eng/developers/wds.

In [1]:
!conda install -y --use-local -c ian.e.preston stats_can

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.10.0
  latest version: 4.10.1

Please update conda by running

    $ conda update -n base conda



## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - stats_can


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2021.5.30  |       ha878542_0         136 KB  conda-forge
    certifi-2021.5.30          |   py38h578d9bd_0         141 KB  conda-forge
    stats_can-2.2.3            |             py_1          64 KB  ian.e.preston
    ------------------------------------------------------------
                                           Total:         341 KB

The following NEW packages will be INSTALLED:

  stats_can          ian.e.preston/noarch::stats_can-2.2.3-py_1

The following packages will be UPDATED:

  ca-certificates              

In [2]:
from stats_can import StatsCan
from stats_can.sc import code_sets_to_df_dict

In [3]:
import os
os.makedirs("data", exist_ok=True)
sc = StatsCan(data_folder="data")

Specify the product ID (PID) to retrieve the table displayed at https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=2710002201

> Product Identification number (PID) is a unique product identifier for all Statistics Canada products, including large multidimensional tables. The first two digits refer to a subject, the next two digits refer to product type, the last four digits refer to the product itself.

In [4]:
df = sc.table_to_df("271-000-22-01")

Downloading and loading table_27100022


In [5]:
df.columns

Index(['REF_DATE', 'GEO', 'DGUID', 'Performing sector',
       'Occupational category', 'Type of science', 'UOM', 'UOM_ID',
       'SCALAR_FACTOR', 'SCALAR_ID', 'VECTOR', 'COORDINATE', 'VALUE', 'STATUS',
       'SYMBOL', 'TERMINATED', 'DECIMALS'],
      dtype='object')

In [6]:
df.head()

Unnamed: 0,REF_DATE,GEO,DGUID,Performing sector,Occupational category,Type of science,UOM,UOM_ID,SCALAR_FACTOR,SCALAR_ID,VECTOR,COORDINATE,VALUE,STATUS,SYMBOL,TERMINATED,DECIMALS
0,2000-01-01,Canada,2016A000011124,Total performing sector,Total personnel,Total sciences,Number,223,units,0,v54329358,1.1.1.1,167940.0,,,,0
1,2000-01-01,Canada,2016A000011124,Total performing sector,Total personnel,Natural sciences and engineering,Number,223,units,0,v54329359,1.1.1.2,147290.0,,,t,0
2,2000-01-01,Canada,2016A000011124,Total performing sector,Total personnel,Social sciences and humanities,Number,223,units,0,v54329360,1.1.1.3,20650.0,,,t,0
3,2000-01-01,Canada,2016A000011124,Total performing sector,Researchers,Total sciences,Number,223,units,0,v54329361,1.1.2.1,107900.0,,,,0
4,2000-01-01,Canada,2016A000011124,Total performing sector,Researchers,Natural sciences and engineering,Number,223,units,0,v54329362,1.1.2.2,91590.0,,,t,0


Exploring the first time series as a specific `VECTOR` is referred to as information along the time axis

In [7]:
df[df['VECTOR']=='v54329358'][::-1].head()

Unnamed: 0,REF_DATE,GEO,DGUID,Performing sector,Occupational category,Type of science,UOM,UOM_ID,SCALAR_FACTOR,SCALAR_ID,VECTOR,COORDINATE,VALUE,STATUS,SYMBOL,TERMINATED,DECIMALS
1283,2018-01-01,Canada,2016A000011124,Total performing sector,Total personnel,Total sciences,Number,223,units,0,v54329358,1.1.1.1,244450.0,,,,0
1207,2017-01-01,Canada,2016A000011124,Total performing sector,Total personnel,Total sciences,Number,223,units,0,v54329358,1.1.1.1,237710.0,,,,0
1122,2016-01-01,Canada,2016A000011124,Total performing sector,Total personnel,Total sciences,Number,223,units,0,v54329358,1.1.1.1,235450.0,,,,0
1037,2015-01-01,Canada,2016A000011124,Total performing sector,Total personnel,Total sciences,Number,223,units,0,v54329358,1.1.1.1,251960.0,,,,0
952,2014-01-01,Canada,2016A000011124,Total performing sector,Total personnel,Total sciences,Number,223,units,0,v54329358,1.1.1.1,246100.0,,,,0


> Coordinate is a concatenation of the member ID values for each dimension. One value per dimension. (i.e. "1.3.1.1.1.1.0.0.0.0" ) A table PID number combined with a coordinate will identify a unique time series of data points.

In [8]:
df[df['COORDINATE']=='1.3.5.3']

Unnamed: 0,REF_DATE,GEO,DGUID,Performing sector,Occupational category,Type of science,UOM,UOM_ID,SCALAR_FACTOR,SCALAR_ID,VECTOR,COORDINATE,VALUE,STATUS,SYMBOL,TERMINATED,DECIMALS
996,2014-01-01,Canada,2016A000011124,Provincial government,On-site research consultants,Social sciences and humanities,Number,223,units,0,v1039066470,1.3.5.3,,..,,t,0
1081,2015-01-01,Canada,2016A000011124,Provincial government,On-site research consultants,Social sciences and humanities,Number,223,units,0,v1039066470,1.3.5.3,,..,,t,0
1166,2016-01-01,Canada,2016A000011124,Provincial government,On-site research consultants,Social sciences and humanities,Number,223,units,0,v1039066470,1.3.5.3,,..,,t,0


Retrieving a specific vector

In [9]:
df_vector = sc.vectors_to_df(["v54329358"])

Getting the meta data for this vector.

In [10]:
sc.vector_metadata('v54329358')

[{'responseStatusCode': 0,
  'productId': 27100022,
  'coordinate': '1.1.1.1.0.0.0.0.0.0',
  'vectorId': 54329358,
  'frequencyCode': 12,
  'scalarFactorCode': 0,
  'decimals': 0,
  'terminated': 0,
  'SeriesTitleEn': 'Canada;Total performing sector;Total personnel;Total sciences',
  'SeriesTitleFr': "Canada;Total, secteurs d'exécution;Total de personnel;Total des sciences",
  'memberUomCode': 223}]

Get list of tables which were updated today.

In [11]:
sc.tables_updated_today()[:5]

[{'responseStatusCode': 0,
  'productId': 11100076,
  'releaseTime': '2021-06-16T08:30'},
 {'responseStatusCode': 0,
  'productId': 11100077,
  'releaseTime': '2021-06-16T08:30'},
 {'responseStatusCode': 0,
  'productId': 11100079,
  'releaseTime': '2021-06-16T08:30'},
 {'responseStatusCode': 0,
  'productId': 10100006,
  'releaseTime': '2021-06-16T08:30'},
 {'responseStatusCode': 0,
  'productId': 18100001,
  'releaseTime': '2021-06-16T08:30'}]

or specific date

In [12]:
sc.tables_updated_on_date('2020-04-29')[:5]

[{'responseStatusCode': 0,
  'productId': 36100583,
  'releaseTime': '2020-04-29T08:30'},
 {'responseStatusCode': 0,
  'productId': 36100584,
  'releaseTime': '2020-04-29T08:30'},
 {'responseStatusCode': 0,
  'productId': 23100006,
  'releaseTime': '2020-04-29T08:30'},
 {'responseStatusCode': 0,
  'productId': 23100007,
  'releaseTime': '2020-04-29T08:30'},
 {'responseStatusCode': 0,
  'productId': 23100008,
  'releaseTime': '2020-04-29T08:30'}]

In [13]:
sc.vectors_to_df_remote('v54329358', start_release_date='2015-01-01', end_release_date='2019-01-01')

Unnamed: 0_level_0,v54329358
refPer,Unnamed: 1_level_1
2011-01-01,239920.0
2012-01-01,231230.0
2013-01-01,232910.0


> Gets all code sets which provide additional information to describe information and are grouped into scales, frequencies, symbols etc.

In [14]:
codes = code_sets_to_df_dict()

In [15]:
codes.keys()

dict_keys(['scalar', 'frequency', 'symbol', 'status', 'uom', 'survey', 'subject', 'classificationType', 'securityLevel', 'terminated', 'wdsResponseStatus'])

In [16]:
codes['scalar']

Unnamed: 0,scalarFactorCode,scalarFactorDescEn,scalarFactorDescFr
0,0,units,unités
1,1,tens,dizaines
2,2,hundreds,centaines
3,3,thousands,milliers
4,4,tens of thousands,dizaines de milliers
5,5,hundreds of thousands,centaines de milliers
6,6,millions,millions
7,7,tens of millions,dizaines de millions
8,8,hundreds of millions,centaines de millions
9,9,billions,milliards
