## Imports and setup

First, let's make the standard imports.

In [1]:
import requests 
import pandas as pd
import json
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm
import matplotlib.ticker as mtick
import matplotlib.dates as mdates
from matplotlib import cm

token = "" # Your TOKEN goes here
url = 'http://api.tukanmx.com/v1/retrieve/'

headers = {
"Content-Type": "application/json",
"Authorization": "Token " + token
}

## Structure of TUKAN's data model

Our data repository works as follows:

*   **Institutions** - the source of the information, and the highest level on our repository.
      *   **Tables** - the dataset that contains the information. Each dataset is associated to an institution or source.
          * **Variables** - the indicators contained in the dataset.

### Institutions

Institutions are the source of the information. These include government and non-government entities that publish the original or raw-data, such as: INEGI, Banco de México, CONSAR, etc.

You can easily query which institutions are available in our data catalog with the following query:


In [2]:
response = requests.request("GET", url = "http://api.tukanmx.com/v1/institutions/", headers=headers)
institutions = pd.DataFrame(response.json())
institutions

Unnamed: 0,id,name,acronym,description,description_en,website,country
0,mex_banxico,Banco de México,Banxico,Banco central mexicano. Las finalidades sustan...,"Mexico's central bank, monetary authority and ...",https://www.banxico.org.mx/,mex
1,mex_grupo_bmv,Bolsa Mexicana de Valores,BMV,La Bolsa de Valores de México es una entidad f...,The BMV is a private financial entity that ope...,https://www.bmv.com.mx/,mex
2,mex_cnbv,Comisión Nacional Bancaria y de Valores,CNBV,Un órgano desconcentrado de la Secretaría de H...,A decentralized body of the Ministry of Financ...,https://www.gob.mx/cnbv,mex
3,mex_cnsf,Comisión Nacional de Seguros y Fianzas,CNSF,La Comisión Nacional de Seguros y Fianzas es u...,The National Insurance and Surety Commission i...,https://www.gob.mx/cnsf,mex
4,mex_consar,Comisión Nacional del Sistema de Ahorro para e...,CONSAR,CONSAR es la Comisión Nacional del Sistema de ...,CONSAR is the regulator in charge of managing ...,https://www.gob.mx/consar/,mex
5,mex_condusef,Comisión Nacional para la Protección y Defensa...,CONDUSEF,Es la encargada de promover y difundir la educ...,Is in charge of promoting transparency and res...,https://www.gob.mx/condusef,mex
6,mex_inegi,Instituto Nacional de Estadística y Geografía,INEGI,Organismo público autónomo responsable de norm...,The National Statistical and Geographic Inform...,https://www.inegi.org.mx/default.html,mex
7,mex_sct,Secretaría de Comunicaciones y Transportes,SCT,Una de las secretarías de Estado que integran ...,One of the state secretariats that make up the...,https://www.gob.mx/sct,mex
8,mex_segob,Secretaría de Gobernación,SEGOB,La Secretaría de Gobernación atiende el desarr...,SEGOB attends to the political devlopment of t...,https://www.gob.mx/segob,mex
9,mex_shcp,Secretaría de Hacienda y Crédito Público,SHCP,La Secretaría de Hacienda y Crédito Público ti...,The SHCP's (Ministry of Finance) mission is to...,https://www.gob.mx/shcp,mex


The columns represent:

|column| description|
|--|--|
|id| the TUKAN institution id|
|name| the official name of the institution|
|acronym| the common acronym associated to the instituion|
|description| a brief overview of the instituion's main functions in Spanish|
|description_en| a brief overview of the instituion's main functions in English|
|website| the official website of the institution|
|country*| the 3-letter ISO code of where the institution is based|

*If `country == wd` then the institution is an international organization.

### Tables

These are the datasets that contain the information. Each table is associated to an institution or source, and has a unique structure depending on the data it contains (don't worry we'll explain this more in detail later).

First, we need to understand which tables are associated to each source. We can do this through two different methods: 1) we use the Explore component on the [web-application](https://dashboard.tukanmx.com/) or 2) we query directly the tables associated to a particular institution.

For example, if you wanted to know which tables are associated to the INEGI, **you will need the institution's id** and then run the following code:

In [21]:
# Define a function for future use
def get_institution_tables(inst_id):

    global url
    global headers

    payload = {
    "type":"institution",
    "institution": inst_id,
    "operation": "data_tables_info"
    }

    response = requests.request("POST", url, headers=headers, data = json.dumps(payload))
    tables = pd.DataFrame(response.json()['data_tables'])
  
    return(tables)

inegi_tables = get_institution_tables("mex_inegi")
inegi_tables.head(5)

Unnamed: 0,id,name,name_en,description,description_en,mode,website,last_updated,data_updated,institution_id,frequency_id,tag_id,categories
0,mex_inegi_api_employment,Estadísticas de Ocupación y Empleo,Employment Statistics,Con base en la Encuesta Nacional de Ocupación ...,Based on the National Occupation and Employmen...,standard,https://www.inegi.org.mx/temas/empleo/#Tabulados,2021-01-22,2021-08-19T06:45:05Z,mex_inegi,quarterly,,[adjustment_type]
1,mex_inegi_api_unemployment,Tasa de Desocupación,Unemployment Rate,Tasa de desocupación en series desestacionaliz...,"Unemployment rate, seasonally adjusted and tre...",standard,https://www.inegi.org.mx/temas/empleo/,2021-01-22,2021-10-25T09:19:54Z,mex_inegi,monthly,,[adjustment_type]
2,mex_inegi_census_households,Censo Población y Vivienda - Indicadores de Vi...,Census - Household Indicators,Proporciona la cuenta y características princi...,Provides information on the main characteristi...,standard,https://www.inegi.org.mx/programas/ccpv/2020/#...,2021-03-25,1977-06-08T05:20:00Z,mex_inegi,decennially,,[geography]
3,mex_inegi_census_people,Censo Población y Vivienda - Indicadores Pobla...,Census - Population Indicators,Proporciona la cuenta y características princi...,Provides information on the main characteristi...,standard,https://www.inegi.org.mx/programas/ccpv/2020/#...,2021-03-25,1977-06-08T05:20:00Z,mex_inegi,decennially,,"[geography, sex]"
4,mex_inegi_econ_census,Censo Económico,Economic Census,Los censos económicos contienen información ec...,The economic census contains economic informat...,standard,https://www.inegi.org.mx/programas/ce/2019/#In...,2021-08-27,2021-08-27T15:23:29Z,mex_inegi,quinquennial,,"[company_size, economic_activity, geography]"


This dictionary allows us to see all of the table's metadata, such as: its description, name, frequency, etc.

In essence, the columns represent:

|column| description|
|--|--|
|id| the TUKAN table id|
|name| the name of the table in Spanish|
|name_en| the name of the table in English|
|description| a brief description of the table in Spanish|
|description_en| a brief description of the table in English|
|mode| internal TUKAN metadata|
|website| the url from where the table was obtained|
|last_updated| the date when the data was last updated|
|institution_id| the TUKAN institution id|
|frequency_id| the frequency of the data|
|categories| a list of TUKAN categories associated to the table|

There are four main metadata items that you need to be aware of: `id`, `website`, `frequency_id` and `categories`. By being aware of these four attributes, you'll be able to have a full understanding of the table's structure and how to extract it properly. In summary:

*   The `id` is required to query the table's data and variable dictionary.
*   The `website` allows you to validate the data with the original source.
*   The `frequency_id` gives you information regarding the table's periodicity, i.e., monthly, quarterly, or daily data.

**Categories**, on the other hand, deserve a section of their own.

### Categories

Categories are attributes which appear in a variety of tables, such as: geographic location, economic activity or products, which have been standardized across TUKAN's data model. In essence, these are the table's columns, and play a fundamental role in TUKAN's data model. 

The main thing you need to know about categories is that they follow a **hierarchical structure**, that **they are unique**, and that they appear in different datasets depending on the structure of the table.

For example, in the previous chunk we queried all of the tables associated to INEGI. Let's take a look at the row which contains the metadata for the Economic Census dataset.



In [22]:
inegi_tables[inegi_tables['id'] == 'mex_inegi_econ_census']

Unnamed: 0,id,name,name_en,description,description_en,mode,website,last_updated,data_updated,institution_id,frequency_id,tag_id,categories
4,mex_inegi_econ_census,Censo Económico,Economic Census,Los censos económicos contienen información ec...,The economic census contains economic informat...,standard,https://www.inegi.org.mx/programas/ce/2019/#In...,2021-08-27,2021-08-27T15:23:29Z,mex_inegi,quinquennial,,"[company_size, economic_activity, geography]"


As you can see, the `categories` column contains the list: `[company_size, economic_activity, geography]`, which means that the Economic Census dataset contains the `company_size, economic_activity` and `geography` categories. This tells you that you can extract the data from that table at each of these levels.

In order to view the structure of each of these categories, you can query the category's dictionary in full with the following function:


In [23]:
# Function for future usage
def get_category_dictionary(category_id):

  global url
  global headers

  payload = {
    "type":"category_dict",
    "category": category_id,
    "operation": "all"
  }

  response = requests.request("POST", url, headers=headers, data = json.dumps(payload))
  category_dict = pd.DataFrame(response.json()[category_id + '_dictionary'])
  
  return(category_dict)

For example, if we wanted to explore the `economic_activity` we simply pass the category id to the function we just defined.

In [28]:
economic_activity_dic = get_category_dictionary('economic_activity')
economic_activity_dic.sort_values(by = "level")

Unnamed: 0,id,category_id,ref,name,name_en,parent_id,level
1889,73727,economic_activity,dfeefc621d16d0c,Actividad económica,Economic activity,,0
988,73729,economic_activity,761bc00426e1c48,Actividades secundarias,Secondary activities,dfeefc621d16d0c,1
1217,73730,economic_activity,8fd5b02b9f891fb,Actividades terciarias,Tertiary activities,dfeefc621d16d0c,1
967,73728,economic_activity,7460634ca523beb,Actividades primarias,Primary activities,dfeefc621d16d0c,1
1497,73753,economic_activity,afaceb85ed568ca,"Minería, gas, agua, y generación de energía el...",Mining and utilities,761bc00426e1c48,2
...,...,...,...,...,...,...,...
1097,75717,economic_activity,83207d4ec8f0115,Comercio al por mayor de miel,Wholesale trade honey,dcc7384f9641f5a,7
345,75777,economic_activity,26bd6acb1209b04,Comercio al por menor de carne de aves,Retail trade of poultry,abdf76ba61f3d5b,7
343,75742,economic_activity,2686ff7def558fd,Comercio al por mayor de otros materiales para...,Wholesale trade of other construction material...,c4ea9872661d023,7
1105,75786,economic_activity,83aa3369763b70d,Comercio al por menor de cerveza,Retail trade of beer,bd573630caa17a1,7


The columns represent:

|column| description|
|--|--|
|id| TUKAN's internal metadata|
|category_id| TUKAN's unique category id|
|ref| TUKAN's unique category value id|
|name| the name of the category value in Spanish|
|name_en| the name of the category value in English|
|parent_id| the category value's parent TUKAN id|
|level| the category value's position in the tree|

The above example showcases the hierarchical structure of the `economic_activity` category. As you can see, at the highest level (starting at 0) we have `Economic activity`, which is a parent of `Primary activities`, `Secondary activities` and `Tertiary activities`, which in turn have childs of their own.

Let's look at a final example in which we query the dictionary for the `company_size` category.

In [29]:
company_size_dic = get_category_dictionary('company_size')
company_size_dic.sort_values(by = "level")

Unnamed: 0,id,category_id,ref,name,name_en,parent_id,level
0,887252,company_size,428fd6823cfdd4f,Pequeña y mediana (PyMe),Small and medium (SME),,0
1,887253,company_size,68441ecf39fb506,Fideicomiso,Trust,,0
2,941293,company_size,69c9fee53fa5b60,Desconocido,Unknown,,0
3,887251,company_size,7d24f374aa8e1da,Grande,Large,,0
6,941292,company_size,c5c2ef05ef0e112,Micro,Micro,,0
4,941290,company_size,995739f4f83bd0d,Pequeña,Small,428fd6823cfdd4f,1
5,941291,company_size,ad5c253664365db,Mediana,Medium,428fd6823cfdd4f,1


### Frequencies

Although the frequency concept is pretty straightforward, there’s some important points we need to mention about how we structure dates when dealing with non-daily data.

TUKAN will always present dates in the following format: YYYY-MM-DD.

If the date associated to a certain data-point is non-daily (e.g. monthly, quarterly, etc.) we will always assign the initial date of the period for reference.

| Frequency | Sample Period          | Sample Period Representation in TUKAN |
|-----------|------------------------|---------------------------------------|
| Bi-weekly | 1st week of April 2021 | 2021-04-01                            |
| Monthly   | February 2021          | 2021-02-01                            |
| Quarterly | 4th quarter 2020       | 2020-10-01                            |
| Yearly    | 2019                   | 2019-01-01                            |

### Variables

Ok, so now that we know how to explore the different datasets that are available in TUKAN's data environment, we now need to explore the different indicators or variables that are available for each particular dataset.

Similar to the previous example, we can do this through two different methods: 1) we use the Explore component on the [web-application](https://dashboard.tukanmx.com/) or 2) we query directly the variable dictionary associated to a particular dataset.

For example, if you wanted to know which tables are associated to the Economic Census, **you will need the table's id** and then run the following code:

In [31]:
# Function for future usage
def get_table_dictionary(table_id):

    global url
    global headers

    payload = {
    "type":"variable_dict",
    "data_table": table_id,
    "operation": "all"
    }

    response = requests.request("POST", url, headers=headers, data = json.dumps(payload))
    var_dict = pd.DataFrame(response.json()[table_id + '_variables_dictionary'])

    return(var_dict)

get_table_dictionary('mex_inegi_econ_census').head(10)

Unnamed: 0,id,data_table_id,ref,popularity,display_name,display_name_en,description,description_en,unit_id
0,3986,mex_inegi_econ_census,0737c3fa9290985,0,Acervo total de maquinaria y equipo de producción,Stock of machinery and production equipment,Es el valor actualizado o a costo de reposició...,It is the updated value or replacement cost on...,mxn
1,3954,mex_inegi_econ_census,091c91b271bafe9,0,Consumo de otros bienes y servicios,Consumption of other goods and services,Son los gastos de operación normal de la unida...,They are the normal operating expenses of the ...,mxn
2,3923,mex_inegi_econ_census,0ab8f80c9c48dca,0,Valor agregado censal bruto,Gross census added value,Es el valor de la producción que se añade dura...,It is the value of the production that is adde...,mxn
3,3960,mex_inegi_econ_census,0bb10a983242e8e,0,Gastos por servicios de comunicación,Communication services expenses,Es el valor de los gastos a costo de adquisici...,It is the value of the expenses at acquisition...,mxn
4,3962,mex_inegi_econ_census,0ed567d0fb48356,0,Gastos por reparaciones y refacciones para man...,Expenses for repairs and spare parts for curre...,Comprende los gastos por servicios de terceros...,It includes the expenses for third-party servi...,mxn
5,3945,mex_inegi_econ_census,17bb747b7e3b685,0,Gasto por consumo de bienes y servicios,Goods and services expenses,Es el valor de todos los bienes y servicios co...,It is the value of all the goods and services ...,mxn
6,3977,mex_inegi_econ_census,19589442c1b7c9f,0,Total de inventario inicial de productos en pr...,Initial inventory of products in process,Es el valor en libros (saldo en el inventario ...,It is the book value (balance in the initial i...,mxn
7,3937,mex_inegi_econ_census,1c4e186c6acb267,0,Horas trabajadas por personal por honorarios o...,Hours worked by freelance personnel,Es el total de horas normales y extraordinaria...,It is the total of normal and overtime hours d...,hours
8,3990,mex_inegi_econ_census,231119ea2d2188e,0,"Acervo total de mobiliario, equipo de oficina ...","Furniture, office equipment and other fixed as...",Es el valor actualizado o a costo de reposició...,It is the updated value or replacement cost of...,mxn
9,3973,mex_inegi_econ_census,2454315149b21dc,0,Activos fijos producidos para uso propio,Fixed assets produced for own use,Es el valor de la producción de los bienes mue...,It is the value of the production of movable a...,mxn


The columns represent:

|column| description|
|--|--|
|id| internal TUKAN metadata|
|data_table_id| TUKAN's unique data table id|
|ref| the unique id of the variable|
|popularity| internal TUKAN metadata|
|display_name| the name of the variable in Spanish|
|display_name_en| the name of the variable in English|
|description| a brief description of the variable in Spanish|
|description_en| a brief description of the variable in English|
|unit_id| the unit in which the variable is presented|

With the dictionary we can get a full picture of the variables contained in the dataset such as: the description and name of the variable, as well as the unit in which the information is presented in.

Let's look at an example.

Suppose we are interested in analyzing the `Gross census added value` from the `mex_inegi_econ_census` dataset. We can query this variable in particular by passing the variable's id (in this case, `'0ab8f80c9c48dca'`) to the payload of the request.

Remember that we previously observed that the Economic Census has the following categories: `[company_size, economic_activity, geography]`, so we will need to specify these as well in the payload of the request. For this particular example, we will also filter the request based on the `economic_activity` category to showcase data only for `Retail trade` (i.e. `category_id = '23cf92d98dd7c11'`) companies.

In [39]:
payload = {
    "type": "data_table",
    "operation": "sum",
    "language": "en",
    "categories": {
        "geography": "all",
        "economic_activity": ["23cf92d98dd7c11"],
        "company_size": "all"
    },
    "request": [
        {
            "table": "mex_inegi_econ_census",
            "variables": [
                "0ab8f80c9c48dca"
            ]
        }
    ]
}

response = requests.post(url, headers = headers, data = json.dumps(payload))
census_data = pd.DataFrame(response.json()['data'])
census_data['date'] = pd.to_datetime(census_data['date'])

In [40]:
census_data

Unnamed: 0,date,economic_activity__ref,economic_activity,geography__ref,geography,company_size__ref,company_size,0ab8f80c9c48dca
0,2019-01-01,23cf92d98dd7c11,Retail trade,001f860c459c018,Texcoco,995739f4f83bd0d,Small,7.160580e+08
1,2019-01-01,23cf92d98dd7c11,Retail trade,001f860c459c018,Texcoco,ad5c253664365db,Medium,1.098086e+09
2,2019-01-01,23cf92d98dd7c11,Retail trade,001f860c459c018,Texcoco,c5c2ef05ef0e112,Micro,1.008958e+09
3,2019-01-01,23cf92d98dd7c11,Retail trade,00341dc0d8812f8,Cosolapa,69c9fee53fa5b60,Unknown,5.667100e+07
4,2019-01-01,23cf92d98dd7c11,Retail trade,003ed897a27e3ca,Sucilá,69c9fee53fa5b60,Unknown,1.097000e+07
...,...,...,...,...,...,...,...,...
3121,2019-01-01,23cf92d98dd7c11,Retail trade,ffc395b30ae92d9,Tezoyuca,69c9fee53fa5b60,Unknown,1.737130e+08
3122,2019-01-01,23cf92d98dd7c11,Retail trade,ffc93354918a89a,Uruachi,c5c2ef05ef0e112,Micro,7.848000e+06
3123,2019-01-01,23cf92d98dd7c11,Retail trade,ffda8cad00e3ca3,Juchitepec,995739f4f83bd0d,Small,3.848800e+07
3124,2019-01-01,23cf92d98dd7c11,Retail trade,ffda8cad00e3ca3,Juchitepec,c5c2ef05ef0e112,Micro,7.995900e+07


## Next Section

Now that we know how to explore TUKAN's data catalog and how we structure our data model, it is important to explore the ins & outs of structure of the payload you'll need to create in order to make a request to our API.

In the [following notebook](01%20-%20tukan_model.ipynb), we'll showcase how to get you set-up and the basics behind our data model and catalog.