# Pipelines workshop

In this workshop we are going to build a pipeline that reads in some external data and processes it so it can be used for further data science work.

The data we will be using is data on household incomes and household expenditures, both provided by the Dutch national statistics bureau CBS.

In this workshop we will be using to libraries you may not have seen before:
- loguru (for logging)
- cbsodata (for interfacing with the CBS Statline service)


In [1]:
import cbsodata
import pandas as pd
from lib.utils import *
from loguru import logger



## Reading in external data and storing it

First we are going to read some data from the internet. The first data set contains information on household incomes. 

When we read data from the internet, we should expect things to go wrong (there may not be an internet connection, the CBS server may be down, etc). A robust pipeline detects this and complains loudly (things should never fail silently).

For each potential error, you should make a decision whether the pipeline can continue after the error has occured or not.

In this case, the CBS household income data is crucial to our data science project, so if we can't read it in, we should abort the pipeline.

In [2]:
# First read in CBS household income data.
cbs_code_household_income = '84493NED'

try:
    hhi_raw = cbsodata.get_data(cbs_code_household_income)
except Exception as e:
    logger.critical(f"Unable to read CBS household income data. Error was: {e}. Aborting")
    exit(0) # Exit kills the entire kernel, which in this case is what we want.


Once we have read in the data, we want to convert it to a Pandas dataframe and then store it in a CSV file.

Since the data comes from the internet, it may not actually be what we expect it to be. The CBS server may be compromised, CBS may have changed it data format, etc.

This means we should also wrap reading and converting the data in a try / except block. Also, when this fails, we should simply crash the kernel instead of continuing.

In [16]:
try:
    df_hhi = pd.DataFrame(hhi_raw)
    # If you want to see what happens when things go wrong, raise an exception yourself
    # raise Exception("Exception to test exception handling.")
except Exception as e:
    logger.critical(f"Unable to parse CBS household income data. Exception was: {e}")
    # At this point we should dump the incorrect data so we can look at it to find out what went wrong.
    dump_file_name = "cbs_household_income_" + timestamp() + ".dump"
    # Of course writing the dump file itself could also go wrong (disk full, wrong filename, etc)
    with open(dump_file_name, "w") as f:
        f.write(str(hhi_raw))
df_hhi

Unnamed: 0,ID,Inkomensbestanddelen,KenmerkenVanHuishoudens,Perioden,ParticuliereHuishoudens_1,ParticuliereHuishoudensRelatief_2,TotaalInkomen_3,GemiddeldInkomen_4,MediaanInkomen_5,AandeelVanBrutoInkomen_6
0,0,1 Inkomen als werknemer,Particuliere huishoudens,2011,4847.2,66.0,290670.0,60.0,51.7,65.5
1,1,1 Inkomen als werknemer,Particuliere huishoudens,2012,4850.8,65.4,295084.0,60.8,52.2,65.2
2,2,1 Inkomen als werknemer,Particuliere huishoudens,2013,4817.2,64.5,296314.0,61.5,52.4,64.3
3,3,1 Inkomen als werknemer,Particuliere huishoudens,2014,4819.7,64.3,300571.0,62.4,52.8,62.5
4,4,1 Inkomen als werknemer,Particuliere huishoudens,2015,4839.3,63.9,300893.0,62.2,52.7,63.3
...,...,...,...,...,...,...,...,...,...,...
77527,77527,14 BESTEEDBAAR INKOMEN,Vermogen: 10e 10%-groep,2019,782.7,100.0,71988.0,92.0,61.9,62.5
77528,77528,14 BESTEEDBAAR INKOMEN,Vermogen: 10e 10%-groep,2020,789.4,100.0,67064.0,85.0,63.4,61.5
77529,77529,14 BESTEEDBAAR INKOMEN,Vermogen: 10e 10%-groep,2021,795.1,100.0,71383.0,89.8,67.7,61.7
77530,77530,14 BESTEEDBAAR INKOMEN,Vermogen: 10e 10%-groep,2022,804.1,100.0,77624.0,96.5,72.0,61.1


We now have a dataframe we can use in the rest of our pipeline. Store it in its original form in de data/raw directory. Storing data as you received it, without further processing, makes your work more transparent and more easily reproducible.

In [None]:
df_hhi.to_csv("data/raw/cbs_hhi_" + timestamp() + ".csv", index = False)

### Now do it yourself

Now do it yourself. Read in data on household expenditure, also from CBS. The code for this data set is 83676NED. Make sure to use try / except blocks to log errors if they occur.