# Pipelines workshop 2: creating an ABT

In the first notebook for this workshop, "importing and preparing the data" you read in data from an external source and processed it into two CSV files: data/processed/income.csv and data/processed/expenditure.csv.

This data consists of information on household incomes and household expenditures, both provided by the Dutch national statistics bureau CBS. 

In this notebook we are going to combine the two CSV files into one large table so that we can analyze link between household expenses and household income. A table in which multiple data sets are combined so they can be analyzed together is known as an Analytical Base Table (ABT).

As this notebook, too, is part of our pipeline, we will be rigorously logging  all the unexpected situations we encounter in our work.




In [None]:
import pandas as pd
from lib.utils import *
from loguru import logger
from datetime import datetime

## Read in our data

Reading in our data should not cause problems, but just in case it does: log errors and halt execution if one of our two data files does not exist.

- Load income data into a dataframe `dfi`
- Load expenditure data into a dataframe `dfe`

In [5]:
# Your code goes here

## Performing sanity checks

Before we proceed, we need to check our data to see if it contains the type of values we expect.

Let's do this for the "expenditure" file:

- Year should be a reasonable value (between 2010 and the current year, say)
- The value in the "all" column should be higher than the values in the other columns combined (it does not need to be *equal* to those columns because we left out a number of expenditure categories during the data import).
- The same is true for the "housing_energy" column: its value should be higher than that of the "energy" column.

Before we get started, we need to think about what we want to do when the files turn out to be incorrect. In the first notebook, we logged a critical error and aborted the import. This made sense, because we don't want to create invalid data files for the rest of the process.

We could argue we are doing similar here, because we're creating an ABT. However, if we halt the process as soon as we encounter our first error, we will not know if the rest of the process is error free. What we want, is to let the entire process run, log any errors that we encounter and only at the end, before we save the final ABT file, do we want to exit if any errors were encountered.

So what we will do is store the number of errors encountered in a variable. Then, at the end, when that number is greater than zero, we will report it and exit before combining the two data sets into a single ABT file.

Meanwhile, every time we encounter an error we will either log it or raise an exception.


In [30]:
# Global variable counting the number of errors encountered.
num_errors = 0

In [None]:
# Check if the year value contains reasonable values
if len(dfe[(dfe['year'] > 2010) & (dfe['year'] < datetime.now().year) ]) != len(dfe):
    num_errors += 1
    # Raising an exception halts execution in the current cell.
    raise Exception("Invalid expenditure file: contains invalid year values")



In [32]:
# Check if the values in the 'all' column are larger than those in the other columns
# (except 'energy' as that is part of 'housing_energy')
dfe_validate_num_cols = pd.DataFrame(dfe['all'])
dfe_validate_num_cols['other'] = dfe[['food','alc_tobacco','clothes','housing_energy', 'transportation']].sum(axis = 1)
if len(dfe_validate_num_cols[dfe_validate_num_cols['all'] < dfe_validate_num_cols['other']]) > 0:
    num_errors += 1
    # Don't raise an exception because we want to continue executing code in this cell.
    logger.error("Invalid expenditure file: other columns combined are larger than the 'all' column")

# Check if 'energy' is less than 'housing_energy')
# if len(dfe['housing_energy'])
if len(dfe[dfe['housing_energy'] < dfe['energy']]) > 0:
    num_errors += 1
    logger.error("Invalid expenditure file: energy is larger than housing and energy combined.")


If we're done with the expenses data, we can start validating the income data. You can do this yourself, below.

In [None]:
# Your code goes here.

Finally, we need to check if we can join our two data frames. For this to be possible, the values in the 'year' columns for both files must match. Since we know the incomes file has more data than the expenses file, we are satisfied if the years in the expenses file are present in the years column in the income file.

In [47]:
# To check if values from the expenses file occur in the incomes file we use Python's built-in "set" function
smallest_set = set(dfe['year'])
largest_set = set(dfi['year'])
# Subtracting sets like this yields all values that are in smallest_set that are not in largest_set
if smallest_set - largest_set:
    num_errors += 1
    logger.error('Not all years in the smallest data set occur in the largest data set')



## Joining data frames together

We can now join the two data sets together - *if* we haven't encountered any errors.



In [48]:
if num_errors > 0:
    logger.critical("Errors were encountered in CSV files. Aborting.")
    exit(0)

If we're still here, we can start joining.

The problem is that the incomes file has much more data than the expenses file. This means we need to filter out all the rows from the expenses data for which no matching incomes row can be found. We could simply drop them, but this would cause us to lose information on changes in incomes in the years for which no data is available in the expenses file.

Let's add an extra column to the incomes file that holds the change in total income compared to the previous year.

In [56]:
dfi['total_inc_prev_mill'] = dfi['total_inc_mill'].shift(1)
dfi['total_inc_delta_mill'] = dfi['total_inc_mill'] - dfi['total_inc_prev_mill']

Do the same for the columns mean_inc_k, median_inc_k

In [None]:
# Your code goes here.

Then, drop the rows in the income data frame for which no matching row exists in the expenses data frame and perform the join.

Note that while Pandas has a `join` method this should not be used for joining two data frames like this (it can only join on indexes). Instead you have to use the `merge` function.

In [None]:
df_joined = pd.merge(dfi[dfi['year'].isin(dfe['year'])], dfe)
df_joined

Unnamed: 0,year,num_hh,total_inc_mill,mean_inc_k,median_inc_k,total_inc_mill_prev,total_inc_prev_mill,total_inc_delta_mill,all,food,alc_tobacco,clothes,housing_energy,energy,transportation
0,2015,7568.5,292094.0,38.6,32.5,306084.0,292108.0,-14.0,33763.0,3721.0,1014.0,1591.0,10646.0,1582.0,4351.0
1,2020,7894.5,366657.0,46.4,38.1,385822.0,357843.0,8814.0,35211.0,4458.0,1222.0,1394.0,11577.0,1540.0,3955.0


Write the joined data frame to a csv file (don't forget to log the timestamp for when you did this! )

In [None]:
csv_logfile = 'data/processed/abt.csv.log'
csv_file = 'data/processed/abt.csv'
# Your code goes here.


## Done

Congratulations! You now have created a robust pipeline process for importing data from the internet, cleaning it up and validating its contents.

A pipeline like this can be handed over to an ML engineer or a dev ops engineer for deployment. 
