# Pipelines workshop 2: creating an ABT

In the first notebook for this workshop, "importing and preparing the data" you read in data from an external source and processed it into two CSV files: data/processed/income.csv and data/processed/expenditure.csv.

This data consists of information on household incomes and household expenditures, both provided by the Dutch national statistics bureau CBS. 

In this notebook we are going to combine the two CSV files into one large table so that we can analyze link between household expenses and household income. A table in which multiple data sets are combined so they can be analyzed together is known as an Analytical Base Table (ABT).

As this notebook, too, is part of our pipeline, we will be rigorously logging  all the unexpected situations we encounter in our work.




In [2]:
import pandas as pd
from lib.utils import *
from loguru import logger
from datetime import datetime

## Read in our data

Reading in our data should not cause problems, but just in case it does: log errors and halt execution if one of our two data files does not exist.

- Load income data into a dataframe `dfi`
- Load expenditure data into a dataframe `dfe`

In [2]:
# Your code goes here

In [None]:
try:
    dfi = pd.read_csv("data/processed/income.csv")
except Exception as e:
    logger.critical(f"Unable to read income csv file: {e}")
    exit(0)

try:
    dfe = pd.read_csv("data/processed/expenditure.csv")
except Exception as e:
    logger.critical(f"Unable to read expenditure csv file: {e}")
    exit(0)

# Performing sanity checks

Before we proceed, we need to check our data to see if it contains the type of values we expect.

Let's do this for the "expenditure" file:

- Year should be a reasonable value (between 2010 and the current year, say)
- The value in the "all" column should be higher than the values in the other columns combined (it does not need to be *equal* to those columns because we left out a number of expenditure categories during the data import).
- The same is true for the "housing_energy" column: its value should be higher than that of the "energy" column.

Before we get started, we need to think about what we want to do when the files turn out to be incorrect. In the first notebook, we logged a critical error and aborted the import. This made sense, because we don't want to create invalid data files for the rest of the process.

We could argue we are doing similar here, because we're creating an ABT. However, if we halt the process as soon as we encounter our first error, we will not know if the rest of the process is error free. What we want, is to let the entire process run, log any errors that we encounter and only at the end, before we save the final ABT file, do we want to exit if any errors were encountered.

So what we will do is store the number of errors encountered in a variable. Then, at the end, when that number is greater than zero, we will report it and exit before writing the ABT file.

Meanwhile, every time we encounter an error we will either log it or raise an exception.


In [30]:
# Global variable counting the number of errors encountered.
num_errors = 0

In [None]:
# Check if the year value contains reasonable values
if len(dfe[(dfe['year'] > 2010) & (dfe['year'] < datetime.now().year) ]) != len(dfe):
    num_errors += 1
    # Raising an exception halts execution in the current cell.
    raise Exception("Invalid expenditure file: contains invalid year values")



In [33]:
# Check if the values in the 'all' column are larger than those in the other columns
# (except 'energy' as that is part of 'housing_energy')
dfe_validate_num_cols = pd.DataFrame(dfe['all'])
dfe_validate_num_cols['other'] = dfe[['food','alc_tobacco','clothes','housing_energy', 'transportation']].sum(axis = 1)
if len(dfe_validate_num_cols[dfe_validate_num_cols['all'] < dfe_validate_num_cols['other']]) > 0:
    num_errors += 1
    # Don't raise an exception because we want to continue executing code in this cell.
    logger.error("Invalid expenditure file: other columns combined are larger than the 'all' column")

# Check if 'energy' is less than 'housing_energy')
# if len(dfe['housing_energy'])
if len(dfe[dfe['housing_energy'] < dfe['energy']]) > 0:
    num_errors += 1
    logger.error("Invalid expenditure file: energy is larger than housing and energy combined.")


If we're done with the expenses data, we can start validating the income data. You can do this yourself, below.

In [None]:
# Your code goes here.