# Final Blows to the Dirty Data
## Cross field validation with Pandas
<img src='images/cross.jpg'></img>
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://www.pexels.com/@ian-beckley-1278367?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Ian Beckley</a>
        on 
        <a href='https://www.pexels.com/photo/top-view-photography-of-roads-2440013/?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Pexels</a>
    </strong>
</figcaption>

### Introduction <small id='intro'></small>

Today, data never comes in a single source. More often than not, it is collected in different locations and merged together. A common challenge when merging data is the data integrity. In simple words, making sure our data is correct by using multiple fields to check the validity of another. In fancier terms, this process is called **Cross Field Validation**.

Sanity checking your dataset for data integrity is essential to have accurate analysis and running machine learning models. Cross field validation should come in after you dealt with most of the other cleaning issues like missing value imputation, ensuring field constraints are in place, etc. 

I wrote the code snippets for this post with regards to execution time. Since cross field validation may involve performing operations across multiple columns for millions of observations, the execution speed should is very important. The solutions suggested here should be scalable enough for even massive datasets.

### Overview
1. [Introduction](#intro)
1. [Setup](#setup)
1. [Cross Field Validation, Example 1](#1)
1. [Cross Field Validation, Example 2](#2)
1. [Cross Field Validation, Speed Comparison ](#speed)

### Setup <small id='setup'></small>

In [1]:
# Scientific libraries
import numpy as np
import pandas as pd
import datetime as dt

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

I generated a fake data to perform cross field validation:

In [2]:
# Load data
people = pd.read_csv('data/people.csv', parse_dates=['birthday'])
people.sample(10)

Unnamed: 0,first_name,last_name,birthday,weight,height,bmi,age
3750,Jameson,Fry,2002-10-06,85,164,31,18
980,King,Ramos,1987-04-02,67,191,18,33
3032,Jon,Powell,1981-01-20,52,179,16,39
9265,Abdul,Ross,1985-04-21,80,181,24,35
5202,Leo,Perez,1999-04-29,65,166,23,21
5598,Teresa,Jenkins,1985-07-13,94,181,28,35
9067,Parker,Martin,1984-12-17,72,176,23,36
9785,Brooke,Foster,2000-12-15,57,176,18,20
2791,Gunther,Cook,1994-04-03,73,182,22,26
4029,Landon,Taylor,1999-07-07,65,161,25,21


In [3]:
people.shape

(10000, 7)

In [4]:
people.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   first_name  10000 non-null  object        
 1   last_name   10000 non-null  object        
 2   birthday    10000 non-null  datetime64[ns]
 3   weight      10000 non-null  int64         
 4   height      10000 non-null  int64         
 5   bmi         10000 non-null  int64         
 6   age         10000 non-null  int64         
dtypes: datetime64[ns](1), int64(4), object(2)
memory usage: 547.0+ KB


 ### Cross Field Validation, Example 1 <small id='1'></small>

In the setup section, we loaded a fake `people` dataset. For example purposes, we are going to assume that this data was collected from two sources: from a census data that gives each individual's full name and birthday which was merged with their hospital records later. 

To have an accurate analysis, we should make sure our data is valid. In this case, we can check the validness of two fields: age and BMI (Body Mass Index). 

Let's start with age. We have to ensure when we subtract their birth year from the current year, the result matches the `age` column.

When you are performing cross field validation, the speed should be the main concern. Unlike our little example, you may have to deal with millions of observations. One of the fastest methods for cross field validation for datasets with any size is `apply` function of `pandas`.

Here is a simple example of `apply`:

In [5]:
def stupid_function(x):
    return f"Hello {x}!"

people['last_name'].apply(stupid_function).sample(5)

5498    Hello Edwards!
6493     Hello Howard!
7782        Hello Fry!
9010      Hello Brown!
7286    Hello Collins!
Name: last_name, dtype: object

The above was an example of a column-wise execution. `apply` takes a function name as an argument and calls that function on each element of the column it was called on. The function has an extra argument `axis` which is by default set to `0` or `rows`. If we set it to `1` or `columns`, the function shifts to row-wise execution.

> Note on `axis` argument: `axis='rows'` means performing an operation along the row axis which is vertical because rows are stacked vertically. `axis='columns'` means performing an operation along the column axis which is horizontal because columns are stacked horizontally. These two terms confuse a lot of people because they look like they are doing the opposite of what they are told to. In reality, it just takes a shift in perspective, or using the words if you will.

Let's create a function that validates person's age:

In [6]:
def validate_age(row):
    """
    A function to validate
    person's age by subtracting
    birth year from the current year
    and comparing the result 
    to the given age.
    """
    # Store todays data
    today = dt.date.today()
    # Calculate age
    age = today.year - row['birthday'].year
    
    return age == row['age']

Since we are going to use `apply` for a row-wise operation, its inputs will be each row of the dataset. That's why we can easily access each column's value just like we would in a normal situation.

Using the `datetime` package we imported earlier, we will store today's date. Then, we calculate the age by subtracting the year components from each other. For this to work, you have to make sure the `birthday` column has a `datetime` data type. 

In the `return` statement we compare the calculated age and the given age, which returns `True` if they match, `False` if otherwise:

In [7]:
people['age_valid'] = people.apply(validate_age, axis=1)
people.sample(5)

Unnamed: 0,first_name,last_name,birthday,weight,height,bmi,age,age_valid
9305,Spencer,Johnson,2003-05-21,93,182,28,17,True
8946,Desirae,Wood,1989-06-26,58,191,15,31,True
4880,Piper,Carter,1994-04-14,63,164,23,26,True
7860,Evelyn,Ortiz,1981-07-31,59,190,16,39,True
3191,Garett,Fry,1994-11-01,84,188,23,26,True


The function works as expected. Now we can subset the data for invalid ages if any:

In [8]:
people[people['age_valid'] == False]

Unnamed: 0,first_name,last_name,birthday,weight,height,bmi,age,age_valid
46,Avery,Price,1983-08-24,90,181,27,52,False
167,Christian,Stewart,1996-01-21,92,165,33,27,False
394,Desirae,Edwards,1980-02-07,98,190,27,32,False
453,Evelyn,Johnson,1988-01-03,81,176,26,22,False
1047,Gunther,Gibson,1994-03-29,86,163,32,38,False
...,...,...,...,...,...,...,...,...
9599,Bradley,Colon,2003-07-23,55,174,18,43,False
9662,Julian,Miller,1998-09-03,55,190,15,79,False
9672,Easton,Chavez,1987-05-30,67,172,22,38,False
9867,Garett,Peterson,1983-08-07,79,180,24,76,False


There were 75 rows with invalid age. If you do the math, the ages do not match. To correct these values, we could write a new function but that would involve a code repetition. We can update `validate_age` to replace any invalid values with valid ones:

In [9]:
def validate_age(row):
    """
    A function to validate
    person's age by subtracting
    birth year from the current year
    and comparing the result 
    to the given age.
    """
    # Store todays data
    today = dt.date.today()
    # Calculate age
    age = today.year - row['birthday'].year
    # Replace all with the age, correct or not
    row['age'] = age

    return row['age']

In [10]:
# Undo the last operation
people.drop('age_valid', axis=1, inplace=True)
# Modify the age column instead of creating a new one
people['age'] = people.apply(validate_age, axis=1)

We can make sure the operation was successful with an `assert` statement:

In [11]:
today = dt.date.today()
# Check with assert, no outout means successful
assert (today.year - people['birthday'].dt.year == people['age']).all() == True

### Cross Field Validation, Example 2 <small id='2'></small>

Next, we will validate Body Mass Index column.
> Body mass index is a value derived from the mass and height of a person.

A quick Google search gives us the formula to calculate BMI:
<img src='images/1.png'></img>

Using the ideas from the first example, we will create the function for BMI which replaces invalid BMI with correct values:

In [12]:
def validate_bmi(row):
    """
    A function to validate BMI
    of a person. Calculates the BMI 
    by dividing the mass (kg)
    by squaring height (m).
    """
    # Actual BMI
    bmi = row['weight'] / (row['height'] / 100) ** 2
    # Replace with bmi, correct or not
    row['bmi'] = bmi
    
    return row['bmi']

In [13]:
# Call the function
people['bmi'] = people.apply(validate_bmi, axis=1)

In [14]:
# Check the results, again no output means success
assert (people['weight'] / (people['height'] / 100) ** 2 == people['bmi']).all() == True

It is a good practice to start all of your validation functions with `validate_`. This gives a signal you are performing a validation to the readers of your code.

### Cross Field Validation, Speed Comparison <small id ='speed'></small>

In this section, we will perform speed comparison of different methods for cross field validation. We will start off with the `apply` function for validating the `bmi`:

In [15]:
def validate_bmi(row):
    """
    A function to validate BMI
    of a person. Calculates the BMI 
    by dividing the mass (kg)
    by squaring height (m).
    """
    # Actual BMI
    bmi = row['weight'] / (row['height'] / 100) ** 2
    # Replace with bmi, correct or not
    row['bmi'] = bmi
    
    return row['bmi']

In [16]:
%%timeit
people['bmi'] = people.apply(validate_bmi, axis=1)

319 ms ± 29.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


It took around 0.3 seconds for 10k dataset. Next, we will use a for loop using pandas `iterrows()`:

In [17]:
def validate_bmi(df):
    """
    A function to validate
    the bmi using iterrows()
    """
    for index, row in df.iterrows():
        # Calculate bmi
        bmi = row['weight'] / (row['height'] / 100) ** 2
        # Replace the old value
        df.loc[index, 'bmi'] = bmi
    return df

In [18]:
%%timeit
df = validate_bmi(people)

3.22 s ± 35.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


It took 10 times longer! Besides, this time difference will not be linear. For larger datasets, the difference becomes bigger and bigger. I don't think any for loop can beat the `apply` function, but let's also try `itertuples` which is generally faster than `iterrows()`:

In [19]:
def validate_bmi(df):
    """
    A function to validate
    the bmi using itertuples()
    """
    for row in df.itertuples():
        # Calculate bmi
        bmi = row[4] / (row[5] / 100) ** 2
        # Replace the old value
        df.loc[row[0], 'bmi'] = bmi
    return df

In [20]:
%%timeit
df = validate_bmi(people)

2.18 s ± 27.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Still much slower than `apply`. So, the general rule of thumb for cross field validation is to always use `apply` function.