# Final Blows to the Dirty Data
## Cross field validation with Pandas
<img src='images/cross.jpg'></img>
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://www.pexels.com/@ian-beckley-1278367?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Ian Beckley</a>
        on 
        <a href='https://www.pexels.com/photo/top-view-photography-of-roads-2440013/?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Pexels</a>
    </strong>
</figcaption>

### Introduction <small id='intro'></small>

### Setup <small id='setup'></small>

In [1]:
# Scientific libraries
import numpy as np
import pandas as pd
import datetime as dt

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

I generated a fake data to perform cross field validation:

In [2]:
# Load data
people = pd.read_csv('data/people.csv', parse_dates=['birthday'])
people.sample(10)

Unnamed: 0,first_name,last_name,birthday,weight,height,bmi,age
200,Tucker,Ross,2008-03-02,56,173,18,12
8344,Nash,Palmer,1989-02-09,74,188,20,31
7392,Beckett,Stewart,1999-07-04,63,185,18,21
3848,Andrea,Roberts,1997-08-14,94,172,31,23
6205,Avery,Ortiz,1987-10-03,54,178,17,33
1888,Hailey,Morgan,2002-01-25,72,188,20,18
4797,Weston,Cooper,2009-05-15,65,181,19,11
2243,Hudson,Carter,2003-07-11,87,178,27,17
7065,Beckham,Rodriguez,1993-05-11,85,160,33,27
2874,Jaxon,Sancherz,1996-02-17,53,173,17,24


In [3]:
people.shape

(10000, 7)

In [4]:
people.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   first_name  10000 non-null  object        
 1   last_name   10000 non-null  object        
 2   birthday    10000 non-null  datetime64[ns]
 3   weight      10000 non-null  int64         
 4   height      10000 non-null  int64         
 5   bmi         10000 non-null  int64         
 6   age         10000 non-null  int64         
dtypes: datetime64[ns](1), int64(4), object(2)
memory usage: 547.0+ KB


 ### Cross Field Validation, Example 1 <small id='1'></small>

In the setup section, we loaded a fake `people` dataset. For example purposes, we are going to assume that this data was collected from two sources: from a census data that gives each individual's full name and birthday which was merged with their hospital records later. 

To have an accurate analysis later, we should make sure that our data is valid. In this case, we can check the validness of two fields: age and BMI (Body Mass Index). 

Let's start with age. We have to ensure when we subtract their birth year from the current year, the result matches the `age` column.

When you are performing cross field validation, the speed should be the main concern. Unlike our little example, you may have to deal with millions of observations. One of the fastest methods for cross field validation for datasets with any size is `apply` function of `pandas`.

Here is a simple example of `apply`:

In [5]:
def stupid_function(x):
    return f"Hello {x}!"

people['last_name'].apply(stupid_function).sample(5)

674        Hello Fry!
6623     Hello Myers!
1247    Hello Murphy!
2622    Hello Turner!
106     Hello Walker!
Name: last_name, dtype: object

The above was an example of a column-wise execution. `apply` takes a function name as an argument and calls that function on each element of the column it was called on. If we set `axis` to `1` the execution is shifted to row-wise.

Let's create a function that validates person's age:

In [6]:
def validate_age(row):
    """
    A function to validate
    person's age by subtracting
    birth year from the current year
    and comparing the result 
    to the given age.
    """
    # Store todays data
    today = dt.date.today()
    # Calculate age
    age = today.year - row['birthday'].year
    
    return age == row['age']

Since we are going to use `apply` for a row-wise operation, its inputs will be each row. That's why we can easily access each column's value just like we would in a normal situation.

Using the `datetime` package we imported earlier, we will store today's date. Then, we calculate the age by subtracting the year components from each other. For this to work, you have to make sure the `birthday` column has a `datetime` data type. 

In the `return` statement we compare the calculated age and the given age, which returns `True` if they match, `False` if otherwise. Let's call it on the dataset:

In [7]:
people['age_valid'] = people.apply(validate_age, axis=1)
people.sample(5)

Unnamed: 0,first_name,last_name,birthday,weight,height,bmi,age,age_valid
6197,Leo,Carter,1986-10-09,98,166,35,34,True
6510,Andrea,Roberts,1988-11-29,88,170,30,32,True
193,Avery,Gray,1980-11-02,86,169,30,40,True
6821,Hunter,Davis,1992-10-18,96,187,27,33,False
1085,Natasha,Peterson,1980-03-23,94,184,27,40,True


The function works as expected. Now we can subset the data for invalid ages if any:

In [8]:
people[people['age_valid'] == False]

Unnamed: 0,first_name,last_name,birthday,weight,height,bmi,age,age_valid
46,Avery,Price,1983-08-24,90,181,27,52,False
167,Christian,Stewart,1996-01-21,92,165,33,27,False
394,Desirae,Edwards,1980-02-07,98,190,27,32,False
453,Evelyn,Johnson,1988-01-03,81,176,26,22,False
1047,Gunther,Gibson,1994-03-29,86,163,32,38,False
...,...,...,...,...,...,...,...,...
9599,Bradley,Colon,2003-07-23,55,174,18,43,False
9662,Julian,Miller,1998-09-03,55,190,15,79,False
9672,Easton,Chavez,1987-05-30,67,172,22,38,False
9867,Garett,Peterson,1983-08-07,79,180,24,76,False


There were 75 rows with invalid age. If you do the math, the ages does not match. To correct these values, we could write a new function but that would involve a code repetition. We can update `validate_age` to replace any invalid values with valid ones:

In [9]:
def validate_age(row):
    """
    A function to validate
    person's age by subtracting
    birth year from the current year
    and comparing the result 
    to the given age.
    """
    # Store todays data
    today = dt.date.today()
    # Calculate age
    age = today.year - row['birthday'].year
    # Replace all with the age, correct or not
    row['age'] = age

    return row['age']

In [10]:
# Undo the last operation
people.drop('age_valid', axis=1, inplace=True)
# Modify the age column instead of creating a new one
people['age'] = people.apply(validate_age, axis=1)

We can make sure the operation was successful with an `assert` statement:

In [26]:
today = dt.date.today()
# Check with assert, no outout means successful
assert (today.year - people['birthday'].dt.year == people['age']).all() == True

### Cross Field Validation, Example 2 <small id='2'></small>

Next, we will validate Body Mass Index column.