**Table of contents**<a id='toc0_'></a>    
- [Import Statements](#toc1_1_)    
  - [Unit uniformity](#toc1_2_)    
  - [Cross field validation](#toc1_3_)    
  - [Missing values](#toc1_4_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=5
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

### <a id='toc1_1_'></a>[Import Statements](#toc0_)

In [1]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import pyreadr
import datetime as pydt
import missingno as msno

### <a id='toc1_2_'></a>[Unit uniformity](#toc0_)

> When we collect data in the real world, often the values of a column doesn't maintain unit uniformity. For example, we can have temperature data that has values in both Fahrenheit and Celsius, weight data in Kilograms and in stones, dates in multiple formats, and so on. Verifying unit uniformity is imperative to having accurate analysis. 

In [2]:
df_accounts = pyreadr.read_r("../datasets/accounts.rds")[None]

In [3]:
df_accounts.head()

Unnamed: 0,id,date_opened,total
0,A880C79F,2003-10-19,169305.0
1,BE8222DF,"October 05, 2018",107460.0
2,19F9E113,2008-07-29,15297152.0
3,A2FE52A3,2005-06-09,14897272.0
4,F6DC2C08,2012-03-31,124568.0


In [4]:
df_accounts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98 entries, 0 to 97
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   id           98 non-null     category
 1   date_opened  98 non-null     object  
 2   total        98 non-null     float64 
dtypes: category(1), float64(1), object(1)
memory usage: 4.6+ KB


We can see that the "date_opened" column has dates in different formats. Also, the column is of "object" type. One way of handling such differently formatted dates is to use the `pd.to_datetime(format="mixed", errors='coerce')` method. This method will try to convert the values in the column to datetime format one by one. If it fails, it won't throw an error, it will return `NaT` (Not a Time) for that value.

**Note:** This can be memory intensive depending on the number of observations in the dataframe. If the number of observations with incorrect format is small, consider setting them to missing values and then impute the values (if applicable) or we can also drop them altogether.

In [5]:
df_accounts["date_opened"] = pd.to_datetime(
    df_accounts.date_opened, errors="coerce", format="mixed"
)

In [6]:
df_accounts.head()

Unnamed: 0,id,date_opened,total
0,A880C79F,2003-10-19,169305.0
1,BE8222DF,2018-10-05,107460.0
2,19F9E113,2008-07-29,15297152.0
3,A2FE52A3,2005-06-09,14897272.0
4,F6DC2C08,2012-03-31,124568.0


### <a id='toc1_3_'></a>[Cross field validation](#toc0_)

> Cross field validation is the use of multiple fields in your dataset to sanity check the integrity of the data. For example if we collected data on peoples age and date of birth, we can check if the age is correct by comparing the date at which observations were collected to the collected date of birth.

- In the "divorce" dataset, we can cross validate the "marriage_duration" by comparing the "marriage_date" and "divorce_date" columns.

In [7]:
divorce_df = pd.read_csv(
    "../datasets/divorce.csv",
    parse_dates=["divorce_date", "marriage_date", "dob_man", "dob_woman"],
)

In [8]:
divorce_df.head()

Unnamed: 0,divorce_date,dob_man,education_man,income_man,dob_woman,education_woman,income_woman,marriage_date,marriage_duration,num_kids
0,2006-09-06,1975-12-18,Secondary,2000.0,1983-08-01,Secondary,1800.0,2000-06-26,5.0,1.0
1,2008-01-02,1976-11-17,Professional,6000.0,1977-03-13,Professional,6000.0,2001-09-02,7.0,
2,2011-01-02,1969-04-06,Preparatory,5000.0,1970-02-16,Professional,5000.0,2000-02-02,2.0,2.0
3,2011-01-02,1979-11-13,Secondary,12000.0,1981-05-13,Secondary,12000.0,2006-05-13,2.0,
4,2011-01-02,1982-09-20,Professional,6000.0,1988-01-30,Professional,10000.0,2007-08-06,3.0,


In [9]:
divorce_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2209 entries, 0 to 2208
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   divorce_date       2209 non-null   datetime64[ns]
 1   dob_man            2209 non-null   datetime64[ns]
 2   education_man      2205 non-null   object        
 3   income_man         2209 non-null   float64       
 4   dob_woman          2209 non-null   datetime64[ns]
 5   education_woman    2209 non-null   object        
 6   income_woman       2209 non-null   float64       
 7   marriage_date      2209 non-null   datetime64[ns]
 8   marriage_duration  2209 non-null   float64       
 9   num_kids           1333 non-null   float64       
dtypes: datetime64[ns](4), float64(4), object(2)
memory usage: 172.7+ KB


In [10]:
calc_mar_dur = (
    (divorce_df.divorce_date - divorce_df.marriage_date) / np.timedelta64(365, "D")
).astype("int")
divorce_df["calculated_marriage_duration"] = calc_mar_dur

# compare
val = divorce_df.marriage_duration == calc_mar_dur
divorce_df.loc[
    ~val,
    [
        "divorce_date",
        "marriage_date",
        "marriage_duration",
        "calculated_marriage_duration",
    ],
]

Unnamed: 0,divorce_date,marriage_date,marriage_duration,calculated_marriage_duration
0,2006-09-06,2000-06-26,5.0,6
1,2008-01-02,2001-09-02,7.0,6
2,2011-01-02,2000-02-02,2.0,10
3,2011-01-02,2006-05-13,2.0,4
6,2005-01-03,1991-10-09,10.0,13
...,...,...,...,...
2189,2007-07-31,2003-09-05,4.0,3
2193,2000-08-31,1989-08-19,10.0,11
2195,2010-08-31,1995-01-27,14.0,15
2200,2001-10-31,1988-04-30,12.0,13


- In the "banking" dataset, we can cross validate the "inv_amount" by summing the "fund_A", "fund_B", "fund_C" and "fund_D" columns.

In [11]:
df_banking = pd.read_csv(
    "../datasets/banking.csv", index_col=0, parse_dates=["birth_date"]
)

In [12]:
df_banking.head()

Unnamed: 0,cust_id,birth_date,Age,acct_amount,inv_amount,fund_A,fund_B,fund_C,fund_D,account_opened,last_transaction
0,870A9281,1962-06-09,58,63523.31,51295,30105.0,4138.0,1420.0,15632.0,02-09-18,22-02-19
1,166B05B0,1962-12-16,58,38175.46,15050,4995.0,938.0,6696.0,2421.0,28-02-19,31-10-18
2,BFC13E88,1990-09-12,34,59863.77,24567,10323.0,4590.0,8469.0,1185.0,25-04-18,02-04-18
3,F2158F66,1985-11-03,35,84132.1,23712,3908.0,492.0,6482.0,12830.0,07-11-17,08-11-18
4,7A73F334,1990-05-17,30,120512.0,93230,12158.4,51281.0,13434.0,18383.0,14-05-18,19-07-18


In [13]:
# Store fund columns to sum against
fund_columns = ["fund_A", "fund_B", "fund_C", "fund_D"]

# Find rows where fund_columns row sum == inv_amount
inv_equ = df_banking[fund_columns].sum(axis=1) == df_banking["inv_amount"]

# Store consistent and inconsistent data
consistent_inv = df_banking[inv_equ]
inconsistent_inv = df_banking[~inv_equ]

# Store consistent and inconsistent data
print("Number of inconsistent investments: ", inconsistent_inv.shape[0])

Number of inconsistent investments:  8


### <a id='toc1_4_'></a>[Missing values](#toc0_)

Missing data is one of the most common and most important data cleaning problems. Essentially, missing data is when no data value is stored for a variable in an observation. Missing data is most commonly represented as NA or NaN, but can take on arbitrary values like 0, dot (.), underscore (_) or hyphen (-).  there are a variety of missingness types when observing missing data. As a reminder, missingness types can be described as the following:

1. Missing Completely at Random (`MCAR`): No systematic relationship between a column's missing values and other or own values.

2. Missing at Random (`MAR`): There is a systematic relationship between a column's missing values and other observed values.
3. Missing not at Random(`MNAR`): There is a systematic relationship between a column's missing values and unobserved values.


<u>Useful functions and methods</u>

- Use `<ser|df>.isna().sum()` to see how many rows have missing values. Use `<ser|df>.isna().mean()` to see the proportion of missing values in each column. To convert to percentage, multiply by 100 i.e, `.isna().mean().mul(100)`.
- Strategies for handling missing data
  - Drop the rows with missing values using `<ser|df>.dropna()` if 5% or less of the total observations are missing.
  - Impute the missing values with `mean/median/mode` (depending on the distribution and context) using `<ser|df>.fillna()`.
  - Also we can impute the missing values by sub-groups if the trend of the data varies between certain groups.

- An useful package for visualizing missing data is the `missingno` package. See the [documentation](https://github.com/ResidentMario/missingno) for more details.