## Pandas DataFrame Validation with Engarde

In this notebook, we'll take a look at how to validate data within `pandas.DataFrame` objects. Tom Augspurger has created the library [engarde](https://github.com/TomAugspurger/engarde), which allows you to write both function decorators or utilize built-in functions to test your DataFrame with specific validation rules or definitions.

In [2]:
import pandas as pd
import engarde.decorators as ed
from datetime import datetime

In [3]:
sales = pd.read_csv('../data/sales_data_duped_with_nulls.csv')

## Data Quality Check

In [4]:
sales.head()

Unnamed: 0,timestamp,city,store_id,sale_number,sale_amount,associate
0,2017-03-24T12:00:00,Hammondborough,2,,-309.0,Emily Gregory
1,2017-03-05T14:00:00,Anthonystad,5,1196.0,249.0,Carol Cannon
2,2017-04-22T05:00:00,South Kennethville,11,2865.0,1338.0,Eric Mills
3,2017-05-11T02:00:00,New Andrea,9,833.0,1432.0,Kristen Smith
4,2017-02-21T10:00:00,East Lisa,9,,1584.0,Linda Atkinson


In [5]:
sales.dtypes

timestamp       object
city            object
store_id         int64
sale_number    float64
sale_amount    float64
associate       object
dtype: object

### Engarde let's us track datatypes, so first we need to record our expected results at the first function -- changing what we will change with our first method

In [6]:
new_dtypes = {
    'timestamp': datetime,
    'city': object,
    'store_id': int,
    'sale_number': float,
    'sale_amount': float,
    'associate': object
}

In [7]:
@ed.has_dtypes(new_dtypes)
@ed.is_shape((None, 6))
def update_dtypes(sales):
    sales.datetime = sales.timestamp.map(lambda x: datetime.strptime(
        x, '%Y-%m-%dT%H:%M:%S'))
    return sales

In [8]:
sales = update_dtypes(sales)

## Now we want to remove poor quality data, let's remove any missing important columns we might need later

In [9]:
@ed.has_dtypes(new_dtypes)
@ed.is_shape((None, 6))
@ed.none_missing()
def remove_poor_quality_data(sales):
    sales = sales.drop_duplicates()
    sales = sales.dropna(subset=['sale_amount', 'store_id', 'sale_number', 
                                 'city', 'associate'])
    return sales

In [10]:
sales = remove_poor_quality_data(sales)

In [11]:
final_types = new_dtypes.copy()
final_types.update({
    'store_total': float,
    'associate_total': float,
    'city_total': float
})

In [12]:
@ed.has_dtypes(final_types)
@ed.none_missing()
def calculate_store_sales(sales):
    sales['store_total'] = sales.groupby('store_id').transform(sum)['sale_amount']
    sales['associate_total'] = sales.groupby('associate').transform(sum)['sale_amount']
    sales['city_total'] = sales.groupby('city').transform(sum)['sale_amount']
    return sales

### The issue is we need to convert the colums to floats specifically

In [13]:
sales = calculate_store_sales(sales)

AssertionError: store_total has the wrong dtype (<class 'float'>)

## Exercise: Can you fix the above error?

In [None]:
# %load ../solutions/engarde.py
@ed.has_dtypes(final_types)
@ed.none_missing()
def calculate_store_sales(sales):
    sales['store_total'] = sales.groupby('store_id').transform(sum)['sale_amount']
    sales['associate_total'] = sales.groupby('associate').transform(sum)['sale_amount']
    sales['city_total'] = sales.groupby('city').transform(sum)['sale_amount']
    sales['store_total'] = pd.to_numeric(sales['store_total'])
    sales['city_total'] = pd.to_numeric(sales['city_total'])
    sales['associate_total'] = pd.to_numeric(sales['associate_total'])
    return sales


In [None]:
sales = calculate_store_sales(sales)

In [None]:
@ed.is_shape((None, 9))
def save_report(sales):
    sales.to_csv('../data/sales_summary.csv')

In [None]:
sales.dtypes