## Example of using Pure to check data quality

This example shows how Pure framework can be used to check the quality of data for some datasets. 

In [1]:
import sys

sys.path.append("../")

In [2]:
import pandas as pd
import numpy as np

from pure.report import Report

In [3]:
pd.set_option("max_colwidth", 40)

#### For the first example let's take dataset from kaggle competitions with House Prices Dataset, Context Ad Clicks Dataset and Sales Dataset: 

https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data

https://www.kaggle.com/datasets/arashnic/ctrtest

https://www.kaggle.com/datasets/abhishekrp1517/sales-data-for-economic-data-analysis

In [4]:
house_prices_data = pd.read_csv("./example_data/house_prices.csv")
clicks_data = pd.read_csv("./example_data/clicks.csv")
sales_data = pd.read_csv("./example_data/sales_for_course.csv")

In [5]:
house_prices_data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [6]:
clicks_data.head()

Unnamed: 0,impression_id,impression_time,user_id,app_code,os_version,is_4G,is_click
0,c4ca4238a0b923820dcc509a6f75849b,2018-11-15 00:00:00,87862,422,old,0,0
1,45c48cce2e2d7fbdea1afc51c7c6ad26,2018-11-15 00:01:00,63410,467,latest,1,1
2,70efdf2ec9b086079795c442636b55fb,2018-11-15 00:02:00,71748,259,intermediate,1,0
3,8e296a067a37563370ded05f5a3bf3ec,2018-11-15 00:02:00,69209,244,latest,1,0
4,182be0c5cdcd5072bb1864cdee4d3d6e,2018-11-15 00:02:00,62873,473,latest,0,0


In [7]:
sales_data.head()

Unnamed: 0,index,Date,Year,Month,Customer Age,Customer Gender,Country,State,Product Category,Sub Category,Quantity,Unit Cost,Unit Price,Cost,Revenue,Column1
0,0,2/19/2016,2016.0,February,29.0,F,United States,Washington,Accessories,Tires and Tubes,1.0,80.0,109.0,80.0,109.0,
1,1,2/20/2016,2016.0,February,29.0,F,United States,Washington,Clothing,Gloves,2.0,24.5,28.5,49.0,57.0,
2,2,2/27/2016,2016.0,February,29.0,F,United States,Washington,Accessories,Tires and Tubes,3.0,3.67,5.0,11.0,15.0,
3,3,3/12/2016,2016.0,March,29.0,F,United States,Washington,Accessories,Tires and Tubes,2.0,87.5,116.5,175.0,233.0,
4,4,3/12/2016,2016.0,March,29.0,F,United States,Washington,Accessories,Tires and Tubes,3.0,35.0,41.666667,105.0,125.0,


### Report

The main application of the Pure project is to provide an opportunity to run some lists of metrics and give information about what metrics are passed, failed or dropped with errors.

For this purpose there is a Report class that takes a dictionary of datasets and apply metrics from the checklist (explained later) to them.

At first, create dictionary with tables that we want to apply metrics to.

In [8]:
tables = {"house_prices": house_prices_data, "clicks": clicks_data, "sales": sales_data}

Import all metrics we want to check for our datasets

In [9]:
import pure.metrics as m

Create checklist for the report class with metrics we want to check and what restrictions (limits) we want to control for resulting values of these metrics.

Checklist is a list of tuples containing:

'name of the table': str,

'metric class': Metric,

'limits dict': Dict[str, Tuple[float, float]] (defines limits for the field of the metric result we want to control).

It's needed for initialization of the Report class.

In [10]:
checklist = [
    ("house_prices", m.CountTotal(), {"total": (1, 1e6)}),
    ("house_prices", m.CountZeros("MiscVal"), {"delta": (0, 0.3)}),
    ("house_prices", m.CountNull("LotFrontage"), {"delta": (0, 0.2)}),
    ("house_prices", m.CountValue("RoofStyle", "Mansard"), {"delta": (0.1, 0.2)}),
    ("house_prices", m.CountBelowValue("SalePrice", 200000), {"delta": (0.4, 0.8)}),
    ("house_prices", m.CountAboveValue("SalePrice", 300000), {"delta": (0, 0.1)}),
    (
        "house_prices",
        m.CountBelowColumn("YearBuilt", "YearRemodAdd", True),
        {"count": (100, 1000)},
    ),
    ("house_prices", m.CountCB("SalePrice"), {}),
    ("house_prices", m.CountExtremeValuesFormula("LotArea", 4), {"delta": (0, 0.01)}),
    ("house_prices", m.CountExtremeValuesQuantile("LotArea", 0.95), {"delta": (0, 0.05)}),
    ("sales", m.CountRatioBelow("Revenue", "Unit Price", "Quantity"), {}),
    ("sales", m.CountValueInBounds("Customer Age", 18, 85), {"delta": (0, 0)}),
    ("clicks", m.CountDuplicates(["impression_time", "user_id"]), {"count": (0, 10)}),
    ("clicks", m.CountTotal(), {"total": (1, 1e6)}),
    ("clicks", m.CountZeros("is_4G"), {"delta": (0, 0.1)}),
    ("clicks", m.CountLag("impression_time"), {}),
    (
        "clicks",
        m.CountValueInSet("os_version", ["intermediate", "latest", "old"]),
        {"delta": (0, 0.01)},
    ),
    ("clicks", m.CountLastDayRows("impression_time", percent=80), {}),
]

Report class has fit method:

``fit`` -- applies metrics to the datasets and check if the resulting values match the limits defined in checklist,
returns a dictionary with result dataframe and some meta fields.

Let's create an instance with setting engine we want ('pandas' or 'pyspark'). For us it's 'pandas'

In [11]:
report = Report(tables=tables, checklist=checklist, engine="pandas", table_max_col_width=64)

Running checks: 100%|██████████| 18/18 [00:00<00:00, 63.76check/s, clicks: CountLastDayRows(column='impression_time', percent=80)]                               


Statistics of performed checks

In [12]:
report.stats

{'tables': ['house_prices', 'clicks', 'sales'],
 'total': 18,
 'passed': 12,
 'failed': 6,
 'errors': 0}

In [13]:
pd.set_option('display.max_colwidth', 0)
report.df

Unnamed: 0,table,metric,values,limits,status,error
0,house_prices,CountTotal(),{'total': 1460},1 <= total <= 1000000.0,.,
1,house_prices,CountZeros(column='MiscVal'),"{'total': 1460, 'count': 1408, 'delta': 0.964}",0 <= delta <= 0.3,F,
2,house_prices,"CountNull(columns=['LotFrontage'], aggregation='any')","{'total': 1460, 'count': 259, 'delta': 0.177}",0 <= delta <= 0.2,.,
3,house_prices,"CountValue(column='RoofStyle', value='Mansard')","{'total': 1460, 'count': 7, 'delta': 0.005}",0.1 <= delta <= 0.2,F,
4,house_prices,"CountBelowValue(column='SalePrice', value=200000, strict=True)","{'total': 1460, 'count': 1025, 'delta': 0.702}",0.4 <= delta <= 0.8,.,
5,house_prices,"CountAboveValue(column='SalePrice', value=300000, strict=False)","{'total': 1460, 'count': 115, 'delta': 0.079}",0 <= delta <= 0.1,.,
6,house_prices,"CountBelowColumn(column_x='YearBuilt', column_y='YearRemodAdd', strict=True)","{'total': 1460, 'count': 696, 'delta': 0.477}",100 <= count <= 1000,.,
7,house_prices,"CountCB(column='SalePrice', conf=0.95)","{'lcb': 80000.0, 'ucb': 384510.75}",{},.,
8,house_prices,"CountExtremeValuesFormula(column='LotArea', std_coef=4, style='greater')","{'total': 1460, 'count': 10, 'delta': 0.007}",0 <= delta <= 0.01,.,
9,house_prices,"CountExtremeValuesQuantile(column='LotArea', q=0.95, style='greater')","{'total': 1460, 'count': 73, 'delta': 0.05}",0 <= delta <= 0.05,.,


Let's print resulting report info

In [14]:
print(report)

DQ Report for tables ['house_prices', 'clicks', 'sales'], engine: `pandas`.
+--------------+------------------------------------------------------------------+----------------------------------------------------------------+-------------------------+----------+---------+
| table        | metric                                                           | values                                                         | limits                  | status   | error   |
|--------------+------------------------------------------------------------------+----------------------------------------------------------------+-------------------------+----------+---------|
| house_prices | CountTotal()                                                     | {'total': 1460}                                                | 1 <= total <= 1000000.0 | .        |         |
| house_prices | CountZeros(column='MiscVal')                                     | {'total': 1460, 'count': 1408, 'delta': 0.964}          

#### Explanation of the report result

'status' column:

‘.’ - if the check is successful; 

'F' - if the check is not successful; 

'E' - if an error occurred during execution

In the report result we see six metrics passed with 'F' status -- it means that the field of the metric result on which  we imposed the restriction limits doesn't match them. 

One metric got 'E' status -- it means that there is something wrong during metric execution. In this case KeyError was made specifically for the demonstration (in metric CountNull 'columns' parameter must be list[str], not str).

All other metrics are passed with no errors.

### Metrics

In case we want just apply some metric to our dataset and see the result value.

Import metrics we want to check.

For example, let it be CountCB metric that calculates confidence bounds for chosen column, CountLag that calculates a lag between last date and today, CountLastDayRows that checks if number of values in last day is at least 'percent'% of the average.

In [15]:
from pure.metrics import CountCB, CountLag, CountLastDayRows

Create an instances of metrics

In [16]:
cb_metric = CountCB(column="LotArea", conf=0.95)
lag_metric = CountLag(column="impression_time")
last_day_rows_metric = CountLastDayRows(column="impression_time")

Calculate the results

In [17]:
bounds = cb_metric('pandas', house_prices_data)
print(bounds)

{'lcb': 2298.0250000000005, 'ucb': 22698.249999999927}


In [18]:
lag_result = lag_metric('pandas', clicks_data)
print(lag_result)

{'today': '2023-08-19', 'last_day': '2018-12-13', 'lag': 1710}


In [19]:
metric_result = last_day_rows_metric('pandas', clicks_data)
print(metric_result)

{'average': 8472.142857142857, 'last_date_count': 389, 'percentage': 4.591518421718237, 'at_least_80%': False}
