## TDDA: Test-Driven Data Analysis

In this notebook, we'll review a Python library: [TDDA](https://github.com/tdda/tdda), which takes data inputs (such as NumPy arrays or Pandas DataFrames) and builds a set of constraints around them. You can then save your constraints (JSON output) and test new data against observed constraints.

In [2]:
import pandas as pd
import numpy as np
from tdda.constraints.pdconstraints import discover_constraints, verify_df

In [3]:
df = pd.read_csv('../data/iot_example.csv')

## Basic Data Quality Check

In [4]:
df.head()

Unnamed: 0,timestamp,username,temperature,heartrate,build,latest,note
0,2017-01-01T12:00:23,michaelsmith,12,67,4e6a7805-8faa-2768-6ef6-eb3198b483ac,0,interval
1,2017-01-01T12:01:09,kharrison,6,78,7256b7b0-e502-f576-62ec-ed73533c9c84,0,wake
2,2017-01-01T12:01:34,smithadam,5,89,9226c94b-bb4b-a6c8-8e02-cb42b53e9c90,0,
3,2017-01-01T12:02:09,eddierodriguez,28,76,2599ac79-e5e0-5117-b8e1-57e5ced036f7,0,update
4,2017-01-01T12:02:36,kenneth94,29,62,122f1c6a-403c-2221-6ed1-b5caa08f11e0,0,user


In [5]:
df.dtypes

timestamp      object
username       object
temperature     int64
heartrate       int64
build          object
latest          int64
note           object
dtype: object

## Use `discover_constraints` to build the constraint object

In [6]:
constraints = discover_constraints(df)

In [7]:
constraints

<tdda.constraints.base.DatasetConstraints at 0x119945828>

In [8]:
constraints.fields

Fields([('timestamp', <tdda.constraints.base.FieldConstraints at 0x119945ba8>),
        ('username', <tdda.constraints.base.FieldConstraints at 0x1199457f0>),
        ('temperature',
         <tdda.constraints.base.FieldConstraints at 0x119945d68>),
        ('heartrate', <tdda.constraints.base.FieldConstraints at 0x119945c18>),
        ('build', <tdda.constraints.base.FieldConstraints at 0x119ac93c8>),
        ('latest', <tdda.constraints.base.FieldConstraints at 0x119ac9320>),
        ('note', <tdda.constraints.base.FieldConstraints at 0x119ac94e0>)])

## Now write the constraints to a file - .tdda is esentially just a json

In [9]:
with open('../data/ignore-iot_constraints.tdda', 'w') as f:
    f.write(constraints.to_json())

In [10]:
cat ../data/ignore-iot_constraints.tdda

{
    "fields": {
        "timestamp": {
            "type": "string",
            "min_length": 19,
            "max_length": 19,
            "max_nulls": 0,
            "no_duplicates": true
        },
        "username": {
            "type": "string",
            "min_length": 3,
            "max_length": 21,
            "max_nulls": 0
        },
        "temperature": {
            "type": "int",
            "min": 5,
            "max": 29,
            "sign": "positive",
            "max_nulls": 0
        },
        "heartrate": {
            "type": "int",
            "min": 60,
            "max": 89,
            "sign": "positive",
            "max_nulls": 0
        },
        "build": {
            "type": "string",
            "min_length": 36,
            "max_length": 36,
            "max_nulls": 0,
            "no_duplicates": true
        },
        "latest": {
            "type": "int",
            "min": 0,
            "max": 1,
 

## Exercise: what types of constraints are being extracted? How does this compare with defining your own schema?

### Now, let's read in our other IOT dataset :D (can anyone guess what will happen?)

In [11]:
new_df = pd.read_csv('../data/iot_example_with_nulls.csv')

## We use `verify_df` to pass in the new dataframe, along with either the filepath to our saved constraints.

In [12]:
v = verify_df(new_df, '../data/ignore-iot_constraints.tdda')

## We can now test passes, failures and look at the output

In [13]:
v.passes

30

In [14]:
v.failures

4

In [15]:
print(str(v))

FIELDS:

timestamp: 0 failures  5 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  no_duplicates ✓

username: 0 failures  4 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓

temperature: 1 failure  4 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✗

heartrate: 0 failures  5 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✓

build: 1 failure  4 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✗  no_duplicates ✓

latest: 1 failure  4 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✗

note: 1 failure  4 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✗  allowed_values ✓

SUMMARY:

Passes: 30
Failures: 4


## In addition, we can take a look at the passes and failures in a dataframe

In [16]:
v.to_frame()

Unnamed: 0,field,failures,passes,type,min,min_length,max,max_length,sign,max_nulls,no_duplicates,allowed_values
0,timestamp,0,5,True,,True,,True,,True,True,
1,username,0,4,True,,True,,True,,True,,
2,temperature,1,4,True,True,,True,,True,False,,
3,heartrate,0,5,True,True,,True,,True,True,,
4,build,1,4,True,,True,,True,,False,True,
5,latest,1,4,True,True,,True,,True,False,,
6,note,1,4,True,,True,,True,,False,,True


## Exercise: How could we fix the schema or separate data so all tests pass?