## TFX Data Validation 

This illustaates the use of TFX for data validation. The TFX pipeline enables a logical workflow of 
component sequences for scalable, high-performance machine learning data validation. TFX components run on apache beam as the execution engines
***

The following TFX components are included in this pipeline:
- #### <font color='blue'> StatisticsGen </font>: calculates statistics for the dataset.
- #### <font color='blue'> SchemaGen </font>: examines the statistics and creates a data schema.
- #### <font color='blue'> ExampleValidator </font>: looks for anomalies and missing values in the dataset.


In [1]:
import os
import urllib
import zipfile
import tempfile 
import pandas as pd
import tensorflow as tf
import tensorflow_data_validation as tfdv


print('TF version:', tf.__version__)
print('TFDV version:', tfdv.version.__version__)

TF version: 2.8.0
TFDV version: 1.7.0


In [2]:
DATA_DIR = os.path.join('data')                                # data directory
TRAIN_DATA = os.path.join(DATA_DIR, 'train', 'data.csv')       # train data 
EVAL_DATA = os.path.join(DATA_DIR, 'eval', 'data.csv')         # eval data
SERVING_DATA = os.path.join(DATA_DIR, 'serving', 'data.csv')   # serving data
OUTPUT_DIR = os.path.join('chicago_taxi_output')               # write out schema file

#### View data with Pandas

In [4]:
df = pd.read_csv(TRAIN_DATA)

df.head()

Unnamed: 0,pickup_community_area,fare,trip_start_month,trip_start_hour,trip_start_day,trip_start_timestamp,pickup_latitude,pickup_longitude,dropoff_latitude,dropoff_longitude,trip_miles,pickup_census_tract,dropoff_census_tract,payment_type,company,trip_seconds,dropoff_community_area,tips
0,22,12.85,3,11,7,1393673400,41.920452,-87.679955,41.877406,-87.621972,0.0,,17031320000.0,Cash,Taxi Affiliation Services,720,32.0,0.0
1,22,5.45,8,21,7,1439675100,41.920452,-87.679955,41.906771,-87.681025,1.2,,17031240000.0,Cash,Dispatch Taxi Affiliation,360,24.0,0.0
2,33,0.0,5,10,4,1432118700,41.849247,-87.624135,41.849247,-87.624135,0.0,,17031840000.0,Cash,Northwest Management LLC,0,33.0,0.0
3,33,11.05,3,15,1,1427037300,41.849247,-87.624135,41.892508,-87.626215,0.0,,17031080000.0,Cash,Taxi Affiliation Services,900,8.0,0.0
4,33,11.05,5,15,6,1401464700,41.849247,-87.624135,41.892508,-87.626215,3.2,,17031080000.0,Cash,,960,8.0,0.0


### <font color='blue'> StatisticsGen </font>
- The StatisticsGen TFX pipeline component generates features statistics over both training and serving data
- The resulting statistics is used by other downstream pipeline components. 



In [5]:
train_stats = tfdv.generate_statistics_from_csv(data_location=TRAIN_DATA)


tfdv.visualize_statistics(train_stats)




Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


#### Compute `StatisticsGen`  stats for Eval data

In [6]:
# for eval data
eval_stats = tfdv.generate_statistics_from_csv(data_location=EVAL_DATA)

tfdv.visualize_statistics(eval_stats)




#### Compare `Train` & `Eval` data


In [7]:
tfdv.visualize_statistics(lhs_statistics=train_stats, rhs_statistics=eval_stats,
                          lhs_name='TRAIN_DATASET', rhs_name='EVAL_DATASET')


### <font color='blue'> SchemaGen </font>
- Its generates the description of the input data - schema. 
- Automatically generates a schema by inferring types, categories, and ranges from the training data.
- The schema is an instance of schema.proto (Data schema proto). 

In [8]:
schema = tfdv.infer_schema(statistics=train_stats)

tfdv.display_schema(schema=schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'pickup_community_area',INT,required,,-
'fare',FLOAT,required,,-
'trip_start_month',INT,required,,-
'trip_start_hour',INT,required,,-
'trip_start_day',INT,required,,-
'trip_start_timestamp',INT,required,,-
'pickup_latitude',FLOAT,required,,-
'pickup_longitude',FLOAT,required,,-
'dropoff_latitude',FLOAT,optional,single,-
'dropoff_longitude',FLOAT,optional,single,-


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'payment_type',"'Cash', 'Credit Card', 'Dispute', 'No Charge', 'Pcard', 'Unknown'"
'company',"'0118 - 42111 Godfrey S.Awir', '0694 - 59280 Chinesco Trans Inc', '1085 - 72312 N and W Cab Co', '2733 - 74600 Benny Jona', '2809 - 95474 C & D Cab Co Inc.', '3011 - 66308 JBL Cab Inc.', '3152 - 97284 Crystal Abernathy', '3201 - C&D Cab Co Inc', '3201 - CID Cab Co Inc', '3253 - 91138 Gaither Cab Co.', '3385 - 23210 Eman Cab', '3623 - 72222 Arrington Enterprises', '3897 - Ilie Malec', '4053 - Adwar H. Nikola', '4197 - 41842 Royal Star', '4615 - 83503 Tyrone Henderson', '4615 - Tyrone Henderson', '4623 - Jay Kim', '5006 - 39261 Salifu Bawa', '5006 - Salifu Bawa', '5074 - 54002 Ahzmi Inc', '5074 - Ahzmi Inc', '5129 - 87128', '5129 - 98755 Mengisti Taxi', '5129 - Mengisti Taxi', '5724 - KYVI Cab Inc', '585 - Valley Cab Co', '5864 - 73614 Thomas Owusu', '5864 - Thomas Owusu', '5874 - 73628 Sergey Cab Corp.', '5997 - 65283 AW Services Inc.', '5997 - AW Services Inc.', '6488 - 83287 Zuha Taxi', '6743 - Luhak Corp', 'Blue Ribbon Taxi Association Inc.', 'C & D Cab Co Inc', 'Chicago Elite Cab Corp.', 'Chicago Elite Cab Corp. (Chicago Carriag', 'Chicago Medallion Leasing INC', 'Chicago Medallion Management', 'Choice Taxi Association', 'Dispatch Taxi Affiliation', 'KOAM Taxi Association', 'Northwest Management LLC', 'Taxi Affiliation Services', 'Top Cab Affiliation'"


### <font color='blue'> 4 - ExampleValidator </font>
- The ExampleValidator component detects anomalies in data, based on the expectations defined by the schema.
- ExampleValidator takes as input the statistics from StatisticsGen, and the schema from SchemaGen.
- Also checks for anomalies such as if evaluation dataset match the schema from training dataset
- This is especially important for categorical features, where we want to identify the range of acceptable values.


In [9]:
# Check eval data for errors by validating the eval data stats using the previously inferred schema.
# Anomaly detected - some new values in eval data, not in training data. (payment_type: Prcard) 

anomalies = tfdv.validate_statistics(statistics=eval_stats, schema=schema)

tfdv.display_anomalies(anomalies)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'payment_type',Unexpected string values,Examples contain values missing from the schema: Prcard (<1%).
'company',Unexpected string values,"Examples contain values missing from the schema: 2092 - 61288 Sbeih company (<1%), 2192 - 73487 Zeymane Corp (<1%), 2192 - Zeymane Corp (<1%), 2823 - 73307 Seung Lee (<1%), 3094 - 24059 G.L.B. Cab Co (<1%), 3319 - CD Cab Co (<1%), 3385 - Eman Cab (<1%), 3897 - 57856 Ilie Malec (<1%), 4053 - 40193 Adwar H. Nikola (<1%), 4197 - Royal Star (<1%), 585 - 88805 Valley Cab Co (<1%), 5874 - Sergey Cab Corp. (<1%), 6057 - 24657 Richard Addo (<1%), 6574 - Babylon Express Inc. (<1%), 6742 - 83735 Tasha ride inc (<1%)."


#### Examine differences in training data vs eval data

In [10]:
# train data
training_data_df = pd.read_csv(TRAIN_DATA)

# Find unique values of 'payment_type' column
print(training_data_df['payment_type'].unique())

['Cash' 'Credit Card' 'No Charge' 'Unknown' 'Dispute' 'Pcard']


In [12]:
# eval data
eval_data_df = pd.read_csv(EVAL_DATA)

# Find unique values of 'payment_type' column
print(eval_data_df['payment_type'].unique())

['Cash' 'Credit Card' 'Dispute' 'No Charge' 'Unknown' 'Prcard' 'Pcard']


## `Anomalies Correction with - ExampleValidator` 
Problem:
- With the anomaliy in the Eval data (payment_type: Prcard) 
***
Solution:
- Add new value to the domain of feature (payment_type)

In [14]:
# Add new value to the domain of feature 'payment_type'.
payment_type_domain = tfdv.get_domain(schema, 'payment_type')
payment_type_domain.value.append('Prcard')


# Relax the minimum fraction of values that must come from the domain for the feature 'company'.
company = tfdv.get_feature(schema, 'company')
company.distribution_constraints.min_domain_mass = 0.9

In [15]:
# Validate eval stats after updating the schema 
updated_anomalies = tfdv.validate_statistics(eval_stats, schema)
tfdv.display_anomalies(updated_anomalies)

## Check for anomaly in Train/Serving Environments

In [16]:
# Generate statistics for serving data with StatisticsGen
serving_stats = tfdv.generate_statistics_from_csv(SERVING_DATA)
serving_anomalies = tfdv.validate_statistics(serving_stats, schema)

tfdv.display_anomalies(serving_anomalies)



Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'tips',Column dropped,Column is completely missing


#### Serving Data Probelem
- The label column (tips) showing up as an anomaly, 

#### Solution:
- Tell TFDV to ignore that column in serving env.

In [17]:
# All features are by default in both TRAINING and SERVING environments.
schema.default_environment.append('TRAINING')
schema.default_environment.append('SERVING')


# Specify that 'tips' feature is not in SERVING environment.
tfdv.get_feature(schema, 'tips').not_in_environment.append('SERVING')


serving_anomalies_with_env = tfdv.validate_statistics(
            serving_stats, schema, environment='SERVING')


tfdv.display_anomalies(serving_anomalies_with_env)

## Check for `drift` and `skew`
- TFDV performs this check by comparing the statistics of the different datasets based on the drift/skew comparators specified in the schema.
***
##### Drift: 
- Drift detection is supported for categorical features and between consecutive spans of data (i.e., between span N and span N+1), such as between different days of training data. 
- We express drift in terms of L-infinity distance, and you can set the threshold distance so that you receive warnings when the drift is higher than is acceptable. 

##### Skew:
- TFDV can detect three different kinds of skew in your data - `schema skew`, `feature skew`, and `distribution skew`.

 

In [18]:
# Compare all 3 schemas (train, eval & serving) now - after correction is done
skew_anomalies = tfdv.validate_statistics(train_stats, schema,
                                          previous_statistics=eval_stats,
                                          serving_statistics=serving_stats)



tfdv.display_anomalies(skew_anomalies)

### Freeze the schema

- Now that the schema has been reviewed and curated, we will store it in a file to reflect its "frozen" state.

In [19]:
from tensorflow.python.lib.io import file_io
from google.protobuf import text_format

In [20]:
file_io.recursive_create_dir(OUTPUT_DIR)
schema_file = os.path.join(OUTPUT_DIR, 'schema.pbtxt')
tfdv.write_schema_text(schema, schema_file)