# Data Validation

Once we have data, next step is to validate that data. This part usually different in typical research projects and actual ML systems. In both scenarious we need to clean up the data, like data type handling, missing value filling etc. But production system have some other problems like data structure changes, additional values being sent etc. So it is important to be ready for such scenarios in actual ML systems. 

In data validation step we,
 - check for data anomalies
 - check the data schema
 - check the data statistics (mean, std, max/min vals) compared to the baseline values

If any issue found, we need to manually check the dataset and fix them.

Such validation helps us in identifying data drifts, feature deprecations early on act upon them (retraining the model with different features). 

Tensorflow Extended provides us with the ability to validate the data using `Tensorflow Data Validation` component or simply `TFDV`. This package get installed along with TFX normally, but if you need to install it as standalone package install it using below.

<center>pip install tensorflow-data-validation</center>

(TFDV accepts 2 data formats CSV and TFRecord files via apache beam.)

In [None]:
import tensorflow_data_validation as tfdv

stats = tfdv.generate_statistics_from_csv(data_location='data/consumer_complaints_with_narrative.csv',
                                          delimiter=',')

In [4]:
stats = tfdv.generate_statistics_from_tfrecord(data_location='data/consumer_complaints.tfrecord')

In [None]:
stats

In TFDV summary stats, for every numerical feature following values will be given.
- Overall count of records
- num of missing data records
- mean and std
- min and max
- percentage of zero values of feature

And also will generate a histogram of values for each feature.

For categorical features,
- count of data records
- percentage of missing values
- num of unique records
- average string length of records of the feature
- for each category sample count for each label and its rank

These values are useful as a baseline for further validation cycles. Usage will be shown later.

Before that we need to generate our data schema to define the validation next steps.


### Generate Schema

A data schema defines how our data should be and what types they are. We can outline max, min thresholds of allowed missing value containing records etc as well. We can generate schema information for our data using below.

In [6]:
schema = tfdv.infer_schema(stats)

This generates a schema protocol defined by Tensorflow `protocol buffer`.

In [None]:
tfdv.display_schema(schema)

In above presence means whether the feature can be optional. Valency defines number of values required per training sample from that feature. For example categorical features should only have one value per record.

Also the generated schema might not exactly be what we need. In those cases we need to update schema according to our needs.

Before that below is a demonstration of how tfdv data validation can be used in spotting data issues.

Assume we have 2 datasets (train and validation lets say). We would need to verify whether the 2 datasets have similar characteristics. We can do it like below.

In [10]:
# Assume these 2 are different
train_stats = tfdv.generate_statistics_from_tfrecord(data_location='data/consumer_complaints.tfrecord')
valid_stats = tfdv.generate_statistics_from_tfrecord(data_location='data/consumer_complaints.tfrecord')

tfdv.visualize_statistics(  lhs_statistics=valid_stats, rhs_statistics=train_stats, 
                            lhs_name='VAL_DATASET', rhs_name='TRAIN_DATASET')


Very cool huh! Since my datasets are same diagrams means nothing, but we can see how to use tfdv to dataset validation.

Also we can use tfdv to detect anomalies in data as well.

In [11]:
anomalies = tfdv.validate_statistics(statistics=train_stats, schema=schema)
tfdv.display_anomalies(anomalies=anomalies)

### Schema updation

Some times we need to change our schema to match our business needs. We can achieve that like below!

In [12]:
considering_feature = tfdv.get_feature(schema, 'sub_issue')
considering_feature.presence.min_fraction = 0.9

As you can see we access the feature object in the schema and update its values using object property like notation.
Once we are done with all the schema related tasks, we can write it to the disk using below syntax.

In [15]:
tfdv.write_schema_text(schema, output_path='data/schema_output.schema')
schema = tfdv.load_schema_text('data/schema_output.schema')

### Data Skew and Drift

Data skew and drift is a concern we usually get to experience in long running machine learning systems. As data distribution gets changed it causes our models to perform poorly. To identify such cases tfdv provides us with `skew operator` to detect large differences between statistics of 2 datasets. This is defined as L-infinity norm of the difference between the 2 datasets.

> L-Infinity norm is defined as the maximum absolute value of the vector of entries( [3, –10, –5] --> 10). So when comparing 2 datasets, we first take their vector difference and calculate the L-infinity norm. If it is larger than a predefined threshold we consider it as an anomaly.

Below is a code sample for above.

In [17]:
tfdv.get_feature(schema, 'company').skew_comparator.infinity_norm.threshold = 0.01

skew_anomalies = tfdv.validate_statistics(  statistics=train_stats, 
                                            schema=schema, 
                                            serving_statistics=valid_stats)

We can visuaize above to see the results(if any!).

In a similar manner to skew_comparator, theres a `drift_comparator` to identify dataset drift issues. We can use it similar to the skew comparator as well.