# Data Validation

Once we have data, next step is to validate that data. This part usually different in typical research projects and actual ML systems. In both scenarious we need to clean up the data, like data type handling, missing value filling etc. But production system have some other problems like data structure changes, additional values being sent etc. So it is important to be ready for such scenarios in actual ML systems. 

In data validation step we,
 - check for data anomalies
 - check the data schema
 - check the data statistics (mean, std, max/min vals) compared to the baseline values

If any issue found, we need to manually check the dataset and fix them.

Such validation helps us in identifying data drifts, feature deprecations early on act upon them (retraining the model with different features). 

Tensorflow Extended provides us with the ability to validate the data using `Tensorflow Data Validation` component or simply `TFDV`. This package get installed along with TFX normally, but if you need to install it as standalone package install it using below.

<center>pip install tensorflow-data-validation</center>

(TFDV accepts 2 data formats CSV and TFRecord files via apache beam.)

In [None]:
import tensorflow_data_validation as tfdv

stats = tfdv.generate_statistics_from_csv(data_location='data/consumer_complaints_with_narrative.csv',
                                          delimiter=',')

In [4]:
stats = tfdv.generate_statistics_from_tfrecord(data_location='data/consumer_complaints.tfrecord')

In [5]:
stats

datasets {
  num_examples: 66799
  features {
    type: STRING
    string_stats {
      common_stats {
        num_non_missing: 66799
        min_num_values: 1
        max_num_values: 1
        avg_num_values: 1.0
        num_values_histogram {
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 6679.9
          }
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 6679.9
          }
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 6679.9
          }
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 6679.9
          }
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 6679.9
          }
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 6679.9
          }
          buckets {
            low

In TFDV summary stats, for every numerical feature following values will be given.
- Overall count of records
- num of missing data records
- mean and std
- min and max
- percentage of zero values of feature

And also will generate a histogram of values for each feature.

For categorical features,
- count of data records
- percentage of missing values
- num of unique records
- average string length of records of the feature
- for each category sample count for each label and its rank

These values are useful as a baseline for further validation cycles. Usage will be shown later.

Before that we need to generate our data schema to define the validation next steps.
