# Introduction to TensorFlow Data Validation


## Learning objectives

1. Review TFDV methods.
2. Generate statistics.
3. Visualize statistics.
4. Infer a schema.
5. Update a schema.



## Introduction 
This lab is an introduction to TensorFlow Data Validation (TFDV), a key component of TensorFlow Extended.  This lab serves as a foundation for understanding the features of TFDV and how it can help you understand, validate, and monitor your data. 

TFDV can be used for generating schemas and statistics about the distribution of every feature in the dataset. Such information is useful for comparing multiple datasets (e.g. training vs inference datasets) and reporting:

Statistical differences in the features distribution
TFDV also offers visualization capabilities for comparing datasets based on the Google PAIR Facets project.  

Each learning objective will correspond to a __#TODO__ in the [student lab notebook](../labs/tfdv_basic_spending.ipynb) -- try to complete that notebook first before reviewing this solution notebook.

### Import Libraries

In [1]:
!pip install pyarrow==10.0.1
!pip install numpy==1.19.2
!pip install tensorflow-data-validation


Collecting pyarrow==2.0.0
  Using cached pyarrow-2.0.0-cp37-cp37m-manylinux2014_x86_64.whl (17.7 MB)
Installing collected packages: pyarrow
  Attempting uninstall: pyarrow
    Found existing installation: pyarrow 5.0.0
    Uninstalling pyarrow-5.0.0:
      Successfully uninstalled pyarrow-5.0.0
Successfully installed pyarrow-2.0.0


**Restart the kernel (Kernel > Restart kernel > Restart).**

**Re-run the above cell and proceed further.**

**Note: Please ignore any incompatibility warnings and errors.**

In [2]:
import pandas as pd
import tensorflow_data_validation as tfdv
import sys
import warnings
warnings.filterwarnings('ignore')

print('Installing TensorFlow Data Validation')
!pip install -q tensorflow_data_validation[visualization]

print('TFDV version: {}'.format(tfdv.version.__version__))
# Confirm that we're using Python 3
assert sys.version_info.major == 3, 'Oops, not running Python 3. Use Runtime > Change runtime type'


Installing TensorFlow Data Validation
TFDV version: 1.5.0


###  Load the Consumer Spending Dataset

We will download our dataset from Google Cloud Storage. The columns in the dataset are:

* 'Graduated': Whether or not the person is a college graduate
* 'Work Experience': The number of years in the workforce
* 'Family Size': The size of the family unit
* 'Spending Score': The spending score for consumer spending

In [3]:
# TODO
score_train = pd.read_csv('data/score_train.csv')
score_train.head() 

Unnamed: 0,Graduated,Profession,Work_Experience,Family_Size,Spending_Score
0,No,Healthcare,1.0,4.0,Low
1,Yes,Engineer,,3.0,Average
2,Yes,Engineer,1.0,1.0,Low
3,Yes,Lawyer,0.0,2.0,High
4,Yes,Entertainment,,6.0,High


In [5]:
# TODO
score_test = pd.read_csv('data/score_test.csv')
score_test.head()

Unnamed: 0,Graduated,Profession,Work_Experience,Family_Size,Spending_Score
0,No,Doctor,0.0,5.0,Average
1,Yes,Entertainment,1.0,4.0,Average
2,No,Lawyer,0.0,5.0,Low
3,Yes,Executive,1.0,5.0,High
4,Yes,Artist,1.0,2.0,Average


In [7]:
score_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 5 columns):
Graduated          3964 non-null object
Profession         3944 non-null object
Work_Experience    3589 non-null float64
Family_Size        3831 non-null float64
Spending_Score     4000 non-null object
dtypes: float64(2), object(3)
memory usage: 156.4+ KB


#### Review the methods present in TFDV

In [8]:
# check methods present in tfdv
# TODO
[methods for methods in dir(tfdv)]

['CombinerStatsGenerator',
 'DecodeCSV',
 'DecodeTFExample',
 'FeaturePath',
 'GenerateStatistics',
 'LiftStatsGenerator',
 'NonStreamingCustomStatsGenerator',
 'StatsOptions',
 'TFDV_ACCEPT_RECORD_BATCH',
 'TransformStatsGenerator',
 'WriteStatisticsToTFRecord',
 'WriteStatisticsToText',
 '_',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 'anomalies',
 'api',
 'arrow',
 'coders',
 'compare_slices',
 'constants',
 'display_anomalies',
 'display_schema',
 'generate_statistics_from_csv',
 'generate_statistics_from_dataframe',
 'generate_statistics_from_tfrecord',
 'get_domain',
 'get_feature',
 'get_feature_value_slicer',
 'get_slice_stats',
 'infer_schema',
 'load_anomalies_text',
 'load_schema_text',
 'load_statistics',
 'load_stats_text',
 'pywrap',
 'set_domain',
 'statistics',
 'types',
 'update_schema',
 'utils',
 'validate_examples_in_csv',
 'validate_examples_in_tfrecord',
 'validate

### Describing data with TFDV
The usual workflow when using TFDV during training is as follows:


1.   Generate statistics for the data
2.   Use those statistics to generate a schema for each feature
3.   Visualize the schema and statistics and manually inspect them
4.   Update the schema if needed


### Compute and visualize statistics

First we'll use [`tfdv.generate_statistics_from_csv`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/generate_statistics_from_csv) to compute statistics for our training data. (ignore the snappy warnings)

TFDV can compute descriptive [statistics](https://github.com/tensorflow/metadata/blob/v0.6.0/tensorflow_metadata/proto/v0/statistics.proto) that provide a quick overview of the data in terms of the features that are present and the shapes of their value distributions.

Internally, TFDV uses [Apache Beam](https://beam.apache.org/)'s data-parallel processing framework to scale the computation of statistics over large datasets. For applications that wish to integrate deeper with TFDV (e.g., attach statistics generation at the end of a data-generation pipeline), the API also exposes a Beam PTransform for statistics generation.

**NOTE:  Compute statistics**
* [tfdv.generate_statistics_from_csv](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/generate_statistics_from_csv)
* [tfdv.generate_statistics_from_dataframe](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/generate_statistics_from_dataframe)
* [tfdv.generate_statistics_from_tfrecord](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/generate_statistics_from_tfrecord)

#### Generate Statistics from a Pandas DataFrame

In [10]:
# Compute data statistics for the input pandas DataFrame.
# TODO
stats = tfdv.generate_statistics_from_dataframe(dataframe=score_train)

Now let's use [`tfdv.visualize_statistics`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/visualize_statistics), which uses [Facets](https://pair-code.github.io/facets/) to create a succinct visualization of our training data:

* Notice that numeric features and categorical features are visualized separately, and that charts are displayed showing the distributions for each feature.
* Notice that features with missing or zero values display a percentage in red as a visual indicator that there may be issues with examples in those features.  The percentage is the percentage of examples that have missing or zero values for that feature.
* Notice that there are no examples with values for `pickup_census_tract`.  This is an opportunity for dimensionality reduction!
* Try clicking "expand" above the charts to change the display
* Try hovering over bars in the charts to display bucket ranges and counts
* Try switching between the log and linear scales, and notice how the log scale reveals much more detail about the `payment_type` categorical feature
* Try selecting "quantiles" from the "Chart to show" menu, and hover over the markers to show the quantile percentages

In [11]:
# Visualize the input statistics using Facets.
# TODO
tfdv.visualize_statistics(stats)

#### TFDV generates different types of statistics based on the type of features.

**For numerical features, TFDV computes for every feature:**
* Count of records
* Number of missing (i.e. null values)
* Histogram of values
* Mean and standard deviation
* Minimum and maximum values
* Percentage of zero values

**For categorical features, TFDV provides:**
* Count of values
* Percentage of missing values
* Number of unique values
* Average string length
* Count for each label and its rank

### Let's compare the score_train and the score_test datasets

In [12]:
train_stats = tfdv.generate_statistics_from_dataframe(dataframe=score_train)
test_stats = tfdv.generate_statistics_from_dataframe(dataframe=score_test)

tfdv.visualize_statistics(
  lhs_statistics=train_stats, lhs_name='TRAIN_DATASET',
  rhs_statistics=test_stats, rhs_name='NEW_DATASET')


### Infer a schema

Now let's use [`tfdv.infer_schema`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/infer_schema) to create a schema for our data.  A schema defines constraints for the data that are relevant for ML. Example constraints include the data type of each feature, whether it's numerical or categorical, or the frequency of its presence in the data.  For categorical features the schema also defines the domain - the list of acceptable values.  Since writing a schema can be a tedious task, especially for datasets with lots of features, TFDV provides a method to generate an initial version of the schema based on the descriptive statistics.

Getting the schema right is important because the rest of our production pipeline will be relying on the schema that TFDV generates to be correct.  

#### Generating Schema
Once statistics are generated, the next step is to generate a schema for our dataset. This schema will map each feature in the dataset to a type (float, bytes, etc.). Also define feature boundaries (min, max, distribution of values and missings, etc.).

Link to infer schema
https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/infer_schema

With TFDV, we generate schema from statistics using

In [13]:
# Infers schema from the input statistics.
# TODO
schema = tfdv.infer_schema(statistics=stats)
print(schema)

feature {
  name: "Graduated"
  value_count {
    min: 1
    max: 1
  }
  type: BYTES
  domain: "Graduated"
  presence {
    min_count: 1
  }
}
feature {
  name: "Profession"
  value_count {
    min: 1
    max: 1
  }
  type: BYTES
  domain: "Profession"
  presence {
    min_count: 1
  }
}
feature {
  name: "Work_Experience"
  value_count {
    min: 1
    max: 1
  }
  type: FLOAT
  presence {
    min_count: 1
  }
}
feature {
  name: "Family_Size"
  value_count {
    min: 1
    max: 1
  }
  type: FLOAT
  presence {
    min_count: 1
  }
}
feature {
  name: "Spending_Score"
  type: BYTES
  domain: "Spending_Score"
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
string_domain {
  name: "Graduated"
  value: "No"
  value: "Yes"
}
string_domain {
  name: "Profession"
  value: "Artist"
  value: "Doctor"
  value: "Engineer"
  value: "Entertainment"
  value: "Executive"
  value: "Healthcare"
  value: "Homemaker"
  value: "Lawyer"
  value: "Mar

The schema also provides documentation for the data, and so is useful when different developers work on the same data.  Let's use [`tfdv.display_schema`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/display_schema) to display the inferred schema so that we can review it.

In [14]:
tfdv.display_schema(schema=schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'Graduated',STRING,optional,single,'Graduated'
'Profession',STRING,optional,single,'Profession'
'Work_Experience',FLOAT,optional,single,-
'Family_Size',FLOAT,optional,single,-
'Spending_Score',STRING,required,,'Spending_Score'


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'Graduated',"'No', 'Yes'"
'Profession',"'Artist', 'Doctor', 'Engineer', 'Entertainment', 'Executive', 'Healthcare', 'Homemaker', 'Lawyer', 'Marketing'"
'Spending_Score',"'Average', 'High', 'Low'"


#### TFDV provides a API to print a summary of each feature schema using

In this visualization, the columns stand for:

**Type** indicates the feature datatype.

**Presence** indicates whether the feature must be present in 100% of examples (required) or not (optional).

**Valency** indicates the number of values required per training example. 

**Domain and Values** indicates The feature domain and its values

In the case of categorical features, single indicates that each training example must have exactly one category for the feature.

### Updating the Schema 
As stated above, **Presence** indicates whether the feature must be present in 100% of examples (required) or not (optional).  Currently, all of our features except for our target label are shown as "optional". We need to make our features all required except for "Work Experience".  We will need to update the schema.

TFDV lets you update the schema according to your domain knowledge of the data if you are not satisfied by the auto-generated schema.  We will update three use cases:  Making a feature required, adding a value to a feature, and change a feature from a float to an integer. 

#### Change optional features to required.

In [15]:
# Update Family_Size from FLOAT to Int
Graduated_feature = tfdv.get_feature(schema, 'Graduated')
Graduated_feature.presence.min_fraction = 1.0
Profession_feature = tfdv.get_feature(schema, 'Profession')
Profession_feature.presence.min_fraction = 1.0
Family_Size_feature = tfdv.get_feature(schema, 'Family_Size')
Family_Size_feature.presence.min_fraction = 1.0


In [16]:
tfdv.display_schema(schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'Graduated',STRING,required,single,'Graduated'
'Profession',STRING,required,single,'Profession'
'Work_Experience',FLOAT,optional,single,-
'Family_Size',FLOAT,required,single,-
'Spending_Score',STRING,required,,'Spending_Score'


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'Graduated',"'No', 'Yes'"
'Profession',"'Artist', 'Doctor', 'Engineer', 'Entertainment', 'Executive', 'Healthcare', 'Homemaker', 'Lawyer', 'Marketing'"
'Spending_Score',"'Average', 'High', 'Low'"


#### Update a feature with a new value

Let's add "self-employed" to the Profession feature

In [17]:
Profession_domain = tfdv.get_domain(schema, 'Profession')
Profession_domain.value.insert(0, 'Self-Employed')
Profession_domain.value
# [0 indicates I want 'Self-Employed to come first', if the number were 3, 
# it would be placed after the third value. ]

['Self-Employed', 'Artist', 'Doctor', 'Engineer', 'Entertainment', 'Executive', 'Healthcare', 'Homemaker', 'Lawyer', 'Marketing']

#### Let's remove "Homemaker" from "Profession"

In [19]:
Profession_domain = tfdv.get_domain(schema, 'Profession')
Profession_domain.value.remove('Homemaker')

In [20]:
Profession_domain.value

['Self-Employed', 'Artist', 'Doctor', 'Engineer', 'Entertainment', 'Executive', 'Healthcare', 'Lawyer', 'Marketing']

#### Change a feature from a float to an integer

In [22]:
# Update Family_Size to Int
size = tfdv.get_feature(schema, 'Family_Size')

In [23]:
size.type=2
tfdv.display_schema(schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'Graduated',STRING,required,single,'Graduated'
'Profession',STRING,required,single,'Profession'
'Work_Experience',FLOAT,optional,single,-
'Family_Size',INT,required,single,-
'Spending_Score',STRING,required,,'Spending_Score'


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'Graduated',"'No', 'Yes'"
'Profession',"'Self-Employed', 'Artist', 'Doctor', 'Engineer', 'Entertainment', 'Executive', 'Healthcare', 'Lawyer', 'Marketing'"
'Spending_Score',"'Average', 'High', 'Low'"


In the next lab, you compare two datasets and check for anomalies.

## When to use TFDV

It's easy to think of TFDV as only applying to the start of your training pipeline, as we did here, but in fact it has many uses. Here are a few more:

* Validating new data for inference to make sure that we haven't suddenly started receiving bad features
* Validating new data for inference to make sure that our model has trained on that part of the decision surface
* Validating our data after we've transformed it and done feature engineering (probably using [TensorFlow Transform](https://www.tensorflow.org/tfx/guide/transform)) to make sure we haven't done something wrong

https://github.com/GoogleCloudPlatform/mlops-on-gcp/blob/master/examples/tfdv-structured-data/tfdv-covertype.ipynb