# TensorFlow Data Validation

This example colab notebook demonstrates the use of TensorFlow Data Validation (TFDV) for exploring and visualizing datasets comprehensively. This process involves examining descriptive statistics, deriving a schema, identifying and rectifying anomalies, as well as monitoring dataset drift and skew. Gaining insights into the dataset's features and its evolution over time within a production environment is crucial. Additionally, detecting data anomalies and ensuring consistency across training, evaluation, and serving datasets is vital for model reliability.

The dataset utilized in this example is the [Taxi Trips dataset](https://data.cityofchicago.org/Transportation/Taxi-Trips/wrvz-psew) provided by the City of Chicago.

Note: The data for these applications has been adapted from its original form on www.cityofchicago.org, the City of Chicago's official website. The City of Chicago does not guarantee the data's accuracy, timeliness, or completeness presented here. The data is subject to change and should be used at one's own risk.

To learn more about the dataset, visit [Google BigQuery](https://cloud.google.com/bigquery/public-data/chicago-taxi) and explore the dataset in detail via the [BigQuery UI](https://bigquery.cloud.google.com/dataset/bigquery-public-data:chicago_taxi_trips).

Key Point: As developers and modelers, it's important to consider how this data is utilized, along with the possible positive and negative impacts of a model's predictions. A model could potentially perpetuate existing societal biases and inequalities. Evaluate whether a feature is truly pertinent to your objective or if it might introduce bias. For further insights, delve into [ML fairness](https://developers.google.com/machine-learning/fairness-overview/).

The columns in the dataset are:
<table>
<tr><td>pickup_community_area</td><td>fare</td><td>trip_start_month</td></tr>

<tr><td>trip_start_hour</td><td>trip_start_day</td><td>trip_start_timestamp</td></tr>
<tr><td>pickup_latitude</td><td>pickup_longitude</td><td>dropoff_latitude</td></tr>
<tr><td>dropoff_longitude</td><td>trip_miles</td><td>pickup_census_tract</td></tr>
<tr><td>dropoff_census_tract</td><td>payment_type</td><td>company</td></tr>
<tr><td>trip_seconds</td><td>dropoff_community_area</td><td>tips</td></tr>
</table>

## Install and import packages

Install the packages for TensorFlow Data Validation.

### Upgrade Pip

To avoid upgrading Pip in a system when running locally, check to make sure that we're running in Colab.  Local systems can of course be upgraded separately.

In [None]:
try:
  import colab
  !pip install --upgrade pip
except:
  pass

### Install Data Validation packages

Install the TensorFlow Data Validation packages and dependencies, which takes a few minutes. You may see warnings and errors regarding incompatible dependency versions, which you will resolve in the next section.

In [None]:
print('Installing TensorFlow Data Validation')
!pip install --upgrade 'tensorflow_data_validation[visualization]<2'

### Import TensorFlow and reload updated packages

The previous action upgrades the default packages within the Google Colab environment, necessitating a refresh of the package resources to accommodate the updated dependencies.

Note: Undertaking this step addresses the dependency issues stemming from the installation process. Should you continue to face issues with code execution post this update, consider restarting the runtime (via Runtime > Restart runtime ...).

In [None]:
import pkg_resources
import importlib
importlib.reload(pkg_resources)

Check the versions of TensorFlow and the Data Validation before proceeding.

In [None]:
import tensorflow as tf
import tensorflow_data_validation as tfdv
print('TF version:', tf.__version__)
print('TFDV version:', tfdv.version.__version__)

## Load the dataset
We will download our dataset from Google Cloud Storage.

In [None]:
import os
import tempfile, urllib, zipfile

# Set up some globals for our file paths
BASE_DIR = tempfile.mkdtemp()
DATA_DIR = os.path.join(BASE_DIR, 'data')
OUTPUT_DIR = os.path.join(BASE_DIR, 'chicago_taxi_output')
TRAIN_DATA = os.path.join(DATA_DIR, 'train', 'data.csv')
EVAL_DATA = os.path.join(DATA_DIR, 'eval', 'data.csv')
SERVING_DATA = os.path.join(DATA_DIR, 'serving', 'data.csv')

# Download the zip file from GCP and unzip it
zip, headers = urllib.request.urlretrieve('https://storage.googleapis.com/artifacts.tfx-oss-public.appspot.com/datasets/chicago_data.zip')
zipfile.ZipFile(zip).extractall(BASE_DIR)
zipfile.ZipFile(zip).close()

print("Here's what we downloaded:")
!ls -R {os.path.join(BASE_DIR, 'data')}

## Compute and visualize statistics

Initially, we'll employ [`tfdv.generate_statistics_from_csv`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/generate_statistics_from_csv) to calculate statistics for our training dataset. (Please disregard any snappy warnings).

TFDV is capable of generating [statistics](https://github.com/tensorflow/metadata/blob/v0.6.0/tensorflow_metadata/proto/v0/statistics.proto) that offer a snapshot overview of the dataset, highlighting available features and the distribution of their values.

For the computation of statistics across potentially vast datasets, TFDV leverages the [Apache Beam](https://beam.apache.org/) data-parallel processing framework, ensuring scalability. Additionally, for those looking to deeply integrate TFDV into their workflows (for instance, generating statistics as part of a data-production pipeline), TFDV provides access to a Beam PTransform specifically for the creation of statistics.

In [None]:
train_stats = tfdv.generate_statistics_from_csv(data_location=TRAIN_DATA)

Let's proceed to utilize [`tfdv.visualize_statistics`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/visualize_statistics), which employs [Facets](https://pair-code.github.io/facets/) for generating a clear visual representation of our training dataset:

* Observe that numeric and categorical features are depicted separately, with individual charts illustrating the distribution of each feature.
* Pay attention to features exhibiting missing or null values, which are highlighted in red percentages to signal potential issues with data integrity in those features. The indicated percentage reflects the proportion of data points lacking or having null values for a given feature.
* Note the absence of data for `pickup_census_tract`, suggesting a chance to reduce the dataset's dimensionality.
* Experiment by clicking "expand" above the charts to alter the visualization.
* Explore the data further by hovering over the bars within the charts to reveal the ranges and counts of each bucket.
* Toggle between log and linear scales to observe how the log scale can unveil more intricate details about the `payment_type` categorical feature.
* Consider selecting "quantiles" from the "Chart to show" dropdown and hover over the markers to view the corresponding quantile percentages.

In [None]:
# docs-infra: no-execute
tfdv.visualize_statistics(train_stats)

<!-- <img class="tfo-display-only-on-site" src="images/statistics.png"/> -->

## Infer a schema

Next, we'll employ [`tfdv.infer_schema`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/infer_schema) to generate a schema for our dataset. A schema outlines the necessary constraints for machine learning data, such as the data type for each feature (numerical or categorical), presence regularity, and, for categorical features, the domain - a set of permissible values. Given the complexity of manually crafting a schema for data-rich datasets, TFDV automates this process by producing an initial schema draft based on the descriptive statistics previously computed.

Ensuring the accuracy of the schema is critical, as it underpins the integrity of the entire production pipeline. Beyond serving as a cornerstone for data validation, the schema acts as a form of documentation, facilitating collaboration among developers working with the dataset. To review and validate the inferred schema, let's visualize it using [`tfdv.display_schema`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/display_schema).

In [None]:
schema = tfdv.infer_schema(statistics=train_stats)
tfdv.display_schema(schema=schema)

## Check evaluation data for errors

Until now, our focus has been exclusively on the training data. However, it's crucial that our evaluation data aligns with our training data, including adherence to the same schema. Equally important is ensuring that the evaluation data reflects a similar range of values for our numerical features as found in the training data, to guarantee that the evaluation encompasses a comparable span of the loss surface as training does. This principle applies to categorical features as well. A lack of consistency can lead to unidentified training issues during evaluation if parts of the loss surface are not examined.

* Observe that statistics now cover both training and evaluation data for each feature.
* Charts are updated to overlay data from both training and evaluation sets, simplifying comparison.
* A new percentages view has been added to the charts, which works with either log or the standard linear scales.
* Differences in the mean and median values for `trip_miles` between training and evaluation data raise questions about potential issues.
* A significant discrepancy in the maximum `tips` between training and evaluation datasets prompts concerns over possible problems.
* Expanding the Numeric Features chart and switching to the log scale reveals a notable difference in the maximum for `trip_seconds`. Does this mean the evaluation might overlook certain areas of the loss surface?

In [None]:
# Compute stats for evaluation data
eval_stats = tfdv.generate_statistics_from_csv(data_location=EVAL_DATA)

In [None]:
# docs-infra: no-execute
# Compare evaluation data with training data
tfdv.visualize_statistics(lhs_statistics=eval_stats, rhs_statistics=train_stats,
                          lhs_name='EVAL_DATASET', rhs_name='TRAIN_DATASET')

<!-- <img class="tfo-display-only-on-site" src="images/statistics_eval.png"/> -->

## Check for evaluation anomalies

Is the evaluation dataset consistent with the schema of our training dataset? This consideration is particularly critical for categorical features, as it's essential to ascertain the spectrum of permissible values.

Crucial Insight: Consider the implications of evaluating with data that includes categorical feature values absent from our training dataset. Similarly, ponder the effects of numeric features possessing values beyond the scopes observed in our training data.

In [None]:
# Check eval data for errors by validating the eval data stats using the previously inferred schema.
anomalies = tfdv.validate_statistics(statistics=eval_stats, schema=schema)
tfdv.display_anomalies(anomalies)

## Fix evaluation anomalies in the schema

It appears we've encountered new values for `company` and `payment_type` in our evaluation data that weren't present in our training dataset. These should be flagged as anomalies, but the approach to addressing them hinges on our understanding of the data. If these anomalies signify data inaccuracies, corrections should be made to the data itself. Alternatively, if they are valid within the context of the data, updating the schema to encompass these new evaluation dataset values is a viable solution.

Critical Consideration: How might our evaluation outcomes be impacted if we overlook these discrepancies?

While not all issues can be rectified by modifying the evaluation dataset, adjustments within the schema are possible for anomalies we deem acceptable. This may involve redefining our criteria for anomalies for specific features and amending the schema to account for previously unlisted values for categorical features. TFDV's insights have illuminated the necessary corrections.

Now, let's proceed with those corrections and conduct a final review.

In [None]:
# Relax the minimum fraction of values that must come from the domain for feature company.
company = tfdv.get_feature(schema, 'company')
company.distribution_constraints.min_domain_mass = 0.9

# Add new value to the domain of feature payment_type.
payment_type_domain = tfdv.get_domain(schema, 'payment_type')
payment_type_domain.value.append('Prcard')

# Validate eval stats after updating the schema
updated_anomalies = tfdv.validate_statistics(eval_stats, schema)
tfdv.display_anomalies(updated_anomalies)

Hey, look at that!  We verified that the training and evaluation data are now consistent!  Thanks TFDV ;)

## Schema Environments

In this scenario, we've also allocated a 'serving' dataset for demonstration purposes, which warrants its own verification. Typically, all datasets within a pipeline are expected to adhere to the same schema, though exceptions are common. For instance, supervised learning models include labels in their datasets for training, but these labels are omitted during model inference in serving. Sometimes, slight modifications to the schema are necessary to accommodate these differences.

**Environments** offer a way to manage such specific needs. They allow for the association of features within the schema to different sets of conditions using `default_environment`, `in_environment`, and `not_in_environment`.

Taking this dataset as an example, the `tips` feature serves as the label during training but is absent in the serving data. Without defining an appropriate environment, this discrepancy would be flagged as an anomaly.

In [None]:
serving_stats = tfdv.generate_statistics_from_csv(SERVING_DATA)
serving_anomalies = tfdv.validate_statistics(serving_stats, schema)

tfdv.display_anomalies(serving_anomalies)

We'll address the `tips` feature shortly. Additionally, we've encountered an issue where the `trip seconds` feature, which our schema anticipated to be a FLOAT, is actually an INT. TFDV plays a crucial role here by highlighting this discrepancy, pointing out potential inconsistencies in how data is prepared for training versus serving. Such discrepancies can go unnoticed until they impact model performance, potentially in severe ways. Whether this issue is critical or not warrants further examination.

In this scenario, converting INT values to FLOATs is a straightforward solution, and we aim to instruct TFDV to align with our schema for type inference. Let's proceed with this adjustment.

In [None]:
options = tfdv.StatsOptions(schema=schema, infer_type_from_schema=True)
serving_stats = tfdv.generate_statistics_from_csv(SERVING_DATA, stats_options=options)
serving_anomalies = tfdv.validate_statistics(serving_stats, schema)

tfdv.display_anomalies(serving_anomalies)

Now we just have the `tips` feature (which is our label) showing up as an anomaly ('Column dropped').  Of course we don't expect to have labels in our serving data, so let's tell TFDV to ignore that.

In [None]:
# All features are by default in both TRAINING and SERVING environments.
schema.default_environment.append('TRAINING')
schema.default_environment.append('SERVING')

# Specify that 'tips' feature is not in SERVING environment.
tfdv.get_feature(schema, 'tips').not_in_environment.append('SERVING')

serving_anomalies_with_env = tfdv.validate_statistics(
    serving_stats, schema, environment='SERVING')

tfdv.display_anomalies(serving_anomalies_with_env)

## Check for drift and skew

TFDV goes beyond merely verifying dataset compliance with the predefined schema; it is also equipped to identify drift and skew across datasets. This is accomplished by contrasting the statistical properties of various datasets, guided by drift/skew comparators integrated into the schema.


### Drift

For categorical features, TFDV facilitates drift detection across sequential data spans (e.g., from span N to span N+1), which might represent successive days of training data. Drift is quantified using the [L-infinity distance](https://en.wikipedia.org/wiki/Chebyshev_distance), allowing users to establish a threshold beyond which an alert for excessive drift would be triggered. Determining an appropriate threshold typically demands a blend of domain expertise and iterative experimentation.

### Skew

TFDV is adept at identifying three distinct types of skew in data: schema skew, feature skew, and distribution skew.

#### Schema Skew

Schema skew arises when there's a discrepancy in schema adherence between the training and serving datasets. Both sets are expected to follow the same schema. Any deviations, such as the presence of a label feature in training data but its absence in serving data, should be clearly defined using the environments field within the schema.

#### Feature Skew

Feature skew is noted when there's a difference between the features on which a model was trained and the features it encounters during serving. This could occur under circumstances like:

* Alterations in a data source that supplies certain feature values from the training phase to the serving phase.
* Discrepancies in feature generation logic across training and serving, such as applying a transformation exclusively in one environment.

#### Distribution Skew

Distribution skew is observed when the training dataset's distribution diverges significantly from that of the serving dataset. This skew can result from different methodologies or data sources used in creating the training set or from a flawed sampling method that selects a non-representative subset of the serving data for training purposes.

In [None]:
# Add skew comparator for 'payment_type' feature.
payment_type = tfdv.get_feature(schema, 'payment_type')
payment_type.skew_comparator.infinity_norm.threshold = 0.01

# Add drift comparator for 'company' feature.
company=tfdv.get_feature(schema, 'company')
company.drift_comparator.infinity_norm.threshold = 0.001

skew_anomalies = tfdv.validate_statistics(train_stats, schema,
                                          previous_statistics=eval_stats,
                                          serving_statistics=serving_stats)

tfdv.display_anomalies(skew_anomalies)

In this example we do see some drift, but it is well below the threshold that we've set.

## Freeze the schema

Now that the schema has been reviewed and curated, we will store it in a file to reflect its "frozen" state.

In [None]:
from tensorflow.python.lib.io import file_io
from google.protobuf import text_format

file_io.recursive_create_dir(OUTPUT_DIR)
schema_file = os.path.join(OUTPUT_DIR, 'schema.pbtxt')
tfdv.write_schema_text(schema, schema_file)

!cat {schema_file}

## When to use TFDV

While it might seem like the role of TFDV is confined to the initial stages of your training pipeline, as illustrated in this guide, its utility extends far beyond that. Here are a few additional applications:

* Ensuring the quality of new data entering the inference pipeline by verifying that no incorrect features are introduced.
* Confirming that new data for inference covers aspects of the decision surface that the model has been trained on.
* Checking the integrity of data post-transformation and feature engineering (often conducted using [TensorFlow Transform](https://www.tensorflow.org/tfx/transform/get_started)) to prevent and identify any errors in the process.