# Introduction to TensorFlow Data Validation (TFDV)

This notebook demonstrates how to use TensorFlow Data Validation (TFDV) to analyze and validate structured data. In addition to testing code, an ML pipeline must also test data and look for anomalies, compare training and evaluation datasets and make sure they are consistent. TFDV is a tool that can help to generateÂ descriptive statistics, inferring schema, and detecting drift and skew.

This lab shows you how to use TFDV during the data exploratory phase of your model deployment. The goal is to:

- Extract data from BigQuery.
- Compute the summary statistics.
- Explore the computed statistics to understand information about the data.
- Infer an initial schema.
- Validate and update the schema based on a new dataset from BigQuery.
- Save the updated schema to be used as a contract during inference.

### Dataset

This notebook uses [Chicago crime data](https://data.cityofchicago.org/) published as a public dataset in BigQuery. This dataset reflects reported incidents of crime that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. The data will be extracted with the following columns:

- **date**: Date when the incident occurred. this is sometimes a best estimate.
- **iucr**: The Illinois Unifrom Crime Reporting code.
- **primary_type**: The primary description of the IUCR code.
- **location_description**: Description of the location where the incident occurred.
- **arrest**: Indicates whether an arrest was made.
- **domestic**: Indicates whether the incident was domestic-related as defined by the Illinois Domestic Violence Act.
- **district**: Indicates the police district where the incident occurred. 
- **ward**: The ward (City Council district) where the incident occurred.
- **fbi_code**: Indicates the crime classification.
- **year**: Year the incident occurred.

### Installing dependencies

In [None]:
!pip install tensorflow tensorflow_data_validation google-cloud-bigquery

### Imports

In [None]:
from google.cloud import bigquery
import tensorflow_data_validation as tfdv
import pandas as pd
from tensorflow.python.lib.io import file_io
from google.protobuf import text_format
from tensorflow_metadata.proto.v0 import schema_pb2

CHICAGO_CRIME_TABLE = 'bigquery-public-data.chicago_crime.crime'
bq_client = bigquery.Client()

## Extract data from BigQuery

Our dataset is public in BigQuery. If not done yet, ensure your environment is correctly set up to access GCP (export GOOGLE_APPLICATION_CREDENTIALS). First, let's get 5 records to confirm we can query the table.

In [None]:
def execute_query(client: bigquery.Client, query: str) -> pd.DataFrame:
    query_job = bq_client.query(query)
    results = query_job.result()
    return results.to_dataframe()

In [None]:
EXPLORATION_QUERY = f"""
    SELECT
        date,
        iucr,
        primary_type,
        location_description,
        arrest,
        domestic,
        district,
        ward,
        fbi_code
    FROM
      {CHICAGO_CRIME_TABLE}
    LIMIT 5
"""
results = execute_query(bq_client, EXPLORATION_QUERY)
results.head()

Feel free to explore the dataset if you want to, as not all attributes have been included. In the next sections, we will use crime data from 2019 to generate the statistics and a reference schema, then we will validate 2020 data against it.

Now, let's extract data from 2019.

In [None]:
def generate_query(year_from: int = None, year_to: int = None, limit: int = None) -> str:
    query = f"""
        SELECT 
            FORMAT_DATE('%Y',  CAST(date AS DATE)) AS crime_year,
            FORMAT_DATE('%b',  CAST(date AS DATE)) AS crime_month,
            FORMAT_DATE('%d',  CAST(date AS DATE)) AS crime_day, 
            FORMAT_DATE('%a',  CAST(date AS DATE)) AS crime_day_of_week, 
            iucr,
            primary_type,
            location_description,
            CAST(domestic AS INT64) AS domestic,
            district,
            ward,
            fbi_code,
            CAST(arrest AS INT64) AS arrest,
        FROM 
          {CHICAGO_CRIME_TABLE}
        """
    if year_from:
        query += f"WHERE year >= {year_from}"
        if year_to:
            query += f" AND year <= {year_to} \n"
    if limit:
        query  += f"LIMIT {limit}"
        
    return query

In [None]:
crime_df = execute_query(bq_client, generate_query(2019, 2019))
crime_df.count()

## Compute summary statistics

If we want to use this data to build a model, we need to generate baseline statistics that we can use to compare with more recent data and ensure there is no skew or drift. Currently, our data is in a pandas dataframe, so we can use [tfdv.generate_statistics_from_dataframe](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/generate_statistics_from_dataframe) to generate the statistics. Similar functions exist to compute statistics from TF Records and CSV datasets.

In [None]:
crime_2019_stats = tfdv.generate_statistics_from_dataframe(crime_df)

We can visualize the statistics using [tfdv.visualize_statistics](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/visualize_statistics). It uses Facets to create a succinct visualization of our data and helps to identify common bugs like unbalanced datasets. Feel free to explore the filters and other features this tool offers.

In [None]:
tfdv.visualize_statistics(crime_2019_stats)

Using Facets, you can quickly and easily spot issues, identify data ranges, categorical attribute values, etc. For example, you could use "Sort by missing/zeroes" to quickly identify attributes with a lot of null or 0 values, and decide if it's expected or if something needs to be fixed in your data.

## Generate Schema

After deploying your pipeline to production, you may not be aware of changes in the data source. For example, an attribute used by your model could be dropped by the source system, or the data type could be converted from integer to string. If you don't detect these changes, the downstream steps of your pipeline may not succeed, or the performance of your model may decrease. Generating a schema and ensuring all new datasets going through your ML pipeline follow the same structure make your solution more robust and reliable.


Using the statistics that we have generated earlier, let's infer the schema using [tfdv.infer_schema](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/infer_schema) and [display_schema](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/display_schema).

In [None]:
crime_2019_schema = tfdv.infer_schema(statistics=crime_2019_stats)
tfdv.display_schema(schema=crime_2019_schema)

This schema is inferred, meaning that it can be enhanced. This is strongly encouraged if you want to be able to detect data skew and data drift.

## Updating initial schema

The feature `domestic` has been converted from a boolean to an integer, so the values should be 0 or 1. We also know districts in Chicago can be between 1 and 31, so we can set our domain accordingly.

In [None]:
tfdv.set_domain(crime_2019_schema, "domestic", schema_pb2.IntDomain(min=0, max=1))
tfdv.set_domain(crime_2019_schema, "district", schema_pb2.IntDomain(min=1, max=31))

If you display the new schema, you should see the domain has been updated as expected.

In [None]:
tfdv.display_schema(schema=crime_2019_schema)

## Validating schema

We have used data from 2019 to generate the schema. Let's try to validate it with 2020 data. We extract the data from BigQuery, and generate the statistics.

In [None]:
crime_2020_df = execute_query(bq_client, generate_query(2020, 2020))

In [None]:
crime_2020_stats = tfdv.generate_statistics_from_dataframe(crime_2020_df)

First, let's see how you can visually compare the statistics using tfdv.visualize_statistics.

In [None]:
tfdv.visualize_statistics(
    lhs_statistics=crime_2019_stats,
    rhs_statistics=crime_2020_stats,
    lhs_name='2019',
    rhs_name='2020'
)

This is an easy way to compare the values. You can quickly see the total number of crimes in 2020 is lower than in 2019, but the percentage of cases where an arrest has been made is also lower.

Let's do a programmatic comparison now:

In [None]:
anomalies = tfdv.validate_statistics(
    statistics=crime_2020_stats, 
    schema=crime_2019_schema,
    previous_statistics=crime_2019_stats
)
tfdv.display_anomalies(anomalies)

We can see one anomaly being detected. The feature `primary_type` is a categorical feature, and there is a new value that wasn't in the original dataset in 2019. This error shouldn't be flagged as an error, because based on our business knowledge we know RITUALISM is a valid `primary_type`. So let's update our schema and the domain.

In [None]:
primary_types = tfdv.get_domain(crime_2019_schema, 'primary_type')
primary_types.value.append('RITUALISM')

Let's recompute the anomalies, see if it has fixed the problem.

In [None]:
anomalies = tfdv.validate_statistics(
    statistics=crime_2020_stats, 
    schema=crime_2019_schema,
    previous_statistics=crime_2019_stats
)
tfdv.display_anomalies(anomalies)

Looks good! There are many more ways to update your schema and apply more constraints, especially for detecting skew and drift. Have a look at [the list of anomalies](https://www.tensorflow.org/tfx/data_validation/anomalies) that can be identified by tfdv.

### Additional constraint examples

We can see `location_description` has 0.57% missing values in 2020 vs 0.45% in 2019. Let's say you want to set a threshold of 0.5% of missing values max. You could do it like below: 

In [None]:
tfdv.get_feature(crime_2019_schema, 'location_description').presence.min_fraction = 0.995
anomalies = tfdv.validate_statistics(
    statistics=crime_2020_stats, 
    schema=crime_2019_schema,
    previous_statistics=crime_2019_stats
)
tfdv.display_anomalies(anomalies)

Let's also try to add drift detection. TFDV uses [L-infinity norm](https://en.wikipedia.org/wiki/L-infinity) to identify drifts, so we just need to set the maximum threshold we are ready to accept. 

In [None]:
tfdv.get_feature(crime_2019_schema, 'primary_type').drift_comparator.infinity_norm.threshold = 0.01
anomalies = tfdv.validate_statistics(
    statistics=crime_2020_stats, 
    schema=crime_2019_schema,
    previous_statistics=crime_2019_stats
)
tfdv.display_anomalies(anomalies)

In this example, we can see there is a drift for the "THEFT" type of crime between 2019 and 2020.

Let's remove these two constraints for now.

In [None]:
tfdv.get_feature(crime_2019_schema, 'primary_type').drift_comparator.infinity_norm.threshold = 1.0
tfdv.get_feature(crime_2019_schema, 'location_description').presence.min_fraction = 0.0
anomalies = tfdv.validate_statistics(
    statistics=crime_2020_stats, 
    schema=crime_2019_schema,
    previous_statistics=crime_2019_stats
)
tfdv.display_anomalies(anomalies)

## Handling different environments

If you want to use this dataset to predict if an arrest will be made or not, you will have the flag `arrest` during training, but not during serving time. The schema needs to be updated to be aware of this difference depending on the environment.

For example, let's say 2020 data is our serving dataset. Let's drop the `arrest` attribute, and check for anomalies. 

In [None]:
crime_serving_df = crime_2020_df.drop(["arrest"], axis=1)
crime_serving_stats = tfdv.generate_statistics_from_dataframe(crime_serving_df)
anomalies = tfdv.validate_statistics(
    statistics=crime_serving_stats, 
    schema=crime_2019_schema,
    previous_statistics=crime_2019_stats
)
tfdv.display_anomalies(anomalies)

As expected, we see a feature is missing. Let's indicate this is expected for serving.

In [None]:
crime_2019_schema.default_environment.append('TRAINING')
crime_2019_schema.default_environment.append('SERVING')
tfdv.get_feature(crime_2019_schema, 'arrest').not_in_environment.append('SERVING')

In [None]:
anomalies = tfdv.validate_statistics(crime_serving_stats, crime_2019_schema, environment='SERVING')
tfdv.display_anomalies(anomalies)

And now it's ok!

## Saving your schema

Once you are happy with your schema, you can save it so that you can reuse it in your pipeline. 

In [None]:
schema_file = 'schema.pbtxt'
tfdv.write_schema_text(crime_2019_schema, schema_file)

## End of lab

In this lab, we have seen how to generate statistics from a dataset, and how to visually explore them using Facets. We have seen how to generate and update a schema, and then how to apply it to identify anomalies in the data. 

There are other TFDV features that we haven't covered, for example, how to slice the data by a specific feature before extracting the statistics. You can check out the official documentation for more details on this topic.

We have used data from 2019 to generate the initial schema, but if your dataset is bigger, you may need to execute this code using Cloud computing. TFDV has an Apache Beam runtime, so in the next lab, we will see how you could do the same steps using DataFlow.