# Introduction to TensorFlow Data Validation (TFDV)

This notebook demonstrates how to use TensorFlow Data Validation (TFDV) to analyze and validate structured data. In addition of testing code, a CI/CD ML pipeline must also unit test data and look for anomlies, compare training and evaluation datasets and make sure they are consistent. TFDV is a tool that can help generating descripting statistics, inferring schema and detecting drift and skew.

This lab shows you how to use TFDV during the data exploratory phase of your model deployment. The goal is to:

- Extract data from BigQuery.
- Compute the summary statistics.
- Explore the computed statistics visually to understand information about the data.
- Infer an initial schema.
- Validate and Update the schema based on a new dataset from BigQuery.
- Save the updated schema to be used as a contract during inference.

### Installing dependencies

In [41]:
!pip install tensorflow tensorflow_data_validation google-cloud-bigquery

You should consider upgrading via the '/Users/matthieu/dev/freeldom/mlops-framework/venv/bin/python -m pip install --upgrade pip' command.[0m


### Dataset

This notebook uses [Chicago crime data](https://data.cityofchicago.org/) data published as a public dataset in BigQuery. This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. The data will be extracted with the following columns:

- **date**: Date when the incident occurred. this is sometimes a best estimate.
- **iucr**: The Illinois Unifrom Crime Reporting code.
- **primary_type**: The primary description of the IUCR code.
- **location_description**: Description of the location where the incident occurred.
- **arrest**: Indicates whether an arrest was made.
- **domestic**: Indicates whether the incident was domestic-related as defined by the Illinois Domestic Violence Act.
- **district**: Indicates the police district where the incident occurred. 
- **ward**: The ward (City Council district) where the incident occurred.
- **fbi_code**: Indicates the crime classification.
- **year**: Year the incident occurred.


### Imports

In [119]:
from google.cloud import bigquery
import tensorflow_data_validation as tfdv
import pandas as pd
from tensorflow.python.lib.io import file_io
from google.protobuf import text_format

## Extract data from BigQuery

Our dataset is part of public data in BigQuery. I assume you have already set up your environment to query BigQuery, if not follow the [official documentation](https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas). First, let's get 5 records to confirm we can query the table. 

In [120]:
CHICAGO_CRIME_TABLE = 'bigquery-public-data.chicago_crime.crime'

bq_client = bigquery.Client()

In [121]:
def execute_query(client: bigquery.Client, query: str) -> pd.DataFrame:
    query_job = bq_client.query(query)
    results = query_job.result()
    return results.to_dataframe()

In [123]:
EXPLORATION_QUERY = f"""
    SELECT
        date,
        iucr,
        primary_type,
        location_description,
        arrest,
        domestic,
        district,
        ward,
        fbi_code
    FROM
      {CHICAGO_CRIME_TABLE}
    LIMIT 5
"""
results = execute_query(bq_client, EXPLORATION_QUERY)
results.head()



Unnamed: 0,date,iucr,primary_type,location_description,arrest,domestic,district,ward,fbi_code
0,2015-10-03 19:20:00+00:00,470,PUBLIC PEACE VIOLATION,SIDEWALK,True,False,9,25,24
1,2011-03-01 00:00:00+00:00,3960,INTIMIDATION,RESIDENCE,True,True,4,8,26
2,2015-05-15 15:00:00+00:00,918,MOTOR VEHICLE THEFT,RESIDENCE-GARAGE,False,False,4,10,7
3,2015-10-30 15:35:00+00:00,470,PUBLIC PEACE VIOLATION,CTA STATION,True,False,8,23,24
4,2015-11-04 08:00:00+00:00,266,CRIM SEXUAL ASSAULT,RESIDENCE,False,False,4,10,2


Feel free to further explore the dataset if you want to. In the next sections, we will use crime data from 2019 to generate the statistics and validate them against 2020 data. Then, we will see how you can perform the same data validation at scale with bigger datasets.

Now, let's extract the data.

In [124]:
def generate_query(year_from: int = None, year_to: int = None, limit: int = None) -> str:
    query = f"""
        SELECT 
            FORMAT_DATE('%Y',  CAST(date AS DATE)) AS crime_year,
            FORMAT_DATE('%b',  CAST(date AS DATE)) AS crime_month,
            FORMAT_DATE('%d',  CAST(date AS DATE)) AS crime_day, 
            FORMAT_DATE('%a',  CAST(date AS DATE)) AS crime_day_of_week, 
            iucr,
            primary_type,
            location_description,
            CAST(domestic AS INT64) AS domestic,
            district,
            ward,
            fbi_code,
            CAST(arrest AS INT64) AS arrest,
        FROM 
          {CHICAGO_CRIME_TABLE}
        """
    if year_from:
        query += f"WHERE year >= {year_from}"
        if year_to:
            query += f" AND year <= {year_to} \n"
    if limit:
        query  += f"LIMIT {limit}"
        
    return query

In [125]:
crime_df = execute_query(bq_client, generate_query(2019, 2019))



In [126]:
crime_df.count()

crime_year              260673
crime_month             260673
crime_day               260673
crime_day_of_week       260673
iucr                    260673
primary_type            260673
location_description    259512
domestic                260673
district                260673
ward                    260658
fbi_code                260673
arrest                  260673
dtype: int64

## Generate statistics

We have extracted Chicago crime data in 2019 and we can now generate the statistics. The function execute_query returns a pandas dataframe and we can use [tfdv.generate_statistics_from_dataframe](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/generate_statistics_from_dataframe) to generate the statistics. Note similar functions exist to extract from TF Records and CSV datasets.

In [127]:
crime_2019_stats = tfdv.generate_statistics_from_dataframe(crime_df)

Let's visualize the statistics using [tfdv.visualize_statistics](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/visualize_statistics), which uses Facets to create a succinct visualization of our data and help to identify common bugs like unbalanced datasets. Feel free to explore the filters and other features this tool offers.

In [128]:
tfdv.visualize_statistics(crime_2019_stats)

Using this tool, you can quickly and easily spot issues, identify data ranges, categorical attributes values, etc. in your datasets. For example, you could use "Sort by missing/zeroes" to quickly identify attributes with a lot of null or 0 values. and decide if it's expected or if something needs to be fixed in your data.


## Generate Schema

After deploying your pipeline to production, you may not be aware of changes in the data source. For example, an attribute used by your model could be added or dropped, or the type could be converted from integer to string. If you don't detect these changes, the downstream steps of your pipeline may not succeed or the performance of your model may decrease. Generating a schema and ensuring all new datasets going through your ML pipeline follow the same one makes your solution more robust and reliable.


So now that we have the statistics, let's infer the schema using [tfdv.infer_schema](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/infer_schema) and [display_schema](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/display_schema).

In [129]:
crime_2019_schema = tfdv.infer_schema(statistics=crime_2019_stats)
tfdv.display_schema(schema=crime_2019_schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'crime_year',BYTES,required,,-
'crime_month',STRING,required,,'crime_month'
'crime_day',BYTES,required,,-
'crime_day_of_week',STRING,required,,'crime_day_of_week'
'iucr',BYTES,required,,-
'primary_type',STRING,required,,'primary_type'
'location_description',BYTES,optional,single,-
'domestic',INT,required,,-
'district',INT,required,,-
'ward',FLOAT,optional,single,-


  pd.set_option('max_colwidth', -1)


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'crime_month',"'Apr', 'Aug', 'Dec', 'Feb', 'Jan', 'Jul', 'Jun', 'Mar', 'May', 'Nov', 'Oct', 'Sep'"
'crime_day_of_week',"'Fri', 'Mon', 'Sat', 'Sun', 'Thu', 'Tue', 'Wed'"
'primary_type',"'ARSON', 'ASSAULT', 'BATTERY', 'BURGLARY', 'CONCEALED CARRY LICENSE VIOLATION', 'CRIM SEXUAL ASSAULT', 'CRIMINAL DAMAGE', 'CRIMINAL SEXUAL ASSAULT', 'CRIMINAL TRESPASS', 'DECEPTIVE PRACTICE', 'GAMBLING', 'HOMICIDE', 'HUMAN TRAFFICKING', 'INTERFERENCE WITH PUBLIC OFFICER', 'INTIMIDATION', 'KIDNAPPING', 'LIQUOR LAW VIOLATION', 'MOTOR VEHICLE THEFT', 'NARCOTICS', 'NON-CRIMINAL', 'OBSCENITY', 'OFFENSE INVOLVING CHILDREN', 'OTHER NARCOTIC VIOLATION', 'OTHER OFFENSE', 'PROSTITUTION', 'PUBLIC INDECENCY', 'PUBLIC PEACE VIOLATION', 'ROBBERY', 'SEX OFFENSE', 'STALKING', 'THEFT', 'WEAPONS VIOLATION'"
'fbi_code',"'01A', '01B', '02', '03', '04A', '04B', '05', '06', '07', '08A', '08B', '09', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '22', '24', '26'"


You can see the list of features and their type. Some of them have been detected as categorical features and the domain has been extracted.

## Validating using crimes from 2020

We have used data from 2019 to generate the schema. However, what if a specific type of crimes hasn't been made in 2019, but exists in 2020? Let's see what would happen by extracting 2020 data and applying our schema.

In [130]:
crime_2020_df = execute_query(bq_client, generate_query(2020, 2020))



In [131]:
crime_2020_stats = tfdv.generate_statistics_from_dataframe(crime_2020_df)

We have generated the stats for 2020, note that you can visualize them as done previously. You can also directly compare the stats with 2019 using [tfdv.visualize_statistics](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/visualize_statistics)

In [132]:
tfdv.visualize_statistics(
    lhs_statistics=crime_2019_stats,
    rhs_statistics=crime_2020_stats,
    lhs_name='2019',
    rhs_name='2020'
)

This is an easy way to compare the values. For example, you can deduce the number of crimes in 2020 is lower than 2019, but the percentage of cases where an arrest has been made is also lower.

Let's validate 2020 statistics with 2019 stats using [tfdv.validate_statistics](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/validate_statistics) and [tfdv.display_anomalies](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/display_anomalies).

In [133]:

anomalies = tfdv.validate_statistics(
    statistics=crime_2020_stats, 
    schema=crime_2019_schema,
    previous_statistics=crime_2019_stats
)

In [134]:
tfdv.display_anomalies(anomalies)

  pd.set_option('max_colwidth', -1)


Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'primary_type',Unexpected string values,Examples contain values missing from the schema: RITUALISM (<1%).


We can see one anomaly being detected. The feature `primary_type` is a categorical feature and there is a new value that wasn't in the original dataset in 2019. This error shouldn't be detected as one, the issue is more that our schema hasn't been generated with all possible values we could expect for this attribute. Let's update the schema and include this new primary_type.

In [135]:
primary_types = tfdv.get_domain(crime_2019_schema, 'primary_type')
primary_types.value.append('RITUALISM')

Let's recompute the anomalies, see if it fixed the problem.

In [136]:
anomalies = tfdv.validate_statistics(
    statistics=crime_2020_stats, 
    schema=crime_2019_schema,
    previous_statistics=crime_2019_stats
)
tfdv.display_anomalies(anomalies)

  pd.set_option('max_colwidth', -1)


Looks good! There are many more ways to update your schema and apply more constraints, especially for detecting skew and drift. One of the best ways is to have a look at [the list of anomalies](https://www.tensorflow.org/tfx/data_validation/anomalies) that can be detected by tfdv, then apply them to your schema.

For example, we can see `location_description` has 0.57% missing values in 2020 vs 0.45% in 2019. Let's say you want to set a treshold of 0.5% of missing values max. You could do it like below: 

In [139]:
tfdv.get_feature(crime_2019_schema, 'location_description').presence.min_fraction = 0.995

In [140]:
anomalies = tfdv.validate_statistics(
    statistics=crime_2020_stats, 
    schema=crime_2019_schema,
    previous_statistics=crime_2019_stats
)
tfdv.display_anomalies(anomalies)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'location_description',Column dropped,"The feature was present in fewer examples than expected: minimum fraction = 0.995000, actual = 0.994332"


And now, let's add a drift example. TFDV uses [L-infinity norm](https://en.wikipedia.org/wiki/L-infinity) to detect drifts, so we just need to set the maximum treshold we are ready to accept. 

In [141]:
tfdv.get_feature(crime_2019_schema, 'primary_type').drift_comparator.infinity_norm.threshold = 0.01

In [142]:
anomalies = tfdv.validate_statistics(
    statistics=crime_2020_stats, 
    schema=crime_2019_schema,
    previous_statistics=crime_2019_stats
)
tfdv.display_anomalies(anomalies)

  pd.set_option('max_colwidth', -1)


Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'primary_type',High Linfty distance between current and previous,"The Linfty distance between current and previous is 0.0443245 (up to six significant digits), above the threshold 0.01. The feature value with maximum difference is: THEFT"
'location_description',Column dropped,"The feature was present in fewer examples than expected: minimum fraction = 0.995000, actual = 0.994332"


In this example, we can see there is a drift for the "THEFT" type of crime between 2019 and 2020.

One last thing about this schema. Let's say we want to predict if an arrest will be made or not. During training phase, we will have this feature "arrest", but in production and during inference time, we won't have it. To handle this case, tfdv has a feature to specify some attributes available in some environments but not in all. Let's add this information to our schema.

In [144]:
crime_2019_schema.default_environment.append('TRAINING')
crime_2019_schema.default_environment.append('SERVING')
tfdv.get_feature(crime_2019_schema, 'arrest').not_in_environment.append('SERVING')

When you validate statistics, you can specify the environment as below:

In [145]:
anomalies = tfdv.validate_statistics(crime_2020_stats, crime_2019_schema, environment='SERVING')

## Saving your schema

Once you have identified the constraints you want to set, you can save your schema to be able to reuse it later.

In [143]:
schema_file = 'schema.pbtxt'
tfdv.write_schema_text(crime_2019_schema, schema_file)


## End of lab

In this lab we have seen how to generate statistics from a dataset and easily explore using Facets. We have seen how to generate and update a schema, and then how to apply it to new datasets to detect changes, skew or drift in the data. There are other TFDV features we haven't covered, for example how to slice the data by a specific attribute before extracting the statistics. You can check out the official documentation for more details on this topic.

We have used data from 2019 to generate the initial schema, but if your dataset is bigger, you may need to execute this code using a Cloud compute. TFDV has an Apache Beam runtime, so in the next lab, we will see how you could do the same steps using DataFlow.