# Analyzing and Validating Data using TensorFlow Data Validation (TFDV)

Building a successful machine learning (ML) system involves more than training a model. In this two-part article discusses the role of [TensorFlow Data Validation](https://www.tensorflow.org/tfx/data_validation/get_started) (TFDV) library in performing data exploration and descriptive analytics during experimentation, as well as in validating the incoming data for training or prediction during production. 

This tutorial shows you step-by-step how to use TFDV to analyze and validate data for ML on Google Cloud Platform (GCP). 

The objective of this tutorial is to:
1. Exrtact data from BigQuery to GCS.
2. Generate statistics from the data using TFDV.
3. Explore and visualise the data statistics.
4. Generate a Schema for the data using TFDV.
5. Extract new data from BigQuery.
6. Validate the new data using the generated Schema.

### Install requirements

Install the required packages for the tutorial:

In [None]:
!pip install -U tensorflow==0.14.0
!pip install -U tensorflow-data-validation==0.11.0
!pip install -U apache-beam[gcp]==2.8.0
!pip install -U google_cloud_bigquery==1.18.0
!pip install -U python-snappy==0.5.4

In [79]:
import os
import tensorflow as tf 
import apache_beam as beam 
import tensorflow_data_validation as tfdv
from google.cloud import bigquery
from datetime import datetime

Verify the version of the installed packages:

In [2]:
print "TF version:", tf.__version__
print "TFDV version:", tfdv.__version__
print "Beam version:", beam.__version__
print "BQ SDK version:", bigquery.__version__

TF version: 1.14.0
TFDV version: 0.11.0
Beam version: 2.8.0
BQ SDK version: 1.18.0


### Setup
To get started, set your GCP **PROJECT_ID**, **BUCKET_NAME**, and **REGION** to the following variables. [Create a GCP Project](console.cloud.google.com/projectcreate) if you don't have one. [Create a regional Cloud Storage bucket](https://console.cloud.google.com/storage/create-bucket) if you don't have one.

In [3]:
LOCAL = True # Change to false to run on the GCP

PROJECT_ID = 'validateflow' # Set your GCP Project Id
BUCKET_NAME = 'validateflow' # Set your Bucket name
REGION = 'europe-west1' # Set the region for Dataflow jobs

ROOT = '.' if LOCAL else 'gs://{}'.format(BUCKET_NAME)

DATA_DIR = ROOT + '/tfdv/data/' # Location to store data
SCHEMA_DIR = ROOT + '/tfdv/schema/' # Location to store data schema 
STATS_DIR = ROOT +'/tfdv/stats/' # Location to store stats 
STAGING_DIR = ROOT + '/tfdv/job/staging/' # Dataflow staging directory on GCP
TEMP_DIR =  ROOT + '/tfdv/job/temp/' # Dataflow temporary directory on GCP

Cleanup working directory...

In [None]:
if tf.gfile.Exists(ROOT):
    print("Removing {} contents...".format(ROOT))
    tf.gfile.DeleteRecursively(ROOT)

### Dataset

In this tutorial, we will use the [flights](https://bigquery.cloud.google.com/table/bigquery-samples:airline_ontime_data.flights?pli=1&tab=schema) data table, which is a publically available sample data in [BigQuery](https://bigquery.cloud.google.com/dataset/bigquery-samples:airline_ontime_data?pli=1). 

The table has more than 70 million records on internal US flights, including information on date, airlline, departure airport, arrival airport, departure schedule, actual departure time, arrival schedule, and actual arrival time.

You can use the [BigQuery](console.cloud.google.com/bigquery) to explore the data, or you can run the following cell, which count the number of flights by year.


In [None]:
bq_client = bigquery.Client() 

query ="""
    SELECT 
        EXTRACT(YEAR FROM CAST(date as DATE)) as year,
        COUNT(*) as flight_count
    FROM 
        `bigquery-samples.airline_ontime_data.flights`
    GROUP BY
        year
    ORDER BY 
        year DESC
    """

query_job = bq_client.query(query=query, location='US') 
data_frame = query_job.to_dataframe() 

In [None]:
display(data_frame)

We have data from 2002 to 2012. The dataset is ~8GB, which might be too big to store into memory for exploration. However, you can use TFDV to peform the data crunching on GCP at scale using Cloud Dataflow, to produce the statistics that can be easily loaded into memory, visualized and analzyed.

## 1. Extract the data from BigQuery to GCS
In this step, we will extract the data we want to analyze from BigQuery, convert it to TFRecord files, and store the data files to Cloud Storage (GCS). This data file in GCS will then be used by TFDV. We are going to use Apache Beam to to accomplish this.

Let's say that you use this dataset to estimate the arrival delay of a particular flight using ML. Note that, in this tutorial, we are not focusing on building the model, rather we focusing on analyzing and validating the data is it changes over time. We are going to use **data in 2010-2011 to generate the schema**, while validating **data in 2012 to identify anomalies**.

Note that, in more realistic scenarios, new flights data arrives on daily or weekly basis to your data warehouse, and you would validate this day-worth of data against the schema. The purpose of this example to show how this can be done at scale (using year-worth of data) to identify anomalies.

The data will be extracted with the following columns:
* **fligt_date**: The scheduled flight date
* **flight_month**: The scheduled flight abbreviated month name 
* **flight_day**: The scheduled flight day of month 
* **flight_week_of_day**: The scheduled flight abbreviated week day name 
* **airline**: Abbreviated airline name
* **departure_airport**: Abbreviated departure airport
* **arrival_airport**: Abbreviated arrival airport
* **depature_hour**: depature hour
* **departure_minute**: depature hour
* **departure_time_slot**: (6am - 9am), (9am - 12pm), (12pm - 3pm), (3pm - 6pm),  (6pm - 9pm), (9pm - 12am), (12am - 6am)
* **departure_delay**: depature delay (in minutes)
* **arrival_delay**: arrival delay (in seconds)

### Implementing the source query

In [6]:
def generate_query(date_from=None, date_to=None, limit=None):
    query ="""
        SELECT 
          CAST(date AS DATE) AS flight_date, 
          FORMAT_DATE('%b',  CAST(date AS DATE)) AS flight_month, 
          EXTRACT(DAY FROM CAST(date AS DATE)) AS flight_day, 
          FORMAT_DATE('%a',  CAST(date AS DATE)) AS flight_day_of_week, 
          airline,
          departure_airport,
          CAST(SUBSTR(LPAD(CAST(departure_schedule AS STRING), 4, '0'), 0, 2) AS INT64) AS departure_schedule_hour, 
          CAST(SUBSTR(LPAD(CAST(departure_schedule AS STRING), 4, '0'), 3, 2) AS INT64) AS departure_schedule_minute, 
          CASE 
            WHEN departure_schedule BETWEEN 600 AND 900 THEN '[6:00am - 9:00am]'
            WHEN departure_schedule BETWEEN 900 AND 1200 THEN '[9:00am - 12:pm]'
            WHEN departure_schedule BETWEEN 1200 AND 1500 THEN '[12:00pm - 3:00pm]'
            WHEN departure_schedule BETWEEN 1500 AND 1800 THEN '[3:00pm - 6:00pm]'
            WHEN departure_schedule BETWEEN 1800 AND 2100 THEN '[6:00pm - 9:00pm]'
            WHEN departure_schedule BETWEEN 2100 AND 2400 THEN '[9:00pm - 12:00am]'
            ELSE '[12:00am - 6:00am]'
          END AS departure_time_slot,
          departure_delay,
          arrival_delay
        FROM 
          `bigquery-samples.airline_ontime_data.flights`
        """
    if date_from:
        query += "WHERE CAST(date as DATE) >= CAST('{}' as DATE) \n".format(date_from)
        if date_to:
            query += "AND CAST(date as DATE) < CAST('{}' as DATE) \n".format(date_to)
    elif date_to:
        query += "WHERE CAST(date as DATE) < CAST('{}' as DATE) \n".format(date_to)
    
    if limit:
        query  += "LIMIT {}".format(limit)
        
    return query

You can run the following cell to see a sample of the data to be extract...

In [None]:
bq_client = bigquery.Client() 
query_job = bq_client.query(query=generate_query(limit=5), location='US') 
query_job.to_dataframe() 

### Implementing helper functions

In [80]:
def get_type_map(query):
    bq_client = bigquery.Client()
    query_job = bq_client.query("SELECT * FROM ({}) LIMIT 0".format(query))
    results = query_job.result()
    type_map = {}
    for field in results.schema:
        type_map[field.name] = field.field_type
    
    return type_map

def row_to_example(instance, type_map):
    feature = {}
    for key, value in instance.items():
        data_type = type_map[key]
        if value is None:
            feature[key] = tf.train.Feature()
        elif data_type == 'INTEGER':
            feature[key] = tf.train.Feature(
                int64_list=tf.train.Int64List(value=[value]))
        elif data_type == 'FLOAT':
            feature[key] = tf.train.Feature(
                float_list=tf.train.FloatList(value=[value]))
        else:
            feature[key] = tf.train.Feature(
                bytes_list=tf.train.BytesList(value=[tf.compat.as_bytes(value)]))
            
    return tf.train.Example(features=tf.train.Features(feature=feature))

### Implementing the pipeline

In [21]:
def run_pipeline(args):

    source_query = args.pop('source_query')
    sink_data_location = args.pop('sink_data_location')
    runner = args.pop('runner')
    
    pipeline_options = beam.options.pipeline_options.GoogleCloudOptions(**args)
    print(pipeline_options)
    
    with beam.Pipeline(runner, options=pipeline_options) as pipeline:
        (pipeline 
         | "Read from BigQuery">> beam.io.Read(beam.io.BigQuerySource(query = source_query, use_standard_sql = True))
         | 'Convert to tf Example' >> beam.Map(lambda instance: row_to_example(instance, type_map))
         | 'Serialize to String' >> beam.Map(lambda example: example.SerializeToString(deterministic=True))
         | "Wirte as TFRecords to GCS" >> beam.io.WriteToTFRecord(
                    file_path_prefix = sink_data_location+"extract", 
                    file_name_suffix=".tfrecords")
        )
        

### Run the pipeline

In [7]:
runner = 'DirectRunner' if LOCAL else 'DataflowRunner'
job_name = 'tfdv-flights-data-extraction-{}'.format(datetime.utcnow().strftime('%y%m%d-%H%M%S'))
date_from =  '2010-01-01'
date_to = '2011-12-31'
data_location = os.path.join(DATA_DIR, 
        "{}-{}/".format(date_from.replace('-',''), date_to.replace('-','')))
print("Data will be extracted to: {}".format(data_location))

print("Generating source query...")
source_query = generate_query(date_from, date_to, 10000)

print("Retrieving data type...")
type_map = get_type_map(source_query)

args = {
    'job_name': job_name,
    'runner': runner,
    'source_query': source_query,
    'type_map': type_map,
    'sink_data_location': data_location,
    'project': PROJECT_ID,
    'region': REGION,
    'staging_location': STAGING_DIR,
    'temp_location': TEMP_DIR,
    'save_main_session': True,
    'setup_file': './setup.py'
}
print("Pipeline args are set.")

Data will be extracted to: ./tfdv/data/20100101-20111231/
Generating source query...
Retrieving data type...
Pipeline args are set.


In [None]:
print("Running data extraction pipeline...")
run_pipeline(args)
print("Pipeline is done.")

You can list the extracted data files...

In [8]:
#!gsutil ls {DATA_DIR}/*
!ls {DATA_DIR}/*

[34m20100101-20111231[m[m
extract-00009-of-00010.tfrecords extract-00004-of-00010.tfrecords
extract-00008-of-00010.tfrecords extract-00003-of-00010.tfrecords
extract-00007-of-00010.tfrecords extract-00002-of-00010.tfrecords
extract-00006-of-00010.tfrecords extract-00001-of-00010.tfrecords
extract-00005-of-00010.tfrecords extract-00000-of-00010.tfrecords


## 2. Generate Statistics from the Data using TFDV
In this setp, we will use TFDV to analyze the data in GCS and compute various statistics from it. This operation requires (multiple) full pass on the data to compute mean, max, min, etc., which needs to run at scale to analyze large dataset. 

If we run the analysis on a sample of data, we can use TFDV to compute the statistics locally. However, we can run the TFDV process using Cloud Dataflow for scalability. The generated statistics is stored as a proto buffer to GCS.

In [10]:
job_name = 'tfdv-flights-stats-gen-{}'.format(datetime.utcnow().strftime('%y%m%d-%H%M%S'))
args['job_name'] = job_name
stats_location = os.path.join(STATS_DIR, 'stats.pb')

print("Computing statistics...")
_ = tfdv.generate_statistics_from_tfrecord(
    data_location=data_location, 
    output_path=stats_location,
    stats_options=tfdv.StatsOptions(
        sample_rate=.3
    ),
    pipeline_options = beam.options.pipeline_options.GoogleCloudOptions(**args)
)

print("Statistics are computed and saved to: {}".format(stats_location))

Removing ./tfdv/stats/ contents...
Computing statistics...
Statistics are computed and saved to: ./tfdv/stats/stats.pb


You can list saves statistics file...

In [11]:
!ls {stats_location}
#!gsutil ls {stats_location}

./tfdv/stats/stats.pb


## 3. Explore and Visualize the Statistics
In this step, we use TFDV visualization capabilities to explore and analyze the data, in order to identify data ranges, categorical columns vocabulary, missing values percentages, etc. This step helps to generate the expected schema of the data. TFDV uses [Facets](https://pair-code.github.io/facets/) capabilities for visualization.

In [12]:
stats = tfdv.load_statistics(stats_location)
tfdv.visualize_statistics(stats)

## 4. Generate Schema for the Data
In this step, we generate schema for the data based on the statistics. The schema describes the data types, ranges, etc., which will be used for validating incoming new data. Before storing the generated schema to GCS, we can alter and extend this schema manually.

In [58]:
schema = tfdv.infer_schema(statistics=stats)
tfdv.display_schema(schema=schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'departure_time_slot',STRING,required,,'departure_time_slot'
'departure_schedule_hour',INT,required,,-
'departure_airport',STRING,required,,'departure_airport'
'arrival_delay',FLOAT,required,,-
'departure_schedule_minute',INT,required,,-
'airline',STRING,required,,'airline'
'flight_day',INT,required,,-
'flight_day_of_week',STRING,required,,'flight_day_of_week'
'flight_month',STRING,required,,'flight_month'
'departure_delay',FLOAT,required,,-


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'departure_time_slot',"'[12:00am - 6:00am]', '[12:00pm - 3:00pm]', '[3:00pm - 6:00pm]', '[6:00am - 9:00am]', '[6:00pm - 9:00pm]', '[9:00am - 12:pm]', '[9:00pm - 12:00am]'"
'departure_airport',"'AVL', 'BGR', 'BHM', 'CAE', 'CHS', 'CLT', 'CMH', 'CVG', 'DEN', 'DTW', 'GSP', 'IAD', 'IAH', 'IND', 'LEX', 'LWB', 'MEM', 'MIA', 'MSP', 'ORF', 'PWM', 'RDU', 'RIC', 'SAV', 'SDF', 'TVC', 'TYS'"
'airline',"'9E', 'AA', 'OO', 'YV'"
'flight_day_of_week',"'Fri', 'Mon', 'Sat', 'Sun', 'Thu', 'Tue', 'Wed'"
'flight_month',"'Apr', 'Aug', 'Dec', 'Feb', 'Jan', 'Jul', 'Jun', 'Mar', 'May', 'Nov', 'Oct', 'Sep'"


### Fix and save the schema

In [91]:
from tensorflow_metadata.proto.v0 import schema_pb2

departure_delay_domain = tfdv.utils.schema_util.schema_pb2.FloatDomain(
    min=-60, # a flight can departure 1 hour earlier
    max=480 # maximum departure delay is 8 hours, otherwise the flight is cancelled.
)
tfdv.set_domain(schema, 'departure_delay', departure_delay_domain)

flight_month_feature = tfdv.get_feature(schema, 'flight_month')
flight_month_feature.drift_comparator.infinity_norm.threshold = 0.001



In [92]:
from tensorflow.python.lib.io import file_io
from google.protobuf import text_format

tf.gfile.MkDir(dirname=SCHEMA_DIR)
schema_location = os.path.join(SCHEMA_DIR, 'schema.pb')
tfdv.write_schema_text(schema, schema_location)
print("Schema file saved to:{}".format(schema_location))

Schema file saved to:./tfdv/schema/schema.pb


You can list saved schema file...

In [16]:
!ls {schema_location}
#!gsuitl ls {schema_location}

./tfdv/schema/schema.pb


## 5. Extract New Data

In this step, we are going to extract new data from BigQuery and store it to GCS is TFRecord files. This will be flights data in **2012**, however, we are going to introduce the following alternation in the data schema and content to demonestrate types of anomalies to be detected via TFDV:
1. Skip February data
2. Introduce missing values to **airline**
3. Add **is_weekend** column
4. Convert the time solt (12:00am - 6:00am) to two time slots: (12:00am - 3:00am), (3:00am - 6:00am)
5. Change the **departure_delay** values from minutes to seconds

### Implementing the "altered" source query

In [71]:
def generate_altered_query(date_from=None, date_to=None, limit=None):
    query ="""
        SELECT * FROM (
            SELECT 
              CAST(date AS DATE) AS flight_date, 
              FORMAT_DATE('%b',  CAST(date AS DATE)) AS flight_month, 
              EXTRACT(DAY FROM CAST(date AS DATE)) AS flight_day, 
              FORMAT_DATE('%a',  CAST(date AS DATE)) AS flight_day_of_week, 
              CASE WHEN EXTRACT(DAYOFWEEK FROM CAST(date AS DATE)) IN (1 , 7) THEN 'Yes' ELSE 'No' END AS is_weekend,
              CASE WHEN airline = 'MQ' THEN NULL ELSE airline END airline,
              departure_airport,
              CAST(SUBSTR(LPAD(CAST(departure_schedule AS STRING), 4, '0'), 0, 2) AS INT64) AS departure_schedule_hour, 
              CAST(SUBSTR(LPAD(CAST(departure_schedule AS STRING), 4, '0'), 3, 2) AS INT64) AS departure_schedule_minute, 
              CASE 
                WHEN departure_schedule BETWEEN 600 AND 900 THEN '[6:00am - 9:00am]'
                WHEN departure_schedule BETWEEN 900 AND 1200 THEN '[9:00am - 12:pm]'
                WHEN departure_schedule BETWEEN 1200 AND 1500 THEN '[12:00pm - 3:00pm]'
                WHEN departure_schedule BETWEEN 1500 AND 1800 THEN '[3:00pm - 6:00pm]'
                WHEN departure_schedule BETWEEN 1800 AND 2100 THEN '[6:00pm - 9:00pm]'
                WHEN departure_schedule BETWEEN 2100 AND 2400 THEN '[9:00pm - 12:00am]'
                WHEN departure_schedule BETWEEN 0000 AND 300 THEN '[12:00am - 3:00am]'
                ELSE '[3:00am - 6:00am]'
              END AS departure_time_slot,
              departure_delay * 60 AS departure_delay,
              arrival_delay
            FROM 
              `bigquery-samples.airline_ontime_data.flights`
            WHERE 
              EXTRACT(MONTH FROM CAST(date AS DATE)) != 2
        )
        """
    if date_from:
        query += "WHERE flight_date >= CAST('{}' as DATE) \n".format(date_from)
        if date_to:
            query += "AND flight_date < CAST('{}' as DATE) \n".format(date_to)
    elif date_to:
        query += "WHERE flight_date < CAST('{}' as DATE) \n".format(date_to)
    
    if limit:
        query  += "LIMIT {}".format(limit)
        
    return query

You can run the following cell to see a sample of the data to be extract...

In [None]:
bq_client = bigquery.Client() 
query_job = bq_client.query(query=generate_altered_query(limit=5), location='US') 
query_job.to_dataframe() 

### Run the pipeline

In [82]:
runner = 'DirectRunner' if LOCAL else 'DataflowRunner'
job_name = 'tfdv-flights-data-extraction-{}'.format(datetime.utcnow().strftime('%y%m%d-%H%M%S'))
date_from =  '2012-01-01'
date_to = '2012-12-31'
data_location = os.path.join(DATA_DIR, 
        "{}-{}/".format(date_from.replace('-',''), date_to.replace('-','')))
print("Data will be extracted to: {}".format(data_location))

print("Generating altered source query...")
source_query = generate_altered_query(date_from, date_to, 10000)

print("Retrieving data type...")
type_map = get_type_map(source_query)

args = {
    'job_name': job_name,
    'runner': runner,
    'source_query': source_query,
    'type_map': type_map,
    'sink_data_location': data_location,
    'project': PROJECT_ID,
    'region': REGION,
    'staging_location': STAGING_DIR,
    'temp_location': TEMP_DIR,
    'save_main_session': True,
    'setup_file': './setup.py'
}
print("Pipeline args are set.")

Data will be extracted to: ./tfdv/data/20120101-20121231/
Generating altered source query...
Retrieving data type...
Pipeline args are set.


In [83]:
print("Running data extraction pipeline...")
run_pipeline(args)
print("Pipeline is done.")

Running data extraction pipeline...
GoogleCloudOptions(dataflow_endpoint=https://dataflow.googleapis.com, job_name=tfdv-flights-data-extraction-190906-174407, labels=None, no_auth=False, project=validateflow, region=europe-west1, service_account_email=None, staging_location=./tfdv/job/staging/, temp_location=./tfdv/job/temp/, template_location=None)




Pipeline is done.


You can list the extracted data files...

In [84]:
#!gsutil ls {DATA_DIR}/*
!ls {DATA_DIR}/*

./tfdv/data//20100101-20111231:
extract-00000-of-00010.tfrecords extract-00005-of-00010.tfrecords
extract-00001-of-00010.tfrecords extract-00006-of-00010.tfrecords
extract-00002-of-00010.tfrecords extract-00007-of-00010.tfrecords
extract-00003-of-00010.tfrecords extract-00008-of-00010.tfrecords
extract-00004-of-00010.tfrecords extract-00009-of-00010.tfrecords

./tfdv/data//20120101-20121231:
[34mbeam-temp-extract-79b6dcb3d0cd11e98a37784f439392c6[m[m
extract-00000-of-00010.tfrecords
extract-00001-of-00010.tfrecords
extract-00002-of-00010.tfrecords
extract-00003-of-00010.tfrecords
extract-00004-of-00010.tfrecords
extract-00005-of-00010.tfrecords
extract-00006-of-00010.tfrecords
extract-00007-of-00010.tfrecords
extract-00008-of-00010.tfrecords
extract-00009-of-00010.tfrecords


### Generate statistics for the new data

In [85]:
job_name = 'tfdv-flights-stats-gen-{}'.format(datetime.utcnow().strftime('%y%m%d-%H%M%S'))
args['job_name'] = job_name
new_stats_location = os.path.join(STATS_DIR, 'new_stats.pb')

print("Computing statistics...")
_ = tfdv.generate_statistics_from_tfrecord(
    data_location=data_location, 
    output_path=new_stats_location,
    stats_options=tfdv.StatsOptions(
        sample_rate=1.
    ),
    pipeline_options = beam.options.pipeline_options.GoogleCloudOptions(**args)
)

print("Statistics are computed and saved to: {}".format(new_stats_location))

Computing statistics...




Statistics are computed and saved to: ./tfdv/stats/new_stats.pb


## 6. Validate the New Data and Identify Anomalies
In this step, we are going to use the generated schema to validate the newly extracted data to identify if the data complies with the schema, or if there are any anomalies to be handled.

### Load Schema

In [86]:
schema = tfdv.utils.schema_util.load_schema_text(schema_location)

### Load data statistics

In [87]:
stats = tfdv.load_statistics(stats_location)
new_stats = tfdv.load_statistics(new_stats_location)

### Validate new statistics against schema 

In [93]:
anomalies = tfdv.validate_statistics(
    statistics=new_stats, 
    schema=schema,
    previous_statistics=stats
)

### Display anomalies (if any)

In [94]:
tfdv.display_anomalies(anomalies)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'departure_airport',Unexpected string values,"Examples contain values missing from the schema: DFW (~17%), EGE (<1%), MCO (<1%), ORD (~34%), STL (~3%)."
'departure_delay',Multiple errors,Unexpectedly low values: -1080<-1080(upto six significant digits) Unexpectedly high value: 60660>480(upto six significant digits)
'is_weekend',New column,New column (column in data but not in schema)
'flight_month',High Linfty distance between current and previous,"The Linfty distance between current and previous is 0.0681896 (up to six significant digits), above the threshold 0.001. The feature value with maximum difference is: Feb"


## License

Authors: Khalid Salama and Eric Evn der Knaap

---

**Disclaimer**: This is not an official Google product. The sample code provided for an educational purpose.

---

Copyright 2019 Google LLC

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.