# Drift Detection with TensorFlow Data Validation

This tutorial shows how to use [TensorFlow Data Validation](https://www.tensorflow.org/tfx/data_validation/get_started) (TFDV) to identify and analyze different data skews in request-response serving data logged by AI Platform Prediction in BigQuery.

The tutorial has two parts:

* **Part 1**: Detecting data skews
 1. Download training data
 2. Generate baseline statistics and reference schema using TFDV
 3. Implement an Apache Beam pipeline to validate serving data logged in BigQuery. This pipeline stores the generated statistics and anomalies to disk. 

* **Part 2**: Analyzing statistics and anomalies
  1. Load the generated serving statistics and anomalies from disk
  2. Use TFDV to visualize and display the statistics and anomalies
  3. Analyze how statistics change over time.
 



## Setup

### Install packages and dependencies

In [0]:
!pip install -U -q tensorflow
!pip install -U -q tensorflow_data_validation
!pip install -U -q apache_beam
!pip install -U -q pandas

In [0]:
# Automatically restart kernel after installs
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)  

### Configure GCP environment settings

In [0]:
PROJECT_ID = "sa-data-validation"
BQ_DATASET_NAME = 'prediction_logs'
BQ_TABLE_NAME = 'covertype_classifier_logs'  
MODEL_NAME = 'covertype_classifier'
MODEL_VERSION = 'v1'
!gcloud config set project $PROJECT_ID

### Authenticate your GCP account

This is required if you run the notebook in Colab

In [0]:
try:
  from google.colab import auth
  auth.authenticate_user()
  print("Colab user is authenticated.")
except: pass

### Import libraries

In [0]:
import os
import tensorflow as tf
import tensorflow_data_validation as tfdv
import apache_beam as beam
import pandas as pd
from datetime import datetime
import json
import numpy as np

print("TF version: {}".format(tf.__version__))
print("TFDV version: {}".format(tfdv.__version__))
print("Beam version: {}".format(beam.__version__))

### Create a local workspace

In [0]:
WORKSPACE = './workspace'
DATA_DIR = os.path.join(WORKSPACE, 'data')
TRAIN_DATA = os.path.join(DATA_DIR, 'train.csv') 
ARTIFACTS_DIR = os.path.join(WORKSPACE, 'artifacts')

if tf.io.gfile.exists(WORKSPACE):
  print("Removing previous workspace artifacts...")
  tf.io.gfile.rmtree(WORKSPACE)

print("Creating a new workspace...")
tf.io.gfile.makedirs(WORKSPACE)
tf.io.gfile.makedirs(DATA_DIR)
tf.io.gfile.makedirs(ARTIFACTS_DIR)

# Part 1: Detecting Serving Data Skews

## 1. Download data

We use the [covertype](https://archive.ics.uci.edu/ml/datasets/covertype) from UCI Machine Learning Repository.

The dataset is preprocessed, split, and uploaded to the `gs://workshop-datasets/covertype` public GCS location. We use this version of the preprocessed dataset in this notebook. For more information, see [Cover Type Dataset](https://github.com/GoogleCloudPlatform/mlops-on-gcp/tree/master/datasets/covertype).

We use the training data split to generate reference schema and statistics from, in order to use for validating serving data.

In [0]:
!gsutil cp gs://workshop-datasets/covertype/data_validation/training/dataset.csv {TRAIN_DATA}
!wc -l {TRAIN_DATA}

In [0]:
sample = pd.read_csv(TRAIN_DATA).head()
sample.T

## 2. Generate Baseline Statistics and Reference with TFDV

### 2.1. Generate statistics

In [0]:
baseline_stats = tfdv.generate_statistics_from_csv(
    data_location=TRAIN_DATA,
    stats_options = tfdv.StatsOptions(
        sample_count=1000
    )
)

### 2.2. Visualize statistics

In [0]:
tfdv.visualize_statistics(baseline_stats)

### 2.3. Generate a schema 

In [0]:
from tensorflow_metadata.proto.v0 import schema_pb2, statistics_pb2

reference_schema = tfdv.infer_schema(baseline_stats)

tfdv.set_domain(reference_schema, 'Soil_Type', schema_pb2.IntDomain(
    name='Soil_Type', is_categorical=True))

baseline_stats = tfdv.generate_statistics_from_csv(
    data_location=TRAIN_DATA,
    stats_options=tfdv.StatsOptions(
        schema=reference_schema,
        sample_count=1000
        )
    )

reference_schema = tfdv.infer_schema(baseline_stats)

tfdv.get_feature(reference_schema, 'Soil_Type').type = 1


reference_schema.default_environment.append('TRAINING')
reference_schema.default_environment.append('SERVING')

# Specify that 'Cover_Type' feature is not in SERVING environment.
tfdv.get_feature(
    reference_schema, 'Cover_Type').not_in_environment.append('SERVING')

tfdv.display_schema(schema=reference_schema)

In [0]:
def fix_stats(stats):
  # treat 'Soil_Type' (feature 11) as a categorical feature (type = 2)
  for feature in stats.datasets[0].features:
    if feature.path.step == ['Soil_Type']:
      feature.type = 2
      break
  return stats

In [0]:
fix_stats(baseline_stats)

tfdv.display_anomalies(
    tfdv.validate_statistics(
        baseline_stats, 
        reference_schema, 
        environment='TRAINING'
        )
    )

### 2.4. Store reference schema and statistics

In [0]:
REFERENCE_SCHEMA_FILE = os.path.join(ARTIFACTS_DIR, 'reference_schema.pbtxt')

tfdv.write_schema_text(
    reference_schema,
    REFERENCE_SCHEMA_FILE 
    
)

In [0]:
BASELINE_STATS_FILE = os.path.join(ARTIFACTS_DIR, 'baseline_statistics.pbtxt')

tfdv.utils.stats_util.write_stats_text(
    baseline_stats, 
    BASELINE_STATS_FILE
)

## 3. Implementing Apache Beam Pipeline for Skew Detection
The Beam pipeline will perform the following steps:
1. Read the raw serving request-response data in the BigQuery logs table
2. Parse the data to BeamExamples
3. Generate statistics for the serving data
4. Validate the serving statistics against the reference_schema
5. Store the serving statistics and any detected anomalies for analysis

### 3.1. Implement helper functions

In [0]:
def generate_query(bq_table, model_name, model_version, start_time, end_time):
  query = """
    SELECT *
    FROM `{}`
    WHERE model = '{}' AND model_version = '{}'
    AND time BETWEEN '{}' AND '{}';
  """.format(bq_table, model_name, model_version, start_time, end_time)

  return query

In [0]:
_RAW_DATA_COLUMN = 'raw_data'
_INSTANCES_KEY = 'instances'

class JSONObjectCoder(beam.DoFn):

  def __init__(self):
    self._example_size = beam.metrics.Metrics.counter(
      tfdv.constants.METRICS_NAMESPACE, "example_size")
      
  def process(self, log_record):
    
    raw_data = json.loads(log_record[_RAW_DATA_COLUMN])
    for instance in raw_data[_INSTANCES_KEY]:
        for key, value in instance.items():
            instance[key] = np.array(value)
        yield instance
        

In [0]:
def format_datetime(datetime_str):
  return datetime.strptime(datetime_str, '%Y-%m-%d %H:%M:%S').strftime('%Y%m%d%H%M%S')

### 3.2 Implement skew detection pipeline

In [0]:
from collections import namedtuple

_STATS_FILENAME = 'stats.pb'
_ANOMALIES_FILENAME = 'anomalies.pbtxt'

def run_pipeline(args):
  
  options = beam.options.pipeline_options.PipelineOptions(**args)
  args = namedtuple("options", args.keys())(*args.values())
  
  query = generate_query(
      args.request_response_log_table, args.model_name, args.model_version, 
      args.start_time, args.end_time)    

  output_directory = os.path.join(
      args.output_path, format_datetime(args.start_time)+"-"+format_datetime(args.end_time))
  stats_output_path = os.path.join(output_directory, _STATS_FILENAME)
  anomalies_output_path = os.path.join(output_directory, _ANOMALIES_FILENAME)
  reference_schema = tfdv.load_schema_text(args.reference_schema_path)
  baseline_stats = tfdv.load_statistics(args.baseline_stats_path)
  
  with beam.Pipeline(options=options) as pipeline:
    
    raw_examples = (pipeline
                   | 'ReadBigQeuryData' >> beam.io.Read(beam.io.BigQuerySource(
                       query=query, use_standard_sql=True)))
    examples = (raw_examples
                | 'JSONObjectInstancesToBeamExamples' >> beam.ParDo(
                    JSONObjectCoder()))  
    
    stats = (examples
             | 'BeamExamplesToArrow' >> tfdv.utils.batch_util.BatchExamplesToArrowRecordBatches() 
             | 'GenerateStatistics' >> tfdv.GenerateStatistics())
        
    _ = (stats       
         | 'WriteStatsOutput' >> beam.io.WriteToTFRecord(
             file_path_prefix=stats_output_path,
             shard_name_template='',
             coder=beam.coders.ProtoCoder(
                 statistics_pb2.DatasetFeatureStatisticsList)))
        
    _ = (stats
         | 'ValidateStatistics' >> beam.Map(
             lambda new_stats: tfdv.validate_statistics(
                 new_stats, schema=reference_schema, 
                 previous_statistics=baseline_stats, 
                 environment='SERVING'))
         | 'WriteAnomaliesOutput' >> beam.io.textio.WriteToText(
             file_path_prefix=anomalies_output_path,
             shard_name_template='',
             append_trailing_newlines=False))



### 3.3. Set pipeline parameters

In [0]:
request_response_log_table = "{}.{}.{}".format(
    PROJECT_ID, BQ_DATASET_NAME, BQ_TABLE_NAME
)

start_time = '2020-05-23 10:30:00'
end_time = '2020-05-23 11:30:00'

output_dir = os.path.join(ARTIFACTS_DIR, 'outputs')

args = {
    'job_name': 'tfdv-skew-detection-{}'.format(datetime.utcnow().strftime('%Y%m%d-%H%M%S')),
    'request_response_log_table': request_response_log_table,
    'model_name': MODEL_NAME,
    'model_version': MODEL_VERSION,
    'start_time': start_time,
    'end_time': end_time,
    'reference_schema_path': REFERENCE_SCHEMA_FILE,
    'baseline_stats_path': BASELINE_STATS_FILE,
    'output_path': output_dir,
    'project': PROJECT_ID,
}

args

### 3.4. Run pipeline



In [0]:
!rm -r {output_dir}
!mkdir {output_dir}

We will run the pipeline to create serving statistics and detect skews in the different days in the request-response BigQuery logs table

In [0]:
query = """
  SELECT DISTINCT FORMAT_TIMESTAMP("%Y-%m-%d", time) AS date
  FROM `{}`
  WHERE model = '{}' AND model_version = '{}'
  ORDER BY date
""".format(request_response_log_table, MODEL_NAME, MODEL_VERSION)

results = pd.io.gbq.read_gbq(
    query, project_id=PROJECT_ID)
dates = list(results.date)
dates

In [0]:
for d in dates:
  start_time = d + ' 00:00:00'
  end_time = d + ' 23:59:59'
  args['start_time'] = start_time
  args['end_time'] = end_time
  print("Running pipeline: {} - {}".format(start_time, end_time))
  run_pipeline(args)
  print("Pipeline is done.")

In [0]:
!ls {output_dir}

# Part 2: Analyzing Serving Data Statistics and Anomalies

Load Serving Statistics and Anomalies

In [0]:
serving_stats = []
serving_anomalies = []
for directory in os.listdir(output_dir):
  stats_path = os.path.join(output_dir, directory, _STATS_FILENAME)
  stats = tfdv.load_statistics(stats_path)
  serving_stats.append(stats)
  print("Stats loaded: {}".format(stats_path))

  anomalies_path = os.path.join(output_dir, directory, _ANOMALIES_FILENAME)
  anomalies = tfdv.load_anomalies_text(anomalies_path)
  print("Anomalies loaded: {}".format(anomalies_path))
  serving_anomalies.append(anomalies)

## 1. Visualize Statistics

In [0]:
for stats in serving_stats:
  tfdv.visualize_statistics(
    baseline_stats, stats, 'baseline', 'current')

## 2. Display Anomalies



In [0]:
for anomalies in serving_anomalies:
  tfdv.display_anomalies(anomalies)

## 3. Analyze Statistics Change Over time

In [0]:
from collections import defaultdict
feature_means = defaultdict(list)
for stats in serving_stats:
  for feature in stats.datasets[0].features:
    if feature.type in [0, 1]:
      mean = feature.num_stats.mean
      feature_means[feature.path.step[0]].append(mean)

feature_means

In [0]:
import matplotlib.pyplot as plt
dataframe = pd.DataFrame(feature_means, index=dates)
num_features = len(feature_means)
ncolumns = 3
nrows = int(num_features // ncolumns)

fig, axes = plt.subplots(nrows=nrows, ncols=ncolumns, figsize=(25, 30))
for i, col in enumerate(dataframe.columns[:num_features]):
  r = i // ncolumns
  c = i % ncolumns
  dataframe[col].plot.bar(ax=axes[r][c], title=col)