# Detecting Training-Serving Data Skews using Novelty Detection Modeling

This tutorial shows how to use novelty detection models to detect skews between data split (e.g. training and serving). Novelty detection models can identify whether an instance blongs to a population, or is considered as an outlier. 

The tutorial also shows that, while analyzing feature-by-feature can identify a distribution skew in a feature, it cannot identify instances as outliers, as their feature values might be in the expected range, but the combination of such feature values is odd.

The tutorial consists of the following parts:

* **Part 1: Produceing baseline statistics and reference schema**
 1. Download training and serving  data splits
 2. Compute baseline statistics and reference schema using training data 
 3. Validate the serving data against the reference schema and statistics

* **Part 2: Generating mutated data points**
  1. Generate mutated data with random featuer value combinations
  2. Validate the mutated data against the reference schema and 
statistics

* **Part 3: Detecting data skews with Elliptic Envelope**
  1. Train Elliptic Envelope using the training data
  2. Validate the normal and mutated  data against the model

* **Part 4: Detecting skews in BigQuery request-reponse data**
  1. Implement Apache Beam pipeline for model-based drift detection
  2. Display drift detection output

## Setup

### Install required packages

In [0]:
!pip install -U -q apache-beam[interactive]
!pip install -U -q tensorflow_data_validation
!pip install -U -q pandas
!pip install -U -q sklearn

In [0]:
# Automatically restart kernel after installs
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)  

### Import libraries

In [0]:
import os
from tensorflow import io as tf_io
import tensorflow_data_validation as tfdv
import apache_beam as beam
import pandas as pd
import numpy as np
import warnings

print("TFDV version: {}".format(tfdv.__version__))
print("Apache Beam version: {}".format(beam.__version__))

### Create a local workspace

In [0]:
GCS_DATA_LOCATION = 'gs://workshop-datasets/covertype/data_validation'
WORKSPACE = './workspace'
DATA_DIR = os.path.join(WORKSPACE, 'data')
TRAIN_DATA = os.path.join(DATA_DIR, 'train.csv') 
EVAL_DATA = os.path.join(DATA_DIR, 'eval.csv') 
MODELS_DIR = os.path.join(WORKSPACE, 'models')

In [0]:
if tf_io.gfile.exists(WORKSPACE):
  print("Removing previous workspace artifacts...")
  tf_io.gfile.rmtree(WORKSPACE)

print("Creating a new workspace...")
tf_io.gfile.makedirs(WORKSPACE)
tf_io.gfile.makedirs(DATA_DIR)
tf_io.gfile.makedirs(MODELS_DIR)

# Part1: Produceing baseline statistics and reference schema

 1. Download training and serving  data splits
 2. Compute baseline statistics and reference schema using training data 
 3. Validate the serving data against the reference schema and statistics


## 1. Download Data Splits

We use the [covertype](https://archive.ics.uci.edu/ml/datasets/covertype) from UCI Machine Learning Repository. The task is to Predict forest cover type from cartographic variables only. 

The dataset is preprocessed, split, and uploaded to the `gs://workshop-datasets/covertype` public GCS location. 

We use this version of the preprocessed dataset in this notebook. For more information, see [Cover Type Dataset](https://github.com/GoogleCloudPlatform/mlops-on-gcp/tree/master/datasets/covertype)

In [0]:
!gsutil cp gs://workshop-datasets/covertype/data_validation/training/dataset.csv {TRAIN_DATA}
!gsutil cp gs://workshop-datasets/covertype/data_validation/evaluation/dataset.csv {EVAL_DATA}
!wc -l {TRAIN_DATA}
!wc -l {EVAL_DATA}

In [0]:
sample = pd.read_csv(TRAIN_DATA).head()
sample.T

## 2. Create Schema and Statistics with TFDV

### 2.1. Compute and visualize the statistics

In [0]:
baseline_stats = tfdv.generate_statistics_from_csv(
    data_location=TRAIN_DATA
)

# tfdv.visualize_statistics(baseline_stats)

### 2.2. Generate and display schema 

In [0]:
from tensorflow_metadata.proto.v0 import schema_pb2, statistics_pb2

# Infer schema
reference_schema = tfdv.infer_schema(baseline_stats)

# Set Soil_Type to be categorical
tfdv.set_domain(reference_schema, 'Soil_Type', schema_pb2.IntDomain(
    name='Soil_Type', is_categorical=True))

# Set Cover_Type to be categorical
tfdv.set_domain(reference_schema, 'Cover_Type', schema_pb2.IntDomain(
    name='Cover_Type', is_categorical=True))

baseline_stats = tfdv.generate_statistics_from_csv(
    data_location=TRAIN_DATA,
    stats_options=tfdv.StatsOptions(
        schema=reference_schema,
        sample_count=10000
        )
    )

reference_schema = tfdv.infer_schema(baseline_stats)

# Set Soil_Type to be categorical
tfdv.set_domain(reference_schema, 'Soil_Type', schema_pb2.IntDomain(
    name='Soil_Type', is_categorical=True))

# Set Cover_Type to be categorical
tfdv.set_domain(reference_schema, 'Cover_Type', schema_pb2.IntDomain(
    name='Cover_Type', is_categorical=True))

reference_schema.default_environment.append('TRAINING')
reference_schema.default_environment.append('SERVING')

# Specify that 'Cover_Type' feature is not in SERVING environment.
tfdv.get_feature(reference_schema, 'Cover_Type').not_in_environment.append('SERVING')

In [0]:
tfdv.display_schema(
    schema=reference_schema)

## 3. Validate Serving data against the reference schema and stats

In [0]:
serving_stats = tfdv.generate_statistics_from_csv(
    data_location=EVAL_DATA,
    stats_options = tfdv.StatsOptions(
        schema=reference_schema
    )
)

In [0]:
anomalies = tfdv.validate_statistics(
    serving_stats, 
    schema=reference_schema,
    previous_statistics=baseline_stats,
    environment='TRAINING'
)

In [0]:
tfdv.display_anomalies(anomalies)

# Part 2: Generating mutated data points

  1. Generate mutated data with random featuer value combinations
  2. Validate the mutated data against the reference schema and 
statistics


## 1. Generate Mutated Serving Data
We are going to generate a dataset with mutated data points, by shuffling each column values accross the rows, creating rows with random combination of feature values.

This method makes sure that the values of each feature, independently, follows the distribution of the original serving data. However, the joint distribution is completely different, since we generate feature values independetly.

In [0]:
serving_data = pd.read_csv(EVAL_DATA).drop('Cover_Type', axis=1)
serving_data.head().T

In [0]:
def shuffle_values(dataframe):     
  shuffeld_dataframe = dataframe.copy()
  for column_name in dataframe.columns:
    shuffeld_dataframe[column_name] = shuffeld_dataframe[column_name].sample(frac=1.0).reset_index(drop=True)

  return shuffeld_dataframe

mutated_serving_data = shuffle_values(serving_data)
mutated_serving_data.head().T

In [0]:
MUTATED_DATA_FILE = os.path.join(DATA_DIR, 'mutated_data.csv')
mutated_serving_data.to_csv(MUTATED_DATA_FILE, index=False)


## 2. Validate the mutated serving data against the reference schema

Notice that the individual feature distributions are the same as the original data.

In [0]:
mutated_stats = tfdv.generate_statistics_from_csv(
    data_location=MUTATED_DATA_FILE)

sample_stats = tfdv.generate_statistics_from_csv(
    data_location=TRAIN_DATA)

tfdv.visualize_statistics(
    sample_stats, mutated_stats, 'Original', 'Mutated')

In [0]:
anomalies = tfdv.validate_statistics(
    mutated_stats, 
    schema=reference_schema,
    previous_statistics=baseline_stats,
    environment = 'SERVING'
)

tfdv.display_anomalies(anomalies)

# Part 3: Detecting data skews with Elliptic Envelope

  1. Train Elliptic Envelope using the training data
  2. Validate the normal and mutated  data against the model


## 1. Train an Elliptic Envelope Model using Training Data

### 1.1. Define metadata

In [0]:
TARGET_FEATURE_NAME = 'Cover_Type'

CATEGORICAL_FEATURE_NAMES = [
    'Soil_Type',
    'Wilderness_Area'
]

### 1.2. Prepare the data

In [0]:
train_data = pd.read_csv(TRAIN_DATA).drop(TARGET_FEATURE_NAME, axis=1)

In [0]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

encoders = dict()

for feature_name in CATEGORICAL_FEATURE_NAMES:
  encoder = OneHotEncoder(handle_unknown='ignore')
  encoder.fit(train_data[[feature_name]])
  encoders[feature_name] = encoder

In [0]:
def prepare_data(data_frame):

  if type(data_frame) != pd.DataFrame:
    data_frame = pd.DataFrame(data_frame)
  
  data_frame = data_frame.reset_index()
  for feature_name, encoder in encoders.items():
    encoded_feature = pd.DataFrame(
      encoder.transform(data_frame[[feature_name]]).toarray()
    )
    data_frame = data_frame.drop(feature_name, axis=1)
    encoded_feature.columns = [feature_name+"_"+str(column) 
                               for column in encoded_feature.columns]
    data_frame = data_frame.join(encoded_feature)
  
  return data_frame

### 1.3. Fit the model

In [0]:
prepared_training_data = prepare_data(train_data)

In [0]:
import time
from sklearn.covariance import EllipticEnvelope
from sklearn.neighbors import LocalOutlierFactor

model = EllipticEnvelope(contamination=0.)

print("Fitting...")
t0 = time.time()
model.fit(prepared_training_data)
t1 = time.time()
print("Model is fitted in {} seconds.".format(round(t1-t0)))


In [0]:
import statistics

training_distances = model.mahalanobis(prepared_training_data)
model._mean = statistics.mean(training_distances)
model._stdv = statistics.stdev(training_distances)
print("training distance mean: {}".format(round(model._mean, 5))) 
print("training distance stdv: {}".format(round(model._stdv, 5)))


## 2. Use the Elliptic Envelope Model to Validate Serving Data


In [0]:
def validate_data(model, data_frame, stdv_units=2):

  distances = model.mahalanobis(data_frame)
  distance = statistics.mean(distances)
  threshold = model._mean + (stdv_units * model._stdv)
  ratio = len([v for v in distances if v >= threshold]) / len(data_frame.index)
  
  return distance, ratio

### 2.1. Validate normal serving data

In [0]:
stdv_units = 2
prepared_serving_data = prepare_data(serving_data)
results = validate_data(model, prepared_serving_data, stdv_units)
score = round(results[1]*100, 2)
print("There is {}% of the data points more than {} standard deviation units away from the mean of the training data".format(score, stdv_units))

### 2.2. Validate mutated serving data

In [0]:
prepared_mutated_data = prepare_data(mutated_serving_data)
results = validate_data(model, prepared_mutated_data, stdv_units)
score = round(results[1]*100, 2)
print("There is {}% of the data points more than {} standard deviation units away from the mean of the training data".format(score, stdv_units))

# Part 4: Detecting skews in BigQuery request-reponse data

  1. Implement Apache Beam pipeline for model-based drift detection
  2. Run pipeline and display drift detection output

## Setup

In [0]:
import json
import numpy as np
from collections import namedtuple

### Configure GCP environment settings

In [0]:
PROJECT_ID = "sa-data-validation"
BUCKET = "sa-data-validation"
BQ_DATASET_NAME = 'prediction_logs'
BQ_TABLE_NAME = 'covertype_classifier_logs'  
MODEL_NAME = 'covertype_classifier'
MODEL_VERSION = 'v1'
!gcloud config set project $PROJECT_ID

### Authenticate your GCP account
This is required if you run the notebook in Colab

In [0]:
try:
  from google.colab import auth
  auth.authenticate_user()
  print("Colab user is authenticated.")
except: pass

## 1. Implement Apache Beam pipeline for model-based drift detection

In [0]:
from collections import defaultdict

def parse_batch_data(log_records):
  data_dict = defaultdict(list)

  for log_record in log_records:
    raw_data = json.loads(log_record['raw_data'])
    for raw_instance in raw_data['instances']:
      for name, value in raw_instance.items():
        data_dict[name].append(value[0])

  return data_dict


def score_data(data, model, stdv_units=2):
  distances = model.mahalanobis(data)
  threshold = model._mean + (stdv_units * model._stdv)
  outlier_count = len([v for v in distances if v >= threshold])
  records_count = len(data)
  return {'outlier_count': outlier_count, 'records_count': records_count}


def aggregate_scores(items):
  outlier_count = 0 
  records_count = 0
  for item in items:
    outlier_count += item['outlier_count']
    records_count += item['records_count']
  return {'outlier_count': outlier_count, 'records_count': records_count}


In [0]:
def get_query(bq_table_fullname, model_name, model_version, start_time, end_time):
  query = """
  SELECT raw_data
  FROM {}
  WHERE model = '{}'
  AND model_version = '{}'
  """.format(bq_table_fullname, model_name, model_version, start_time, end_time)

  return query

### 1.1. Beam pipeline implementation

In [0]:
def run_pipeline(args):

  options = beam.options.pipeline_options.PipelineOptions(**args)
  args = namedtuple("options", args.keys())(*args.values())
  query = get_query(
      args.bq_table_fullname, args.model_name, 
      args.model_version, 
      args.start_time, 
      args.end_time
  )

  print("Starting the Beam pipeline...")
  with beam.Pipeline(options=options) as pipeline:
    (
        pipeline 
        | 'ReadBigQueryData' >> beam.io.Read(
            beam.io.BigQuerySource(query=query, use_standard_sql=True))
        | 'BatchRecords' >> beam.BatchElements(
            min_batch_size=100, max_batch_size=1000)
        | 'InstancesToBeamExamples' >> beam.Map(parse_batch_data)
        | 'PrepareData' >> beam.Map(prepare_data)
        | 'ScoreData' >> beam.Map(
            lambda data: score_data(data, args.drift_model, stdv_units=1))
        | 'CombineResults' >> beam.CombineGlobally(aggregate_scores)
        | 'ComputeRatio' >> beam.Map(
            lambda result: {
                "outlier_count": result['outlier_count'], 
                "records_count": result['records_count'],
                "drift_ratio": result['outlier_count'] / result['records_count']
                })
         | 'WriteOutput' >> beam.io.WriteToText(
             file_path_prefix=args.output_file_path, num_shards=1, shard_name_template='')
    )
    

### 1.2. Pipeline parameter settings

In [0]:
from datetime import datetime

job_name = 'drift-detection-{}'.format(
    datetime.utcnow().strftime('%y%m%d-%H%M%S'))
bq_table_fullname = "{}.{}.{}".format(
    PROJECT_ID, BQ_DATASET_NAME, BQ_TABLE_NAME)
runner = 'InteractiveRunner'
output_dir = os.path.join(WORKSPACE, 'output')
output_path = os.path.join(output_dir, 'drift_output.json')
start_time = '2020-06-05 00:00:00 UTC'
end_time = '2020-06-06 23:59:59 UTC'

args = {
    'job_name': job_name,
    'runner': runner,
    'bq_table_fullname': bq_table_fullname,
    'model_name': MODEL_NAME,
    'model_version': MODEL_VERSION,
    'start_time': start_time,
    'end_time': end_time,
    'output_file_path': output_path,
    'project': PROJECT_ID,
    'reference_schema': reference_schema,
    'drift_model': model
}


## 2. Run pipeline and display drift detection output

In [0]:
!rm -r {output_dir}

print("Running pipeline...")
%time run_pipeline(args)
print("Pipeline is done.")

In [0]:
!ls {output_dir}

In [0]:
dirft_results = json.loads(open(output_path).read()).items()
for key, value in dirft_results:
  if key == 'drift_ratio':
    value = str(round(value * 100, 2)) +'%'
  print(key,':', value)