# Slicing AutoML Tables Evaluation Results with BigQuery

This colab assumes that you've created a dataset with AutoML Tables, and used that dataset to train a classification model. Once the model is done training, you also need to export the results table by using the following instructions. You'll see more detailed setup instructions below.

This colab will walk you through the process of using BigQuery to visualize data slices, showing you one simple way to evaluate your model for bias.

## Setup

To use this Colab, copy it to your own Google Drive or open it in the Playground mode. Follow the instructions in the [AutoML Tables Product docs](https://cloud.google.com/automl-tables/docs/) to create a GCP project, enable the API, and create and download a service account private key, and set up required permission. You'll also need to use the AutoML Tables frontend or service to create a model and export its evaluation results to BigQuery. You should find a link on the Evaluate tab to view your evaluation results in BigQuery once you've finished training your model. Then navigate to BigQuery in your GCP console and you'll see your new results table in the list of tables to which your project has access. 

For demo purposes, we'll be using the [Default of Credit Card Clients](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients) dataset for analysis. This dataset was collected to help compare different methods of predicting credit card default. Using this colab to analyze your own dataset may require a little adaptation.

The code below will sample if you want it to. Or you can set sample_count to be as large or larger than your dataset to use the whole thing for analysis. 

Note also that although the data we use in this demo is public, you'll need to enter your own Google Cloud project ID in the parameter below to authenticate to it.



In [0]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

from google.colab import auth
import numpy as np
import os
import pandas as pd
import sys
sys.path.append('./python')
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, roc_curve, roc_auc_score
from sklearn.metrics import precision_recall_curve
# For facets
from IPython.core.display import display, HTML
import base64
!pip install --upgrade tf-nightly witwidget
import witwidget.notebook.visualization as visualization
!pip install apache-beam
!pip install --upgrade tensorflow_model_analysis
!pip install --upgrade tensorflow

import tensorflow as tf
import tensorflow_model_analysis as tfma
print('TFMA version: {}'.format(tfma.version.VERSION_STRING))

# https://cloud.google.com/resource-manager/docs/creating-managing-projects
project_id = '[YOUR PROJECT ID HERE]' #@param {type:"string"}
table_name = 'bigquery-public-data:ml_datasets.credit_card_default' #@param {type:"string"}
os.environ["GOOGLE_CLOUD_PROJECT"]=project_id
sample_count = 3000 #@param
row_count = pd.io.gbq.read_gbq('''
  SELECT 
    COUNT(*) as total
  FROM [%s]''' % (table_name), project_id=project_id, verbose=False).total[0]
df = pd.io.gbq.read_gbq('''
  SELECT
    *
  FROM
    [%s]
  WHERE RAND() < %d/%d
''' % (table_name, sample_count, row_count), project_id=project_id, verbose=False)
print('Full dataset has %d rows' % row_count)
df.describe()

##Data Preprocessing

Many of the tools we use to analyze models and data expect to find their inputs in the [tensorflow.Example](https://www.tensorflow.org/tutorials/load_data/tf_records) format. Here, we'll preprocess our data into tf.Examples, and also extract the predicted class from our classifier, which is binary.

In [0]:
unique_id_field = 'ID' #@param
prediction_field_score = 'predicted_default_payment_next_month_tables_score'  #@param
prediction_field_value = 'predicted_default_payment_next_month_tables_value'  #@param


def extract_top_class(prediction_tuples):
  # values from Tables show up as a CSV of individual json (prediction, confidence) objects.
  best_score = 0
  best_class = u''
  for val, sco in prediction_tuples:
    if sco > best_score:
      best_score = sco
      best_class = val
  return (best_class, best_score)

def df_to_examples(df, columns=None):
  examples = []
  if columns == None:
    columns = df.columns.values.tolist()
  for id in df[unique_id_field].unique():
    example = tf.train.Example()
    prediction_tuples = zip(df.loc[df[unique_id_field] == id][prediction_field_value], df.loc[df[unique_id_field] == id][prediction_field_score])
    row = df.loc[df[unique_id_field] == id].iloc[0]
    for col in columns:
      if col == prediction_field_score or col == prediction_field_value:
        # Deal with prediction fields separately
        continue
      elif df[col].dtype is np.dtype(np.int64):
        example.features.feature[col].int64_list.value.append(int(row[col]))
      elif df[col].dtype is np.dtype(np.float64):
        example.features.feature[col].float_list.value.append(row[col])
      elif row[col] is None:
        continue
      elif row[col] == row[col]:
        example.features.feature[col].bytes_list.value.append(row[col].encode('utf-8'))
    cla, sco = extract_top_class(prediction_tuples)
    example.features.feature['predicted_class'].int64_list.value.append(cla)
    example.features.feature['predicted_class_score'].float_list.value.append(sco)
    examples.append(example)
  return examples

# Fix up some types so analysis is consistent. This code is specific to the dataset.
df = df.astype({"PAY_5": float, "PAY_6": float})

# Converts a dataframe column into a column of 0's and 1's based on the provided test.
def make_label_column_numeric(df, label_column, test):
  df[label_column] = np.where(test(df[label_column]), 1, 0)
  
# Convert label types to numeric. This code is specific to the dataset.
make_label_column_numeric(df, 'predicted_default_payment_next_month_tables_value', lambda val: val == '1')
make_label_column_numeric(df, 'default_payment_next_month', lambda val:  val == '1')

examples = df_to_examples(df)
print("Preprocessing complete!")

## What-If Tool

First, we'll explore the data and predictions using the [What-If Tool](https://pair-code.github.io/what-if-tool/). The What-If tool is a powerful visual interface to explore data, models, and predictions. Because we're reading our results from BigQuery, we aren't able to use the features of the What-If Tool that query the model directly. But we can still learn a lot about this dataset from the exploration that the What-If tool enables.

Imagine that you're curious to discover whether there's a discrepancy in the predictive power of your model depending on the marital status of the person whose credit history is being analyzed. You can use the What-If Tool to look at a glance and see the relative sizes of the data samples for each class. In this dataset, the marital statuses are encoded as 1 = married; 2 = single; 3 = divorce; 0=others. You can see using the What-If Tool that there are very few samples for classes other than married or single, which might indicate that performance could be compromised. If this lack of representation concerns you, you could consider collecting more data for underrepresented classes, downsampling overrepresented classes, or upweighting underrepresented data types as you train, depending on your use case and data availability.


In [0]:
WitWidget = visualization.WitWidget
WitConfigBuilder = visualization.WitConfigBuilder

num_datapoints = 2965  #@param {type: "number"}
tool_height_in_px = 700  #@param {type: "number"}

# Setup the tool with the test examples and the trained classifier
config_builder = WitConfigBuilder(examples[:num_datapoints])
# Need to call this so we have inference_address and model_name initialized
config_builder = config_builder.set_estimator_and_feature_spec('', '')
config_builder = config_builder.set_compare_estimator_and_feature_spec('', '')
wv = WitWidget(config_builder, height=tool_height_in_px)

## Tensorflow Model Analysis

Then, let's examine some sliced metrics. This section of the tutorial will use [TFMA](https://github.com/tensorflow/model-analysis) model agnostic analysis capabilities. 

TFMA generates sliced metrics graphs and confusion matrices. We can use these to dig deeper into the question of how well this model performs on different classes of marital status. The model was built to optimize for AUC ROC metric, and it does fairly well for all of the classes, though there is a small performance gap for the "divorced" category. But when we look at the AUC-PR metric slices, we can see that the "divorced" and "other" classes are very poorly served by the model compared to the more common classes. AUC-PR is the metric that measures how well the tradeoff between precision and recall is being made in the model's predictions. If we're concerned about this gap, we could consider retraining to use AUC-PR as the optimization metric and see whether that model does a better job making equitable predictions. 

In [0]:
import apache_beam as beam
import tempfile

from collections import OrderedDict
from google.protobuf import text_format
from tensorflow_model_analysis import post_export_metrics
from tensorflow_model_analysis import types
from tensorflow_model_analysis.api import model_eval_lib
from tensorflow_model_analysis.evaluators import aggregate
from tensorflow_model_analysis.extractors import slice_key_extractor
from tensorflow_model_analysis.model_agnostic_eval import model_agnostic_evaluate_graph
from tensorflow_model_analysis.model_agnostic_eval import model_agnostic_extractor
from tensorflow_model_analysis.model_agnostic_eval import model_agnostic_predict
from tensorflow_model_analysis.proto import metrics_for_slice_pb2
from tensorflow_model_analysis.slicer import slicer
from tensorflow_model_analysis.view.widget_view import render_slicing_metrics

# To set up model agnostic extraction, need to specify features and labels of
# interest in a feature map.
feature_map = OrderedDict();

for i, column in enumerate(df.columns):
  type = df.dtypes[i]
  if column == prediction_field_score or column == prediction_field_value:
    continue
  elif (type == np.dtype(np.float64)):
    feature_map[column] =  tf.FixedLenFeature([], tf.float32)
  elif (type == np.dtype(np.object)):
    feature_map[column] =  tf.FixedLenFeature([], tf.string)
  elif (type == np.dtype(np.int64)):
    feature_map[column] = tf.FixedLenFeature([], tf.int64)
  elif (type == np.dtype(np.bool)):
    feature_map[column] = tf.FixedLenFeature([], tf.bool)
  elif (type == np.dtype(np.datetime64)):
    feature_map[column] = tf.FixedLenFeature([], tf.timestamp)

feature_map['predicted_class'] = tf.FixedLenFeature([], tf.int64)
feature_map['predicted_class_score'] = tf.FixedLenFeature([], tf.float32)

serialized_examples = [e.SerializeToString() for e in examples]

BASE_DIR = tempfile.gettempdir()
OUTPUT_DIR = os.path.join(BASE_DIR, 'output')

slice_column = 'MARRIAGE' #@param
predicted_labels = 'predicted_class' #@param
actual_labels = 'default_payment_next_month' #@param
predicted_class_score = 'predicted_class_score' #@param

with beam.Pipeline() as pipeline:
  model_agnostic_config = model_agnostic_predict.ModelAgnosticConfig(
            label_keys=[actual_labels],
            prediction_keys=[predicted_labels],
            feature_spec=feature_map)
  
  extractors = [
          model_agnostic_extractor.ModelAgnosticExtractor(
              model_agnostic_config=model_agnostic_config,
              desired_batch_size=3),
           slice_key_extractor.SliceKeyExtractor([
               slicer.SingleSliceSpec(columns=[slice_column])
           ])
      ]

  auc_roc_callback = post_export_metrics.auc(
      labels_key=actual_labels,
      target_prediction_keys=[predicted_labels])
  
  auc_pr_callback = post_export_metrics.auc(
      curve='PR',
      labels_key=actual_labels,
      target_prediction_keys=[predicted_labels])
  
  confusion_matrix_callback = post_export_metrics.confusion_matrix_at_thresholds(
      labels_key=actual_labels,
      target_prediction_keys=[predicted_labels],
      example_weight_key=predicted_class_score,
      thresholds=[0.0, 0.5, 0.8, 1.0])

  # Create our model agnostic aggregator.
  eval_shared_model = types.EvalSharedModel(
      construct_fn=model_agnostic_evaluate_graph.make_construct_fn(
          add_metrics_callbacks=[confusion_matrix_callback,
                                 auc_roc_callback,
                                 auc_pr_callback,
                                 post_export_metrics.example_count()],
          fpl_feed_config=model_agnostic_extractor
          .ModelAgnosticGetFPLFeedConfig(model_agnostic_config)))

  # Run Model Agnostic Eval.
  _ = (
      pipeline
      | beam.Create(serialized_examples)
      | 'ExtractEvaluateAndWriteResults' >>
        model_eval_lib.ExtractEvaluateAndWriteResults(
            eval_shared_model=eval_shared_model,
            output_path=OUTPUT_DIR,
            extractors=extractors))
    

eval_result = tfma.load_eval_result(output_path=OUTPUT_DIR)
render_slicing_metrics(eval_result,  slicing_column = slice_column)