# Analyzing and Validating ML models

After the model training, most important part we need to do is its analysis and validation. This helps us to improve the model and to identify issues in the model we build. To help with this in production systems TFX provides Tensorflow Model Analysis (TFMA). 

Model analysis starts with our choice of metrics. Based on our requirement we need to choose our metrics and evaluate them accordingly.

These metrics include Precision, recall, F1 score, Mean Absolute Error, Maximum absolute percentage error, Mean squred error etc. (these are provided by TFMA)

In TFX pipeline, TFMA calculates metrics(We define) based on the saved model that is exported by the Trainer component. If we are using Tensorboard, we will only get approximations extrapolated from measurements on mini batches. But TFMA calculates measurements on the whole evaluation set.

<center>


**pip install tensorflow-model-analysis**
</center>

For the model analysis task with TFMA, it expects 2 inputs a saved model and a evaluation set. Below is an example usage of TFMA for out previously built model.


In [1]:
import tensorflow_model_analysis as tfma
import tensorflow as tf

# stop tf warnings 
import logging
logger = tf.get_logger()
logger.setLevel(logging.ERROR)

eval_model = tfma.default_eval_shared_model(eval_saved_model_path='data/tfx/Trainer/model/6/Format-Serving',
                                                    tags=[tf.saved_model.SERVING])


Before doing anything, we need to tell TFMA what we need to measure and provide specifications if theres any and importantly target label.

In [2]:
from google.protobuf import text_format

# Setup tfma.EvalConfig settings
eval_config = text_format.Parse("""
                ## Model information
                model_specs {
                    # For keras (and serving models) we need to add a `label_key`.
                    label_key: "consumer_disputed"
                }

                metrics_specs {
                    metrics { class_name: "BinaryAccuracy" }
                    metrics { class_name: "Precision" }
                    metrics { class_name: "Recall" }
                    metrics { class_name: "ExampleCount" }
                    metrics { class_name: "FalsePositives" }
                    metrics { class_name: "TruePositives" }
                    metrics { class_name: "FalseNegatives" }
                    metrics { class_name: "TrueNegatives" }
                }

                ## Slicing information
                slicing_specs {}  # overall slice
                
                """, tfma.EvalConfig())

In [3]:
eval_result = tfma.run_model_analysis(
    eval_shared_model=eval_model,
    eval_config=eval_config,
    data_location='data/eval_inputs/data_tfrecord-00000-of-00001',
    output_path='data/eval_outputs',
    file_format='tfrecords')





Note that TFMA rendering on jupyter notebooks require special commands to run. Please refer the [Documentation](https://www.tensorflow.org/tfx/model_analysis/install).

In [4]:
tfma.view.render_slicing_metrics(eval_result)

SlicingMetricsViewer(config={'weightedExamplesColumn': 'example_count'}, data=[{'slice': 'Overall', 'metrics':…

Above include some examples for the usage. More details can be found easily through the documentation. 

One main concern in many commercial productionalize ML systems is fairness. This covers issues related to races, genders etc. that could make negative impact on both ML system and the user. Therefore we should recognize such problems earlier and fix them.

To do that, we can use the slicing option given in the TFMA. It helps us to separate groups we are interested in and them check the measures on those slices.

For example, below we have defined a slice speciication on products column in our dataset.

In [5]:
slice = [tfma.slicer.SingleSliceSpec(),  # This returns a slicer which return the whole dataset.
         tfma.slicer.SingleSliceSpec(columns=['product'])]

eval_result = tfma.run_model_analysis(
    eval_shared_model=eval_model,
    eval_config=eval_config,
    data_location='data/eval_inputs/data_tfrecord-00000-of-00001',
    output_path='data/eval_outputs',
    file_format='tfrecords',
    slice_spec=slice)

tfma.view.render_slicing_metrics(eval_result)



SlicingMetricsViewer(config={'weightedExamplesColumn': 'example_count'}, data=[{'slice': 'Overall', 'metrics':…

### Using Fairness indicators for decisions

Fairness indicators is a useful tool for model analysis which has overlapping capabilities with TFMA. Its ability to view matrices sliced on featues at various decision thresholds helps to identify the model fairness at different levels of thresholds.

There are severalways to use fairness indicator tool and one way is using with tensorboard. To do that we need to install the plugin first using below.

<center> 

`pip install tensorboard_plugin_fairness_indicators`
</center>

Next we can use TFMA to evaluate the model and calculate metrics for a set of decision threshold we supply. This is supplied to TFMA in metrics_spec argument for the Eval_Config. Below is an example for that.

In [None]:
eval_config_fairness=tfma.EvalConfig(
        model_specs=[tfma.ModelSpec(label_key='consumer_disputed')],
        slicing_specs=[tfma.SlicingSpec(), tfma.SlicingSpec(feature_keys=['product'])],
        metrics_specs=[
              tfma.MetricsSpec(metrics=[
                  tfma.MetricConfig(class_name='BinaryAccuracy'),
                  tfma.MetricConfig(class_name='ExampleCount'),
                  tfma.MetricConfig(class_name='FalsePositives'),
                  tfma.MetricConfig(class_name='TruePositives'),
                  tfma.MetricConfig(class_name='FalseNegatives'),
                  tfma.MetricConfig(class_name='TrueNegatives'),
                  tfma.MetricConfig(class_name='FairnessIndicators', config='{"thresholds":[0.25, 0.5, 0.75]}')
              ])])

eval_result = tfma.run_model_analysis(
    eval_shared_model=eval_model,
    eval_config=eval_config_fairness,
    data_location='data/eval_inputs/data_tfrecord-00000-of-00001',
    output_path="./data/eval_outputs/",
    file_format='tfrecords',
    slice_spec = slice)

We can write the evaluation results so that it can be used later like below.

In [None]:
%load_ext tensorboard
%tensorboard --logdir=data/eval_logs/fairness_logs

### What-If Tool

This weird named tool, which exactly does what it means, can show how individual data points were affected by the model. It provides features to do extra visualizations other than TFMA and investigate individual data points.

There are several ways to use the what if tool and below is on such method.
First we need to install it by below.

<center> 

`pip install witwidget`
</center>

Then we need to load the data as a TFRecordDatase.


In [7]:
eval_data = tf.data.TFRecordDataset('data/eval_inputs/data_tfrecord-00000-of-00001')
subset = eval_data.take(1000)
eval_examples = [tf.train.Example.FromString(d.numpy()) for d in subset]

Next we need to load the model and define a prediction function that takes list of TFExamples and returns predictions.

In [8]:

model = tf.saved_model.load(export_dir='data/tfx/Trainer/model/6/Format-Serving')
predict_fn = model.signatures['serving_default']


def predict(examples):
    test_examples = tf.constant([example.SerializeToString() for example in examples])
    preds = predict_fn(examples=test_examples)
    return preds['outputs'].numpy()

Below is the WIT configuration.

In [9]:
from witwidget.notebook.visualization import WitConfigBuilder

config_builder = WitConfigBuilder(eval_examples).set_custom_predict_fn(predict)

Below we visualize the data with our configuration.

In [12]:
from witwidget.notebook.visualization import WitWidget
WitWidget(config_builder)

WitWidget(config={'model_type': 'classification', 'label_vocab': [], 'are_sequence_examples': False, 'inferenc…

What if tool provides many interesting features to identifying various model behaviours such as `counterfactuals` and `partial dependency plots` etc. More details about what if tool can be found in the [documentation](https://pair-code.github.io/what-if-tool/index.html).

Also WIT can be used for model explainability tasks as well. We can use the features provided by WIT such as counterfactuals, PDPs to explain model behaviour in various situations. 

Other than that we can use techniques such as LIME, SHAPLEY values to obtain model explainability aspects.

### Analysis and Validation in TFX

All the above mentioned techniques include ways to identify models with good qualities. But in production environments we need to automate the process of this identification. To do that, TFX provides Resolver, Evaluator and Pusher components. These components can check model performance on a evaluation dataset and send the model to serving phase if its performance is better.

> TFX uses a concept called `blessing` to describe the gating process for deciding whether or not to deploy a model.

### Resolver

If we need to compare a new model against a previous version resolver can be used. This checks the TFX metadata store and send the latest best performing model as a baseline to the evaluator so we can compare it with the new model.

### Evaluator

This uses TFMA library to evaluate model predictions on a validation set. It takes input from ExampleGen component, trained model from Trainer component and EvalConfig for TFMA.

### Pusher

It takes a saved model as an input, file path for the model saving location and based on the configuration check the model has been blessed by th Evaluator (better compared to the baseline). If thats the case new model will be pushed to the serving location.