# Hands-On Tabular Workshop: *Detecting Fraud in Transaction Data*

This workshop is focused on the creation, deployment and monitoring of machine learning models for performing fraud detection. 

In this notebook you will be exploring the data, and training the machine learning models themself; in the form of an XGBoost classifier and a Scikit-learn Random Forest classifier. You will then begin to add the advanced monitoring and explainability which Seldon Alibi is famed for. 

The EDA and training of models within this notebook is heavily inspired by the fantastic work of Arjun Joshua, you can find the original [here](https://www.kaggle.com/arjunjoshua/predicting-fraud-in-financial-payment-services/notebook). 

-----------------------------------
Firstly, you install and import the relevant packages which we will use throughout the exploration and training process. 

In [None]:
!pip install alibi==0.7.0
!pip install alibi_detect==0.8.1
!pip install dill
!pip install seaborn
!pip install seldon_deploy_sdk
!pip install xgboost==1.5.2

In [None]:
import pandas as pd
import numpy as np
import joblib

%matplotlib inline
import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.model_selection import train_test_split, learning_curve
from sklearn.metrics import average_precision_score
from sklearn.utils import class_weight
from sklearn.ensemble import RandomForestClassifier

from xgboost.sklearn import XGBClassifier
from xgboost import plot_importance, to_graphviz

from seldon_deploy_sdk import Configuration, ApiClient, SeldonDeploymentsApi, ModelMetadataServiceApi, DriftDetectorApi, BatchJobsApi, BatchJobDefinition
from seldon_deploy_sdk.auth import OIDCAuthenticator

from alibi.explainers import AnchorTabular
from alibi_detect.cd import MMDDrift
from alibi_detect.utils.saving import save_detector, load_detector

import dill

# For repeatability
randomState = 5
np.random.seed(randomState)

You then download the dataset which you will be using for the workshop, and load it into a Pandas DataFrame. The dataset is stored in the public Google Cloud Storage bucket `kelly-seldon`.

In [None]:
!gsutil cp gs://kelly-seldon/fraud-detection/transaction-data.csv data/transaction-data.csv

In [None]:
df = pd.read_csv('data/transaction-data.csv')
df = df.rename(columns={'oldbalanceOrg':'oldBalanceOrig', 'newbalanceOrig':'newBalanceOrig', \
                        'oldbalanceDest':'oldBalanceDest', 'newbalanceDest':'newBalanceDest'})
df.head()

It is worth taking a second to understand the features (columns of the table) within the dataset:
* `step`: This is a time series data set i.e. money transfers occur over time. 1 step represents 1 hour, with a total of 744 steps equivalent to 30 days. 
* `type`: The type of transaction: CASH-IN, CASH-OUT, DEBIT, PAYMENT, TRANSFER.
* `amount`: Amount of the transaction in local currency.
* `nameOrig`: Customer name who started the transaction.
* `oldBalanceOrig`: Initial balance before the transaction.
* `newBalanceOrig`: New balance after the transaction.
* `nameDest`: Customer name who is the recipient of the transaction.
* `oldBalanceDest`: Initial balance of the recipient before the transaction.
* `newBalanceDest`: New balance of the recipient after the transaction.
* `isFraud`: This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control of customers accounts and trying to empty the funds by transferring to another account and then cashing out of the system.
* `isFlaggedFraud`: The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.

It is worth noting that this is a synthetically generated dataset and so does not represent real world transactions, but is based upon the behaviour of a supplied real world dataset. You can read more about the data used [here](https://www.kaggle.com/ntnu-testimon/paysim1).

## Data Preparation

There are a number of data preparation steps which need to be performed prior to visualisation and model training. The first of which is to remove all transaction types apart from TRANSFER and CASH_OUT. These are the only transaction types where fraud occurs, and therefore the other types of transaction can be neglected. 

In [None]:
X = df.loc[(df.type == 'TRANSFER') | (df.type == 'CASH_OUT')]

Next, you can remove a number of the feature columns which have no predictive power. These are the account name fields, as well as the `isFlaggedFraud` which has no clear relation to the other features. 

In [None]:
X = X.drop(['nameOrig', 'nameDest', 'isFlaggedFraud'], axis = 1)

You can then encode the transaction type categorical feature as a binary. Transactions types of TRANSFER will be 0, meanwhile CASH_OUT transactions will be 1. 

In [None]:
X.loc[X.type == 'TRANSFER', 'type'] = 0
X.loc[X.type == 'CASH_OUT', 'type'] = 1
X.type = X.type.astype(int) # convert dtype('O') to dtype(int)

You now create the labels. This will simply be the `isFraud` field, and will be what your machine learning model attempts to predict based on the remaining transaction features. 

In [None]:
Y = X['isFraud']
del X['isFraud']

#### Working with Zero Balances

The data has several transactions with zero balances in the destination account both before and after a non-zero amount is transacted. The fraction of such transactions, where zero likely denotes a missing value, is much larger in fraudulent (50%) compared to genuine transactions (0.06%).


Since the destination account balances being zero is a strong indicator of fraud we replace the values of oldBalanceDest and newBalanceDest with -1 where they are 0 originally, but have a non-zero transfer between them. 

In [None]:
X.loc[(X.oldBalanceDest == 0) & (X.newBalanceDest == 0) & (X.amount != 0), \
      ['oldBalanceDest', 'newBalanceDest']] = -1

The data also has several transactions with zero balances in the originating account both before and after a non-zero amount is transacted. Once again, the fraction of such transactions is much smaller in fraudulent (47%) compared to genuine transactions (0.3%). Once again, from similar reasoning as above, instead of imputing a numerical value we replace the value of 0 with -1.

In [None]:
X.loc[(X.oldBalanceOrig == 0) & (X.newBalanceOrig == 0) & (X.amount != 0), \
      ['oldBalanceOrig', 'newBalanceOrig']] = -1

Motivated by the possibility of zero-balances serving to differentiate between fraudulent and genuine transactions, create 2 new features (columns) recording errors in the originating and destination accounts for each transaction. 

In [None]:
X['errorBalanceOrig'] = X.newBalanceOrig + X.amount - X.oldBalanceOrig
X['errorBalanceDest'] = X.oldBalanceDest + X.amount - X.newBalanceDest

## Data Visualisation

Let's explore the data by generating a series of plots. 

First create a function which allows you to generate strip plots readily.

In [None]:
limit = len(X)

def plotStrip(x, y, hue, figsize = (14, 9)):
    
    fig = plt.figure(figsize = figsize)
    colours = plt.cm.tab10(np.linspace(0, 1, 9))
    with sns.axes_style('ticks'):
        ax = sns.stripplot(x = x, y = y, \
             hue = hue, jitter = 0.4, marker = '.', \
             size = 4, palette = colours)
        ax.set_xlabel('')
        ax.set_xticklabels(['genuine', 'fraudulent'], size = 14)
        for axis in ['top','bottom','left','right']:
            ax.spines[axis].set_linewidth(2)

        handles, labels = ax.get_legend_handles_labels()
        plt.legend(handles, ['Transfer', 'Cash out'], bbox_to_anchor=(1, 1), \
               loc=2, borderaxespad=0, fontsize = 14);
    return ax

Let's compare how genuine and fraudulent transacations are distributed over time. 

In [None]:
ax = plotStrip(Y[:limit], X.step[:limit], X.type[:limit])
ax.set_ylabel('time [hour]', size = 14)

You can see that genuine transactions have a more regular pattern, occuring at intervals with periods in between which do not see any genuine transactions occuring. These periods could represent weekends or holidays resulting in businesses being closed. Meanwhile, the fraudulent transactions are far more evenly distributed, with no discernible pattern. 

Furthermore, it's clear that the majority of genuine transcations are of type CASH OUT, whereas fraudulent transactions feature TRANSFER types far more prominently. 

-----

Next, compare the transfer amount distributions for genuine and fraudulent transctions. 

In [None]:
limit = len(X)
ax = plotStrip(Y[:limit], X.amount[:limit], X.type[:limit], figsize = (14, 9))
ax.set_ylabel('amount', size = 16)

There is no clear pattern between genuine and fraudulent transactions by simply considering the amount. However, it appears there is a ceiling on the limit of a fraudulent transaction (10,000,000).

-----

Finally, you visualise the feature you created earlier `errorBalanceDest`, which is simply calculated by taking the previous balance of the destination account, plus the amount which was transferred minus new balance in the account. 

Remember, that many of the fraudulent transactions we observed had 0 account balance both before and after a non-zero sum of money was transferred. Therefore, the `errorBalanceDest` of these transactions will be a positive number equivalent to the value of the transfer.

In [None]:
limit = len(X)
ax = plotStrip(Y[:limit], X.errorBalanceDest[:limit], X.type[:limit], \
              figsize = (14, 9))
ax.set_ylabel('errorBalanceDest', size = 16)

From this figure we can see a clear distinction between genuine and fraudulent transactions with positive errorBalanceDest being recorded overwhelmingly more so for fraudulent transactions than genuine ones. 

# Model Training

Next you will train your predictor, to determine in an automated fashion whether a new transaction is fraudulent or not. 

You will be using an XGBoost classifier as it is naturally suited to handling such an imbalanced dataset, whereby only 0.3% of the transactions are fraudulent. 

In [None]:
Xfraud = X.loc[Y == 1]
XnonFraud = X.loc[Y == 0]

print('skew = {}'.format( len(Xfraud) / float(len(X)) ))

You split your data into training and testing sets. 

In [None]:
trainX, testX, trainY, testY = train_test_split(X, Y, test_size = 0.2, random_state = randomState)

You also weight the positive class (fraudulent) more than the negative class (genuine) to help account for the overrepresentation of genuine transactions in the dataset. 

In [None]:
weights = (Y == 0).sum() / (1.0 * (Y == 1).sum())

You then train and score an XGBoost classifier. 

In [None]:
# Long computation in this cell (~3 minutes)

clf = XGBClassifier(max_depth = 3, scale_pos_weight = weights, n_jobs = 4, use_label_encoder=False)
probabilities = clf.fit(trainX, trainY).predict_proba(testX)
print('AUPRC = {}'.format(average_precision_score(testY, probabilities[:, 1])))

A very impressive 0.99 AUPRC! Which means your classifier is accurately distinguishing between transactions. 

You can visualise the features which are most important to your new XGBoost classifier as shown below. 

In [None]:
fig = plt.figure(figsize = (14, 9))
ax = fig.add_subplot(111)

colours = plt.cm.Set1(np.linspace(0, 1, 9))

ax = plot_importance(clf, height = 1, color = colours, grid = False, \
                     show_values = False, importance_type = 'cover', ax = ax);
for axis in ['top','bottom','left','right']:
            ax.spines[axis].set_linewidth(2)
        
ax.set_xlabel('importance score', size = 16);
ax.set_ylabel('features', size = 16);
ax.set_yticklabels(ax.get_yticklabels(), size = 12);
ax.set_title('Plotting the models most important features', size = 16);

You can now save your model, and upload it to an artefact store (in this case a Google storage bucket) ready for deployment.

You will be making use of the pre-packaged XGBoost model server, and therefore Seldon expects your classifier to be saved as `model.bst`. 

In [None]:
clf.save_model('model.bst')

You will now upload our saved model file to a Google storage bucket. 

### ⚠️ IMPORTANT ⚠️
Make sure you fill in the YOUR_NAME variable to ensure you're not overwriting existing artefacts.

In [None]:
YOUR_NAME = ""

!gsutil cp model.bst gs://kelly-seldon/fraud-detection/models/{YOUR_NAME}/model.bst

## Model Deployment

We can now deploy our models to the dedicated Seldon Deploy cluster that we have configured for this workshop. To do this, we will interact with the Seldon Deploy SDK.

Firstly, we need to set up the configuration and authentication required to access the cluster.

⚠️ Make sure to fill in the following in the below cell if not already filled in:

- SD_DOM variable - Ensure to change this to the domain name of the cluster you are using.
- CLIENT_SECRET variable - this will either be sd-api-secret or an alternative secret provided during the workshop.

In [None]:
SD_DOM = ""
CLIENT_SECRET = ""

config = Configuration()
config.host = f"https://{SD_DOM}/seldon-deploy/api/v1alpha1"
config.oidc_client_id = "sd-api"
config.oidc_server = f"https://{SD_DOM}/auth/realms/deploy-realm"
config.oidc_client_secret = f"{CLIENT_SECRET}"
config.auth_method = "auth_code"

def auth():
    auth = OIDCAuthenticator(config)
    config.id_token = auth.authenticate()
    api_client = ApiClient(configuration=config, authenticator=auth)
    return api_client

Now you have configured the IP correctly as well as setup your authentication function you can describe the deployment you would like to create. 

For the `MODEL_LOCATION` you do not need to specify the path all the way up to `model.bst` e.g. if you saved your classifier under `gs://kelly-seldon/fraud-detection/models/kelly-spry/model.bst` your `MODEL_LOCATION` should be `gs://kelly-seldon/fraud-detection/models/kelly-spry` and Seldon will automatically pick up the classifier artifact stored there. 

In [None]:
DEPLOYMENT_NAME = f"{YOUR_NAME}-fraud"
NAMESPACE = "seldon-gitops"
MODEL_LOCATION = f"gs://kelly-seldon/fraud-detection/models/{YOUR_NAME}"

PREPACKAGED_SERVER = "XGBOOST_SERVER"

CPU_REQUESTS = "1"
MEMORY_REQUESTS = "1Gi"

CPU_LIMITS = "1"
MEMORY_LIMITS = "1Gi"

mldeployment = {
    "kind": "SeldonDeployment",
    "metadata": {
        "name": DEPLOYMENT_NAME,
        "namespace": NAMESPACE,
        "labels": {
            "fluentd": "true"
        }
    },
    "apiVersion": "machinelearning.seldon.io/v1alpha2",
    "spec": {
        "name": DEPLOYMENT_NAME,
        "annotations": {
            "seldon.io/engine-seldon-log-messages-externally": "true"
        },
        "protocol": "seldon",
        "transport": "rest",
        "predictors": [
            {
                "componentSpecs": [
                    {
                        "spec": {
                            "containers": [
                                {
                                    "name": f"{DEPLOYMENT_NAME}-container",
                                    "resources": {
                                        "requests": {
                                            "cpu": CPU_REQUESTS,
                                            "memory": MEMORY_REQUESTS
                                        },
                                        "limits": {
                                            "cpu": CPU_LIMITS,
                                            "memory": MEMORY_LIMITS
                                        }
                                    }
                                }
                            ]
                        }
                    }
                ],
                "name": "default",
                "replicas": 1,
                "traffic": 100,
                "graph": {
                    "implementation": PREPACKAGED_SERVER,
                    "modelUri": MODEL_LOCATION,
                    "name": f"{DEPLOYMENT_NAME}-container",
                    "endpoint": {
                        "type": "REST"
                    },
                    "parameters": [],
                    "children": [],
                    "logger": {
                        "mode": "all"
                    }
                }
            }
        ]
    },
    "status": {}
}

You can now invoke the `SeldonDeploymentsApi` and create a new Seldon Deployment. 

Time for you to get your hands dirty. You will use the Seldon Deploy SDK to create a new Seldon deployment. You can find the reference documentation [here](https://github.com/SeldonIO/seldon-deploy-sdk/blob/master/python/README.md). 

In [None]:
deployment_api = SeldonDeploymentsApi(auth())
deployment_api.create_seldon_deployment(namespace=NAMESPACE, mldeployment=mldeployment)

You can access the Seldon Deploy cluster and view your freshly created deployment here. Ensure you replace the `XXXXX` with the IP for the cluster you're using:

* URL: http://XXXXX/seldon-deploy/

## Adding a Prediction Schema & Metadata
Seldon Deploy has a model catalog where all deployed models are automatically registered. The model catalog can store custom metadata as well as prediction schemas for your models. 

Metadata promotes lineage from across different machine learning systems, aids knowledge transfer between teams, and allows for faster deployment. Meanwhile, prediction schemas allow Seldon Deploy to automatically profile tabular data into histograms, allowing for filtering on features to explore trends. 

In order to effectively construct a prediction schema Seldon has the [ML Prediction Schema](https://github.com/SeldonIO/ml-prediction-schema) project. The first step is to determine your datatypes. 

In [None]:
trainX.dtypes

From this you can construct the prediction schema object below, which maps the data types to the requests and responses which the model returns. 

In [None]:
prediction_schema = {
    "requests": [
        {
            "name": "step",
            "type": "REAL",
            "data_type": "INT"
        },
        {
            "name": "type",
            "type": "REAL",
            "data_type": "INT"
        },
        {
            "name": "amount",
            "type": "REAL",
            "data_type": "FLOAT"
        },
        {
            "name": "oldBalanceOrig",
            "type": "REAL",
            "data_type": "FLOAT"
        },
        {
            "name": "newBalanceOrig",
            "type": "REAL",
            "data_type": "FLOAT"
        },
        {
            "name": "oldBalanceDest",
            "type": "REAL",
            "data_type": "FLOAT"
        },
        {
            "name": "newBalanceDest",
            "type": "REAL",
            "data_type": "FLOAT"
        },
        {
            "name": "errorBalanceOrig",
            "type": "REAL",
            "data_type": "FLOAT"
        },
        {
            "name": "errorBalanceDest",
            "type": "REAL",
            "data_type": "FLOAT"
        }
    ],
    "responses": [
        {
            "name": "Likelihood of Fraud",
            "type": "REAL",
            "data_type": "FLOAT"
        }
    ]
}

You then add the prediction schema to the wider model catalog metadata. This includes information such as the model storage location, the name, who authored the model etc. The metadata tags and metrics which can be associated with a model are freeform and can therefore be determined based upon the use case which is being developed. 

In [None]:
model_catalog_metadata = {
      "URI": MODEL_LOCATION,
      "name": f"{DEPLOYMENT_NAME}-model",
      "version": "v1.0",
      "artifactType": "XGBOOST",
      "taskType": "fraud classification",
      "tags": {
        "auto_created": "true",
        "author": f"{YOUR_NAME}"
      },
      "metrics": {},
      "project": "default",
      "prediction_schema": prediction_schema
    }

model_catalog_metadata

Next, using the metadata API you can add this to the model which you have just created in Seldon.

In [None]:
metadata_api = ModelMetadataServiceApi(auth())
metadata_api.model_metadata_service_update_model_metadata(model_catalog_metadata)

You can then list the metadata via the API, or view it in the UI, to confirm that it has been successfully added to the model. 

In [None]:
metadata_response = metadata_api.model_metadata_service_list_model_metadata(uri=MODEL_LOCATION)
metadata_response

You can now send requests to your model. As an example of a normal request:
```
{
    "data": {
        "names": ["step", "type", "amount", "oldBalanceOrig", "newBalanceOrig",
                  "oldBalanceDest", "newBalanceDest", "errorBalanceOrig", "errorBalanceDest"],
        "ndarray": [
            [205, 1, 63243.44, -1.00, -1.00, 1853683.32, 1916926.76, 63243.44, 0]
        ]
    }
}
```
And a fraudulent transaction too:

```
{
    "data": {
        "names": ["step", "type", "amount", "oldBalanceOrig", "newBalanceOrig",
                  "oldBalanceDest", "newBalanceDest", "errorBalanceOrig", "errorBalanceDest"],
        "ndarray": [
            [629, 1, 2433009.28, 2433009.28, 0.00, 0.00, 2433009.28, 0.00, 0.00]
        ]
    }
}
```

# Drift Detection

In this example you will use Alibi Detect to train a custom drift detector which can flag when the underlying input data distribution has shifted. This can inform decisions about re-training or prompt deeper investigation into data/model behaviours.

Seldon Deploy also allows you to setup alerts when drift is detected.

In this example you will use the Maximum Mean Discrepancy method. Covariate or input drift detection relies on creating a distance measure between two distributions; a reference distribution and a new distribution. The MMD drift detector is no different; the mean embeddings of your features are used to generate the distributions and then the distance between them is measured. The training data is used to calculate the reference distribution, while the new distribution comes from your inference data.

More technically, a reproducing kernel Hilbert space is used to generate the mean embeddings, by mapping the highly complex feature space within which most machine learning models operate to a linear Euclidean space. A radial basis function kernel is then used to measure the distance between the two embeddings, and the signifiance of the drift is calculated as a p-value using permutation/resampling tests. More details can be found here.

In this case you will train your drift detector on a sample of 5000 instances from the training set. This has been picked for convenience and speed of training in the workshop, and if this was a production case you would likely want to use the entirety of the training set, or a statistically significant segment.

In [None]:
cd = MMDDrift(trainX.iloc[:5000].to_numpy(), backend='tensorflow', p_val=.05)

Once you have successfully trained your drift detector you can send it batches of data from the test set and test whether or not drift has occurred. 

In [None]:
preds = cd.predict(testX.iloc[100:200].to_numpy(), return_p_val=True, return_distance=True)
preds

You can then save the drift detector, and upload it to the GS bucket. 

In [None]:
save_detector(cd, "fraud-drift-detector")

In [None]:
!gsutil cp -r fraud-drift-detector gs://kelly-seldon/fraud-detection/models/{YOUR_NAME}/fraud-drift-detector

Then, we can use the Seldon Deploy SDK to deploy our newly configured drift detector. We can define the config for the drift detector and then call the `DriftDetectorApi` to create the drift detector seldon deployment.

In [None]:
DD_URI = f"gs://kelly-seldon/fraud-detection/models/{YOUR_NAME}/fraud-drift-detector"
DD_NAME = "ee-dd"

dd_config = {'config': {'basic': 
                        {'drift_batch_size': '5',
                         'storage_uri': DD_URI},
                        'deployment': {'protocol': 'seldon.http'}
                        },
             'deployment_name': DEPLOYMENT_NAME,
             'detector_type': 'drift',
             'name': DD_NAME,
             'namespace': NAMESPACE
            }

In [None]:
dd_api = DriftDetectorApi(auth())
dd_api.create_drift_detector_seldon_deployment(name=DEPLOYMENT_NAME, namespace=NAMESPACE, detector_data=dd_config)

We could test our drift detector by running a batch job, but for ease here, we will just run the same request through our model around 20 times. We will then be able to view drift data on the Seldon Deploy UI.

```
{
    "data": {
        "names": ["step", "type", "amount", "oldBalanceOrig", "newBalanceOrig",
                  "oldBalanceDest", "newBalanceDest", "errorBalanceOrig", "errorBalanceDest"],
        "ndarray": [
            [629, 1, 2433009.28, 2433009.28, 0.00, 0.00, 2433009.28, 0.00, 0.00]
        ]
    }
}
```

# Explainer
Next, you will train an explainer to glean deeper insights into the decisions being made by your model. 

You will make use of the Anchors algorithm, which has a [production grade implementation available](https://docs.seldon.io/projects/alibi/en/stable/methods/Anchors.html) using the Seldon Alibi Explain library. 

The first step will be to write a simple prediction function which the explainer can call in order to query your XGBoost model. 

In [None]:
def predict_fn(x):
    return clf.predict_proba(x)

You then initialise your Anchor explainer, using the AnchorTabular flavour provided by Alibi due to your data modality. 

The AnchorTabular class expects the prediction function which you defined above, as well as a list of the feature names. You can find a sample notebook in the Alibi docs [here](https://docs.seldon.io/projects/alibi/en/stable/examples/anchor_tabular_adult.html). 

In [None]:
columns = list(trainX.columns)
explainer = AnchorTabular(predict_fn, columns)

You now need to fit your explainer object around some data so that it can learn to generate explanations based upon said data. 

As the training set is highly imbalanced (only a tiny fraction of the datapoints are fraudulent transactions) you create a new balanced set which is 50/50 normal/fraud transactions. This helps you to generate descriptive and useful explanations for both fraudulent and normal transactions.*

In the code block below you generate the new balanced set, and convert it to a numpy array as this is the type which Alibi expects. 

\*It is possible to generate a working explainer based upon the original dataset, but the anchors it identifies are not specific when considering normal transactions. The empty anchor is only ever returned due to the skew in the dataset. 

In [None]:
balanced_set = pd.concat([Xfraud, XnonFraud.iloc[:len(Xfraud)]]).to_numpy()

You then fit our explainer to your newly balanced data set. 

In [None]:
explainer.fit(balanced_set, disc_perc=(25, 50, 75)) 

You can now test your explainer on the test set, and view the explanations it begins to generate. Feel free to change the value of `idx` to see how it impacts the explanation generated. 

In [None]:
idx = 10

testX_array = testX.to_numpy()

class_names = ["Normal", "Fraudulent"]
print('Prediction: ', class_names[explainer.predictor(testX_array[idx].reshape(1, -1))[0]])

explanation = explainer.explain(testX_array[idx], threshold=0.95)
print('Anchor: %s' % (' AND '.join(explanation.anchor)))
print('Precision: %.2f' % explanation.precision)
print('Coverage: %.2f' % explanation.coverage)

Explicitly testing a fraudulent transaction. 

In [None]:
print('Prediction: ', class_names[explainer.predictor(testX.loc[6272989].to_numpy().reshape(1, -1))[0]])

explanation = explainer.explain(testX.loc[6272989].to_numpy(), threshold=0.95)
print('Anchor: %s' % (' AND '.join(explanation.anchor)))
print('Precision: %.2f' % explanation.precision)
print('Coverage: %.2f' % explanation.coverage)

You now save your explainer, and upload it to the GS bucket. You can use the explainer's built-in save method to do this easily and reproducibly. 

In [None]:
explainer.save("fraud-explainer")

In [None]:
!gsutil cp -r fraud-explainer gs://kelly-seldon/fraud-detection/models/{YOUR_NAME}/fraud-explainer

## Deploying the Explainer

You can now deploy our explainer alongside our model. First you define the explainer configuration. 

In [None]:
EXPLAINER_TYPE = "AnchorTabular"
EXPLAINER_URI = f"gs://kelly-seldon/fraud-detection/models/{YOUR_NAME}/fraud-explainer"

explainer_spec = {
                    "type": EXPLAINER_TYPE,
                    "modelUri": EXPLAINER_URI,
                    "containerSpec": {
                        "name": "",
                        "resources": {}
                    }
                }

You can then insert this additional configuration into your original `mldeployment` specification which you defined earlier. 

In [None]:
mldeployment['spec']['predictors'][0]['explainer'] = explainer_spec
mldeployment

You then deploy the explainer to the Seldon Deploy cluster!

In [None]:
deployment_api = SeldonDeploymentsApi(auth())
deployment_api.create_seldon_deployment(namespace=NAMESPACE, mldeployment=mldeployment)

You can now use an example request to generate both a prediction and then a subsequent explanation. 


A fraudulent transaction:

```
{
    "data": {
        "names": ["step", "type", "amount", "oldBalanceOrig", "newBalanceOrig",
                  "oldBalanceDest", "newBalanceDest", "errorBalanceOrig", "errorBalanceDest"],
        "ndarray": [
            [629.0, 1.0, 2433009.28, 2433009.28, 0.0, 0.0, 2433009.28, 0.0, 0.0]
        ]
    }
}
```

# A/B Testing
## Train a second model

We will train a second fraud detection mode using Sklearn, the Sklearn server is another of Seldon's prepackaged servers, therefore again we only need to save our model in object storage to be able to deploy it with Seldon. 

Firstly, we can compute weights using sklearns compute_class_weight `compute_class_weight` function and format them into the dictionary required as input to our Random Forest Classifier. 

In [None]:
sk_weights = class_weight.compute_class_weight(class_weight="balanced", classes=Y.unique(), y=Y)
sk_weights = {1: sk_weights[0], 0: sk_weights[1]}
sk_weights

In [None]:
# Long computation in this cell (~ 5 mins)

sk_clf = RandomForestClassifier(n_estimators=100, max_depth=3, class_weight=sk_weights)
probs = sk_clf.fit(trainX, trainY).predict_proba(testX)

In [None]:
print('AUPRC = {}'.format(average_precision_score(testY, probs[:, 1])))

In [None]:
joblib.dump(sk_clf, "model.joblib")

In [None]:
!gsutil cp model.joblib gs://kelly-seldon/fraud-detection/models/{YOUR_NAME}/model.joblib

## Deploy second model as a Canary

Seldon supports a number of advanced deployment patterns, including Multi-armed bandits, A/B testing, Canary deployments and Shadow deployments.

Here, we will deploy our Sklearn Random Forest Classifier as a Canary Deployment and we will direct 30% of traffic to this model.

We will deploy our Canary model using the Deployment Wizard on the UI.

We need to set the following field values:

**'Add a Canary'** tab:

* Runtime - "SciKit Learn"
* Model URI - gs://kelly-seldon/fraud-detection/models/{YOUR_NAME}/model.joblib
* Canary Traffic Percentage - 30%

Now you can click through the remaining optional tabs and launch the updated Seldon Deployment.

Once our canary model is available, we can make predictions and then view live requests and resource monitoring on the "Dashboard" tab in the Deployment UI for both models running in production.

We will assume that this model has been running in production for some time and it is performing better than our default model.

We can then finally go ahead and promote the canary model to be the default model in the deployment.

# Congratulations!

Thank you for sticking it out to the end of the workshop! 

As a recap you have done the following: 
1. Cleaned and explored a set of transaction data.
2. Trained an XGBoost model to distinguish between normal and fraudulent payments. 
3. Added metadata and a prediction schema. 
4. Trained and deployed a drift detector to understand when your data changes. 
5. Added an explainer to gain deeper insights into the model's behaviour.
6. Trained an Sklearn model and deployed this as a canary.

Not a bad list. Well done, you!