In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Vertex AI Explanations with TabNet models

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/tabnet/ai-explanations-tabnet-algorithm.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/tabnet/ai-explanations-tabnet-algorithm.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/tabnet/ai-explanations-tabnet-algorithm.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>                                                                                               
</table>

## Overview

Vertex AI provides a algorithm called on [TabNet] (https://arxiv.org/abs/1908.07442). TabNet is an interpretable deep learning architecture for tabular (structured) data, the most common data type among enterprises. TabNet combines the best of two worlds: it is explainable, like simpler tree-based models, and can achieve the high accuracy of complex black-box models and ensembles, meaning it is precise without obscuring how the model works. This makes TabNet well-suited for a wide range of tabular data tasks where model explainability is just as important as accuracy.

The goal of the tutorial is to provide a sample plotting tool to visualize the output of TabNet, which is helpful in explaining the algorithm.

Learn more about [Tabular Workflow for TabNet](https://cloud.google.com/vertex-ai/docs/tabular-data/tabular-workflows/tabnet).

### Objective

In this tutorial, you learn how to provide a sample plotting tool to visualize the output of TabNet, which is helpful in explaining the algorithm. 



This tutorial uses the following Google Cloud ML services and resources:

- Vertex Explainable AI
- TabNet builtin algorithm

The steps performed are:
* Setup the project.
* Download the prediction data of pretrain model onf Syn2 data.
* Visualize and understand the feature importance based on the masks output.
* Clean up the resource created by this tutorial.

### Dataset

This tutorial uses Synthetic_2 (Syn2) data, described in Section 4.1 of the [Learning to Explain](https://arxiv.org/pdf/1802.07814.pdf) paper. The input feature X is generated from a 10-dimensional standard Gaussian. The response variable Y is generated from feature X[3:6] only. The data has been split into training and prediction sets and has been uploaded to Google Cloud Storage:
* Training data: gs://cloud-samples-data/ai-platform-unified/datasets/tabnet/tab_net_input/syn2_train.csv.
* Prediction output data: gs://cloud-samples-data/ai-platform-unified/datasets/tabnet/tab_net_output/syn2

At this time, the TabNet pre-trained model file is not publicly available.

### Costs 

This tutorial uses billable components of Google Cloud:

* Vertex AI
* Cloud Storage

Learn about [Vertex AI
pricing](https://cloud.google.com/vertex-ai/pricing) and [Cloud Storage
pricing](https://cloud.google.com/storage/pricing), and use the [Pricing
Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Installation

Install the following packages required to execute this notebook. 

In [None]:
! pip3 install --upgrade --quiet google-cloud-aiplatform tensorflow

### Colab only: Uncomment the following cell to restart the kernel.


In [None]:
# Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

## Before you begin

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

3. [Enable the following APIs: Vertex AI API, Cloud Resource Manager API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com,cloudresourcemanager.googleapis.com).

4. If you are running this notebook locally, you need to install the [Cloud SDK](https://cloud.google.com/sdk).

#### Set your project ID

**If you don't know your project ID**, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

#### Region

You can also change the `REGION` variable used by Vertex AI. Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

In [None]:
REGION = "us-central1"  # @param {type: "string"}

#### UUID

If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a uuid for each instance session, and append it onto the name of resources you create in this tutorial.

In [None]:
import random
import string


# Generate a uuid of a specifed length(default=8)
def generate_uuid(length: int = 8) -> str:
    return "".join(random.choices(string.ascii_lowercase + string.digits, k=length))


UUID = generate_uuid()

### Authenticate your Google Cloud account

Depending on your Jupyter environment, you may have to manually authenticate. Follow the relevant instructions below.

**1. Vertex AI Workbench**
* Do nothing as you are already authenticated.

**2. Local JupyterLab instance, uncomment and run:**

In [None]:
# ! gcloud auth login

**3. Colab, uncomment and run:**

In [None]:
# from google.colab import auth
# auth.authenticate_user()

**4. Service account or other**
* See how to grant Cloud Storage permissions to your service account at https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples.

### Create a Cloud Storage bucket

Create a storage bucket to store intermediate artifacts such as datasets.

When you submit a training job using the Cloud SDK, you upload a Python package
containing your training code to a Cloud Storage bucket. Vertex AI runs
the code from this package. In this tutorial, Vertex AI also saves the
trained model that results from your job in the same bucket. Using this model artifact, you can then
create Vertex AI Model resource and use for prediction.

In [None]:
BUCKET_URI = "gs://your-bucket-name-unique"  # @param {type:"string"}

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l $REGION -p $PROJECT_ID $BUCKET_URI

### Import libraries and define constants

In [None]:
import json

import matplotlib.cm as cm
import matplotlib.pyplot as plt
import numpy as np
from google.cloud import storage

%matplotlib inline

## Reading a sample TabNet prediction on syn2 data

After training and serving your model, you upload the output to Google Cloud Storage. 

Sample prediction data is stored on Google Cloud at gs://cloud-samples-data/ai-platform-unified/datasets/tabnet/tab_net_output/syn2. You can use your own set of prediction data, but you must ensure that the format of the prediction data is the same as the format of the training data.

Each prediction in TabNet contains a mask that is used to explain the predictions. The mask is stored in an **aggregated_mask_values** field.

Information about the training set and the model are better suited for the Dataset section at the previous section of the notebook.


In [None]:
!gsutil cp gs://cloud-samples-data/ai-platform-unified/datasets/tabnet/tab_net_output/syn2 $BUCKET_URI

# Replace your the BUCKET_URI and PREDICTION_FILE
# BUCKET_NAME = "[<your-bucket-name>]"
# PREDICTION_FILE = "[<your-prediction-file>]"

BUCKET_NAME = BUCKET_URI[5:]
PREDICTION_FILE = "syn2"

MASK_KEY = "aggregated_mask_values"

HEADER = [("feat_" + str(i)) for i in range(1, 12)]
HEADER

### Download and preprocess the predictions.

In [None]:
storage_client = storage.Client()
bucket = storage_client.get_bucket(BUCKET_NAME)
blob = bucket.blob(PREDICTION_FILE)
f = blob.download_as_string(client=None).decode("utf-8").strip()
predictions = f.split("\n")
predictions[:1]

## Parse the mask values in prediction. Then, concatenate the mask values.
The output is a matrix having Nxk (N is the number of outputs, k is the size of each mask). Concatenating mask values are used to visualize the feature importance.

In [None]:
masks = []
for prediction in predictions:
    prediction = json.loads(prediction)
    masks.append(prediction[MASK_KEY])
masks = np.matrix(masks)
masks.shape

## Visualize the mask value matrix.
The lighter color indicates more important feature. For example, only features 3-6 are meaningful in prediction output in Syn2 data. In the plot, the column 3-6 have light color.

In [None]:
fig = plt.figure(figsize=(20, 10))
ax = fig.add_subplot(121)
ax.imshow(masks[:50, :], interpolation="bilinear", cmap=cm.Greys_r)
ax.set_xlabel("Features")
ax.set_ylabel("Sample index")
ax.xaxis.set_ticks(np.arange(len(HEADER)))
ax.set_xticklabels(HEADER, rotation="vertical")
plt.show()

## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

In [None]:
import os

# Delete Cloud Storage that were created
if os.getenv("IS_TESTING"):
    ! gsutil -m rm -r $BUCKET_URI

## What's next?

To learn more about TabNet, check out the resources here.

* [TabNet: Attentive Interpretable Tabular Learning](https://arxiv.org/abs/1908.07442)