## Explainability for Tabular Data

This tutorial will go through how we can use the thirdai library to add explainability to tabular models. 

## Download Training Data

For this tutorial, we are using Census Income Dataset. The task for this dataset is to predict whether the income exceeds $50K/yr based on census data.

See here for more info about the dataset: https://archive.ics.uci.edu/ml/datasets/census+income

In [None]:
from thirdai import bolt
import os
import pandas as pd

CENSUS_INCOME_BASE_DOWNLOAD_URL = (
    "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/"
)

TABULAR_TRAIN_FILE = "./census_income_train.csv"

column_names = [
    "age", "workclass", "fnlwgt", "education", "education-num", "marital-status", 
    "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "label"
]

def download_census_income_dataset():
    if not os.path.exists(TABULAR_TRAIN_FILE):
        os.system(
            f"curl {CENSUS_INCOME_BASE_DOWNLOAD_URL}adult.data --output {TABULAR_TRAIN_FILE}"
        )

def reformat_train_csv():
    with open(TABULAR_TRAIN_FILE, "r") as file:
        data = file.readlines()
    with open(TABULAR_TRAIN_FILE, "w") as file:
        file.write(",".join(column_names) + "\n")
        file.writelines([line.replace(", ", ",") for line in data])


download_census_income_dataset()
reformat_train_csv()

## Training Tabular Model

Here, we will train ThirdAI's Universal Deep Transformer(UDT). We need to specify all the column type and target(which contains the prediction for UDT). It will return a Bolt Model. 

For training, we will need to specify a config. Once the training config is specified, we can pass it to the train function along with training data, as specified below.

In [None]:
tabular_model = bolt.UniversalDeepTransformer(
    data_types={
        "age": bolt.types.numerical(range=(17, 90)),
        "workclass": bolt.types.categorical(n_unique_classes=9),
        "fnlwgt": bolt.types.numerical(range=(12285, 1484705)),
        "education": bolt.types.categorical(n_unique_classes=16),
        "education-num": bolt.types.categorical(n_unique_classes=16),
        "marital-status": bolt.types.categorical(n_unique_classes=7),
        "occupation": bolt.types.categorical(n_unique_classes=15),
        "relationship": bolt.types.categorical(n_unique_classes=6),
        "race": bolt.types.categorical(n_unique_classes=5),
        "sex": bolt.types.categorical(n_unique_classes=2),
        "capital-gain": bolt.types.numerical(range=(0, 99999)),
        "capital-loss": bolt.types.numerical(range=(0, 4356)),
        "hours-per-week": bolt.types.numerical(range=(1, 99)),
        "native-country": bolt.types.categorical(n_unique_classes=42),
        "label": bolt.types.categorical(n_unique_classes=2),
    },
    target="label"
)

train_config = (bolt.TrainConfig(epochs=5, learning_rate=0.01)
                    .with_metrics(["categorical_accuracy"]))

tabular_model.train(TABULAR_TRAIN_FILE, train_config)

## Explanations

Below is how we can identify the model that is most responsible for a predicted outcome, with a value which shows their percentage significance

In [None]:
#25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K

dict_for_explainability = {
    "age": "25",
    "workclass": "Private",
    "fnlwgt": "226802",
    "education": "11th",
    "education-num": "7",
    "marital-status": "Never-married",
    "occupation": "Machine-op-inspct",
    "relationship": "Own-child",
    "race": "Black",
    "sex": " Male",
    "capital-gain": "0",
    "capital-loss": "0",
    "hours-per-week": "40",
    "native-country": "United-States",
    "label": "<=50K",
}



explanations = tabular_model.explain(input_sample=dict_for_explainability)
for explanation in explanations:
    print(explanation)

Suppose we want to know what factors(columns in our case) we need to change and how much to have a salary above 50K.

We can use explain function for that too. We need to specify the target_class to be the target we want. In our case, ">50K"

In [None]:
#25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K



explanations = tabular_model.explain(input_sample=dict_for_explainability, target_class=">50K")
for explanation in explanations:
    print(explanation)

## Delete Train Data

In [None]:
os.remove(TABULAR_TRAIN_FILE)