# Tackling Tabular Data with BOLT

Tabular data is some of the most common data in real world applications. Consisting of mostly categories and numbers, tabular data is used in a variety of contexts including financial services, fraud detection, healthcare, climate science, and more. For these problems, gradient boosted decision trees like XGBoost have long been the algorithms of choice for their ability to quickly learn efficient representational decision boundaries on the data. However, deep learning has made significant strides in this area with developments such as TabNet and even our very own BOLT engine.

Our deep learning engine BOLT is an efficient neural framework that has proven effective on a wide array of problems. Most recently, initial experiments have shown that BOLT can beat top methods such as XGBoost and TabNet on several benchmark tabular datasets. Additionally, BOLT has the added capability that it works exceptionally well as an out of the box classifier, requiring much less effort to train than other methods. In this notebook, we'll showcase some of our exciting results on several standard tabular datasets as well as walk you through how you can try out BOLT for yourself. 

# Results

First we'll start off with some results. We compared our BOLT tabular classifier with XGBoost and TabNet on several benchmark tabular classification datasets to produce the following numbers. We hope these numbers serve to counter the narrative that gradient boosted trees are the best solution for tabular data and inspire discussion around our emerging BOLT technology. 

| Dataset      | BOLT Accuracy | XGBoost Accuracy | TabNet Accuracy | 
| ----------- | ----------- | ----------- | ----------- |
| Census Income      |   85%    | 84% | 78% |
| Poker Hand Induction   |    96%     | 62% | 54% |
| Eye Movements      |   60%    | 37% | 30% |
| BNP Paribas Cardif Claims Management  |    76%     | 75% | 76% |
| Churn Modeling   |    84%     | 84% | 77% |
| Higgs Boson   |    65%     | 50% | 74% |

*Results are shown with default parameters, under no specialized tuning and feature engineering, along with early stopping based on validation accuracy.

# Extensions

Using deep learning with tabular data opens up the possibility of transfer learning, self-supervised, semi-supervised learning, and many more influential paradigms, all of which are otherwise hard with only boosted trees. We hope these extensions may open up research in this area and lead to new innovations. 

# Try out the BOLT TabularClassifier

Next we'll move along with a demo. Once you have recieved a valid license file from our website and followed our package installation instructions (https://www.thirdai.com/installation-instructions/), follow the instructions detailed below to try out the BOLT TabularClassifier for yourself!

## Step 1: Download and Set Up your dataset

You're welcome to download and set up any tabular dataset of your choosing. However, for this demo we'll be using the publicly available Census Income dataset which is a common benchmark in tabular classification problems. 

In [None]:
from thirdai import bolt
import os
import pandas as pd

CENSUS_INCOME_BASE_DOWNLOAD_URL = (
    "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/"
)

TRAIN_FILE = "./census_income_train.csv"
TEST_FILE = "./census_income_test.csv"
PREDICTION_FILE = "./census_income_predictions.txt"


def download_census_income_dataset():
    if not os.path.exists(TRAIN_FILE):
        os.system(
            f"curl {CENSUS_INCOME_BASE_DOWNLOAD_URL}adult.data --output {TRAIN_FILE}"
        )
    if not os.path.exists(TEST_FILE):
        os.system(
            f"curl {CENSUS_INCOME_BASE_DOWNLOAD_URL}adult.test --output {TEST_FILE}"
        )

def reformat_test_csv():
    with open(TEST_FILE, "r") as fin:
        data = fin.read().splitlines(True)
    with open(TEST_FILE, "w") as fout:
        # for some reason each of the labels end with a "." in the test set
        # loop through data[1:] since the first line is bogus
        fout.writelines([line.replace(".", "") for line in data[1:]])

def get_dataset_metadata():
    df = pd.read_csv(TEST_FILE)
    n_classes = df[df.columns[-1]].nunique()
    column_datatypes = []
    for col_type in df.dtypes[:-1]:
        if col_type == "int64":
            column_datatypes.append("numeric")
        elif col_type == "object":
            column_datatypes.append("categorical")

    # the last column is the label
    column_datatypes.append("label")

    # TabularClassifier assumes no header so we add the first label 
    # (considered a column name in pandas) as part of the test labels
    test_labels = [df.columns[-1]] + list(df[df.columns[-1]])

    return n_classes, column_datatypes, test_labels


download_census_income_dataset()
reformat_test_csv()
(n_classes, column_datatypes, test_labels) = get_dataset_metadata()


## Step 2: Defining the TabularClassifier model

### Arguments
- `model_size`: Controls how big the model is. Options are ‘small’, ‘medium’,  ‘large’, or a target model size in gigabytes, i.e. “4 Gb” or “4Gb”.
- `n_classes`: How many output classes are in the dataset.

In [None]:
classifier = bolt.TabularClassifier(model_size="medium", n_classes=n_classes)

# Step 3: Training the TabularClassifier

### Arguments
- `train_file`: The dataset to train on. The expected format is a csv file with comma separated columns. This csv file should not have a header. The TabularClassifier also does not attempt to shuffle the dataset, the user should shuffle the dataset before calling train if necessary.
- `column_datatypes`: List of column types. Should be a list of strings with the types of the columns, which are one of: "numeric", "categorical", or "label". Numeric columns will be interpreted as integers or floating point values. Categorical columns and label columns will be interpreted as strings. Empty values are supported for categorical and numeric columns. Finally, there must be one and only one "label" column. 
- `epochs`: This determines the number of epochs to train on.
- `learning_rate`: Determines the learning_rate for training.

In [None]:
classifier.train(
    train_file=TRAIN_FILE,
    column_datatypes=column_datatypes,
    epochs=1,
    learning_rate=0.01,
)

# Step 4: Testing The Classifier
### Arguments
- `test_file`: The test dataset to run. The file should follow the same format as the test file, column datatypes are assumed to be identical.
- `output_file`: This is an optional parameter. If it is specified then as the data is processed it writes the names of the predicted classes into this file. Each predicted class name is on its own line.

In [None]:
classifier.predict(test_file=TEST_FILE, output_file=PREDICTION_FILE)

# Step 5: Verifying results

In [None]:
def compute_accuracy(test_labels, pred_file):
    with open(pred_file) as pred:
        predictions = pred.readlines()

    correct = 0
    total = 0
    assert len(predictions) == len(test_labels)
    for (prediction, answer) in zip(predictions, test_labels):
        if prediction[:-1] == answer:
            correct += 1
        total += 1

    return correct / total

acc = compute_accuracy(test_labels, PREDICTION_FILE)

print("Computed Accuracy: ", acc)

In [None]:
os.remove(TRAIN_FILE)
os.remove(TEST_FILE)
os.remove(PREDICTION_FILE)