# Click-Through Prediction with ThirdAI's Universal Deep Transformer

This notebook shows how to build a Click-Through Prediction model with ThirdAI's
Universal Deep Transformer (UDT), our all-purpose classifier for a multitude of problems.
In this demo, we will train and evaluate the model on the Kaggle Display Advertising (Criteo) dataset, 
but you can easily replace this with your own dataset.

To run this notebook, you will need to obtain a ThirdAI license at the following link if you have not already:
https://www.thirdai.com/try-bolt/



In [None]:
!pip3 install thirdai==0.5.1
!pip3 install pandas # We need Pandas in the utils module to convert downloaded datasets to CSV format
!pip3 install numpy # We use numpy to analyze UDT performance in this notebook
!pip3 install sklearn # We use sklearn to analyze UDT performance (calculate AUC)

# Dataset Download

We will use the utils module in this repo to download the Kaggle Display Advertising dataset (if you have just copied this notebook and not cloned the entire repo, you will need to copy the utils.py file as well). You can replace 
this step with your own downloader and preprocessor that is specific to your dataset. The file that you pass into the UDT's train and evaluate methods have to be in CSV format with a header that has all column names.


In [None]:
import utils

(
    train_filename,
    test_filename,
    y_train,
    y_test,
    min_vals_of_numeric_cols,
    max_vals_of_numeric_cols,
    n_unique_classes,
    sample_batch,
) = utils.download_criteo()

# UDT Initilization

We can now create a UDT model by passing in the types of each column in the dataset
and the target column we want to be able to predict.

In [None]:
from thirdai import bolt
from sklearn.metrics import roc_auc_score
import numpy as np

columns = {}

# Add numerical columns
for i in range(1, 14):
    columns[f"num_{i}"] = bolt.types.numerical(
        range=(min_vals_of_numeric_cols[i - 1], max_vals_of_numeric_cols[i - 1])
    )

# Add categorical columns
for i in range(1, 27):
    columns[f"cat_{i}"] = bolt.types.categorical(
        n_unique_classes=n_unique_classes[i - 1]
    )

# Add label column
columns["label"] = bolt.types.categorical(n_unique_classes=2)

# Define UDT
model = bolt.UniversalDeepTransformer(columns, target="label")

# Training

We can now train our UDT model with just two lines! You can customize the
number of epochs and the learning rate through the *train_config* object. If you do not give any train_config, the *train()* method will auto-tune the best *train_config*.

This task has N=40M examples and is thus a reasonably large problem. As such, depending on how powerful your machine is it may take an hour or more to train (a typical 16 core machine on AWS will take around 15 minutes).

In [None]:
train_config = bolt.TrainConfig(epochs=1, learning_rate=0.001)

model.train(train_filename, train_config)

# Evaluation

Evaluating the performance of the UDT model is also just two lines! For this specific application, AUC is the preferred metric. We will first get the activations and then compute the AUC.

This task again may take a while because of the size of the test dataset. We expect it to take about 10% as much time as the training step above.

In [None]:
eval_config = bolt.EvalConfig().return_activations()

activations = model.evaluate(test_filename, eval_config)

roc_auc = roc_auc_score(y_test, activations[:, 1])

# Saving and Loading

Saving and loading a trained UDT model to disk is also extremely straight forward.

In [None]:
save_location = "click_through.model"

# Saving
model.save(save_location)

# Loading
model = bolt.UniversalDeepTransformer.load(save_location)

# Testing Predictions

The evaluation method is great for testing, but it requires labels, which don't
exist in a production setting. We also have a predict method that can take in an 
in-memory batch of rows or a single row (without the target column), allowing 
easy integration into production pipelines.

In [None]:
import numpy as np

print(sample_batch[0])

prediction = model.predict(sample_batch[0])

print("Predicted class:", np.argmax(prediction))