# Criteo Click Through Prediction using ThirdAI's Universal Deep Transformer(UDT) APIs
This notebook shows how to build a Click Through Prediction model using ThirdAI's UDT.

In [14]:
!pip3 install thirdai --upgrade

Requirement already up-to-date: thirdai in /home/pratik/.local/lib/python3.8/site-packages (0.5.10)


# Download and Process Dataset

We will now create a sample dataset for training. Criteo-Click Logs dataset can be downloaded from https://ailab.criteo.com. Once datasets are downloaded convert the datasets into csv format. 

Column headers: `<label>` is titled as `<label>`, `<integer feature 1>` is titled as `numeric_1` and `<categorical feature 1>` as `cat_1`

<span style="color:red">Disclaimer: Remember that the following model is trained on sample data, to achieve better AUC train on the original criteo-dataset.</span>

In [15]:
import numpy as np
import pandas as pd

TRAIN_FILE = 'train_file.txt'
TEST_FILE = 'test_file.txt'


header = 'label,numeric_1,numeric_2,numeric_3,numeric_4,numeric_5,numeric_6,numeric_7,numeric_8,numeric_9,numeric_10,numeric_11,numeric_12,numeric_13,'
header += 'cat_1,cat_2,cat_3,cat_4,cat_5,cat_6,cat_7,cat_8,cat_9,cat_10,cat_11,cat_12,cat_13,cat_14,cat_15,cat_16,cat_17,cat_18,cat_19,cat_20,cat_21,'
header += 'cat_22,cat_23,cat_24,cat_25,cat_26'

column_header=header.split(",")


def generate_data():
    num_training_data=1200
    num_numeric_features=13
    num_categorical_features=26

    numeric_data = np.random.randint(1500, size=(num_training_data, num_numeric_features))
    categorical_data = np.random.randint(10000, size=(num_training_data, num_categorical_features))
    label = np.random.randint(2, size=(num_training_data, 1))

    return np.concatenate((numeric_data, categorical_data), axis=1), label

data_X , data_Y = generate_data()

# Saving CSV for train data
train_data = np.concatenate((data_Y[:1000], data_X[:1000]), axis=1)
train_df = pd.DataFrame(train_data, columns=column_header)
train_df.to_csv(TRAIN_FILE, index=False)

# Saving CSV for test data
test_data = np.concatenate((data_Y[1000:], data_X[1000:]), axis=1)
train_df = pd.DataFrame(test_data, columns=column_header)
train_df.to_csv(TEST_FILE, index=False)

train_df.head(10)

Unnamed: 0,label,numeric_1,numeric_2,numeric_3,numeric_4,numeric_5,numeric_6,numeric_7,numeric_8,numeric_9,...,cat_17,cat_18,cat_19,cat_20,cat_21,cat_22,cat_23,cat_24,cat_25,cat_26
0,1,592,705,105,753,1237,1010,594,1343,276,...,4768,9631,9205,5841,744,5123,7816,8312,2153,7272
1,1,160,314,1160,1269,1004,1317,546,193,223,...,9387,6668,3496,3965,7090,7440,1256,9550,5252,8862
2,1,1432,1115,898,1204,233,596,771,359,153,...,3222,4999,9008,3040,4554,9367,6920,4470,370,530
3,0,1113,449,139,537,597,101,1103,786,114,...,870,7576,1388,203,2006,5484,658,8703,5885,10
4,0,1496,732,408,845,232,762,85,108,208,...,6863,9594,2187,6210,2807,8295,7723,9145,1024,7033
5,1,1262,887,779,537,282,480,422,808,951,...,3472,6151,7603,2362,1254,419,3398,8524,1310,458
6,0,534,17,1264,126,498,1497,232,771,927,...,2699,9719,2718,8249,8956,8826,629,4285,3106,7750
7,0,1345,1352,943,1292,1277,781,586,517,425,...,7840,5009,6443,5532,1509,3247,3814,6611,3724,626
8,1,857,780,284,922,653,1091,898,1152,182,...,6406,9644,1812,7734,5496,3294,373,7068,7338,1843
9,1,1318,998,909,1069,962,197,245,1411,245,...,6288,5427,1556,3282,9396,7379,1800,688,8318,8951


# UDT Initialization


We can create a UDT model specific for Criteo Click Through Prediction as follows. Here we define Bolt Datatype for all the columns, specify the target, n_target classes and the embedding dimension for UDT. 

In [16]:
from thirdai import bolt
import numpy as np
from sklearn.metrics import roc_auc_score


tabular_model = bolt.UniversalDeepTransformer(
    data_types={
        "numeric_1": bolt.types.numerical(range=(0, 1500)),
        "numeric_2": bolt.types.numerical(range=(0, 1500)),
        "numeric_3": bolt.types.numerical(range=(0, 1500)),
        "numeric_4": bolt.types.numerical(range=(0, 1500)),
        "numeric_5": bolt.types.numerical(range=(0, 1500)),
        "numeric_6": bolt.types.numerical(range=(0, 1500)),
        "numeric_7": bolt.types.numerical(range=(0, 1500)),
        "numeric_8": bolt.types.numerical(range=(0, 1500)),
        "numeric_9": bolt.types.numerical(range=(0, 1500)),
        "numeric_10": bolt.types.numerical(range=(0, 1500)),
        "numeric_11": bolt.types.numerical(range=(0, 1500)),
        "numeric_12": bolt.types.numerical(range=(0, 1500)),
        "numeric_13": bolt.types.numerical(range=(0, 1500)),
        "cat_1": bolt.types.categorical(),
        "cat_2": bolt.types.categorical(),
        "cat_3": bolt.types.categorical(),
        "cat_4": bolt.types.categorical(),
        "cat_5": bolt.types.categorical(),
        "cat_6": bolt.types.categorical(),
        "cat_7": bolt.types.categorical(),
        "cat_8": bolt.types.categorical(),
        "cat_9": bolt.types.categorical(),
        "cat_10": bolt.types.categorical(),
        "cat_11": bolt.types.categorical(),
        "cat_12": bolt.types.categorical(),
        "cat_13": bolt.types.categorical(),
        "cat_14": bolt.types.categorical(),
        "cat_15": bolt.types.categorical(),
        "cat_16": bolt.types.categorical(),
        "cat_17": bolt.types.categorical(),
        "cat_18": bolt.types.categorical(),
        "cat_19": bolt.types.categorical(),
        "cat_20": bolt.types.categorical(),
        "cat_21": bolt.types.categorical(),
        "cat_22": bolt.types.categorical(),
        "cat_23": bolt.types.categorical(),
        "cat_24": bolt.types.categorical(),
        "cat_25": bolt.types.categorical(),
        "cat_26": bolt.types.categorical(),
        "label": bolt.types.categorical(),
    },
    target="label",
    n_target_classes=2,
    options={"embedding_dimension": 512},
)

# Training

We will now train the UDT with just one line of code. Here we are specifying the training file name, number of epochs to train and maximum number of batches we want to load in memory at once. `max_in_memory_batches` support loaing data in a streaming fashion. You can change `max_in_memory_batches` based on your own memory specification.

In [17]:


tabular_model.train(filename=TRAIN_FILE, epochs=1, max_in_memory_batches=1000)



loading data | source 'train_file.txt' | vectors 1000 | batches 1 | time 0s | complete

train | epoch 0 | train_steps 1 | {} | train_batches 1 | time 0s | complete     



# Evaluation

Evaluating the UDT is also just one line of code. Evaluate function by default returns activations. Then, we are using sklearn's roc_auc_score to calculate roc_auc_score for the model we trained.

In [18]:
activations = tabular_model.evaluate(filename=TEST_FILE, metrics=["categorical_accuracy"])

true_labels = np.zeros(activations.shape[0], dtype=np.float32)
with open(TEST_FILE) as f:
    header = f.readline()
    count = 0
    for line in f:
        true_labels[count] = np.float32(line.split(",")[0])
        count += 1

roc_auc = roc_auc_score(true_labels, activations[:, 1])

print("ROC_AUC:", roc_auc)



loading data | source 'test_file.txt' | vectors 200 | batches 1 | time 0s | complete

evaluate | epoch 1 | train_steps 1 | {categorical_accuracy: 0.555} | eval_batches 1 | time 9ms

ROC_AUC: 0.48109243697478987


In [19]:
import os
os.remove(TRAIN_FILE)
os.remove(TEST_FILE)