### Neural Network for Criteo Clickthrough Data
We will implement a neural network to predict ad clickthrough probabilities with Keras. 
We start by importing some modules. 

In [57]:
import numpy as np
import tensorflow as tf
import tensorflow.keras as keras
import random

The neural network consists of 3 dense hidden layers, each with 500 neurons, and a single-neuron final layer with a sigmoid activation function, representing the probability that the ad is clicked. We stack them together using `keras.Sequential`.

In [58]:
hidden_layer_1 = keras.layers.Dense(500, activation=tf.nn.relu)
hidden_layer_2 = keras.layers.Dense(500, activation=tf.nn.relu)
hidden_layer_3 = keras.layers.Dense(500, activation=tf.nn.relu)
final_layer = keras.layers.Dense(1, activation=tf.nn.sigmoid)

layers = keras.Sequential([
    hidden_layer_1,
    hidden_layer_2,
    hidden_layer_3,
    final_layer
])

This model will be trained with binary cross entropy loss since the dataset has two classes (1 = clicked, 0 = not clicked). For the optimizer, we will use ADAM. To evaluate the model, we use the accuracy metric, which simply shows us how often the model predicts correctly (probability > 0.5 when truth = 1, and probability < 0.5 when truth = 0).

We are not setting `from_logits=True` in the loss function since the final layer has a sigmoid activation; it is already a probability.


In [59]:
loss_fn = keras.losses.BinaryCrossentropy()
optimizer = keras.optimizers.Adam()
metric = keras.metrics.BinaryAccuracy()

Our model is almost done! However, we still haven't prepared our datasets. We need to prepare it in a way specific to the Criteo dataset. 
Specifically, it has 13 integer features (count features) and 26 categorical features hashed to 32 bits. while we can stack the 13 integer features into a vector, the categorical features must be treated differently since it does not make sense to treat category labels as scalars (e.g. if category car = 1 and category apple = 2, it does not make sense that apple is 2 x car). At the same time, each feature may comprise too large of a vocabulary to be one-hot encoded. Thus, we will use embedding tables, one for each categorical feature. The dimensions of each feature is not readily available, so we need to analyze the dataset to find out. For now, we use the small Criteo dataset with only 1,000,000 entries.

Here are the dimensions of each data point

0 : 1261    | 1 : 531       | 2 : 321438    | 3 : 120964    | 4 : 267   | 5 : 15        | 6 : 10863 | 7 : 563   | 8 : 3     | 9 : 30792 

10 : 4731   | 11 : 268487   | 12 : 3068     | 13 : 26       | 14 : 8934 | 15 : 205923   | 16 : 10   | 17 : 3881 | 18 : 1854 | 19 : 3

20 : 240747 | 21 : 15       | 22 : 15       | 23 : 41282    | 24 : 69   | 25 : 30956

Since some of them have very small dimensions, we will only use embeddings for features with > 100 categories.

In [60]:
embedding_layers = [
    keras.layers.Embedding(input_dim=100, output_dim=70, 
                            embeddings_initializer=tf.keras.initializers.RandomNormal(stddev=1.)),

    keras.layers.Embedding(input_dim=531, output_dim=50, 
                            embeddings_initializer=tf.keras.initializers.RandomNormal(stddev=1.)),

    keras.layers.Embedding(input_dim=600, output_dim=100, 
                            embeddings_initializer=tf.keras.initializers.RandomNormal(stddev=1.)),

    keras.layers.Embedding(input_dim=400, output_dim=100, 
                            embeddings_initializer=tf.keras.initializers.RandomNormal(stddev=1.)),

    keras.layers.Embedding(input_dim=267, output_dim=50, 
                            embeddings_initializer=tf.keras.initializers.RandomNormal(stddev=1.)),

    keras.layers.CategoryEncoding(num_tokens=15, output_mode='one_hot'),

    keras.layers.Embedding(input_dim=110, output_dim=100, 
                            embeddings_initializer=tf.keras.initializers.RandomNormal(stddev=1.)),
                            
    keras.layers.Embedding(input_dim=563, output_dim=50, 
                            embeddings_initializer=tf.keras.initializers.RandomNormal(stddev=1.)),

    keras.layers.CategoryEncoding(num_tokens=3, output_mode='one_hot'),

    keras.layers.Embedding(input_dim=200, output_dim=100, 
                            embeddings_initializer=tf.keras.initializers.RandomNormal(stddev=1.)),

    keras.layers.Embedding(input_dim=200, output_dim=70, 
                            embeddings_initializer=tf.keras.initializers.RandomNormal(stddev=1.)),

    keras.layers.Embedding(input_dim=600, output_dim=100, 
                            embeddings_initializer=tf.keras.initializers.RandomNormal(stddev=1.)),

    keras.layers.Embedding(input_dim=150, output_dim=70, 
                            embeddings_initializer=tf.keras.initializers.RandomNormal(stddev=1.)),

    keras.layers.CategoryEncoding(num_tokens=26, output_mode='one_hot'),

    keras.layers.Embedding(input_dim=300, output_dim=70, 
                            embeddings_initializer=tf.keras.initializers.RandomNormal(stddev=1.)),

    keras.layers.Embedding(input_dim=500, output_dim=100, 
                            embeddings_initializer=tf.keras.initializers.RandomNormal(stddev=1.)),

    keras.layers.CategoryEncoding(num_tokens=10, output_mode='one_hot'),

    keras.layers.Embedding(input_dim=250, output_dim=70, 
                            embeddings_initializer=tf.keras.initializers.RandomNormal(stddev=1.)),

    keras.layers.Embedding(input_dim=100, output_dim=70, 
                            embeddings_initializer=tf.keras.initializers.RandomNormal(stddev=1.)),

    keras.layers.CategoryEncoding(num_tokens=3, output_mode='one_hot'),

    keras.layers.Embedding(input_dim=600, output_dim=100, 
                            embeddings_initializer=tf.keras.initializers.RandomNormal(stddev=1.)),

    keras.layers.CategoryEncoding(num_tokens=15, output_mode='one_hot'),

    keras.layers.CategoryEncoding(num_tokens=15, output_mode='one_hot'),

    keras.layers.Embedding(input_dim=200, output_dim=100, 
                            embeddings_initializer=tf.keras.initializers.RandomNormal(stddev=1.)),

    keras.layers.CategoryEncoding(num_tokens=69, output_mode='one_hot'),
    
    keras.layers.Embedding(input_dim=200, output_dim=100, 
                            embeddings_initializer=tf.keras.initializers.RandomNormal(stddev=1.)),
]

In [61]:
def concat_helper(layer, inputs, index):
    layer_inputs = tf.gather(inputs, [index], axis=1)
    if isinstance(layer, tf.keras.layers.Embedding):
        layer_inputs = tf.math.mod(layer_inputs, layer.input_dim)
        output_dim = layer.output_dim
    else:
        layer_inputs = tf.math.mod(layer_inputs, layer.num_tokens)
        output_dim = layer.num_tokens
    final_shape = (-1, output_dim)
    return tf.reshape(layer(layer_inputs), final_shape)

Let's now combine everything into the model.

In [62]:
inputs = keras.Input(shape=(39,), dtype=tf.float32)
x = keras.layers.Concatenate()([
    tf.gather(inputs, range(0,13), axis=1),
    *[concat_helper(layer, inputs, i + 13) for i, layer in enumerate(embedding_layers)]
])

outputs = layers(x)
model = keras.Model(inputs, outputs)
model.compile(loss=loss_fn, optimizer=optimizer, metrics=[metric])

To set up the experiments, we need to first parse the dataset and package it as a tensorflow dataset. First, we load everything into memory as numpy array, then we will cast it as a tensorflow dataset.

In [63]:
data = np.load('/Users/benitogeordie/Desktop/thirdai_datasets/criteo/kaggleAdDisplayChallenge_processed.npz')

X_cat = data['X_cat'].astype(np.int32)
X_int = data['X_int'].astype(np.int32)
y = data['y']
counts = data['counts']

start_idx = np.zeros(len(counts)+1, dtype=np.int32)
start_idx[1:] = np.cumsum(counts)

idxs = np.arange(y.shape[0])
np.random.shuffle(idxs)

n_train = int(len(idxs)*0.8)
n_test = y.shape[0]-n_train

train_idxs = idxs[:n_train]
test_idxs = idxs[n_train:]

X_cat_train = X_cat[train_idxs]
X_cat_test = X_cat[test_idxs]

X_int_train = X_int[train_idxs]
X_int_test = X_int[test_idxs]

y_train = y[train_idxs]
y_test = y[test_idxs]

In [145]:
x_train = np.concatenate((X_int_train, X_cat_train), axis=1)
x_test = np.concatenate((X_int_test, X_cat_test), axis=1)


In [65]:
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
test_dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test))

In [67]:
# train_path = '/Users/benitogeordie/Desktop/thirdai_datasets/criteo/train_shuf.txt' # TODO: Always make sure this is correct before running
# test_path = '/Users/benitogeordie/Desktop/thirdai_datasets/criteo/test_shuf.txt' # TODO: Always make sure this is correct before running

# def load_examples_and_labels(criteo_path):
#     examples = np.ndarray([0,39], dtype=np.int)
#     labels = np.ndarray([0,1], dtype=np.int)

#     f = open(criteo_path)

#     for line in f:
#         itms = line.split(' ')
#         np.append(labels, [[np.int32(itms[0])]], axis=0)
#         np.append(examples, [np.int32(itm) if itm!='' else 0 for itm in itms[1:]])

#     return (examples, labels)

# train_dataset = tf.data.Dataset.from_tensor_slices(load_examples_and_labels(train_path))
# test_dataset = tf.data.Dataset.from_tensor_slices(load_examples_and_labels(test_path))

Since data is unbalanced, we need to check the distribution of positive vs negative examples. Suppose 75% of the examples are negative. Then even if we just predicted false for everything, we would get 75% accuracy. Thus, even a 70% accuracy is not good. We want at least 75% accuracy.

In [68]:
def get_percent_neg(labels):
    n_examples = labels.shape[0]
    negatives = n_examples - np.count_nonzero(labels)
    percent_negative = 100 * negatives / n_examples
    return percent_negative

print(f"Train: {get_percent_neg(y_train)}% negative.")
print(f"Test: {get_percent_neg(y_test)}% negative.")


Train: 74.37850761877574% negative.
Test: 74.37427766029343% negative.


Lets give her a run.

In [69]:
# batch_size = 256
# train_batches = train_dataset.batch(batch_size)
# test_batches = test_dataset.batch(batch_size)

# for i in range(10):
#     print(f"Epoch {i + 1}/10")
#     model.fit(train_batches)
#     model.evaluate(test_batches)


Now what if we instead diversity-sampled the input?

In [70]:
class MinHash:
    def __init__(self, r_repetitions: int, h_hashes_per_table: int, b_buckets: int, seed: int=314152):
        num_hashes = r_repetitions * h_hashes_per_table
        g = tf.random.Generator.from_seed(seed)
        self.a = g.uniform(shape=(1, num_hashes), dtype=tf.int64, minval=None)
        self.b = g.uniform(shape=(num_hashes,), dtype=tf.int64, minval=None)

        self.b_buckets = b_buckets
        self.table_shape = (r_repetitions, h_hashes_per_table)
        self.r_start_idxs = tf.constant([i * b_buckets for i in range(r_repetitions)], dtype=tf.int64)
    
    @tf.function
    def hash(self, tensor: tf.Tensor):
        vertical_tensor = tf.reshape(tensor, (-1, 1))
        hashes = (tf.matmul(vertical_tensor, self.a) + self.b) % self.b_buckets
        hashes = tf.reduce_min(hashes, axis=0)
        hashes = tf.reshape(hashes, self.table_shape)
        hashes = tf.as_string(hashes)
        hashes = tf.strings.reduce_join(hashes, axis=-1)
        return tf.cast(tf.strings.to_hash_bucket_fast(hashes, self.b_buckets), dtype=tf.int64) + self.r_start_idxs
    
    def summary(self):
        return



(1000,)
0.22


In [122]:
class Race:
    def __init__(self, r_repetitions: int, b_buckets: int, h_hashes_per_table: int):
        self.arrays = tf.Variable(np.zeros(shape=(r_repetitions * b_buckets)), dtype=tf.float64)
        self.hash = MinHash(r_repetitions, h_hashes_per_table, b_buckets).hash #tf.function()

    @tf.function
    def query(self, tensor: tf.Tensor):
        hashes = self.hash(tensor)
        return tf.reduce_mean(tf.gather(self.arrays, hashes))

    @tf.function
    def index(self, tensor: tf.Tensor):
        hashes = tf.reshape(self.hash(tensor), (-1, 1))
        self.arrays.assign(tf.tensor_scatter_nd_add(self.arrays, hashes, tf.ones(shape=hashes.shape[0], dtype=tf.float64)))

    @tf.function
    def index_and_query(self, tensor: tf.Tensor):
        hashes = tf.reshape(self.hash(tensor), (-1, 1))
        self.arrays.assign(tf.tensor_scatter_nd_add(self.arrays, hashes, tf.ones(shape=hashes.shape[0], dtype=tf.float64)))
        return tf.reduce_mean(tf.gather(self.arrays, hashes))
    
    @tf.function
    def query_and_index(self, tensor: tf.Tensor):
        hashes = tf.reshape(self.hash(tensor), (-1, 1))
        result = tf.reduce_mean(tf.gather(self.arrays, hashes))
        self.arrays.assign(tf.tensor_scatter_nd_add(self.arrays, hashes, tf.ones(shape=hashes.shape[0], dtype=tf.float64)))
        return result
    
    def summary(self):
        # Mean
        # Stdev
        # Num zeros
        # Nonzero min
        # Max
        return

race = Race(10, 1000, 2)
tensor1 = tf.constant([1,2,3,8], dtype=tf.int64)
tensor2 = tf.constant([7,2,3,8], dtype=tf.int64)
# index = tf.function(race.index)
# query = tf.function(race.query)
race.index(tensor1)


In [111]:
race.index(tensor1)
race.index(tensor1)
race.index(tensor1)
race.index(tensor1)
race.index(tensor2)
race.index(tensor2)
race.index(tensor2)
race.index(tensor2)
race.index(tensor2)
race.index(tensor2)
race.index(tensor2)
print(race.query(tensor1))
print(race.query(tensor2))

tf.Tensor(6.8, shape=(), dtype=float64)
tf.Tensor(8.6, shape=(), dtype=float64)


In [146]:
# Quantize the continuous features
# Make the mapping function
def quantize(columns: np.array, bin_widths: np.array, n_bins: int):
    # TODO: Should I allow arrays of bin_widths and n_binss because each feature has a different range?
    """
    Quantize into bins but keep some notion of locality sensitivity
    """
    assert(len(columns.shape) == 2)
    n_cols = columns.shape[1]
    new_arr = np.reshape(columns, columns.shape + (1,))
    new_arr = np.repeat(new_arr, n_bins, axis=2)
    
    bin_idxs = np.arange(n_bins)
    add = np.reshape(bin_idxs, (1, n_bins)) * np.reshape(bin_widths, (n_cols, 1)) // n_bins
    new_arr = (new_arr + add)
    new_arr = new_arr // np.repeat(np.reshape(bin_widths, (n_cols, 1)), n_bins, axis=1)
    new_arr[new_arr < 0] = 0
    return tf.reshape(new_arr, (-1, columns.shape[1] * n_bins))

def separate_domains(columns: np.array):
    """
    Separate domains by interleaving them (instead of shifting by domain ranges).
    This makes it range-agnostic
    """
    assert(len(columns.shape) == 2)
    n_domains = columns.shape[1]
    idx_shifts = np.arange(n_domains)
    return columns * n_domains + idx_shifts

# print(quantize(np.array([[0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5]]), np.array([2, 2]), 2))
# print(separate_domains(quantize(np.array([[0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5]]), np.array([2, 2]), 2)))

In [151]:
print(x_train[:,:13].shape)
x_train_int_binned = quantize(x_train[:,:13], bin_widths=np.max(x_train, axis=0) // 100, n_bins=5)
print(x_train_int_binned.shape)
x_train = separate_domains(np.concatenate((x_train_int_binned, x_train[13:]), axis=0))



(36672493, 13)
(13, 195)


  new_arr = new_arr // np.repeat(np.reshape(bin_widths, (n_cols, 1)), n_bins, axis=1)


ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 195 and the array at index 1 has size 39

## Future Directions
Consider:
- Reweighting vs no reweighting vs variable reweighting
- Even distribution across clusters vs prioritizing hard samples (use for max-likelihood)


- Diversification of results
- Prove something about how RACE maximizes a diversity metric
- Expert selection
- other anomaly detection ideas
- race in place of convolution filters? how complex is convolution filter? Instead of multiplying with each filter, we can use race to match with most relevant filters using just a few hash computations, allowing us to do efficient inference with many filters. Allows us to use larger patches or kernels? less convolutions?

Experiment should be reproducible by running an executable
- It's ok if I write a separate exec for each dataset.
- bash script for each exp