# Click-prediction with XDeepFM

(PROPOSAL: Start with what is the problem we are addressing and why the user should care)

In this notebook we are going to analyze an example of collaborative filtering using the Microsoft Research algorithm xDeepFM ([Paper](https://arxiv.org/abs/1803.05170)). For it, we are going to use the [dataset CRITEO](https://www.kaggle.com/c/criteo-display-ad-challenge/data), which contains:

- Label - Target variable that indicates if an ad was clicked (1) or not (0).
- I1-I13 - A total of 13 columns of integer features (mostly count features).
- C1-C26 - A total of 26 columns of categorical features. The values of these features have been hashed onto 32 bits for anonymization purposes. 

The training set consists of a portion of Criteo's traffic over a period of 7 days. Each row corresponds to a display ad served by Criteo. Positive (clicked) and negatives (non-clicked) examples have both been subsampled at different rates in order to reduce the dataset size. The examples are chronologically ordered. Label - Target variable that indicates if an ad was clicked (1) or not (0).

An algorithm like xDeepFM can be beneficial for problems of click optimization, where the objective is to maximize the CTR (TODO: link to CTR info). The evaluation metrics that we are going to use are regression metrics like RMSE, AUC or logloss.

## xDeepFM

Combinatorial features are essential for the success of many com- mercial models. Manually crafting these features usually
comes with high cost due to the variety, volume and velocity of raw data in web-scale systems. Factorization based models,
which measure interactions in terms of vector product, can learn patterns of com- binatorial features automatically and 
generalize to unseen features as well. With the great success of deep neural works (DNNs) in various fields, recently 
researchers have proposed several DNN- based factorization model to learn both low- and high-order feature interactions.
Despite the powerful ability of learning an arbitrary function from data, plain DNNs generate feature interactions im- 
plicitly and at the bit-wise level. In this paper, we propose a novel Compressed Interaction Network (CIN), which aims 
to generate feature interactions in an explicit fashion and at the vector-wise level. We show that the CIN share some 
functionalities with con- volutional neural networks (CNNs) and recurrent neural networks (RNNs). We further combine a 
CIN and a classical DNN into one unified model, and named this new model eXtreme Deep Factor- ization Machine (xDeepFM). 
On one hand, the xDeepFM is able to learn certain bounded-degree feature interactions explicitly; on the other hand, it
can learn arbitrary low- and high-order feature interactions implicitly. We conduct comprehensive experiments on three
real-world datasets. Our results demonstrate that xDeepFM outperforms state-of-the-art models.

In [34]:
%load_ext blackcellmagic

The blackcellmagic extension is already loaded. To reload it, use:
  %reload_ext blackcellmagic


In [None]:
max_rows = 1000000
data_url = "https://s3-eu-west-1.amazonaws.com/kaggle-display-advertising-challenge-dataset/dac.tar.gz"

In [42]:
import os
import sys
import collections
import csv
import math
import random
import time
from collections import defaultdict
from machine_utils import (
    get_gpu_name,
    get_number_processors,
    get_gpu_memory,
    get_cuda_version,
)

import numpy as np
import pandas as pd
from urllib.request import urlretrieve
import tarfile
import tensorflow as tf

sys.path.append("..")
sys.path.append("xDeepFM/exdeepfm")

import config_utils
import utils.util as util
import utils.metric as metric
import train

from train import cache_data, run_eval, run_infer, create_train_model
from utils.log import Log
from src.exDeepFM import ExtremeDeepFMModel

import utilities

print("OS: ", sys.platform)
print("Python: ", sys.version)
print("Numpy: ", np.__version__)
print("Number of CPU processors: ", get_number_processors())
# breaks built on CPU/mac
# print("GPU: ", get_gpu_name())
# print("GPU memory: ", get_gpu_memory())
# print("CUDA: ", get_cuda_version())

# runtime checks
util.check_tensorflow_version()
util.check_and_mkdir()

%matplotlib inline
%load_ext autoreload
%autoreload 2

OS:  linux
Python:  3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 17:14:51) 
[GCC 7.2.0]
Numpy:  1.15.1
Number of CPU processors:  24
GPU:  ['TITAN V', 'TITAN V', 'TITAN V']
GPU memory:  ['12065 MiB', '12066 MiB', '12066 MiB']
CUDA:  CUDA Version 9.1.85
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [43]:
# Parameters
# TODO
# T=4 # cut-off for minimum counts
# nrows=10000 # limit the data to reduce runtime

# supplied files come without header
fieldnames = ['Label', \
             'I1', 'I2', 'I3', 'I4', 'I5', 'I6', 'I7', 'I8', 'I9', 'I10', 'I11', 'I12', 'I13', 'C1', 'C2', 'C3', 'C4', \
             'C5', 'C6', 'C7', 'C8', 'C9', 'C10', 'C11', 'C12', 'C13', 'C14', 'C15', 'C16', 'C17', 'C18', \
             'C19', 'C20', 'C21', 'C22', 'C23', 'C24', 'C25', 'C26']

### Dataset retrieval

See [Criteo](https://www.kaggle.com/c/criteo-display-ad-challenge/data) for licencing. 

In [None]:
# Download dataset
urlretrieve(
   data_url,
   "dac.tar.gz",
   lambda count, blockSize, totalSize: sys.stdout.write(
       "\rDownloading...%d%%" % int(count * blockSize * 100 / totalSize)
   )
)

print("\n\nExtracing data")
with tarfile.open("dac.tar.gz", "r:gz") as tar:
    tar.extractall("data")

train_head = pd.read_csv("data/train.txt", names=fieldnames, sep="\t", nrows=5)
train_head


### Data preparation

In [55]:
# create data staging directory
utilities.mkdir_safe("data_prep")

full_data = "data_prep/full.txt"
full_data_ffm = "data_prep/full.ffm"

utilities.split_files("data/train.txt", [full_data], [1], max_rows=max_rows)

feat_cnt = defaultdict(lambda: 0)

for row in csv.DictReader(open(full_data), fieldnames=fieldnames, delimiter="\t"):
    for key, val in row.items():
        if "C" in key:
            if val == "":
                feat_cnt[str(key) + "#" + "absence"] += 1
            else:
                feat_cnt[str(key) + "#" + str(val)] += 1

print("Found %d features" % len(feat_cnt))


Found 43591 features


### Feature engineering

* Handle missing values
* Integers > 2: logarithmic transform (discriminate the small values [details](https://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf))  
* Integers $\le$ 2: convert to categorical
* Categoricals below minimum threshold (=T): replace with categorical floor feature (**column name** # **feature count**)


In [56]:
def get_feature(key, val):
    if "I" in key and key != "Id":
        if val == "":
            # handle missing values
            return str(key) + "#" + "absence"
        else:
            val = int(val)
            if val > 2:
                # log transform + ^2 to discriminate small values
                val = int(math.log(float(val)) ** 2)
            else:
                # convert to categorical
                val = "SP" + str(val)
            return str(key) + "#" + str(val)

        if "C" in key:
            if val == "":
                # handle missing values
                return str(key) + "#" + "absence"
            else:
                return str(key) + "#" + str(val)
            if feat_cnt[feat] <= T:
                # group values with small frequencies
                return str(key) + "#" + str(feat_cnt[feat])

        raise ValueError("Unsupported key: '%s'" % key)


### Discover all categorical features

In [57]:
featSet = set()
label_cnt = defaultdict(lambda: 0)

for row in csv.DictReader(open(full_data), fieldnames=fieldnames, delimiter="\t"):
    for key, val in row.items():
        if key == "Label":
            label_cnt[str(val)] += 1
            continue

        feat = get_feature(key, val)
        featSet.add(feat)

rows = sum(label_cnt.values())
print(
    "%25s 0: %8d (%.1f%%) 1: %8d (%.1f%%)"
    % (
        full_data,
        label_cnt["0"],
        label_cnt["0"] * 100 / rows,
        label_cnt["1"],
        label_cnt["1"] * 100 / rows,
    )
)


       data_prep/full.txt 0:     7818 (78.2%) 1:     2182 (21.8%)


### Calculate feature and column statistics

In [58]:
featIndex = dict()
for index, feat in enumerate(featSet, start=1):
    featIndex[feat] = index
print("Categorical features count:", len(featIndex))

fieldIndex = dict()
fieldList = fieldnames[1:]

for index, field in enumerate(fieldList, start=1):
    fieldIndex[field] = index
print("Field count:", len(fieldIndex))


Categorical features count: 710
Field count: 39


### Convert to [ffm format](https://github.com/guestwalk/libffm) 

In [59]:
with open(full_data_ffm, "w") as out:
    for row in csv.DictReader(open(full_data), fieldnames=fieldnames, delimiter="\t"):
        feats = []

        for key, val in row.items():
            if key == "Label":
                feats.append(val)
                continue

            feat = get_feature(key, val)
            # lookup field index + lookup feature index
            feats.append(str(fieldIndex[key]) + ":" + str(featIndex[feat]) + ":1")

        out.write(" ".join(feats) + "\n")


#### FFM Format
One example entry "1:552:1" can be split into 

* field: 1
* feature index: 552
* feature value: 1

In [60]:
train_ffm_head = pd.read_csv(full_data_ffm, names=fieldnames, sep=' ', nrows=5)
train_ffm_head

Unnamed: 0,Label,I1,I2,I3,I4,I5,I6,I7,I8,I9,...,C17,C18,C19,C20,C21,C22,C23,C24,C25,C26
0,0,1:352:1,2:151:1,3:167:1,4:455:1,5:273:1,6:535:1,7:646:1,8:111:1,9:38:1,...,30:607:1,31:607:1,32:607:1,33:607:1,34:607:1,35:607:1,36:607:1,37:607:1,38:607:1,39:607:1
1,0,1:264:1,2:304:1,3:416:1,4:300:1,5:475:1,6:190:1,7:281:1,8:111:1,9:669:1,...,30:607:1,31:607:1,32:607:1,33:607:1,34:607:1,35:607:1,36:607:1,37:607:1,38:607:1,39:607:1
2,0,1:264:1,2:304:1,3:700:1,4:220:1,5:633:1,6:461:1,7:233:1,8:111:1,9:320:1,...,30:607:1,31:607:1,32:607:1,33:607:1,34:607:1,35:607:1,36:607:1,37:607:1,38:607:1,39:607:1
3,0,1:133:1,2:394:1,3:21:1,4:32:1,5:464:1,6:53:1,7:63:1,8:359:1,9:351:1,...,30:607:1,31:607:1,32:607:1,33:607:1,34:607:1,35:607:1,36:607:1,37:607:1,38:607:1,39:607:1
4,0,1:518:1,2:602:1,3:21:1,4:455:1,5:616:1,6:37:1,7:233:1,8:359:1,9:351:1,...,30:607:1,31:607:1,32:607:1,33:607:1,34:607:1,35:607:1,36:607:1,37:607:1,38:607:1,39:607:1


### Split data into train, test and eval

In [61]:
train_file_ffm = "data_prep/train.ffm"
eval_file_ffm = "data_prep/eval.ffm"
test_file_ffm = "data_prep/test.ffm"

utilities.split_files(
    full_data_ffm, [train_file_ffm, test_file_ffm, eval_file_ffm], [0.8, 0.1, 0.1]
)


### Update configuration

In [62]:
# load network hyper-parameter
config = config_utils.load_yaml("config/exDeepFM.yaml")

# patch config to reflect the current data set
config["data"]["FEATURE_COUNT"] = len(featIndex)
config["data"]["FIELD_COUNT"] = len(fieldIndex)

config["data"]["train_file"] = train_file_ffm
config["data"]["eval_file"] = eval_file_ffm
config["data"]["test_file"] = test_file_ffm
del config["data"]["infer_file"]

# setup hparams
hparams = config_utils.create_hparams(config)
log = Log(hparams)
hparams.logger = log.logger


training network configuration file is config/exDeepFM.yaml


In [54]:
hparams

HParams([('DNN_FIELD_NUM', None), ('FEATURE_COUNT', 710), ('FIELD_COUNT', 39), ('PAIR_NUM', None), ('activation', ['relu', 'relu', 'relu', 'relu']), ('attention_activation', None), ('attention_layer_sizes', None), ('batch_size', 4096), ('cross_activation', 'identity'), ('cross_l1', 0.0), ('cross_l2', 0.0), ('cross_layer_sizes', [100, 100, 50]), ('cross_layers', None), ('data_format', 'ffm'), ('dim', 10), ('dropout', [0.0, 0.0, 0.0, 0.0]), ('embed_l1', 0.0), ('embed_l2', 0.001), ('epochs', 10), ('eval_file', 'data_prep/eval.ffm'), ('infer_file', None), ('init_method', 'tnormal'), ('init_value', 0.1), ('layer_l1', 0.0), ('layer_l2', 0.001), ('layer_sizes', [400, 400, 400, 400]), ('learning_rate', 0.001), ('load_model_name', None), ('log', 'log'), ('logger', <Logger utils.log (INFO)>), ('loss', 'log_loss'), ('method', 'classification'), ('metrics', ['auc', 'logloss']), ('model_type', 'exDeepFM'), ('mu', None), ('n_item', None), ('n_item_attr', None), ('n_user', None), ('n_user_attr', None

### Create extreme deep FM network

In [64]:
cache_data(hparams, hparams.train_file, flag='train')
cache_data(hparams, hparams.eval_file, flag='eval')
cache_data(hparams, hparams.test_file, flag='test')

train_model = create_train_model(ExtremeDeepFMModel, hparams)

gpuconfig = tf.ConfigProto()
gpuconfig.gpu_options.allow_growth = True

tf.set_random_seed(1234)

train_sess = tf.Session(target='', graph=train_model.graph, config=gpuconfig)
train_sess.run(train_model.model.init_op)

print('Epochs: %d' % hparams.epochs)
print('total_loss = data_loss+regularization_loss, data_loss = logloss\n')

with tf.summary.FileWriter(util.SUMMARIES_DIR, train_sess.graph) as writer:
    last_eval = 0
    
    for epoch in range(hparams.epochs):
        print('Epoch %d' % epoch)
        step = 0
        train_sess.run(train_model.iterator.initializer, feed_dict={train_model.filenames: [hparams.train_file_cache]})
        epoch_loss = 0

        # TODO: collect timing infomration
        while True:
            try:
                # TODO: collect timing information 
                (_, step_loss, step_data_loss, summary) = train_model.model.train(train_sess)
                writer.add_summary(summary, step)
                
                epoch_loss += step_loss
                step += 1
                
                if step % hparams.show_step == 0:
                    print('Step {0:d}: total_loss: {1:.4f} data_loss: {2:.4f}' \
                          .format(step, step_loss, step_data_loss))
            except tf.errors.OutOfRangeError as e:
                break
    
        # TODO: do we need  model saving in between?
        if epoch % hparams.save_epoch == 0:
            checkpoint_path = train_model.model.saver.save(
                sess=train_sess,
                save_path=util.MODEL_DIR + 'epoch_' + str(epoch))

        eval_res = run_eval(train_model, train_sess, hparams.eval_file_cache, util.EVAL_NUM, hparams, flag='eval')
        test_res = run_eval(train_model, train_sess, hparams.test_file_cache, util.TEST_NUM, hparams, flag='test')
        
        print ('Train loss: %1.3f; Test loss: %1.3f; Eval loss: %1.3f auc: %0.3f\n' 
              % (epoch_loss / step, test_res['logloss'], eval_res['logloss'], eval_res['auc']))

        # early stopping
        if eval_res["auc"] - last_eval < - 0.003:
            break
        if eval_res["auc"] > last_eval:
            last_eval = eval_res["auc"]

cache filename: data_prep/train.ffm
has not cached file, begin cached...
caced file used time, 2s, Fri Sep 28 14:47:31 2018.
data sample num:7943
cache filename: data_prep/eval.ffm
has not cached file, begin cached...
caced file used time, 0s, Fri Sep 28 14:47:31 2018.
data sample num:1033
cache filename: data_prep/test.ffm
has not cached file, begin cached...
caced file used time, 0s, Fri Sep 28 14:47:32 2018.
data sample num:1024


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epochs: 10
total_loss = data_loss+regularization_loss, data_loss = logloss

Epoch 0
Train loss: 4.946; Test loss: 0.530; Eval loss: 0.537 auc: 0.582

Epoch 1
Train loss: 4.643; Test loss: 0.582; Eval loss: 0.596 auc: 0.596

Epoch 2
Train loss: 4.613; Test loss: 0.554; Eval loss: 0.565 auc: 0.605

Epoch 3
Train loss: 4.479; Test loss: 0.506; Eval loss: 0.512 auc: 0.607

Epoch 4
Train loss: 4.358; Test loss: 0.528; Eval loss: 0.532 auc: 0.610

Epoch 5
Train loss: 4.289; Test loss: 0.514; Eval loss: 0.519 auc: 0.614

Epoch 6
Train loss: 4.187; Test loss: 0.502; Eval loss: 0.510 auc: 0.618

Epoch 7
Train loss: 4.098; Test loss: 0.506; Eval loss: 0.515 auc: 0.621

Epoch 8
Train loss: 4.016; Test loss: 0.501; Eval loss: 0.510 auc: 0.624

Epoch 9
Train loss: 3.926; Test loss: 0.499; Eval loss: 0.507 auc: 0.628

