## xDeepFM : the eXtreme Deep Factorization Machine

This notebook will give you a quick example of how to train an xDeepFM model. xDeepFM [1] is a deep learning-based model aims at capturing both lower- and higher-order feature interactions for precise recommender systems. Thus it can learn feature interactions more effectively and manual feature engineering effort can be substantially reduced. To summarize, xDeepFM has the following key properties:

- It contains a component, named CIN, that learns feature interactions in an explicit fashion and in vector-wise level;
- It contains a traditional DNN component that learns feature interactions in an implicit fashion and in bit-wise level.
- The implementation makes this model quite configurable. We can enable different subsets of components by setting hyperparameters like use_Linear_part, use_FM_part, use_CIN_part, and use_DNN_part. For example, by enabling only the use_Linear_part and use_FM_part, we can get a classical FM model.


In this notebook, we test xDeepFM on two datasets: 1) a small synthetic dataset and 2) Criteo dataset

In [2]:
import sys
sys.path.append("../../")
import os
import papermill as pm
from tempfile import TemporaryDirectory

import tensorflow as tf

from reco_utils.common.constants import SEED
from reco_utils.recommender.deeprec.deeprec_utils import (
    download_deeprec_resources, prepare_hparams
)
from reco_utils.recommender.deeprec.models.xDeepFM import XDeepFMModel
from reco_utils.recommender.deeprec.IO.iterator import FFMTextIterator

print("System version: {}".format(sys.version))
print("Tensorflow version: {}".format(tf.__version__))

tmpdir = TemporaryDirectory()


import warnings
warnings.filterwarnings('ignore')

System version: 3.6.10 |Anaconda, Inc.| (default, Jan  7 2020, 15:18:16) [MSC v.1916 64 bit (AMD64)]
Tensorflow version: 1.12.0


In [3]:
EPOCHS_FOR_SYNTHETIC_RUN = 15
EPOCHS_FOR_CRITEO_RUN = 30
BATCH_SIZE_SYNTHETIC = 128
BATCH_SIZE_CRITEO = 4096
RANDOM_SEED = SEED  # Set None for non-deterministic result

## 1. Synthetic data

Now let's start with a small synthetic dataset. In this dataset, there are 10 fields, 1000 fefatures, and label is generated according to the result of a set of preset pair-wise feature interactions.

In [4]:
data_path = tmpdir.name
yaml_file = os.path.join(data_path, r'xDeepFM.yaml')
train_file = os.path.join(data_path, r'synthetic_part_0')
valid_file = os.path.join(data_path, r'synthetic_part_1')
test_file = os.path.join(data_path, r'synthetic_part_2')
output_file = os.path.join(data_path, r'output.txt')

if not os.path.exists(yaml_file):
    download_deeprec_resources(r'https://recodatasets.blob.core.windows.net/deeprec/', data_path, 'xdeepfmresources.zip')

100%|██████████| 10.3k/10.3k [00:04<00:00, 2.25kKB/s]


### 1.1 Prepare hyper-parameters

prepare_hparams() will create a full set of hyper-parameters for model training, such as learning rate, feature number, and dropout ratio. We can put those parameters in a yaml file, or pass parameters as the function's parameters (which will overwrite yaml settings).

In [5]:
hparams = prepare_hparams(yaml_file, 
                          FEATURE_COUNT=1000, 
                          FIELD_COUNT=10, 
                          cross_l2=0.0001, 
                          embed_l2=0.0001, 
                          learning_rate=0.001, 
                          epochs=EPOCHS_FOR_SYNTHETIC_RUN,
                          batch_size=BATCH_SIZE_SYNTHETIC)
print(hparams)

[('DNN_FIELD_NUM', None), ('EARLY_STOP', 100), ('FEATURE_COUNT', 1000), ('FIELD_COUNT', 10), ('L', None), ('MODEL_DIR', None), ('PAIR_NUM', None), ('SUMMARIES_DIR', None), ('T', None), ('activation', ['relu', 'relu']), ('att_fcn_layer_sizes', None), ('attention_activation', None), ('attention_dropout', 0.0), ('attention_layer_sizes', None), ('attention_size', None), ('batch_size', 128), ('cate_embedding_dim', None), ('cate_vocab', None), ('cross_activation', 'identity'), ('cross_l1', 0.0), ('cross_l2', 0.0001), ('cross_layer_sizes', [1]), ('cross_layers', None), ('data_format', 'ffm'), ('dim', 10), ('doc_size', None), ('dropout', [0.0, 0.0]), ('dtype', 32), ('embed_l1', 0.0), ('embed_l2', 0.0001), ('embedding_dropout', 0.3), ('enable_BN', False), ('entityEmb_file', None), ('entity_dim', None), ('entity_embedding_method', None), ('entity_size', None), ('epochs', 15), ('fast_CIN_d', 0), ('filter_sizes', None), ('hidden_size', None), ('init_method', 'tnormal'), ('init_value', 0.3), ('is_c

In [6]:
input_creator = FFMTextIterator

In [7]:
model = XDeepFMModel(hparams, input_creator, seed=RANDOM_SEED)

## sometimes we don't want to train a model from scratch
## then we can load a pre-trained model like this: 
#model.load_model(r'your_model_path')


Add CIN part.


In [8]:
print(model.run_eval(test_file))

{'auc': 0.5043, 'logloss': 0.7515}


In [9]:
model.fit(train_file, valid_file)

at epoch 1
train info: auc:0.5189, logloss:0.7006
eval info: auc:0.504, logloss:0.7042
at epoch 1 , train time: 7.5 eval time: 7.0
at epoch 2
train info: auc:0.5365, logloss:0.6919
eval info: auc:0.5066, logloss:0.6973
at epoch 2 , train time: 6.9 eval time: 6.9
at epoch 3
train info: auc:0.5552, logloss:0.6884
eval info: auc:0.5099, logloss:0.6953
at epoch 3 , train time: 6.7 eval time: 7.2
at epoch 4
train info: auc:0.5784, logloss:0.6848
eval info: auc:0.5147, logloss:0.6946
at epoch 4 , train time: 6.9 eval time: 7.4
at epoch 5
train info: auc:0.6104, logloss:0.6783
eval info: auc:0.523, logloss:0.6941
at epoch 5 , train time: 6.9 eval time: 6.9
at epoch 6
train info: auc:0.6582, logloss:0.6625
eval info: auc:0.5416, logloss:0.6929
at epoch 6 , train time: 7.2 eval time: 7.0
at epoch 7
train info: auc:0.7318, logloss:0.6204
eval info: auc:0.5916, logloss:0.6831
at epoch 7 , train time: 7.1 eval time: 6.7
at epoch 8
train info: auc:0.83, logloss:0.5231
eval info: auc:0.7024, logloss

<reco_utils.recommender.deeprec.models.xDeepFM.XDeepFMModel at 0x234e54c7dd8>

In [10]:
res_syn = model.run_eval(test_file)
print(res_syn)
pm.record("res_syn", res_syn)

{'auc': 0.9716, 'logloss': 0.2278}


In [11]:
model.predict(test_file, output_file)

<reco_utils.recommender.deeprec.models.xDeepFM.XDeepFMModel at 0x234e54c7dd8>

In [13]:
print('demo with Criteo dataset')
hparams = prepare_hparams(yaml_file, 
                          FEATURE_COUNT=2300000, 
                          FIELD_COUNT=39, 
                          cross_l2=0.01, 
                          embed_l2=0.01, 
                          layer_l2=0.01,
                          learning_rate=0.002, 
                          batch_size=BATCH_SIZE_CRITEO, 
                          epochs=EPOCHS_FOR_CRITEO_RUN, 
                          cross_layer_sizes=[20, 10], 
                          init_value=0.1, 
                          layer_sizes=[20,20],
                          use_Linear_part=True, 
                          use_CIN_part=True, 
                          use_DNN_part=True)

train_file = os.path.join(data_path, r'cretio_tiny_train')
valid_file = os.path.join(data_path, r'cretio_tiny_valid')
test_file = os.path.join(data_path, r'cretio_tiny_test')

demo with Criteo dataset


In [14]:
model = XDeepFMModel(hparams, FFMTextIterator, seed=RANDOM_SEED)

# check the predictive performance before the model is trained
print(model.run_eval(test_file)) 
model.fit(train_file, valid_file)
# check the predictive performance after the model is trained
res_real = model.run_eval(test_file)
print(res_real)
pm.record("res_real", res_real)

Add linear part.
Add CIN part.
Add DNN part.
{'auc': 0.4728, 'logloss': 0.7113}
at epoch 1
train info: auc:0.6648, logloss:0.5347
eval info: auc:0.6637, logloss:0.5342
at epoch 1 , train time: 107.1 eval time: 58.4
at epoch 2
train info: auc:0.7155, logloss:0.51
eval info: auc:0.7137, logloss:0.5109
at epoch 2 , train time: 94.8 eval time: 58.5
at epoch 3
train info: auc:0.7331, logloss:0.5012
eval info: auc:0.7283, logloss:0.5037
at epoch 3 , train time: 94.4 eval time: 58.4
at epoch 4
train info: auc:0.7412, logloss:0.4964
eval info: auc:0.7359, logloss:0.4991
at epoch 4 , train time: 96.8 eval time: 59.3
at epoch 5
train info: auc:0.7456, logloss:0.4937
eval info: auc:0.74, logloss:0.4963
at epoch 5 , train time: 98.3 eval time: 59.5
at epoch 6
train info: auc:0.7487, logloss:0.4919
eval info: auc:0.7426, logloss:0.4946
at epoch 6 , train time: 97.8 eval time: 59.1
at epoch 7
train info: auc:0.7505, logloss:0.4907
eval info: auc:0.7441, logloss:0.4934
at epoch 7 , train time: 100.0 

In [None]:
tmpdir.cleanup()