Atomu2014/product-nets

Tensorflow implementation of Product-based Neural Networks. An extended version is at https://github.com/Atomu2014/product-nets-distributed.
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.
 Failed to load latest commit information. .idea Sep 16, 2016 python Jun 19, 2018 .gitignore Jun 20, 2017 README.md Aug 3, 2018 data Aug 8, 2017

Product-based Neural Networks for User Response Prediction

Note: An extended version of the conference paper is https://arxiv.org/abs/1807.00311 , which is accepted by TOIS. Compared with this simple demo, a more detailed implementation of the journal paper is at https://github.com/Atomu2014/product-nets-distributed , which has large-scale data access, multi-gpu support, and distributed training support.

Note: Any problems, you can contact me at kevinqu@apex.sjtu.edu.cn, or kevinqu16@gmail.com. Through email, you will get my rapid response.

This repository maintains the demo code of the paper Product-based Neural Network for User Response Prediction and other baseline models, implemented with tensorflow. And this paper has been published on ICDM2016.

Introduction to User Response Prediction

User response prediction takes a fundamental and crucial role in today's business, especially personalized recommender system and online display advertising. Different from traditional machine learning tasks, user response prediction always has categorical features grouped by different fields, which we call multi-field categorical data, e.g.:

ad. request={
'weekday': 3,
'hour': 18,
'IP': 255.255.255.255,
'domain': xxx.com,
'click': 1
}


In practice, these categorical features are usually one-hot encoded for training. However, this representation results in sparsity. Challenged by data sparsity, linear models (e.g., LR), latent factor-based models (e.g., FM, FFM), tree models (e.g., GBDT), and DNN models (e.g., FNN, DeepFM) are proposed.

A core problem in user response prediction is how to represent the complex feature interactions. Industrial applications prefer feature engineering and simple models. With GPU servers becoming more and more popular, it is promising to design complex models to explore feature interactions automatically. Through our analysis and experiments, we find a coupled gradient issue of latent factor-based models, and an insensitive gradient issue of DNN models.

Take FM as an example, the gradient of each feature vector is the sum over other feature vectors. Suppose two features are independent, FM can hardly learn two orthogonal feature vectors. The gradient issue of DNNs is discussed in the paper Failures of Gradient-based Deep Learning.

In order to solve these issues, we propose to use product operators in DNN to help explore feature interactions. We discuss these issues in an extended paper, which is submitted to TOIS at Seq. 2017 and will be released later. Any discussion is welcomed, please contact Yanru Qu.

Product-based Neural Networks

Through discussion of previous works, we think a good predictor should have a good feature extractor (to convert sparse features into dense representations) as well as a powerful classifier (e.g., DNN as universal approximator). Since FM is good at represent feature interactions, we introduce product operators in DNN. The proposed PNN models follow this architecture: an embedding layer to represent sparse features, a product layer to explore feature interactions, and a DNN classifier.

For product layer, we propose 2 types of product operators in the paper: inner product and outer product. These operators output $n(n-1)/2$ feature interactions, which are concatenated with embeddings and fed to the following fully conncted layers.

The inner product is easy to understand, the outer product is actually equivalent to projecting embeddings into a hidden space and computing the inner product of projected embeddings:

$uv^T\odot w = u^Twv$

Since there are $n(n-1)/2$ feature interactions, we propose some tricks to reduce complexity. However, we find these tricks restrict model capacity and are unecessary. In recent update of the code, we remove the tricks for better performance.

In our implementation, we add the parameter kernel_type: {mat, vec, num} for outer product. The default type is mat, and you can switch to other types to save time and memory.

A potential risk may happen in training the first hidden layer. Feature embeddings and interactions are concatenated and fed to the first hidden layer, but the embeddings and interactions have different distribution. A simple method is adding linear transformation to the embeddings to balance the distributions. Layer norm is also worth to try.

How to Use

For simplicity, we provide iPinYou dataset at make-ipinyou-data. Follow the instructions and update the soft link data: