# Description
这里使用TensorFlow 1.5实现用于CTR预测的FM，数据集选用的是kaggle上的criteo数据集。下载链接：[http://labs.criteo.com/2014/02/kaggle-display-advertising-challenge-dataset/](http://labs.criteo.com/2014/02/kaggle-display-advertising-challenge-dataset/) 

数据集介绍：   
这是criteo-Display Advertising Challenge比赛的部分数据集， 里面有train.csv和test.csv两个文件：
* train.csv： 训练集由Criteo 7天内的部分流量组成。每一行对应一个由Criteo提供的显示广告。为了减少数据集的大小，正(点击)和负(未点击)的例子都以不同的比例进行了抽样。示例是按时间顺序排列的
* test.csv: 测试集的计算方法与训练集相同，只是针对训练期之后一天的事件

字段说明：
* Label： 目标变量， 0表示未点击， 1表示点击
* l1-l13: 13列的数值特征， 大部分是计数特征
* C1-C26: 26列分类特征， 为了达到匿名的目的， 这些特征的值离散成了32位的数据表示

这个比赛的任务就是：开发预测广告点击率(CTR)的模型。给定一个用户和他正在访问的页面，预测他点击给定广告的概率是多少？比赛的地址链接：[https://www.kaggle.com/c/criteo-display-ad-challenge/overview](https://www.kaggle.com/c/criteo-display-ad-challenge/overview)

# FM
FM模型方程：   
$$y=w_0+\sum_{i=1}^nw_ix_i+\sum_{i=1}^{n- 1}\sum_{j=i+1}^{n}\langle v_i,v_j \rangle x_ix_j$$
其中，$v_i$是第$i$维特征的隐向量，$\langle \cdot, \cdot \rangle$代表向量点积，$\langle v_i, v_j \rangle=\sum_{f=1}^k v_{if}v_{jf}$。隐向量的长度为$k(k\ll n)$，包含 $k$ 个描述特征的因子。   

从FM的公式可以看出时间复杂度为$O(kn^2)$，因为所有的交叉特征都需要计算。但是通过二次化简可以将时间复杂度优化到$O(kn)$，化简结果如下：   
$$y=w_0+\sum_{i=1}^nw_ix_i+\frac{1}{2}\sum_{f=1}^k\Bigg[\bigg(\sum_{i=1}^n v_{i,f}x_i\bigg)^2-\sum_{i=1}^n (v_{i,f})^2x_i^2\Bigg]$$

In [37]:
import tensorflow as tf
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tqdm import tqdm

In [38]:
# dense特征空值用0填充，并取对数， sparse特征空值用'-1'填充
def process_feat(data, dense_feats, sparse_feats):
    df = data.copy()
    # dense
    df[dense_feats] = df[dense_feats].fillna(0.0)
    for f in tqdm(dense_feats):
        df[f] = df[f].apply(lambda x: np.log(1 + x) if x > -1 else -1)
    # sparse
    df[sparse_feats] = df[sparse_feats].fillna('-1')
    return df

In [39]:
# 数据加载
file = '../dataset/criteo_sampled_data.csv'
data = pd.read_csv(file, sep=',')
# dense 特征开头是I, sparse特征开头是C， label是标签
cols = data.columns.values
dense_feats = [f for f in cols if f[0] == 'I']
sparse_feats = [f for f in cols if f[0] == 'C']
ignore_feats = ['label']
# 数据预处理
data_new = process_feat(data, dense_feats, sparse_feats)
# 切分训练集和验证集
train, test = train_test_split(data_new, test_size=0.2, random_state=1)

100%|██████████| 13/13 [00:08<00:00,  1.47it/s]


In [40]:
# 遍历数据获取对应feature_dict，total_feature
def get_feature_dict(data, ignore_feats, dense_feats):
    feature_dict = {}
    total_feature = 0
    for col in tqdm(data.columns):
        if col in ignore_feats:
            continue
        elif col in dense_feats:
            feature_dict[col] = total_feature
            total_feature += 1
        else:
            unique_val = data[col].unique()
            feature_dict[col] = dict(
                zip(unique_val,
                    range(total_feature,
                        len(unique_val) + total_feature)))
            total_feature += len(unique_val)
    return feature_dict, total_feature

In [103]:
def data_tran(data, feature_dict, ignore_feats, dense_feats):
    labels = data['label']
    # 这里存储的是每个值对应在feature_dict的idx，将每一条数据转换为对应的特征索引
    feature_index = data.copy()
    # 这里存储的是每个值，将每一条数据转换为对应的特征值
    feature_value = data.copy()
    for col in tqdm(feature_index.columns):
        if col in ignore_feats:
            feature_index.drop(col, axis=1, inplace=True)
            feature_value.drop(col, axis=1, inplace=True)
        elif col in dense_feats:
            feature_index[col] = feature_dict[col]
        else:
            feature_index[col] = feature_index[col].map(feature_dict[col])
            feature_value[col] = 1
    return feature_index, feature_value, labels

In [104]:
feature_dict, total_feature = get_feature_dict(data_new, ignore_feats, dense_feats)
print('total_feature:', total_feature)
print('feature_dict size:', len(feature_dict))
# 产出用于训练的数据
train_feature_index, train_feature_value, train_labels = data_tran(
    train, feature_dict, ignore_feats, dense_feats)
test_feature_index, test_feature_value, test_labels = data_tran(
    test, feature_dict, ignore_feats, dense_feats)

100%|██████████| 40/40 [00:01<00:00, 26.56it/s] 


total_feature: 885697
feature_dict size: 39


100%|██████████| 40/40 [00:07<00:00,  5.00it/s]
100%|██████████| 40/40 [00:03<00:00, 12.92it/s]


In [244]:
fm_params = {
    'embedding_size': 8,
    'batch_size': 4000,
    'learning_rate': 0.001,
    'epoch': 20,
    'optimizer': 'adagrad'
}
fm_params['feature_size'] = total_feature
fm_params['field_size'] = len(train_feature_index.columns)

In [245]:
# 开始构建模型
tf.reset_default_graph()  # 重置网络结构
# 定义模型输入
# 训练模型的输入有三个，分别是刚才转换得到的特征索引和特征值，以及label：
feat_index = tf.placeholder(tf.int32,
                            shape=[None, fm_params['field_size']],
                            name='feat_index')
feat_value = tf.placeholder(tf.float32,
                            shape=[None, fm_params['field_size']],
                            name='feat_value')
labels = tf.placeholder(tf.int32, shape=[None], name='labels')
# tf fm weights
weights = dict()
weights_initializer = tf.glorot_normal_initializer()
bias_initializer = tf.constant_initializer(0.0)
weights["feature_embeddings"] = tf.get_variable(
    name='weights',
    dtype=tf.float32,
    initializer=weights_initializer,
    regularizer=tf.contrib.layers.l2_regularizer(scale=1e-5),
    shape=[fm_params['feature_size'], fm_params['embedding_size']])
weights["weights_first_order"] = tf.get_variable(
    name='vectors',
    dtype=tf.float32,
    initializer=weights_initializer,
    regularizer=tf.contrib.layers.l2_regularizer(1e-5),
    shape=[fm_params['field_size'], 1])
weights["fm_bias"] = tf.get_variable(name='bias',
                                     dtype=tf.float32,
                                     initializer=bias_initializer,
                                     shape=[1])
embeddings = tf.nn.embedding_lookup(weights["feature_embeddings"], feat_index) # shape=(?, 39, 8)
bias = weights['fm_bias']
#build function
##first order
first_order = tf.matmul(feat_value,
                        weights["weights_first_order"])  # shape=(?, 1)

##second order
### feature * embeddings
reshaped_feat_value = tf.reshape(feat_value,
                                 shape=[-1, fm_params['field_size'], 1])
# multiply这个函数实现的是元素级别的相乘，也就是两个相乘的数元素各自相乘，而不是矩阵乘法
f_e_m = tf.multiply(reshaped_feat_value, embeddings)
###  square(sum(feature * embedding))
f_e_m_sum = tf.reduce_sum(f_e_m, 1)
f_e_m_sum_square = tf.square(f_e_m_sum)
###  sum(square(feature * embedding))
f_e_m_square = tf.square(f_e_m)
f_e_m_square_sum = tf.reduce_sum(f_e_m_square, 1)
second_order = f_e_m_sum_square - f_e_m_square_sum
second_order = tf.reduce_sum(second_order, 1, keepdims=True)

##final objective function
logits = second_order + first_order + bias
predicts = tf.sigmoid(logits)

##loss function
new_labels = tf.cast(tf.reshape(labels, shape=[-1, 1]), dtype=tf.float32)
sigmoid_loss = tf.nn.sigmoid_cross_entropy_with_logits(logits=logits,
                                                       labels=new_labels)
sigmoid_loss = tf.reduce_mean(sigmoid_loss)
l2_loss = tf.losses.get_regularization_loss()
loss = sigmoid_loss + l2_loss

# train op
if fm_params['optimizer'] == 'adagrad':
    optimizer = tf.train.AdagradOptimizer(
        learning_rate=fm_params['learning_rate'],
        initial_accumulator_value=1e-8)
elif fm_params['optimizer'] == 'adam':
    optimizer = tf.train.AdamOptimizer(
        learning_rate=fm_params['learning_rate'])
else:
    raise Exception('unknown optimizer', fm_params['optimizer'])
train_op = optimizer.minimize(loss)

# accuracy
one_tensor = tf.ones_like(predicts)
neg_predicts = tf.subtract(one_tensor, predicts)
prediction = tf.concat([neg_predicts, predicts], axis=1)
# new_labels = tf.cast(tf.reshape(labels, shape=[-1]), dtype=tf.int32)
# 如果labels的输入shape是[None, 1]则需要转成[None,]，这样才能供in_top_k使用
correct_prediction = tf.nn.in_top_k(prediction, labels, 1)
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

In [246]:
# Initialize the variables (i.e. assign their default value)
init = tf.global_variables_initializer()
# Start training
with tf.Session() as sess:
    sess.run(init)

    # Training cycle
    for epoch in range(fm_params['epoch']):
        avg_cost = 0.
        avg_acc = 0.
        total_batch = int(train.shape[0] / fm_params['batch_size'])
        # Loop over all batches
        for i in range(total_batch):
            start_idx = i * fm_params['batch_size']
            end_idx = (i + 1) * fm_params['batch_size']
            batch_index = train_feature_index[start_idx:end_idx]
            batch_value = train_feature_value[start_idx:end_idx]
            batch_labels = train_labels[start_idx:end_idx]
            # Fit training using batch data
            _, c, acc = sess.run(
                [train_op, loss, accuracy],
                feed_dict={
                    feat_index: batch_index,
                    feat_value: batch_value,
                    labels: batch_labels
                })
            # Compute average loss
            avg_cost += c / total_batch
            avg_acc += acc / total_batch
        # Display logs per epoch step
        if (epoch + 1) % 1 == 0:
            vloss, pred1, pred2, cprediction, vacc = sess.run(
                [loss, predicts, prediction, correct_prediction, accuracy],
                feed_dict={
                    feat_index: test_feature_index,
                    feat_value: test_feature_value,
                    labels: test_labels
                })
            print('Epoch:', '%04d' % (epoch + 1), 'cost=',
                  '{:.9f}'.format(avg_cost), 'acc=', '{:.9f}'.format(avg_acc),
                  'valid_loss=', '{:.9f}'.format(vloss), 'valid_acc=',
                  '{:.9f}'.format(vacc))

    print('Optimization Finished!')


Epoch: 0001 cost= 0.756251098 acc= 0.481272918 valid_loss= 0.613053322 valid_acc= 0.701250017
Epoch: 0002 cost= 0.570809954 acc= 0.741264583 valid_loss= 0.547638834 valid_acc= 0.752933323
Epoch: 0003 cost= 0.532172237 acc= 0.761231250 valid_loss= 0.526202083 valid_acc= 0.761108339
Epoch: 0004 cost= 0.514010576 acc= 0.768241668 valid_loss= 0.513945758 valid_acc= 0.766366661
Epoch: 0005 cost= 0.501640464 acc= 0.773610419 valid_loss= 0.505695403 valid_acc= 0.769391656
Epoch: 0006 cost= 0.492207383 acc= 0.778025000 valid_loss= 0.499745578 valid_acc= 0.772000015
Epoch: 0007 cost= 0.484651676 acc= 0.781760417 valid_loss= 0.495292366 valid_acc= 0.773783326
Epoch: 0008 cost= 0.478405937 acc= 0.784772916 valid_loss= 0.491876841 valid_acc= 0.774824977
Epoch: 0009 cost= 0.473113014 acc= 0.787108332 valid_loss= 0.489202827 valid_acc= 0.775550008
Epoch: 0010 cost= 0.468529478 acc= 0.789437503 valid_loss= 0.487067133 valid_acc= 0.775908351
Epoch: 0011 cost= 0.464482887 acc= 0.791356252 valid_loss= 0