# Motivation
论文原文：《》

参考学习地址：
* [速览 DeepFM: 使用 FM 取代 Wide & Deep 中的 LR](https://zhuanlan.zhihu.com/p/57158486)
* [推荐系统系列（一）：FM理论与实践](https://zhuanlan.zhihu.com/p/89639306)

相对于传统LR（Logistic Regression），通过特征交叉可以提高CTR预估效果，但Sparse特征会带来的的维度灾难（因为通常交叉项会带来$\frac{n(n-1)}{2}$个参数）。FM的思想通过对二阶交叉矩阵进行低秩分解（低维矩阵内积），类似Embedding将高维稀疏转化为低位稠密的思路，通过一部分隐向量的信息共享，提高性能的trade-off

 <img style="display: block; margin: 0 auto;" src="../../../assets/images/fm.png" width = "600" height = "300" alt="FM" align=center />

## 隐向量
$$
\begin{align}
y=w_0+\Sigma_{i=1}^nw_ix_i+\Sigma_{i=1}^{n-1}\Sigma_{j=i+1}^nw_{ij}x_ix_j
\end{align}
$$

FM在公式定义中引入了二阶交叉项，但因为大量特征的one-hot表示之后高度稀疏性问题，可能存在部分$x_ix_j$没有可学习样本或过少导致的过拟合，对应的参数$w_ij$无法充分学习，所以引入了辅助向量（隐向量）$V_i=(v_{i1},v_{i2},...,v_{ik})$ 使得$w_{ij}\approx V_iV_j^T$。好处有：
* 二阶参数量由 $\frac{n(n-1)}{2}\rightarrow kn$
* 通过共享新向量建立参数间关联，即使$w_{ij}$缺少交叉样本也可以通过仅包含$x_i$或$x_j$的样本更新$<V_i, V_j>$，进而更新 $w_{ij}$

$$
\begin{align}
y=w_0+\Sigma_{i=1}^nw_ix_i+\Sigma_{i=1}^{n-1}\Sigma_{j=i+1}^n<V_i, V_j>x_ix_j
\end{align}
$$

### Tips
在二阶项的隐向量计算使用到了<font color="red">对称矩阵求和公式的小trick，也即：$$(\Sigma a_i)^2 = \Sigma {a_i}^2 + 2 \Sigma \Sigma a_i a_j$$</font>另上式中$a_i=V_ix_i$，则有$$\Sigma \Sigma <V_i, V_j>x_ix_j=0.5 * [(\Sigma V_i x_i)^2 - \Sigma (V_i x_i)^2]$$

带回原公式
$$
\begin{align}
y&=w_0+\Sigma_{i=1}^nw_ix_i+\Sigma_{i=1}^{n-1}\Sigma_{j=i+1}^nw_{ij}x_ix_j\\
&=w_0+\Sigma_{i=1}^nw_ix_i+\frac{1}{2}\Sigma_{f=1}^{k}\{(\Sigma_{i=1}^nv_{if}x_i)^2-\Sigma_{i=1}^nv_{if}^2x_i^2\}
\end{align}
$$

通过改写可以看到无需交叉项$x_ix_j$就可以表示交叉能力，对应计算$y$的复杂度为$O(kn^2)\rightarrow O(kn)$。

**虽然FM可以应用于任意数值类型的数据上，但需要注意对输入特征数值进行预处理，优先选择特征归一化，其次再进行样本归一化**

对比一下后续的衍生工作：
 <img style="display: block; margin: 0 auto;" src="../../../assets/images/fm-comparison.png" width = "1000" height = "300" alt="FM" align=center />

# FM广告点击率预测
FM算法全称为因子分解机（Factorization Machine），思想是在线形回归模型上补充特征的二阶交互，适合捕捉大规模稀疏（类别）特征当中的交互作用

In [1]:
import os
import sys
from pathlib import Path
DIR_PATH = str(Path(os.getcwd()).parent.parent.parent.parent)
sys.path.append(DIR_PATH)

print(DIR_PATH)

/Users/colin/Desktop


# 一. 准备数据

In [7]:
print(DIR_PATH + "/merlin/assets/data/criteo-small/")

/Users/colin/Desktop/merlin/assets/data/criteo-small/


In [8]:
#!/bin/bash
!kaggle datasets download leonerd/criteo-small -p /Users/colin/Desktop/merlin/assets/data/criteo-small/ --unzip

Dataset URL: https://www.kaggle.com/datasets/leonerd/criteo-small
License(s): copyright-authors
Downloading criteo-small.zip to /Users/colin/Desktop/merlin/assets/data/criteo-small
 99%|█████████████████████████████████████▌| 83.0M/83.9M [00:12<00:00, 9.22MB/s]
100%|██████████████████████████████████████| 83.9M/83.9M [00:12<00:00, 7.08MB/s]


In [9]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split

np.random.seed(42)

In [12]:
from sklearn.preprocessing import LabelEncoder

dfdata = pd.read_csv(
    DIR_PATH + "/merlin/assets/data/criteo-small/train_1m.txt", sep='\t', header=None)
dfdata.columns = ["label"] + ["I"+str(x) for x in range(1,14)] + [
    "C"+str(x) for x in range(14,40)]

target_col = 'label'
cat_cols = [x for x in dfdata.columns if x.startswith('C')]
num_cols = [x for x in dfdata.columns if x.startswith('I')]

In [13]:
dftrain_val, dftest_raw = train_test_split(dfdata, test_size=0.2, random_state=42)
dftrain_raw, dfval_raw = train_test_split(dftrain_val, test_size=0.2, random_state=42)

In [16]:
dftrain_raw.shape

(640000, 40)

In [17]:
len(cat_cols)

26

In [15]:
from merlin.charms.datasets.preprocess import TabularPreprocessor
from sklearn.preprocessing import OrdinalEncoder

#特征工程
pipe = TabularPreprocessor(cat_features=cat_cols, onehot_max_cat_num=3)
encoder = OrdinalEncoder()

dftrain = pipe.fit_transform(dftrain_raw.drop(target_col, axis=1))
dftrain[target_col] = encoder.fit_transform(
    dftrain_raw[target_col].values.reshape(-1, 1)).astype(np.int32)

dfval = pipe.transform(dfval_raw.drop(target_col, axis=1))
dfval[target_col] = encoder.transform(
    dfval_raw[target_col].values.reshape(-1, 1)).astype(np.int32)

dftest = pipe.transform(dftest_raw.drop(target_col, axis=1))
dftest[target_col] = encoder.transform(
    dftest_raw[target_col].values.reshape(-1, 1)).astype(np.int32)

100%|██████████| 24/24 [00:02<00:00,  9.61it/s]


ValueError: Shape of passed values is (640000, 44), indices imply (640000, 39)

In [None]:
from torch.utils.data import Dataset, DataLoader
from merlin.charms.datasets.dataset import TabularDataset

def get_dataset(dfdata):
    return TabularDataset(
        data = dfdata,
        task = "binary",
        target = [target_col],
        continuous_cols = pipe.get_numeric_features(),
        categorical_cols = pipe.get_embedding_features(),
    )

def get_dataloader(ds, batch_size=512, num_workers=0, shuffle=False):
    return DataLoader(
        dataset=ds,
        batch_size=batch_size,
        shuffle=shuffle,
        num_workers=num_workers,
        pin_memory=False,
    )

ds_train = get_dataset(dftrain)
ds_val = get_dataset(dfval)
ds_test = get_dataset(dftest)

dl_train = get_dataloader(ds_train, batch_size=2048, shuffle=True)
dl_val = get_dataloader(ds_val, shuffle=False)
dl_test = get_dataloader(ds_test, shuffle=False)

# 二. 定义模型

In [None]:
from merlin.charms.models.fm import FMModel, FMConfig

model_config = FMConfig(task="binary")
config = model_config.merge_dataset_config(ds_train)

print('input_embed_dim = ', config.input_embed_dim)
print('\n categorical_cardinality = ',config.categorical_cardinality)
print('\n embedding_dims = ' , config.embedding_dims)

In [None]:
net = FMModel(config)

# 初始化参数
net.reset_weights()
net.data_aware_initialization(dl_train)

print(net.backbone.output_dim)

In [None]:
for batch in dl_train:
    break 

In [None]:
output = net.forward(batch)
loss = net.compute_loss(output,batch['target'])
print(loss)

# 三. 训练模型

In [None]:
from merlin.tools import WandModel
from merlin.tools.metrics import AUC

optimizer = torch.optim.AdamW(net.parameters(), lr=1e-3, weight_decay=1e-5)
wand_model = WandModel(
    model=net,
    optimizer=optimizer,
    metrics_dict={"auc": AUC()}
)

In [None]:
wand_model.fit(
    train_data=dl_train,
    val_data=dl_val,
    ckpt_path=DIR_PATH + "/merlin/assets/checkpoints/fm_criteo_binary",
    epochs=30,
    patience=5,
    monitor="val_auc",
    mode="max"
)

# 四. 评估模型

In [None]:
wand_model.evaluate(dl_train)

In [None]:
wand_model.evaluate(dl_val)

In [None]:
wand_model.evaluate(dl_test)

# 五. 使用模型

In [None]:
from tqdm import tqdm

net, dl_test = wand_model.accelerator.prepare(net, dl_test)
net.eval()
preds = []
with torch.no_grad():
    for batch in tqdm(dl_test):
        preds.append(net.predict(batch))

In [None]:
yhat_list = [yd.sigmoid().reshape(-1).tolist() for yd in preds]
yhat = []
for yd in yhat_list:
    yhat.extend(yd)

In [None]:
dftest_raw = dftest_raw.rename(columns={target_col: 'y'})
dftest_raw['yhat'] = yhat

In [None]:
from sklearn.metrics import roc_auc_score

roc_auc_score(dftest_raw['y'], dftest_raw['yhat'])

# 六. 保存模型

In [None]:
net.load_state_dict(torch.load('checkpoint', weights_only=True))