# 导师制名企实训班商业智能方向 004期 Lesson 5

### Thinking 1: 在CTR点击率预估中，使用GBDT+LR的原理是什么？

在点击率预估中，GBDT+LR的方式是利用GBDT来进行特征抽取和特征生成，将GBDT输出的种类进行OneHot编码作为特征输入到LR中，LR是用来学习特征组合和进行点击率预估的。

### Thinking 2: Wide & Deep的模型结构是怎样的，为什么能通过具备记忆和泛化能力（memorization and generalization）

Wide&Deep模型有单层Wide部分和多层Deep部分组合而成，单层线性的Wide部分能够让模型学习训练数据中的高频共现的特征，这种能力类似记忆能力，多层非线性的Deep部分可以学习训练数据中特征组合的潜在关联，从而对于在训练数据中未出现过的内容拥有一定推断能力，这种能力使得模型具有泛化能力。

### Thinking 3: 在CTR预估中，使用FM与DNN结合的方式，有哪些结合的方式，代表模型有哪些？

1. 并行结构，如DeepFM，通过将FM和DNN平行计算，最后结合FM和DNN输出的结果进行CTR预估。
2. 串行结构，如NFM，将FM输出的结果作为DNN的输入；如NeuMF，将DNN的输出作为FM的输入。

### Thinking 4: GBDT和随机森林都是基于树的算法，它们有什么区别？

<table border="1">
<tr>
    <td><b>GBDT</b></td><td><b>随机森林</b></td>
</tr>
<tr>
<td>Boosting思想</td><td>Bagging思想</td>
</tr>
<tr>
<td>由回归树构成</td><td>可由分类树构成，也可以由回归树构成</td>
</tr>
<tr>
<td>只能串行生成树</td><td>可并行生成树</td>
</tr>
<tr>
<td>结果由多棵树结果累加</td><td>结果由多棵树结果投票等</td>
</tr>
<tr>
<td>对异常值敏感</td><td>对异常值不敏感</td>
</tr>
<tr>
<td>是弱分类器的集成</td><td>对训练集一视同仁</td>
</tr>
<tr>
<td>通过减少模型偏差提高性能</td><td>通过减少模型方差提高性能</td>
</tr>
</table>

### Thinking 5: item流行度在推荐系统中有怎样的应用

1. 冷启动问题：可以一定程度上解决冷启动问题，在新用户加入系统的时候，可以给用户推荐热门商品。
2. 个性化推荐：在个性化推荐过程中，为了更好地挖掘用户的兴趣，可以降低流行度较高的item对于用户个性化预测的影响。
3. 考虑不同应用场景：如果要打造爆款（如唯品会），就需要进行热门推荐，如果强调个性化甚至需要逆大众化（如婚恋网站）需要挖掘流行度较低的item。

### Action 1: 使用Wide&Deep模型对movielens进行评分预测

In [3]:
#引包
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
# deepctr
from deepctr.models import WDL
from deepctr.feature_column import SparseFeat,get_feature_names

In [4]:
# 加载数据
data = pd.read_csv("data/movielens_sample.txt")
sparse_features = ["movie_id", "user_id", "gender", "age", "occupation", "zip"]
target = ['rating']

In [5]:
# 对特征标签进行编码
for feature in sparse_features:
    lbe = LabelEncoder()
    data[feature] = lbe.fit_transform(data[feature])

In [6]:
# 计算每个特征中的 不同特征值的个数
fixlen_feature_columns = [SparseFeat(feature, data[feature].nunique()) for feature in sparse_features]
linear_feature_columns = fixlen_feature_columns
dnn_feature_columns = fixlen_feature_columns
feature_names = get_feature_names(linear_feature_columns + dnn_feature_columns)

In [7]:
# 将数据集切分成训练集和测试集
train, test = train_test_split(data, test_size=0.2)
train_model_input = {name:train[name].values for name in feature_names}
test_model_input = {name:test[name].values for name in feature_names}

In [8]:
# 使用WDL进行训练
model = WDL(linear_feature_columns, dnn_feature_columns, task='regression')
model.compile("adam", "mse", metrics=['mse'], )
history = model.fit(train_model_input, train[target].values, batch_size=256, epochs=1, verbose=True, validation_split=0.2, )

Instructions for updating:
keep_dims is deprecated, use keepdims instead
Train on 128 samples, validate on 32 samples
Epoch 1/1


In [9]:
# 使用WDL进行预测
pred_ans = model.predict(test_model_input, batch_size=256)

In [11]:
# 输出RMSE或MSE
mse = round(mean_squared_error(test[target].values, pred_ans), 4)
rmse = mse ** 0.5
print("test RMSE", rmse)
print("test MSE", mse)

test RMSE 3.9082860693659565
test MSE 15.2747


### movielens-1m数据集
These files contain 1,000,209 anonymous ratings of approximately 3,900 movies 
made by 6,040 MovieLens users who joined MovieLens in 2000.

In [12]:
# MovieLens数据集预处理
def movie_lens_preprocess(ratings_file, users_file, movies_file, 
                          rating_col="UserID::MovieID::Rating::Timestamp",
                          user_col="UserID::Gender::Age::Occupation::Zip-code", 
                          movie_col="MovieID::Title::Genres"):
    ratings = pd.read_csv(ratings_file, header=None, sep="::", engine='python')
    ratings.columns=rating_col.split("::")
    movies = pd.read_csv(movies_file, header=None, sep="::", engine='python')
    movies.columns=movie_col.split("::")
    users = pd.read_csv(users_file, header=None, sep="::", engine='python')
    users.columns=user_col.split("::")
    data = pd.merge(ratings, movies,how="left", on="MovieID")
    data = pd.merge(data, users, how="left", on="UserID")
    return data

In [13]:
# Wide&Deep模型训练和预测
def WDL_train_predict(data, sparse_features, target):
    # 对特征标签进行编码
    for feature in sparse_features:
        lbe = LabelEncoder()
        data[feature] = lbe.fit_transform(data[feature])
    # 计算每个特征中的 不同特征值的个数
    fixlen_feature_columns = [SparseFeat(feature, data[feature].nunique()) for feature in sparse_features]
    print(fixlen_feature_columns)
    linear_feature_columns = fixlen_feature_columns
    dnn_feature_columns = fixlen_feature_columns
    feature_names = get_feature_names(linear_feature_columns + dnn_feature_columns)
    # 将数据集切分成训练集和测试集
    train, test = train_test_split(data, test_size=0.2)
    train_model_input = {name:train[name].values for name in feature_names}
    test_model_input = {name:test[name].values for name in feature_names}
    # 使用Wide&Deep进行训练
    model = WDL(linear_feature_columns, dnn_feature_columns, task='regression')
    model.compile("adam", "mse", metrics=['mse'], )
    history = model.fit(train_model_input, train[target].values, batch_size=256, epochs=1, verbose=True, validation_split=0.2, )
    # 使用Wide&Deep进行预测
    pred_ans = model.predict(test_model_input, batch_size=256)
    # 输出RMSE或MSE
    mse = round(mean_squared_error(test[target].values, pred_ans), 4)
    rmse = mse ** 0.5
    print("\n\n","*"*150)
    print("test RMSE", rmse)
    print("test MSE", mse)

In [15]:
data = movie_lens_preprocess("data/ml-1m/ratings.dat", 
                             "data/ml-1m/users.dat", 
                             "data/ml-1m/movies.dat")
WDL_train_predict(data, ["MovieID", "UserID", "Gender", "Age", "Occupation", "Zip-code"], ['Rating'])

[SparseFeat(name='MovieID', vocabulary_size=3706, embedding_dim=4, use_hash=False, dtype='int32', embeddings_initializer=<tensorflow.python.keras.initializers.RandomNormal object at 0x00000211D05E89B0>, embedding_name='MovieID', group_name='default_group', trainable=True), SparseFeat(name='UserID', vocabulary_size=6040, embedding_dim=4, use_hash=False, dtype='int32', embeddings_initializer=<tensorflow.python.keras.initializers.RandomNormal object at 0x00000211D2F80EB8>, embedding_name='UserID', group_name='default_group', trainable=True), SparseFeat(name='Gender', vocabulary_size=2, embedding_dim=4, use_hash=False, dtype='int32', embeddings_initializer=<tensorflow.python.keras.initializers.RandomNormal object at 0x00000211D2F80518>, embedding_name='Gender', group_name='default_group', trainable=True), SparseFeat(name='Age', vocabulary_size=7, embedding_dim=4, use_hash=False, dtype='int32', embeddings_initializer=<tensorflow.python.keras.initializers.RandomNormal object at 0x00000211D2F