wide and deep 模型是Tensorflow在2016年6月左右发布的一类用于分类和回归问题的模型，应用在Google Play的应用推荐中。wide and deep的核心思想是结合线性模型的记忆能力和DNN模型的泛化能力。在训练过程中同时优化2个模型的参数，从而整体模型的预测能力最优。

<img src="./imgs/wide & deep.png" width = "550" height = "550" />

- wide端
对应线性模型，输入特征是连续特征，也可以是稀疏的离散特征，离散特征经过交叉后可以构成更高维的离散特征。线性模型经过L1正则化，可以快速收敛，保留有效特征。

- deep端
对应的是DNN模型，每个特征对应一个低纬稠密向量，我们称之为特征的embedding。wide and deep整个模型的输出时线性模型输出与DNN模型输出的叠加。


使用Wide & Deep模型完成点击率预估

In [80]:
import os 
import math
import itertools
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder

import tensorflow as tf
from tensorflow import keras
layers = keras.layers

# tensorflow版本
print("tf.__version__: ", tf.__version__)

tf.__version__:  1.12.0


In [105]:
URL = "https://storage.googleapis.com/sara-cloud-ml/wine_data.csv"
path = tf.keras.utils.get_file(URL.split('/')[-1], URL)

In [106]:
#file_path = './data/wine-reviews/winemag-data_first150k.csv'#'./data/wine-reviews/winemag-data-130k-v2.csv'
data = pd.read_csv(path)
data = data.sample(frac=1)
print('data.shape: ', data.shape)

data.head(2)

data.shape:  (150929, 11)


Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
14964,14964,France,"Full, soft and rounded, this shows flavors of ...",,89,35.0,Provence,Côtes de Provence,,Rosé,Château d'Esclans
135556,135556,Italy,Sangiovese with 25% Merlot for added softness ...,Poggio alla Badiola,87,15.0,Tuscany,Toscana,,Sangiovese,Mazzei


数据预处理

variety: 用于酿造葡萄酒的葡萄种类

In [107]:
# 做一些数据预处理
data = data[pd.notnull(data['country'])]
data = data[pd.notnull(data['price'])]
data = data.drop(data.columns[0], axis=1)

variety_threshold = 500 # 阈值，少于500的原材料会被去掉
value_counts = data['variety'].value_counts()
remove_indexes = value_counts[value_counts <= variety_threshold].index
data.replace(remove_indexes, np.nan, inplace=True)
data = data[pd.notnull(data['variety'])]

In [108]:
data.head(2)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
14964,France,"Full, soft and rounded, this shows flavors of ...",,89,35.0,Provence,Côtes de Provence,,Rosé,Château d'Esclans
135556,Italy,Sangiovese with 25% Merlot for added softness ...,Poggio alla Badiola,87,15.0,Tuscany,Toscana,,Sangiovese,Mazzei


In [109]:
# 划分训练集和测试集
train_size = int(len(data) * 0.8)
print("train size: %d" % train_size)
print ("Test size: %d" % (len(data) - train_size))

train size: 95646
Test size: 23912


In [111]:
# train features, train labels
description_train = data['description'][:train_size]
variety_train = data['variety'][:train_size]
labels_train = data['price'][:train_size]

# test features, test labels
description_test = data['description'][train_size:]
variety_test = data['variety'][train_size:]
labels_test = data['price'][train_size:]

In [112]:
# 使用tokenizer去预训练文本
vocab_size = 12000 
tokenizer = keras.preprocessing.text.Tokenizer(num_words=vocab_size, char_level=False)
tokenizer.fit_on_texts(description_train)

`
description_train[0:2]
1    This is ripe and fruity, a wine that is smooth...
2    Tart and snappy, the flavors of lime flesh and...
Name: description, dtype: object
`

In [113]:
# wide feature 01: bow向量
description_bow_train = tokenizer.texts_to_matrix(description_train)
description_bow_test = tokenizer.texts_to_matrix(description_test)

`
description_bow_train[0:2]
array([[0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.]])
       `

In [114]:
# wide feature 02: one-hot vector of variety categories
encoder = LabelEncoder()
encoder.fit(variety_train)
variety_train = encoder.transform(variety_train) # 由于one-hot向量从0开始，所以classes要加一
variety_test = encoder.transform(variety_test)

In [115]:
num_classes = np.max(variety_train) + 1

# 转换labels为one-hot
variety_train = keras.utils.to_categorical(variety_train, num_classes)
variety_test = keras.utils.to_categorical(variety_test, num_classes)

`
variety_train
[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
`

In [116]:
# wide model
bow_inputs = layers.Input(shape=(vocab_size,))
variety_inputs = layers.Input(shape=(num_classes,))
merged_layer = layers.concatenate([bow_inputs, variety_inputs])
merged_layer = layers.Dense(256, activation='relu')(merged_layer)
predictions = layers.Dense(1)(merged_layer)

wide_model = keras.Model(inputs=[bow_inputs, variety_inputs], outputs=predictions)

In [117]:
# 编译模型
wide_model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
print(wide_model.summary())

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_17 (InputLayer)           (None, 12000)        0                                            
__________________________________________________________________________________________________
input_18 (InputLayer)           (None, 40)           0                                            
__________________________________________________________________________________________________
concatenate_9 (Concatenate)     (None, 12040)        0           input_17[0][0]                   
                                                                 input_18[0][0]                   
__________________________________________________________________________________________________
dense_19 (Dense)                (None, 256)          3082496     concatenate_9[0][0]              
__________

`
description_train.head()
1    This is ripe and fruity, a wine that is smooth...
2    Tart and snappy, the flavors of lime flesh and...
3    Pineapple rind, lemon pith and orange blossom ...
4    Much like the regular bottling from 2012, this...
7    This dry and restrained wine offers spice in p...
Name: description, dtype: object
        `

In [118]:
# deep model feature: word embedding of wine description
train_embed = tokenizer.texts_to_sequences(description_train)
test_embed = tokenizer.texts_to_sequences(description_test)

texts_to_sequences: [[6, 7, 27, 1,...]..]

In [119]:
max_seq_length = 170
train_embed = keras.preprocessing.sequence.pad_sequences(
    train_embed, maxlen=max_seq_length, padding="post" # padding 在序列末尾补零
)
test_embed = keras.preprocessing.sequence.pad_sequences(
    test_embed, maxlen=max_seq_length, padding="post"
)

`
train_embed
array([[  6,   7,  27, ...,   0,   0,   0],
       [104,   1, 904, ...,   0,   0,   0],
       [201, 607,  77, ...,   0,   0,   0],
       ...,
`

In [120]:
# deep model
deep_inputs = layers.Input(shape=(max_seq_length,))
# input_dim: vocab_size(12000) output_dim: 8
embedding = layers.Embedding(vocab_size, 8, input_length=max_seq_length)(deep_inputs)
embedding = layers.Flatten()(embedding)
embed_out = layers.Dense(1)(embedding)
deep_model = keras.Model(inputs=deep_inputs, outputs=embed_out)
print(deep_model.summary())

# 编译模型
deep_model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_19 (InputLayer)        (None, 170)               0         
_________________________________________________________________
embedding_4 (Embedding)      (None, 170, 8)            96000     
_________________________________________________________________
flatten_4 (Flatten)          (None, 1360)              0         
_________________________________________________________________
dense_21 (Dense)             (None, 1)                 1361      
Total params: 97,361
Trainable params: 97,361
Non-trainable params: 0
_________________________________________________________________
None


In [121]:
# 合并宽度和深度模型
merged_out = layers.concatenate([wide_model.output, deep_model.output])
merged_out = layers.Dense(1)(merged_out)

combined_model = keras.Model(wide_model.input + [deep_model.input], merged_out)
print(combined_model.summary())

# 编译
combined_model.compile(loss='mse',
                      optimizer='adam',
                      metrics=['accuracy'])

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_17 (InputLayer)           (None, 12000)        0                                            
__________________________________________________________________________________________________
input_18 (InputLayer)           (None, 40)           0                                            
__________________________________________________________________________________________________
input_19 (InputLayer)           (None, 170)          0                                            
__________________________________________________________________________________________________
concatenate_9 (Concatenate)     (None, 12040)        0           input_17[0][0]                   
                                                                 input_18[0][0]                   
__________

In [122]:
# 训练
combined_model.fit([description_bow_train, variety_train] + [train_embed], labels_train, epochs=5, batch_size=128)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x1a34219208>

In [123]:
# 评估
combined_model.evaluate([description_bow_test, variety_test] + [test_embed], labels_test, batch_size=128)



[753.425784522293, 0.0]

In [125]:
# 预测
predictions = combined_model.predict([description_bow_test, variety_test] + [test_embed])

In [126]:
predictions

array([[ 14.329004],
       [ 17.920763],
       [166.70659 ],
       ...,
       [ 42.538563],
       [ 20.129736],
       [ 12.247935]], dtype=float32)

### 相关阅读

使用Keras建立Wide＆Deep神经网络，通过描述预测葡萄酒价格

https://cloud.tencent.com/developer/news/197412

TensorFlow Wide And Deep 模型详解与应用 

https://cloud.tencent.com/developer/article/1143316

【论文】Wide & Deep Learning for Recommender Systems

https://arxiv.org/pdf/1606.07792.pdf
