# 使用预训练的词向量

Author: [Dongyang Yan]（623320480@qq.com, github.com/fiyen ）

Data created: 2020/11/23

Last modified: 2020/11/24

Description: Tutorial to classify Imdb data using pre-trained word embeddings in paddlepaddle 2.0

## 摘要

在这个示例中，我们将使用飞桨2.0完成针对Imdb数据集（电影评论情感二分类数据集）的分类训练和测试。Imbd将直接调用自飞桨2.0，同时，
利用预训练的词向量（[GloVe embedding](http://nlp.stanford.edu/projects/glove/))完成任务。

## 环境设置

In [None]:
import paddle as pd
from paddle.io import Dataset
import numpy as np
import paddle.text as pt
import random

## 用飞桨2.0调用Imdb数据集
由于飞桨2.0提供了经过处理的Imdb数据集，我们可以方便地调用所需要的数据实例，省去了数据预处理的麻烦。目前，飞桨2.0以及内置的高质量
数据集包括Conll05st、Imdb、Imikolov、Movielens、HCIHousing、WMT14和WMT16等，未来还将提供更多常用数据集的调用接口。

In [None]:
imdb_train = pt.Imdb(mode='train', cutoff=150)
imdb_test = pt.Imdb(mode='test', cutoff=150)

调用Imdb得到的是经过编码的内容。每个样本表示一个文档，以list的形式储存，list中的每个元素都由一个数字表示，对应文档相应位置的某个单词，
而单词和数字编码是一一对应的。其对应关系可以通过imdb_train.word_idx查看。我们可以检查一下以上生成的数据内容：

In [None]:
print("训练集样本数量: %d; 测试集样本数量: %d" % (len(imdb_train), len(imdb_test)))
print(f"样本标签: {set(imdb_train.labels)}")
print(f"样本字典: {list(imdb_train.word_idx.items())[:10]}")
print(f"单个样本: {imdb_train.docs[0]}")
print(f"最小样本长度: {min([len(x) for x in imdb_train.docs])};最大样本长度: {max([len(x) for x in imdb_train.docs])}")

训练集样本数量: 25000; 测试集样本数量: 25000
样本标签: {0, 1}
样本字典: [(b'the', 0), (b'and', 1), (b'a', 2), (b'of', 3), (b'to', 4), (b'is', 5), (b'in', 6), (b'it', 7), (b'i', 8), (b'this', 9)]
单个样本: [5146, 43, 71, 6, 1092, 14, 0, 878, 130, 151, 5146, 18, 281, 747, 0, 5146, 3, 5146, 2165, 37, 5146, 46, 5, 71, 4089, 377, 162, 46, 5, 32, 1287, 300, 35, 203, 2136, 565, 14, 2, 253, 26, 146, 61, 372, 1, 615, 5146, 5, 30, 0, 50, 3290, 6, 2148, 14, 0, 5146, 11, 17, 451, 24, 4, 127, 10, 0, 878, 130, 43, 2, 50, 5146, 751, 5146, 5, 2, 221, 3727, 6, 9, 1167, 373, 9, 5, 5146, 7, 5, 1343, 13, 2, 5146, 1, 250, 7, 98, 4270, 56, 2316, 0, 928, 11, 11, 9, 16, 5, 5146, 5146, 6, 50, 69, 27, 280, 27, 108, 1045, 0, 2633, 4177, 3180, 17, 1675, 1, 2571]
最小样本长度: 10;最大样本长度: 2469


以上参数中，cutoff定义了构建词典的截止大小，即数据集中出现频率在cutoff以下的不予考虑；mode定义了返回的数据用于何种用途（test: 
测试集，train: 训练集）。对于训练集，我们将数据的顺序打乱，以优化将要进行的分类模型训练的效果。

In [None]:
shuffle_index = list(range(len(imdb_train)))
random.shuffle(shuffle_index)
train_x = [imdb_train.docs[i] for i in shuffle_index]
train_y = [imdb_train.labels[i] for i in shuffle_index]

test_x = imdb_test.docs
test_y = imdb_test.labels

从样本长度上可以看到，每个样本的长度是不相同的。然而，在模型的训练过程中，需要保证每个样本的长度相同，以便于构造矩阵进行批量运算。
因此，我们需要先对所有样本进行填充或截断，使样本的长度一致。

In [None]:
def vectorizer(input, label=None, length=2000):
    if label is not None:
        for x, y in zip(input, label):
            yield np.array((x + [0]*length)[:2000]).astype('int64'), np.array([y]).astype('int64')
    else:
        for x in input:
            yield np.array((x + [0]*length)[:2000]).astype('int64')

## 载入预训练向量。
以下给出的文件较小，可以直接完全载入内存。对于大型的预训练向量，无法一次载入内存的，可以采用分批载入，并行
处理的方式进行匹配。这里略过此部分，如果感兴趣可以参考[此链接](https://aistudio.baidu.com/aistudio/projectdetail/496368)进一步了解。

In [26]:
# 下载预训练向量文件，此链接下载较慢，较快下载请转网址：https://aistudio.baidu.com/aistudio/datasetdetail/42051
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip

glove_path = "./glove.6B.100d.txt"  # 请修改至glove.6B.100d.txt所在位置
embeddings = {}

我们先观察上述GloVe预训练向量文件一行的数据：

In [None]:
# 使用utf8编码解码
with open(glove_path, encoding='utf-8') as gf:
    line = gf.readline()
    print("GloVe单行数据：'%s'" % line)

GloVe单行数据：'the -0.038194 -0.24487 0.72812 -0.39961 0.083172 0.043953 -0.39141 0.3344 -0.57545 0.087459 0.28787 -0.06731 0.30906 -0.26384 -0.13231 -0.20757 0.33395 -0.33848 -0.31743 -0.48336 0.1464 -0.37304 0.34577 0.052041 0.44946 -0.46971 0.02628 -0.54155 -0.15518 -0.14107 -0.039722 0.28277 0.14393 0.23464 -0.31021 0.086173 0.20397 0.52624 0.17164 -0.082378 -0.71787 -0.41531 0.20335 -0.12763 0.41367 0.55187 0.57908 -0.33477 -0.36559 -0.54857 -0.062892 0.26584 0.30205 0.99775 -0.80481 -3.0243 0.01254 -0.36942 2.2167 0.72201 -0.24978 0.92136 0.034514 0.46745 1.1079 -0.19358 -0.074575 0.23353 -0.052062 -0.22044 0.057162 -0.15806 -0.30798 -0.41625 0.37972 0.15006 -0.53212 -0.2055 -1.2526 0.071624 0.70565 0.49744 -0.42063 0.26148 -1.538 -0.30223 -0.073438 -0.28312 0.37104 -0.25217 0.016215 -0.017099 -0.38984 0.87424 -0.72569 -0.51058 -0.52028 -0.1459 0.8278 0.27062
'


可以看到，每一行都以单词开头，其后接上该单词的向量值，各个值之间用空格隔开。基于此，可以用如下方法得到所有词向量的字典。

In [None]:
with open(glove_path, encoding='utf-8') as gf:
    for glove in gf:
        word, embedding = glove.split(maxsplit=1)
        embedding = [float(s) for s in embedding.split(' ')]
        embeddings[word] = embedding
print("预训练词向量总数：%d" % len(embeddings))
print(f"单词'the'的向量是：{embeddings['the']}")

预训练词向量总数：400000
单词'the'的向量是：[-0.038194, -0.24487, 0.72812, -0.39961, 0.083172, 0.043953, -0.39141, 0.3344, -0.57545, 0.087459, 0.28787, -0.06731, 0.30906, -0.26384, -0.13231, -0.20757, 0.33395, -0.33848, -0.31743, -0.48336, 0.1464, -0.37304, 0.34577, 0.052041, 0.44946, -0.46971, 0.02628, -0.54155, -0.15518, -0.14107, -0.039722, 0.28277, 0.14393, 0.23464, -0.31021, 0.086173, 0.20397, 0.52624, 0.17164, -0.082378, -0.71787, -0.41531, 0.20335, -0.12763, 0.41367, 0.55187, 0.57908, -0.33477, -0.36559, -0.54857, -0.062892, 0.26584, 0.30205, 0.99775, -0.80481, -3.0243, 0.01254, -0.36942, 2.2167, 0.72201, -0.24978, 0.92136, 0.034514, 0.46745, 1.1079, -0.19358, -0.074575, 0.23353, -0.052062, -0.22044, 0.057162, -0.15806, -0.30798, -0.41625, 0.37972, 0.15006, -0.53212, -0.2055, -1.2526, 0.071624, 0.70565, 0.49744, -0.42063, 0.26148, -1.538, -0.30223, -0.073438, -0.28312, 0.37104, -0.25217, 0.016215, -0.017099, -0.38984, 0.87424, -0.72569, -0.51058, -0.52028, -0.1459, 0.8278, 0.27062]


## 给数据集的词表匹配词向量
接下来，我们提取数据集的词表，需要注意的是，词表中的词编码的先后顺序是按照词出现的频率排列的，频率越高的词编码值越小。

In [None]:
word_idx = imdb_train.word_idx
vocab = [w for w in word_idx.keys()]
print(f"词表的前5个单词：{vocab[:5]}")
print(f"词表的后5个单词：{vocab[-5:]}")

词表的前5个单词：[b'the', b'and', b'a', b'of', b'to']
词表的后5个单词：[b'troubles', b'virtual', b'warriors', b'widely', '<unk>']


观察词表的后5个单词，我们发现，最后一个词是"<unk>"，这个符号代表所有词表以外的词。另外，对于形式b'the'，是字符串'the'
的二进制编码形式，使用中注意使用b'the'.decode()来进行转换（'$<unk>$'并没有进行二进制编码，注意区分）。
接下来，我们给词表中的每个词匹配对应的词向量。预训练词向量可能没有覆盖数据集词表中的所有词，对于没有的词，我们设该词的词
向量为零向量。

In [None]:
# 定义词向量的维度，注意与预训练词向量保持一致
dim = 100

vocab_embeddings = np.zeros((len(vocab), dim))
for ind, word in enumerate(vocab):
    if word != '<unk>':
        word = word.decode()
    embedding = embeddings.get(word, np.zeros((dim,)))
    vocab_embeddings[ind, :] = embedding

## 构建基于预训练向量的Embedding
对于预训练向量的Embedding，我们一般期望它的参数不再变动，所以要设置trainable=False。如果希望在此基础上训练参数，则需要
设置trainable=True。

In [None]:
pretrained_attr = pd.ParamAttr(name='embedding',
                               initializer=pd.nn.initializer.Assign(vocab_embeddings),
                               trainable=False)
embedding_layer = pd.nn.Embedding(num_embeddings=len(vocab),
                                  embedding_dim=dim,
                                  padding_idx=word_idx['<unk>'],
                                  weight_attr=pretrained_attr)

## 构建分类器
这里，我们构建简单的基于一维卷积的分类模型，其结构为：Embedding->Conv1D->Pool1D->Linear。在定义Linear时，由于需要知
道输入向量的维度，我们可以按照公式[官方文档](https://www.paddlepaddle.org.cn/documentation/docs/zh/2.0-beta/api/paddle/nn/layer/conv/Conv2d_cn.html)
来进行计算。这里给出计算的函数如下：

In [None]:
def cal_output_shape(input_shape, out_channels, kernel_size, stride, padding=0, dilation=1):
    return out_channels, int((input_shape + 2*padding - (dilation*(kernel_size - 1) + 1)) / stride) + 1


# 定义每个样本的长度
length = 2000

# 定义卷积层参数
kernel_size = 5
out_channels = 10
stride = 2
padding = 0

output_shape = cal_output_shape(length, out_channels, kernel_size, stride, padding)
output_shape = cal_output_shape(output_shape[1], output_shape[0], 2, 2, 0)
sim_model = pd.nn.Sequential(embedding_layer,
                         pd.nn.Conv1D(in_channels=dim, out_channels=out_channels, kernel_size=kernel_size,
                                      stride=stride, padding=padding, data_format='NLC', bias_attr=True),
                         pd.nn.ReLU(),
                         pd.nn.MaxPool1D(kernel_size=2, stride=2),
                         pd.nn.Flatten(),
                         pd.nn.Linear(in_features=np.prod(output_shape), out_features=2, bias_attr=True),
                         pd.nn.Softmax())

pd.summary(sim_model, input_size=(-1, length), dtypes='int64')

---------------------------------------------------------------------------
 Layer (type)       Input Shape          Output Shape         Param #    
  Embedding-1       [[1, 2000]]         [1, 2000, 100]        514,700    
   Conv1D-1       [[1, 2000, 100]]       [1, 998, 10]          5,010     
    ReLU-1         [[1, 998, 10]]        [1, 998, 10]            0       
  MaxPool1D-1      [[1, 998, 10]]        [1, 998, 5]             0       
   Flatten-1       [[1, 998, 5]]          [1, 4990]              0       
   Linear-1         [[1, 4990]]             [1, 2]             9,982     
   Softmax-1          [[1, 2]]              [1, 2]               0       
Total params: 529,692
Trainable params: 529,692
Non-trainable params: 0
---------------------------------------------------------------------------
Input size (MB): 0.01
Forward/backward pass size (MB): 1.75
Params size (MB): 2.02
Estimated Total Size (MB): 3.78
---------------------------------------------------------------------

{'total_params': 529692, 'trainable_params': 529692}

## 读取数据，进行训练
我们可以利用飞桨2.0的io.Dataset模块来构建一个数据的读取器，方便地将数据进行分批训练。

In [None]:
class DataReader(Dataset):
    def __init__(self, input, label, length):
        self.data = list(vectorizer(input, label, length=length))

    def __getitem__(self, idx):
        return self.data[idx]

    def __len__(self):
        return len(self.data)


# 指定训练设备
device = pd.set_device('gpu')  # 可选：cpu

# 开启动态图模式
pd.disable_static(device)

# 定义输入格式
input_form = pd.static.InputSpec(shape=[None, length], dtype='int64', name='input')
label_form = pd.static.InputSpec(shape=[None, 1], dtype='int64', name='label')

model = pd.Model(sim_model, input_form, label_form)
model.prepare(optimizer=pd.optimizer.Adam(learning_rate=0.001, parameters=model.parameters()),
              loss=pd.nn.loss.CrossEntropyLoss(),
              metrics=pd.metric.Accuracy())

# 分割训练集和验证集
eval_length = int(len(train_x) * 1/4)
model.fit(train_data=DataReader(train_x[:-eval_length], train_y[:-eval_length], length),
          eval_data=DataReader(train_x[-eval_length:], train_y[-eval_length:], length),
          batch_size=32, epochs=10)

Epoch 1/10
step  10/586 - loss: 0.8757 - acc: 0.4813 - 18ms/step
step  20/586 - loss: 0.8331 - acc: 0.4828 - 13ms/step
step  30/586 - loss: 0.6944 - acc: 0.5042 - 11ms/step


  return (isinstance(seq, collections.Sequence) and


step  40/586 - loss: 0.7220 - acc: 0.5070 - 10ms/step
step  50/586 - loss: 0.6808 - acc: 0.4981 - 9ms/step
step  60/586 - loss: 0.7056 - acc: 0.5010 - 9ms/step
step  70/586 - loss: 0.6920 - acc: 0.5004 - 8ms/step
step  80/586 - loss: 0.6837 - acc: 0.5035 - 8ms/step
step  90/586 - loss: 0.6995 - acc: 0.4997 - 8ms/step
step 100/586 - loss: 0.6805 - acc: 0.5056 - 8ms/step
step 110/586 - loss: 0.6981 - acc: 0.5051 - 8ms/step
step 120/586 - loss: 0.7033 - acc: 0.5070 - 8ms/step
step 130/586 - loss: 0.7437 - acc: 0.5108 - 8ms/step
step 140/586 - loss: 0.6721 - acc: 0.5109 - 8ms/step
step 150/586 - loss: 0.6856 - acc: 0.5083 - 7ms/step
step 160/586 - loss: 0.6862 - acc: 0.5119 - 7ms/step
step 170/586 - loss: 0.6881 - acc: 0.5132 - 7ms/step
step 180/586 - loss: 0.6655 - acc: 0.5141 - 7ms/step
step 190/586 - loss: 0.6620 - acc: 0.5155 - 7ms/step
step 200/586 - loss: 0.6299 - acc: 0.5219 - 7ms/step
step 210/586 - loss: 0.7355 - acc: 0.5228 - 7ms/step
step 220/586 - loss: 0.6562 - acc: 0.5267 - 7

## 评估效果并用模型预测

In [None]:
# 评估
model.evaluate(eval_data=DataReader(test_x, test_y, length), batch_size=32)

# 预测
true_y = test_y[100:105] + test_y[-110:-105]
pred_y = model.predict(DataReader(test_x[100:105] + test_x[-110:-105], None, length), batch_size=1)

for index, y in enumerate(pred_y[0]):
    print("预测的标签是：%d, 实际标签是：%d" % (np.argmax(y), true_y[index]))

Eval begin...
step  10/782 - loss: 0.4515 - acc: 0.8531 - 3ms/step
step  20/782 - loss: 0.5053 - acc: 0.8656 - 3ms/step
step  30/782 - loss: 0.4896 - acc: 0.8406 - 3ms/step
step  40/782 - loss: 0.3849 - acc: 0.8469 - 3ms/step
step  50/782 - loss: 0.5705 - acc: 0.8331 - 3ms/step
step  60/782 - loss: 0.3480 - acc: 0.8370 - 3ms/step
step  70/782 - loss: 0.3403 - acc: 0.8460 - 3ms/step
step  80/782 - loss: 0.3370 - acc: 0.8473 - 3ms/step
step  90/782 - loss: 0.5180 - acc: 0.8462 - 3ms/step
step 100/782 - loss: 0.4266 - acc: 0.8481 - 3ms/step
step 110/782 - loss: 0.4605 - acc: 0.8486 - 3ms/step
step 120/782 - loss: 0.3836 - acc: 0.8477 - 3ms/step
step 130/782 - loss: 0.4657 - acc: 0.8474 - 3ms/step
step 140/782 - loss: 0.4203 - acc: 0.8462 - 3ms/step
step 150/782 - loss: 0.4735 - acc: 0.8408 - 3ms/step
step 160/782 - loss: 0.4959 - acc: 0.8412 - 3ms/step
step 170/782 - loss: 0.3490 - acc: 0.8419 - 3ms/step
step 180/782 - loss: 0.6037 - acc: 0.8415 - 3ms/step
step 190/782 - loss: 0.4110 - ac