# 训练中文词向量

这个demo展示Paddle如何训练中文的词向量模型。这里使用了处理过的维基百科中文语料作为训练语料。所有训练文件均分好词，放置在`wiki_data/data`目录中。

首先我们先读取所有的文件，生成词表文件，并缓存到本地的目录中。

In [1]:
import cPickle
import os
import collections

try:  # load word dict from disk
    with open("word_dict.pkl") as f:
        word_dict = cPickle.load(f)
except:  # generate word dict in the first time
    print 'Generating word dictionary in the first time.'
    word_dict = collections.defaultdict(int)
    for dirpath, dirnames, filenames in os.walk("data/"):
        if len(filenames) != 0:
            for fn in filenames:
                with open(os.path.join(dirpath, fn)) as f:
                    for line in f:
                        for w in line.strip().split():
                            word_dict[w] += 1
                            
    items = list(word_dict.items())
    items.sort(key=lambda x: x[1], reverse=True)
    
    word_dict = dict()
    for i in xrange(len(items)):
        word_dict[items[i][0]] = i
    
    print 'Saving to word_dict.pkl'
    with open("word_dict.pkl", "w") as f:
        cPickle.dump(word_dict, f, -1)

Generating word dictionary in the first time.
Saving to word_dict.pkl


下一步，我们开始读数据的过程。在读数据过程中，我们将词转换为词ID。由于数据量本身不大，所以我们将全部数据全部读入内存中即可。

同时，我们丢弃低频词，从而加快训练过程。

In [2]:
WORD_LIMIT=2000   # 只训练2000个词汇
WINDOW_SIZE=11    # 训练窗口大小为11
EMB_SIZE=32       # 设定词向量宽度
NUM_PASSES = 20   # 设定训练轮数

START_ID = WORD_LIMIT  # 句子开始标志
END_ID = START_ID + 1  # 句子结束标志

try:
    with open("all_data.pkl") as f:
        all_data = cPickle.load(f)
except:
    print 'Converting words to word ids in the first time'
    all_data = []

    for dirpath, dirnames, filenames in os.walk("data/"):
        for fn in filenames:
            with open(os.path.join(dirpath, fn)) as f:
                for line in f:
                    line = [word_dict[w] for w in line.strip().split() if word_dict[w] < WORD_LIMIT]
                    line = [START_ID] + line + [END_ID]
                    if len(line) >= WINDOW_SIZE:
                        all_data.append(line)
    
    print 'Saving to all_data.pkl'
    with open("all_data.pkl", 'w') as f:
        cPickle.dump(all_data, f, -1)

Converting words to word ids in the first time
Saving to all_data.pkl


下一步开始配置reader_creator。 reader_creator是Paddle的一个概念，用户通过自定义reader_creator定义Paddle的输入数据。reader_creator是一个函数，他返回一个reader函数，而reader函数是一个可以返回每一条数据的iterable的函数。简单示例如下:

In [3]:
import random
def word_reader_creator():
    def reader():
        global all_data  # access all data below
        random.shuffle(all_data)
        for line in all_data:
            for i in xrange(len(line) - WINDOW_SIZE + 1):
                yield line[i:i+WINDOW_SIZE]  # yield word ids from 0 to WINDOW_SIZE
    
    return reader

下面开始配置神经网络，这里配置一个简单的CBOW网络, trainer_count是物理 cpu_cores 的个数。。一会儿 ubutu 安装。htop 看看资源消耗情况。。

In [4]:
import paddle.v2 as paddle
paddle.init(use_gpu=False)
words = [paddle.layer.data(name="word_%d"%i, type=paddle.data_type.integer_value(WORD_LIMIT + 2)) 
         for i in xrange(WINDOW_SIZE)]

embs = []
for w in words[:WINDOW_SIZE / 2] + words[-WINDOW_SIZE / 2 + 1:]:
    embs.append(paddle.layer.embedding(input=w, size=EMB_SIZE, param_attr=
                                       paddle.attr.Param(name='emb', sparse_update=True)))

with paddle.layer.mixed(size=EMB_SIZE) as sum_emb:
    for emb in embs:
        sum_emb += paddle.layer.identity_projection(input=emb)

label = words[WINDOW_SIZE / 2]

cost = paddle.layer.hsigmoid(input=sum_emb, label=label, num_classes=WORD_LIMIT+2)

下面构建训练的参数，优化器，和trainer

In [5]:
parameters = paddle.parameters.create(cost)
optimizer = paddle.optimizer.RMSProp(learning_rate=1e-3)
trainer = paddle.trainer.SGD(cost, parameters, optimizer)

下一步书写event_handler。Paddle的event handler是在训练过程中响应训练事件的回调函数，在这里用户可以对训练误差进行监控，保存模型等。

进而开始训练。

In [6]:
!mkdir -p output
import sys
import gzip

total_cost = 0.0
counter = 0
prefix="./output"
def event_handler(event):
    global total_cost
    global counter
    if isinstance(event, paddle.event.EndIteration):
        total_cost += event.cost
        counter += 1
        sys.stdout.write('.')
        if event.batch_id % 100 == 0:
            print "Pass %d, Batch %d, AvgCost %f" % (event.pass_id, event.batch_id, total_cost / counter)
        if event.batch_id % 10000 == 0:
            with gzip.open(os.path.join(prefix, "model_%d_%d.tar.gz" % (event.pass_id, event.batch_id)), 'w') as f:
                parameters.to_tar(f)
    if isinstance(event, paddle.event.EndPass):
        print "Pass %d" % event.pass_id
        with gzip.open(os.path.join(prefix, "model_%d.tar.gz" % event.pass_id), 'w') as f:
            parameters.to_tar(f)

trainer.train(paddle.batch(paddle.reader.buffered(word_reader_creator(), 16 * 4000), 3000),
        num_passes=NUM_PASSES,
        event_handler=event_handler,
        feeding=[w.name for w in words])

.Pass 0, Batch 0, AvgCost 7.624405
..............................Pass 0
.Pass 1, Batch 0, AvgCost 7.305836
..............................Pass 1
.Pass 2, Batch 0, AvgCost 7.054216
..............................Pass 2
.Pass 3, Batch 0, AvgCost 6.944893
..............................Pass 3
.Pass 4, Batch 0, AvgCost 6.885181
..............................Pass 4
.Pass 5, Batch 0, AvgCost 6.844780
..............................Pass 5
.Pass 6, Batch 0, AvgCost 6.814500
..............................Pass 6
.Pass 7, Batch 0, AvgCost 6.787237
..............................Pass 7
.Pass 8, Batch 0, AvgCost 6.761542
..............................Pass 8
.Pass 9, Batch 0, AvgCost 6.737207
..............................Pass 9
.Pass 10, Batch 0, AvgCost 6.713035
..............................Pass 10
.Pass 11, Batch 0, AvgCost 6.689688
..............................Pass 11
.Pass 12, Batch 0, AvgCost 6.666462
..............................Pass 12
.Pass 13, Batch 0, AvgCost 6.643713
......................

至此，训练完20轮之后，所有的模型均保存在了output路径下，以备之后使用

In [7]:
!ls -l ./output

total 18956
-rw-r--r-- 1 root root 482709 May 14 15:03 model_0.tar.gz
-rw-r--r-- 1 root root 478947 May 14 15:03 model_0_0.tar.gz
-rw-r--r-- 1 root root 481346 May 14 15:03 model_1.tar.gz
-rw-r--r-- 1 root root 483481 May 14 15:04 model_10.tar.gz
-rw-r--r-- 1 root root 483489 May 14 15:04 model_10_0.tar.gz
-rw-r--r-- 1 root root 483718 May 14 15:04 model_11.tar.gz
-rw-r--r-- 1 root root 483617 May 14 15:04 model_11_0.tar.gz
-rw-r--r-- 1 root root 483693 May 14 15:04 model_12.tar.gz
-rw-r--r-- 1 root root 483760 May 14 15:04 model_12_0.tar.gz
-rw-r--r-- 1 root root 483955 May 14 15:05 model_13.tar.gz
-rw-r--r-- 1 root root 483827 May 14 15:05 model_13_0.tar.gz
-rw-r--r-- 1 root root 483917 May 14 15:05 model_14.tar.gz
-rw-r--r-- 1 root root 483837 May 14 15:05 model_14_0.tar.gz
-rw-r--r-- 1 root root 483978 May 14 15:05 model_15.tar.gz
-rw-r--r-- 1 root root 483899 May 14 15:05 model_15_0.tar.gz
-rw-r--r-- 1 root root 484058 May 14 15:05 model_16.tar.gz
-rw-r--r-- 1 roo

把字典和 embeding 保存下来。。。

In [11]:
embeddings = parameters.get("emb").reshape(WORD_LIMIT + 2, EMB_SIZE)
print("数学的词向量为%s" % str(embeddings[word_dict['数学']]))
import numpy
numpy.save("emb",embeddings)

数学的词向量为[-9.7721918e-03 -2.5700396e-01  6.3699834e-02  2.0338324e-01
 -1.4195548e-01 -1.4174160e-01  7.2824247e-03 -8.0354570e-05
 -2.9269570e-01 -4.3650303e-02  3.1950042e-02 -2.6742721e-01
 -4.4773249e-03  3.1809303e-01 -1.5773760e-01 -3.4221902e-02
 -1.5549351e-01 -1.0023664e-01  8.4953494e-02  4.0698282e-02
 -3.3146757e-01 -1.5535420e-01 -4.2131353e-02 -5.6917582e-02
 -1.3943800e-01  9.1001265e-02  5.5853993e-02 -1.1030980e-02
 -3.3745095e-01  5.5707268e-02 -3.9042065e-01 -2.4750319e-01]
