### 下载维基中文语料并训练简单的词嵌入<br />
主要步骤
- 下载维基中文语料
- 抽取中文文本
- 转换繁体为简体
- 训练词嵌入<br />

需要安装的库
- jieba分词
- wikiextractor 抽取中文文本
- opencc-python

In [0]:
# 下载维基数据
! wget -c https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2

In [0]:
# 安装抽取和转换工具
! pip install jieba
! git clone https://github.com/attardi/wikiextractor.git
! cd wikiextractor && python setup.py install
# MacOS 可直接用Homebrew装，其它系统可查看原项目https://github.com/BYVoid/OpenCC
! brew install OpenCC

In [0]:
# 在wikiextractor目录下抽取维基文字，http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
! cd wikiextractor && python WikiExtractor.py -b 500M -o ../ ../zhwiki-latest-pages-articles1.xml-p1p162886.bz2

In [0]:
# 使用opencc 将繁体转换为简体
! opencc -i input_filename -o output_filename -c t2s.json

## 部分文本演示
也可以直接下载已经转换好的部分文本（180MB/1.6GB）：https://drive.google.com/open?id=1ORNDviCeIIiosE_XlEEJfJHp0F01mtsQ

下面以这部分文本做演示：[Colab地址](https://colab.research.google.com/drive/11qGq-rqv-tnvATUmcGhH0rqMSBqCWVxz)

In [0]:
!pip install -U -q PyDrive 

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [0]:
extraed = '1ORNDviCeIIiosE_XlEEJfJHp0F01mtsQ'
txt = drive.CreateFile({'id':extraed})
txt.GetContentFile('wiki_text.txt')

In [0]:
! pip install jieba

Collecting jieba
[?25l  Downloading https://files.pythonhosted.org/packages/71/46/c6f9179f73b818d5827202ad1c4a94e371a29473b7f043b736b4dab6b8cd/jieba-0.39.zip (7.3MB)
[K    100% |████████████████████████████████| 7.3MB 5.5MB/s 
[?25hBuilding wheels for collected packages: jieba
  Running setup.py bdist_wheel for jieba ... [?25l- \ | / - \ | / done
[?25h  Stored in directory: /root/.cache/pip/wheels/c9/c7/63/a9ec0322ccc7c365fd51e475942a82395807186e94f0522243
Successfully built jieba
Installing collected packages: jieba
Successfully installed jieba-0.39


In [0]:
! wget -c http://horatio-jsy.oss-cn-beijing.aliyuncs.com/seg_dict.txt

--2018-12-05 05:07:14--  http://horatio-jsy.oss-cn-beijing.aliyuncs.com/seg_dict.txt
Resolving horatio-jsy.oss-cn-beijing.aliyuncs.com (horatio-jsy.oss-cn-beijing.aliyuncs.com)... 59.110.190.32, 59.110.190.36
Connecting to horatio-jsy.oss-cn-beijing.aliyuncs.com (horatio-jsy.oss-cn-beijing.aliyuncs.com)|59.110.190.32|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11167 (11K) [text/plain]
Saving to: ‘seg_dict.txt’


2018-12-05 05:07:14 (213 MB/s) - ‘seg_dict.txt’ saved [11167/11167]



In [0]:
import collections
import os
import random
import jieba
import time
import re
import numpy as np
import tensorflow as tf

with open('./wiki_text.txt', 'r', encoding='utf-8') as f:
    txt = f.read()
jieba.load_userdict("./seg_dict.txt")

def read_data(txt):
    seg_list = []
    start = time.time()
    cut_list = jieba.lcut(txt)
    print('sgement time:', time.time()-start)
    print('total number of words:', len(cut_list))

    except_no = re.compile(r'[\u4e00-\u9fa5]{2,}')
    for i in cut_list:
        if except_no.search(i) is not None:
            seg_list.append(i)
    print('total number of words after preprocess:', len(seg_list))
    return seg_list



# 返回的words是一个列表，每一个元素是一个单词字符
words = read_data(txt)
print('data size', len(words))

# 节省分词的时间
import pickle
with open('words_data_saving.p', 'wb') as f:
    pickle.dump(words, f)

Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
Loading model cost 1.104 seconds.
Prefix dict has been built succesfully.


sgement time: 361.9946744441986
total number of words: 37541365
total number of words after preprocess: 17703257
data size 17703257


In [0]:
import pickle
with open('./words_data_saving.p', 'rb') as f:
    words = pickle.load(f)

In [0]:
import collections
import os
import random
import jieba
import time
import re
import numpy as np
import tensorflow as tf

vocabulary_size = 150000

# with open('./data.txt', 'w+') as f:
#     f.write(str(words[:800]))


def bulid_dataset(words):
    count = [['UNK', -1]]
    # Counter统计字符出现的个数，返回无序的dict，key为词、value为数；
    # most_common会返回列表，且列表每一个元素为元组（key, value）
    count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
    dictionary = dict()
    for word, _ in count:
        dictionary[word] = len(dictionary)
    data = list()
    unk_count = 0
    for word in words:
        if word in dictionary:
            index = dictionary[word]
        else:
            index = 0
            unk_count +=1
        data.append(index)
    # 将前面预设的「-1」替换为真正的集外词
    count[0][1] = unk_count
    # value -> key
    reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    return data, count, dictionary, reverse_dictionary


data, count, dictionary, reverse_dictionary = bulid_dataset(words)
del words
print('Most common words(+UNK)', count[:5])
print('Sample data', data[:10], [reverse_dictionary[i] for i in data[:10]])
data_index = 0


def generate_batch(batch_size, num_skips, skip_windows):
    """
    num_skip:每个单词生成的样本数；skip_windows:单词最远联系的距离
    样本：（word1, word2_nearby_word1）
    batch_size/num_skips = num_words
    """
    global data_index
    assert batch_size % num_skips == 0
    assert num_skips <= 2 * skip_windows
    batch = np.zeros(shape=[batch_size], dtype=np.int32)
    labels = np.zeros([batch_size, 1], np.int32)
    span = 2 * skip_windows + 1
    # deque为双向队列，使用append方法只会保留后插入的span个变量
    buffer = collections.deque(maxlen=span)

    # span个值依次读入buffer
    for _ in range(span):
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)

    for i in range(batch_size // num_skips):
        # 为一个批量的词生成样本
        target = skip_windows
        target_to_avoid = [skip_windows]
        # 每个词生成num_skips个样本
        for j in range(num_skips):
            # 只要是已生成过的，就重新生成
            while target in target_to_avoid:
                # 大于等于0，小于等于span - 1
                target = random.randint(0, span - 1)
            target_to_avoid.append(target)
            # 作为输入的词，每个词重复num_skips次
            batch[i * num_skips + j] = buffer[skip_windows]
            # 作为标注的词，输入词的左右
            labels[i * num_skips + j, 0] = buffer[target]
        # 读取下一个词，丢弃buffer第一个词
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
    return batch, labels


# batch, labels = generate_batch(batch_size=8, num_skips=2, skip_windows=1)
# for i in range(8):
#     print(batch[i], reverse_dictionary[batch[i]], '->', labels[i, 0], reverse_dictionary[labels[i, 0]])

batch_size = 512
embedding_size = 300
skip_window = 8
num_skips = 16

valid_size = 32
valid_window = 10000
# a=常数表示从零到常数抽取，a=数列则抽取元素；一次抽取size个元素。
valid_examples = np.random.choice(a=valid_window, size=valid_size, replace=False)
num_sampled = 5000

tf.reset_default_graph()
with tf.Graph().as_default() as g:
    train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
    train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
    valid_inputs = tf.constant(valid_examples, tf.int32)
    
    with tf.device('/gpu:0'):
        embeddings = tf.get_variable('embeddings', shape=[vocabulary_size, embedding_size],
                                     initializer=tf.random_uniform_initializer(-1.0, 1.0))
        # 查找train_inputs对应的词嵌入向量
        embed = tf.nn.embedding_lookup(embeddings, train_inputs)
        
        loss_weights = tf.get_variable('loss_weights', [vocabulary_size, embedding_size],
                                      initializer=tf.truncated_normal_initializer(stddev=1.0))
        loss_biases = tf.get_variable('loss_bias', [vocabulary_size], initializer=tf.constant_initializer(0))

    loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(weights=loss_weights,
                                         biases=loss_biases,
                                         labels=train_labels,
                                         inputs=embed,
                                         num_sampled=num_sampled,
                                         num_classes=vocabulary_size))
    optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
#     optimizer = tf.train.AdagradOptimizer(0.01).minimize(loss)
#     optimizer = tf.train.RMSPropOptimizer(learning_rate=0.01).minimize(loss)
    
    norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))
    normalized_embeddings = embeddings / norm
    valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_examples)
    similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)

num_steps = 300001
start = time.time()
with tf.Session(graph=g) as sess:
    sess.run(tf.global_variables_initializer())
    print('initialized')

    average_loss = 0
    for step in range(num_steps):
        batch_inputs, batch_labels = generate_batch(batch_size, num_skips, skip_window)
        feed_dict = {train_inputs : batch_inputs, train_labels : batch_labels}
        _, loss_val = sess.run([optimizer, loss], feed_dict=feed_dict)
        average_loss += loss_val

        if step % 4000 == 0 and step > 0:
            average_loss /= 4000
            print('Average loss at %d is %f' % (step, average_loss))
            average_loss = 0

        if step % 50000 == 0:
            print('training time: %f' % (time.time()-start))
            start = time.time()
            sim = similarity.eval()
            for i in range(valid_size):
                valid_word = reverse_dictionary[valid_examples[i]]
                top_k = 8
                nearest = (-sim[i, :]).argsort()[1:top_k + 1]
                log_str = 'Nearest to %s:' % valid_word
                for k in range(top_k):
                    close_word = reverse_dictionary[nearest[k]]
                    log_str = '%s %s' % (log_str, close_word)
                print(log_str)
    final_embeddings = normalized_embeddings.eval()

Most common words(+UNK) [['UNK', 734256], ('一个', 65764), ('中国', 61702), ('可以', 47089), ('成为', 40047)]
Sample data [426, 426, 426, 242, 145465, 30, 430, 224, 340, 10] ['数学', '数学', '数学', '利用', '符号语言', '研究', '数量', '结构', '变化', '以及']
Instructions for updating:
Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.
initialized
training time: 2.323299
Nearest to 发射: 欧尼尔 赵人 关中地区 六项 水准 收分 细化 矛尾鱼
Nearest to 汉中: 蒋村 于夫罗 柴咲幸 留长 巨宅 王世杰 由班 洛矶
Nearest to 而来: 各个方面 劫材 要少 巨港 驼峰 菁英 赛金花 独子
Nearest to 飞机: 吴玉章 施南生 最早 僧伽罗语 王莽篡汉 再论 西套 同床异梦
Nearest to 大规模: 云雾茶 六普 王岐山 调和 断袖 脚踏实地 冀州市 寿永
Nearest to 诺贝尔奖: 蒙帕纳斯 大洋河 每过 李总统 词中 帝尧 衔接 刺毛
Nearest to 宁波: 穆德 女护士 中国矿业 正典 过路 左至 帝女花 缺钱
Nearest to 做法: 想像 李等 代领 战死 葵花 议局 剪应力 身高
Nearest to 福建: 愈演愈烈 闭眼 甲子日 乔达诺 西平 帝皇 苏美英 细胞膜
Nearest to 收看: 常见病 横切面 实习生 默写 行装 回龙观 曲尺形 诉讼案
Nearest to 付费: 探病 诸地 哈利波 晋铎 起才 王司马亮 女权主义 拉丁舞
Nearest to 科举: 西北工业大学 荚果 石库门 出于 回放 负离子 检票 责骂
Nearest to 撰写: 秦公 观测员 巴罗 重内 巴伯 加班费 暂存器 王屋山
Nearest to 距离: 三弟 齐氏 两乡 道森 伦敦桥 西斯科 新历 思想进步
Nearest to 正确: 江岸区 忠次 费