### 下载维基中文语料并训练简单的词嵌入<br />
主要步骤
- 下载维基中文语料
- 抽取中文文本
- 转换繁体为简体
- 训练词嵌入<br />

需要安装的库
- jieba分词
- wikiextractor 抽取中文文本
- opencc-python

In [0]:
# 下载维基数据
! wget -c https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2

In [0]:
# 安装抽取和转换工具
! pip install jieba
! git clone https://github.com/attardi/wikiextractor.git
! cd wikiextractor && python setup.py install
# MacOS 可直接用Homebrew装，其它系统可查看原项目https://github.com/BYVoid/OpenCC
! brew install OpenCC

In [0]:
# 在wikiextractor目录下抽取维基文字，http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
! cd wikiextractor && python WikiExtractor.py -b 500M -o ../ ../zhwiki-latest-pages-articles1.xml-p1p162886.bz2

In [0]:
# 使用opencc 将繁体转换为简体
! opencc -i input_filename -o output_filename -c t2s.json

## 部分文本演示

也可以直接下载已经转换好的部分文本（180MB/1.6GB）：https://drive.google.com/open?id=1ORNDviCeIIiosE_XlEEJfJHp0F01mtsQ

下面以这部分文本做演示：[Colab地址](https://colab.research.google.com/drive/1uFnqsHyIn5C84pVkYW_dpoGB9NG7eFme)

In [0]:
!pip install -U -q PyDrive 

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [0]:
extraed = '1ORNDviCeIIiosE_XlEEJfJHp0F01mtsQ'
txt = drive.CreateFile({'id':extraed})
txt.GetContentFile('wiki_text.txt')

In [0]:
! pip install jieba
! wget -c http://horatio-jsy.oss-cn-beijing.aliyuncs.com/seg_dict.txt

--2018-12-15 09:19:30--  http://horatio-jsy.oss-cn-beijing.aliyuncs.com/seg_dict.txt
Resolving horatio-jsy.oss-cn-beijing.aliyuncs.com (horatio-jsy.oss-cn-beijing.aliyuncs.com)... 59.110.185.122
Connecting to horatio-jsy.oss-cn-beijing.aliyuncs.com (horatio-jsy.oss-cn-beijing.aliyuncs.com)|59.110.185.122|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11167 (11K) [text/plain]
Saving to: ‘seg_dict.txt’


2018-12-15 09:19:31 (165 MB/s) - ‘seg_dict.txt’ saved [11167/11167]



In [0]:
import jieba
import time
import re
import pickle

with open('./wiki_text.txt', 'r', encoding='utf-8') as f:
    txt = f.read()
jieba.load_userdict("./seg_dict.txt")

def read_data(txt):
    seg_list = []
    start = time.time()
    cut_list = jieba.lcut(txt)
    print('sgement time:', time.time()-start)
    print('total number of words:', len(cut_list))

    except_no = re.compile(r'[\u4e00-\u9fa5]{2,}')
    for i in cut_list:
        if except_no.search(i) is not None:
            seg_list.append(i)
    print('total number of words after preprocess:', len(seg_list))
    return seg_list


# 返回的words是一个列表，每一个元素是一个单词字符
words = read_data(txt)
print('data size', len(words))

# 节省分词的时间
with open('words_data_saving.p', 'wb') as f:
    pickle.dump(words, f)

Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
Loading model cost 1.043 seconds.
Prefix dict has been built succesfully.


sgement time: 344.09616708755493
total number of words: 37541365
total number of words after preprocess: 17703257
data size 17703257


In [1]:
from collections import Counter, defaultdict
from random import shuffle
import tensorflow as tf
import numpy as np
import pickle
import time


with open('./words_data_saving.p', 'rb') as f:
    words = pickle.load(f)


class GloVeModel:
    def __init__(self, embedding_size=300, context_size=2, max_vocab_size=100000, min_occurrences=1,
                 scaling_factor=3 / 4, cooccurrence_cap=100, batch_size=512, learning_rate=0.1, valid_size=32):
        self.embedding_size = embedding_size
        if isinstance(context_size, tuple):
            self.left_context, self.right_context = context_size
        elif isinstance(context_size, int):
            self.left_context = self.right_context = context_size
        else:
            raise ValueError("`context_size` should be an int or a tuple of two ints")
        self.max_vocab_size = max_vocab_size
        self.min_occurrences = min_occurrences
        self.scaling_factor = scaling_factor
        self.cooccurrence_cap = cooccurrence_cap
        self.batch_size = batch_size
        self.learning_rate = learning_rate
        self.valid_size = valid_size
        self.__words = None
        self.__word_to_id = None
        self.__cooccurrence_matrix = None
        self.__embeddings = None

    def corpus_to_graph(self, corpus):
        self.fit_corpus(corpus, self.max_vocab_size, self.min_occurrences,
                        self.left_context, self.right_context)
        self.build_graph()

    def fit_corpus(self, corpus, vocab_size, min_occurrences, left_size, right_size):

        # 如果 key 存在，就返回 key 对应的 value，如果 key 不存在，就返回默认值0。
        cooccurrence_counts = defaultdict(float)
        # corpus的结构：list[words 1, ..., words n]
        word_counts = Counter(corpus)
        # word_counts.update(corpus)
        for l_context, word, r_context in _context_windows(corpus, left_size, right_size):
            # 计数一个window内的共现频数；左右两个区域
            for i, context_word in enumerate(l_context[::-1]):
                # 中心词的左右词频数为1，其它离越远频数越小；如不存在则创建新的键(word, context_word)，并赋值
                cooccurrence_counts[(word, context_word)] += 1 / (i + 1)
            for i, context_word in enumerate(r_context):
                cooccurrence_counts[(word, context_word)] += 1 / (i + 1)

        if len(cooccurrence_counts) == 0:
            raise ValueError("No coccurrences in corpus. Did you try to reuse a generator?")

        # 选取vocab_size个常见词汇
        self.__words = [word for word, count in word_counts.most_common(vocab_size)
                        if count >= min_occurrences]
        self.__word_to_id = {word: i for i, word in enumerate(self.__words)}

        # 构造共现矩阵；将{(word1, word2): counts} 转换为 {(ID1, ID2): counts}
        self.__cooccurrence_matrix = {
            (self.__word_to_id[words[0]], self.__word_to_id[words[1]]): count
            for words, count in cooccurrence_counts.items()
            if words[0] in self.__word_to_id and words[1] in self.__word_to_id}

    def build_graph(self):
        self.graph = tf.Graph()
        with self.graph.as_default(), self.graph.device(device_for_node):
            # 通过最大共现数抑制常见词；缩放因子小于1大于0会加强低频词的重要性
            count_max = tf.constant([self.cooccurrence_cap], dtype=tf.float32,
                                    name='max_cooccurrence_count')
            scaling_factor = tf.constant([self.scaling_factor], dtype=tf.float32,
                                         name="scaling_factor")

            # 生成验证样本
            self.valid_examples = np.random.choice(a=1000, size=self.valid_size, replace=False)

            self.center_input = tf.placeholder(tf.int32, shape=[self.batch_size],
                                               name="center_words")
            self.context_input = tf.placeholder(tf.int32, shape=[self.batch_size],
                                                name="context_words")
            self.cooccurrences = tf.placeholder(tf.float32, shape=[self.batch_size],
                                                name="co-occurrence_count")

            focal_embeddings = tf.Variable(
                tf.random_uniform([self.vocab_size, self.embedding_size], 1.0, -1.0),
                name="center_embeddings")
            context_embeddings = tf.Variable(
                tf.random_uniform([self.vocab_size, self.embedding_size], 1.0, -1.0),
                name="context_embeddings")

            center_biases = tf.Variable(tf.random_uniform([self.vocab_size], 1.0, -1.0),
                                       name='center_biases')
            context_biases = tf.Variable(tf.random_uniform([self.vocab_size], 1.0, -1.0),
                                         name="context_biases")

            focal_embedding = tf.nn.embedding_lookup([focal_embeddings], self.center_input)
            context_embedding = tf.nn.embedding_lookup([context_embeddings], self.context_input)
            focal_bias = tf.nn.embedding_lookup([center_biases], self.center_input)
            context_bias = tf.nn.embedding_lookup([context_biases], self.context_input)

            # 定义原论文中的加权函数f(x)
            weighting_factor = tf.minimum(
                1.0,
                tf.pow(
                    tf.div(self.cooccurrences, count_max),
                    scaling_factor))

            # 因为单个中心词向量只与单个上下文词向量做向量乘法，所以不能使用tf.matmul；这里先对应元素相乘再相加可得到类似的结果
            embedding_product = tf.reduce_sum(tf.multiply(focal_embedding, context_embedding), 1)
            log_cooccurrences = tf.log(tf.to_float(self.cooccurrences))

            # 列表元素对应相加
            distance_expr = tf.square(tf.add_n([
                embedding_product,
                focal_bias,
                context_bias,
                tf.negative(log_cooccurrences)]))

            single_losses = tf.multiply(weighting_factor, distance_expr)
            self.total_loss = tf.reduce_sum(single_losses)
            self.optimizer = tf.train.AdagradOptimizer(self.learning_rate).minimize(
                self.total_loss)

            self.combined_embeddings = tf.add(focal_embeddings, context_embeddings,
                                              name="combined_embeddings")
            norm = tf.sqrt(tf.reduce_sum(tf.square(self.combined_embeddings), 1, keep_dims=True))
            self.combined_embeddings = self.combined_embeddings/norm
            valid_embeddings = tf.nn.embedding_lookup(self.combined_embeddings, self.valid_examples)
            self.similarity = tf.matmul(valid_embeddings, self.combined_embeddings, transpose_b=True)

    def train(self, num_epochs, summary_interval=5000):

        batches = self.prepare_batches()
        total_steps = 0
        average_loss = 0

        with tf.Session(graph=self.graph) as session:
            tf.global_variables_initializer().run()
            start = time.time()
            for epoch in range(num_epochs):
                shuffle(batches)

                # 遍历完列表即一个Epoch
                for batch_index, batch in enumerate(batches):
                    i_s, j_s, counts = batch
                    # 不满批量数则跳过
                    if len(counts) != self.batch_size:
                        continue
                    feed_dict = {
                        self.center_input: i_s,
                        self.context_input: j_s,
                        self.cooccurrences: counts}
                    _, iter_loss = session.run([self.optimizer, self.total_loss], feed_dict=feed_dict)

                    average_loss += iter_loss
                    total_steps += 1
                    if total_steps % summary_interval == 0:
                        average_loss /= summary_interval
                        print('Average loss at %d is %f' % (total_steps, average_loss))
                        average_loss = 0
                print('the training time in one epoch: %f' % (time.time() - start))
                start = time.time()
                
                if epoch % 2 ==0:
                    sim = self.similarity.eval()
                    for i in range(self.valid_size):
                        valid_word = self.__words[self.valid_examples[i]]
                        top_k = 8
                        nearest = (-sim[i, :]).argsort()[1:top_k + 1]
                        log_str = 'Nearest to %s:' % valid_word
                        for k in range(top_k):
                            close_word = self.__words[nearest[k]]
                            log_str = '%s %s' % (log_str, close_word)
                        print(log_str)

            self.__embeddings = self.combined_embeddings.eval()

    def prepare_batches(self):
        if self.__cooccurrence_matrix is None:
            raise NotFitToCorpusError(
                "Need to fit model to corpus before preparing training batches.")
        cooccurrences = [(word_ids[0], word_ids[1], count)
                         for word_ids, count in self.__cooccurrence_matrix.items()]

        # zip()会将两个列表相同位置的元素组成一个元组；zip(*)可理解为解压，返回二个列表
        # i_indices, j_indices 及计数是否对称 ？？
        i_indices, j_indices, counts = zip(*cooccurrences)
        # 返回的列表，每一个元素是i,j,counts组成的批量数据
        return list(batchify(self.batch_size, i_indices, j_indices, counts))

    @property
    def vocab_size(self):
        return len(self.__words)

    @property
    def words(self):
        if self.__words is None:
            raise NotFitToCorpusError("Need to fit model to corpus before accessing words.")
        return self.__words

    @property
    def embeddings(self):
        if self.__embeddings is None:
            raise NotTrainedError("Need to train model before accessing embeddings")
        return self.__embeddings

    def id_for_word(self, word):
        if self.__word_to_id is None:
            raise NotFitToCorpusError("Need to fit model to corpus before looking up word ids.")
        return self.__word_to_id[word]


def _context_windows(region, left_size, right_size):
    """针对语料库的每一个词构建一个window"""
    for i, word in enumerate(region):
        start_index = i - left_size
        end_index = i + right_size
        left_context = window(region, start_index, i - 1)
        right_context = window(region, i + 1, end_index)

        # 返回一个iterable对象，对象的每一个元素是(left_context, word, right_context)
        yield (left_context, word, right_context)


def window(region, start_index, end_index):
    """
    从 `start_index`到 `end_index`为一个词构建一个列表；
    如果头尾不满足window的长度，补充`NULL_WORD`.
    """
    last_index = len(region) + 1
    selected_tokens = region[max(start_index, 0):min(end_index, last_index) + 1]
    return selected_tokens


def device_for_node(n):
    # 矩阵乘法OP使用GPU
    if n.type == "MatMul":
        return "/gpu:0"
    else:
        return "/cpu:0"


# 可变参数，接收一个元组
def batchify(batch_size, *sequences):
    for i in range(0, len(sequences[0]), batch_size):
        # 每一个批量，将批量个i,j,counts打包为元组；所有元组组成一个Epoch
        yield tuple(sequence[i:i + batch_size] for sequence in sequences)


class NotTrainedError(Exception):
    pass


class NotFitToCorpusError(Exception):
    pass


def main():
    start = time.time()
    model = GloVeModel(batch_size=512)
    print('Modelling time:', time.time()-start)
    start = time.time()
    model.corpus_to_graph(corpus=words)
    print('fit corpus time:', time.time() - start)
    start = time.time()
    model.train(num_epochs=15)
    print('total training time:', time.time() - start)


if __name__ == "__main__":
    main()

Modelling time: 1.1920928955078125e-05
Instructions for updating:
keep_dims is deprecated, use keepdims instead
fit corpus time: 165.28611016273499
Average loss at 5000 is 538.168232
Average loss at 10000 is 393.484393
Average loss at 15000 is 324.043227
Average loss at 20000 is 279.659004
Average loss at 25000 is 251.959477
Average loss at 30000 is 225.221584
Average loss at 35000 is 209.419566
Average loss at 40000 is 192.015937
Average loss at 45000 is 179.629480
Average loss at 50000 is 167.615812
Average loss at 55000 is 159.700317
the training time in one epoch: 206.182901
Nearest to 实际上: 情况 但是 一般 虽然 没有 矛盾 富庶 他们
Nearest to 出任: 横轴 韬光 音讯 法子 画院 通海 加加林 硬朗
Nearest to 网络: 香港 萨伏伊 玛格丽特 每隔 密奏 安插 富康 中国气象局
Nearest to 正在: 解决方案 齐心 第九 飞机 爆炸性 主角奖 开始 其他
Nearest to 能量: 松烟 本条 日生 转换 林旭 平衡态 光滑 廿六
Nearest to 关系: 以及 一个 因此 然而 之后 他们 其他 可以
Nearest to 知道: 因为 我们 当时 已经 他们 李光前 但是 魏博
Nearest to 组成: 以及 包括 因此 一个 称为 部分 作为 由于
Nearest to 电视台: 持家 康隆 上溯 自如 赈灾 主题 包公 庐江县
Nearest to 四个: 其中 联大 一个 博德 因为 故须 三个 分别
Nearest 