# 一、数据获取

本词向量利用的是中文维基百科的语料进行训练。

语料地址：[Link](https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2)（大小1.16G）
也可以在我的网盘上下载：
链接: [Pan](https://pan.baidu.com/s/16eS2730jyIZuLvpO0ZLV_w) 提取码: ihu4

# 二、数据转换

原数据的格式是xml，我们要将其转换为txt。

这里使用的是`gensim`自带的WikiCorpus，首先读取xml文件到`input_file`中，然后其中的`get_texts`方法会生成一个迭代器，每一个迭代蕴含了一篇文章，这样我们就可以将其写入新的txt文件中了。

In [5]:
from gensim.corpora import WikiCorpus
input_path = 'zhwiki-latest-pages-articles.xml.bz2'
output_path = 'zhwiki.txt'
print('Chinese Wiki data reading...')
input_file = WikiCorpus(input_path, lemmatize=False, dictionary={})
print('Chinese Wiki data reading finishes.')
with open(output_path, 'w', encoding='utf-8') as output_file:
    print('Transformation begins...')
    count = 0
    for text in input_file.get_texts():
        output_file.write(' '.join(text) + '\n')
        count += 1
        if count % 10000 == 0:
            print(f"#{count} of texts have been processed.")
    print('Transformation finished.')

Chinese Wiki data reading...
Chinese Wiki data reading finishes.
Transformation begins...
#10000 of texts have been processed.
#20000 of texts have been processed.
#30000 of texts have been processed.
#40000 of texts have been processed.
#50000 of texts have been processed.
#60000 of texts have been processed.
#70000 of texts have been processed.
#80000 of texts have been processed.
#90000 of texts have been processed.
#100000 of texts have been processed.
#110000 of texts have been processed.
#120000 of texts have been processed.
#130000 of texts have been processed.
#140000 of texts have been processed.
#150000 of texts have been processed.
#160000 of texts have been processed.
#170000 of texts have been processed.
#180000 of texts have been processed.
#190000 of texts have been processed.
#200000 of texts have been processed.
#210000 of texts have been processed.
#220000 of texts have been processed.
#230000 of texts have been processed.
#240000 of texts have been processed.
#250000

# 三、繁体数据转换为简体数据

该Wiki数据是繁体中文数据，我们要把他们转换为简体中文数据。

In [8]:
import zhconv
print('Traditional Chinese to Simplified Chinese.')
input_path = 'zhwiki.txt'
output_path = 'zhwiki.simplify.txt'
with open(input_path, 'r', encoding='utf-8') as input_file:
    print('Traditional Chinese file reading...')
    lines = input_file.readlines()
    print('Traditional Chinese file reading finishes...')
print('Tradition to simplified begins...')
count = 0
with open(output_path, 'w', encoding='utf-8') as output_file:
    for line in lines:
        output_file.write(zhconv.convert(line, 'zh-hans'))
        count += 1
        if count % 10000 == 0:
            print(f"#{count} of texts have been transformed.")
print('Tradition to simplified finished.')

Traditional Chinese to Simplified Chinese.
Traditional Chinese file reading...
Traditional Chinese file reading finishes...
Tradition to simplified begins...
#10000 of texts have been transformed.
#20000 of texts have been transformed.
#30000 of texts have been transformed.
#40000 of texts have been transformed.
#50000 of texts have been transformed.
#60000 of texts have been transformed.
#70000 of texts have been transformed.
#80000 of texts have been transformed.
#90000 of texts have been transformed.
#100000 of texts have been transformed.
#110000 of texts have been transformed.
#120000 of texts have been transformed.
#130000 of texts have been transformed.
#140000 of texts have been transformed.
#150000 of texts have been transformed.
#160000 of texts have been transformed.
#170000 of texts have been transformed.
#180000 of texts have been transformed.
#190000 of texts have been transformed.
#200000 of texts have been transformed.
#210000 of texts have been transformed.
#220000 of 

# 四、分词

利用结巴分词。

In [9]:
import jieba
input_path = 'zhwiki.simplify.txt'
output_path = 'zhwiki.simplify.tok.txt'
with open(input_path, 'r', encoding='utf-8') as input_file:
    print('Simplified Chinese wiki data reading...')
    lines = input_file.readlines()
    print('Simplified Chinese wiki data reading finishes.')
print('Tokenization begins.')
with open(output_path, 'w', encoding='utf-8') as output_file:
    count = 0
    for line in lines:
        output_file.write(' '.join(jieba.cut(line.split('\n')[0].replace(' ', ''))) + '\n')
        count += 1
        if count % 10000 == 0:
            print(f"#{count} of texts have been tokenized.")
print('Tokenization finished.')

Simplified Chinese wiki data reading...


Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache


Simplified Chinese wiki data reading finishes.
Tokenization begins.


Loading model cost 0.747 seconds.
Prefix dict has been built successfully.


#10000 of texts have been tokenized.
#20000 of texts have been tokenized.
#30000 of texts have been tokenized.
#40000 of texts have been tokenized.
#50000 of texts have been tokenized.
#60000 of texts have been tokenized.
#70000 of texts have been tokenized.
#80000 of texts have been tokenized.
#90000 of texts have been tokenized.
#100000 of texts have been tokenized.
#110000 of texts have been tokenized.
#120000 of texts have been tokenized.
#130000 of texts have been tokenized.
#140000 of texts have been tokenized.
#150000 of texts have been tokenized.
#160000 of texts have been tokenized.
#170000 of texts have been tokenized.
#180000 of texts have been tokenized.
#190000 of texts have been tokenized.
#200000 of texts have been tokenized.
#210000 of texts have been tokenized.
#220000 of texts have been tokenized.
#230000 of texts have been tokenized.
#240000 of texts have been tokenized.
#250000 of texts have been tokenized.
Tokenization finished.


# 五、去除非中文词

一些词语中会包含非中文的词，我们要利用正则表达式将该词去除。

In [11]:
import re
input_path = 'zhwiki.simplify.tok.txt'
output_path = 'zhwiki.data.txt'
with open(input_path, 'r', encoding='utf-8') as input_file:
    print('Simplified Chinese wiki data reading...')
    lines = input_file.readlines()
    print('Simplified Chinese wiki data reading finishes.')
print('Remove Non-zh begins...')
with open(output_path, 'w', encoding='utf-8') as output_file:
    count = 0
    remove = r'^[\u4e00-\u9fa5]+$'
    for line in lines:
        line_list = line.split('\n')[0].split(' ')
        new_line = []
        for word in line_list:
            if re.search(remove, word):
                new_line.append(word)
        output_file.write(' '.join(new_line) + '\n')
        count += 1
        if count % 10000 == 0:
            print(f"#{count} of texts have been processed.")
print('Remove Non-zh finishes.')

Simplified Chinese wiki data reading...
Simplified Chinese wiki data reading finishes.
Remove Non-zh begins...
#10000 of texts have been processed.
#20000 of texts have been processed.
#30000 of texts have been processed.
#40000 of texts have been processed.
#50000 of texts have been processed.
#60000 of texts have been processed.
#70000 of texts have been processed.
#80000 of texts have been processed.
#90000 of texts have been processed.
#100000 of texts have been processed.
#110000 of texts have been processed.
#120000 of texts have been processed.
#130000 of texts have been processed.
#140000 of texts have been processed.
#150000 of texts have been processed.
#160000 of texts have been processed.
#170000 of texts have been processed.
#180000 of texts have been processed.
#190000 of texts have been processed.
#200000 of texts have been processed.
#210000 of texts have been processed.
#220000 of texts have been processed.
#230000 of texts have been processed.
#240000 of texts have be

# 六、词向量训练

In [13]:
import multiprocessing
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence

input_path = 'zhwiki.data.txt'
output_path = 'zhwiki.model'
print('Word2Vec Generation begin...')
model = Word2Vec(LineSentence(input_path),
                 size=300,
                 window=5,
                 min_count=5,
                 workers=multiprocessing.cpu_count())
print('Word2Vec Generation finishes.')
print('Model Saving...')
model.save(output_path)
print('Model Saved.')

Word2Vec Generation begin...
Word2Vec Generation finishes.
Model Saving...
Model Saved.


保存词向量：

In [17]:
model.wv.save_word2vec_format('zhwiki.model.vector', binary=False)

加载词向量：

In [19]:
import gensim
new_model = gensim.models.KeyedVectors.load_word2vec_format('zhwiki.model.vector',binary=False)

In [20]:
new_model.similar_by_word('汽车', topn=5)

[('轿车', 0.6613024473190308),
 ('卡车', 0.6038083434104919),
 ('商用车', 0.6000862717628479),
 ('摩托车', 0.599767804145813),
 ('雪佛兰', 0.5972775220870972)]