## 随着后面学习到了ELMo和Bert，它们能够训练出动态的词向量，能够解决一词多义的问题。所以有必要对词向量更深入了解，借着这次的项目，把词向量实现一下。

## 语料准备

#### wiki 中文语料库 https://ftp.acc.umu.se/mirror/wikimedia.org/dumps/zhwiki/20191120/zhwiki-20191120-pages-articles-multistream.xml.bz2

官方数据处理方式：
https://github.com/attardi/wikiextractor
wikiextractor
直接安装：pip install wikiextractor

In [1]:
# 直接定位到下载 WikiExtractor 目录找到路径，供给下面的命令
from wikiextractor import WikiExtractor

In [11]:
# cd 到WikiExtractor.py目录下执行，或者拷贝一下WikiExtractor.py所在目录到新目录再执行。
# 不能在这里直接执行，不知道什么原因，没有任何的打印信息。
# 一直显示运行状态，解压路径也没有增加大小，看似是死掉了。
# 下面的命令仅作参考
# -cb10M 压缩
# !python WikiExtractor.py -cb10M -o extracted OriginalData/zhwiki-20191120-pages-articles-multistream.xml.bz2
# !python wikiextractor/WikiExtractor.py -cb10M -o extracted zhwiki-20191120-pages-articles-multistream.xml.bz2
# !python wikiextractor/WikiExtractor.py -o extracted zhwiki-20191120-pages-articles-multistream.xml.bz2 

In [2]:
import re
import os
import thulac

In [4]:
def save(wordList, fileName, endSplit = "\n"):
    with open(fileName, "w", encoding='utf-8') as f:
        for a in wordList:
            f.write(a+endSplit)

In [5]:
line="<doc sfffff>this hdr-biz 123 model server 456</doc><doc sfffff>this hdr-biz 123 model server 456</doc>"
pattern = "<.+?>(.+?)</.+?>"
matchObj = re.findall( pattern, line)
matchObj


['this hdr-biz 123 model server 456', 'this hdr-biz 123 model server 456']

In [6]:
# 分词

def cutSentences_thulac(sentences:[]):
    thulac_ = thulac.thulac(user_dict=None, model_path=None, T2S=False, seg_only=True) 
    for i, sentence in enumerate(sentences):
        sentences[i] = thulac_.cut(sentence)
    return sentences

def fileCutWord_Thulac(filePath, toPath):
    thulac_.cut_f(filePath, toPath)
    return thulac_

In [7]:
wiki_all_path = []
extractedPath = "OriginalData/extracted/"
for folderName in os.listdir(extractedPath):
    first_path = extractedPath + folderName
    for fileName in os.listdir(first_path):
        wiki_all_path.append(first_path + "/" + fileName)

len(wiki_all_path)

1327

In [9]:
wiki_articles_data_path = 'WikiArticlesData/'
thulac_ = thulac.thulac(user_dict=None, model_path=None, T2S=True, seg_only=True) 
for i, filePath in enumerate(wiki_all_path):
    thulac_.cut_f(filePath, wiki_articles_data_path + 'wiki_article_' + str(i) + '.txt',
                  input_file_encoding='utf-8', output_file_encoding='utf-8')
    if i% 10 == 9:
        print('已完成：', i + 1, '/', len(wiki_all_path))
print("全部完成！")




Model loaded succeed
successfully cut file OriginalData/extracted/AA/wiki_00!
successfully cut file OriginalData/extracted/AA/wiki_01!
successfully cut file OriginalData/extracted/AA/wiki_02!
successfully cut file OriginalData/extracted/AA/wiki_03!
successfully cut file OriginalData/extracted/AA/wiki_04!
successfully cut file OriginalData/extracted/AA/wiki_05!
successfully cut file OriginalData/extracted/AA/wiki_06!
successfully cut file OriginalData/extracted/AA/wiki_07!
successfully cut file OriginalData/extracted/AA/wiki_08!
successfully cut file OriginalData/extracted/AA/wiki_09!
已完成： 10 / 1327
successfully cut file OriginalData/extracted/AA/wiki_10!
successfully cut file OriginalData/extracted/AA/wiki_11!
successfully cut file OriginalData/extracted/AA/wiki_12!
successfully cut file OriginalData/extracted/AA/wiki_13!
successfully cut file OriginalData/extracted/AA/wiki_14!
successfully cut file OriginalData/extracted/AA/wiki_15!
successfully cut file OriginalData/extracted/AA/wiki

## 为什么使用 gensim ？ 

使用Gensim训练Word2vec十分方便。 主要用于主题建模和文档相似性处理，它支持包括TF-IDF，LSA，LDA，和word2vec在内的多种主题模型算法。
Gensim在诸如获取单词的词向量等任务中非常有用。

## 使用gensim生成词向量
使用Gensim训练Word2vec十分方便，训练步骤如下：

1）将语料库预处理：一行一个文档或句子，将文档或句子分词（以空格分割，英文可以不用分词，英文单词之间已经由空格分割，中文预料需
要使用分词工具进行分词，常见的分词工具有StandNLP、ICTCLAS、Ansj、FudanNLP、HanLP、结巴分词等）；

2）将原始的训练语料转化成一个sentence的迭代器，每一次迭代返回的sentence是一个word（utf8格式）的列表。可以使用Gensim中
word2vec.py中的LineSentence()方法实现；

3）将上面处理的结果输入Gensim内建的word2vec对象进行训练即可：

In [None]:
# https://blog.csdn.net/qq_27586341/article/details/90025288
from gensim.models import word2vec, Word2Vec 
 
sentences = word2vec.LineSentence('./in_the_name_of_people_segment.txt') 
# in_the_name_of_people_segment.txt 分词之后的文档
 
model = Word2Vec([] , size=50, window=5, min_count=1, workers=4)

 

## 测试