<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#1-读入数据" data-toc-modified-id="1-读入数据-1">1 读入数据</a></span></li><li><span><a href="#2-计算相似度" data-toc-modified-id="2-计算相似度-2">2 计算相似度</a></span><ul class="toc-item"><li><span><a href="#2.1-建立词向量模型" data-toc-modified-id="2.1-建立词向量模型-2.1">2.1 建立词向量模型</a></span></li><li><span><a href="#2.2-计算相似度" data-toc-modified-id="2.2-计算相似度-2.2">2.2 计算相似度</a></span><ul class="toc-item"><li><span><a href="#2.2.1-计算与曹操最相似的前10个词" data-toc-modified-id="2.2.1-计算与曹操最相似的前10个词-2.2.1">2.2.1 计算与曹操最相似的前10个词</a></span></li><li><span><a href="#2.2.2-曹操+刘备-张飞=？" data-toc-modified-id="2.2.2-曹操+刘备-张飞=？-2.2.2">2.2.2 曹操+刘备-张飞=？</a></span></li></ul></li></ul></li><li><span><a href="#3-修改参数计算相似度" data-toc-modified-id="3-修改参数计算相似度-3">3 修改参数计算相似度</a></span><ul class="toc-item"><li><span><a href="#3.1-构建词向量模型" data-toc-modified-id="3.1-构建词向量模型-3.1">3.1 构建词向量模型</a></span></li><li><span><a href="#3.2-计算相似度" data-toc-modified-id="3.2-计算相似度-3.2">3.2 计算相似度</a></span><ul class="toc-item"><li><span><a href="#3.2.1-计算与曹操最相似的前10个词" data-toc-modified-id="3.2.1-计算与曹操最相似的前10个词-3.2.1">3.2.1 计算与曹操最相似的前10个词</a></span></li><li><span><a href="#3.2.2-曹操+刘备-张飞=？" data-toc-modified-id="3.2.2-曹操+刘备-张飞=？-3.2.2">3.2.2 曹操+刘备-张飞=？</a></span></li></ul></li></ul></li><li><span><a href="#4-保存模型" data-toc-modified-id="4-保存模型-4">4 保存模型</a></span></li></ul></div>

Action3要求：    

使用Gensim中的Word2Vec对三国演义进行Word Embedding，分析和曹操最相近的词有哪些，曹操+刘备-张飞=?    

数据集：three_kingdoms.txt

In [1]:
# !which python

/usr/local/anaconda3/envs/envpy37/bin/python


In [1]:
import os
import jieba
import pandas as pd
from utils import files_processing # utils是一个小型python函数和类的集合

In [9]:
import multiprocessing
from gensim.models import word2vec

# 1 读入数据

In [2]:
# 源文件所在目录
source_folder = './three_kingdoms/source'
segment_folder = './three_kingdoms/segment'

```files_processing.get_files_list(file_dir, postfix='ALL')```
- 获得file_dir目录下，后缀名为postfix所有文件列表，包括子目录
- 参数：
    - file_dir 文件所在父目录
    - postfix 指定文件后缀名，如果为ALL，则所有文件将会被读取

In [3]:
# 获取source_folder目录下，后缀为txt的文件，返回相对路径
file_list = files_processing.get_files_list(source_folder, postfix='*.txt')

In [4]:
file_list

['./three_kingdoms/source/three_kingdoms.txt']

In [5]:
with open(file_list[0], 'r') as f:
    document = f.read()

In [6]:
document[:10]

'三国演义\n作者：罗贯'

In [7]:
# 字词分割，对整个文件内容进行字词分割
def segment_lines(file_list, segment_out_dir, stopwords=[]):
    """从source中读取文本，分词后，保存到segment目录中"""
    for i,file in enumerate(file_list):
        # 用于存储分词结果的路径
        segment_out_name = os.path.join(segment_out_dir, 'segment_{}.txt'.format(i))
        with open(file, 'r') as f:
            document = f.read()
            # jieba分词
            document_cut = jieba.cut(document)
            sentence_segment=[]
            for word in document_cut:
                if word not in stopwords:
                    sentence_segment.append(word)
            result = ' '.join(sentence_segment)
            with open(segment_out_name, 'w') as f2:
                f2.write(result)

In [8]:
segment_lines(file_list, segment_folder)

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/dk/4d4_y1kn1js69jdc1n1xyjtr0000gn/T/jieba.cache
Loading model cost 0.710 seconds.
Prefix dict has been built successfully.


# 2 计算相似度

In [10]:
sentences = word2vec.PathLineSentences(segment_folder)

## 2.1 建立词向量模型

In [12]:
model = word2vec.Word2Vec(sentences, 
                          size=100,  # 向量维度，默认为100
                          window=3,  # 句子中当前单词和被预测单词的最大距离
                          min_count=1) # 需要训练词语的最小出现次数，默认为5

## 2.2 计算相似度

### 2.2.1 计算与曹操最相似的前10个词

In [16]:
similars10 = model.wv.most_similar(positive=['曹操'], topn=10)

In [17]:
similars10

[('关公', 0.9948543906211853),
 ('孔明亦', 0.9942571520805359),
 ('众官', 0.9936116933822632),
 ('彰', 0.992612361907959),
 ('孔明', 0.9916199445724487),
 ('先主', 0.9904195070266724),
 ('相害', 0.9903540015220642),
 ('请', 0.9899086952209473),
 ('维问', 0.9895586371421814),
 ('哀痛', 0.9894794225692749)]

### 2.2.2 曹操+刘备-张飞=？

In [13]:
model.wv.similarity('曹操', '刘备')

0.98375

In [19]:
model.wv.similarity('曹操', '张飞')

0.9841915

In [20]:
model.wv.similarity('刘备', '张飞')

0.9587526

In [18]:
model.wv.most_similar(positive=['曹操', '刘备'], negative=['张飞'])

[('丞相', 0.9932718276977539),
 ('主公', 0.9924185872077942),
 ('臣', 0.992241382598877),
 ('今番', 0.991971492767334),
 ('此', 0.9919576644897461),
 ('商议', 0.9906771183013916),
 ('哀告', 0.9903850555419922),
 ('吾', 0.9900258779525757),
 ('今', 0.9899616837501526),
 ('玄德公', 0.9898769855499268)]

# 3 修改参数计算相似度

## 3.1 构建词向量模型

In [39]:
# 直接读取本地文件
sentences = word2vec.LineSentence('./three_kingdoms/segment/segment_0.txt')

In [34]:
# 4个CPU
multiprocessing.cpu_count()

4

In [40]:
model2 = word2vec.Word2Vec(sentences, 
                           size=128,
                           window=5, # 句子中当前单词和被预测单词的最大距离
                           min_count=5, #  # 需要训练词语的最小出现次数，默认为5
                           workers=multiprocessing.cpu_count() # 训练使用的线程数，默认为1即不使用多线程
                          ) 

## 3.2 计算相似度

In [41]:
model2.wv.similarity('曹操', '刘备')

0.8655579

In [42]:
model2.wv.similarity('曹操', '张飞')

0.86310124

In [43]:
model2.wv.similarity('刘备', '张飞')

0.5358451

In [49]:
model2.wv.similarity('刘备', '关羽')

0.845196

In [50]:
model2.wv.similarity('曹操', '关羽')

0.77187425

### 3.2.1 计算与曹操最相似的前10个词

In [44]:
model2.wv.most_similar('曹操', topn=10)

[('孙权', 0.988555371761322),
 ('先主', 0.9860959053039551),
 ('众将', 0.9845090508460999),
 ('故人', 0.9833407998085022),
 ('书', 0.9832016229629517),
 ('周瑜', 0.9823917150497437),
 ('鲁肃', 0.9820099472999573),
 ('报', 0.9817290306091309),
 ('刘璋', 0.9816986322402954),
 ('中堂', 0.9806970357894897)]

### 3.2.2 曹操+刘备-张飞=？

In [46]:
# 与刘备最相近的词
model2.wv.most_similar('刘备', topn=10)

[('愿', 0.9965015649795532),
 ('何', 0.9954298734664917),
 ('此人', 0.9951705932617188),
 ('公', 0.9947369694709778),
 ('奈何', 0.9936156272888184),
 ('某', 0.9933364391326904),
 ('之论', 0.9933246374130249),
 ('陆伯言', 0.9932084679603577),
 ('亦', 0.9930257201194763),
 ('大笑', 0.9929901361465454)]

In [47]:
# 与张飞最相近的词
model2.wv.most_similar('张飞', topn=10)

[('望', 0.9899088144302368),
 ('正', 0.9861788749694824),
 ('赵云', 0.9857125282287598),
 ('吕布', 0.9856086373329163),
 ('孙峻', 0.9852096438407898),
 ('杨锋', 0.9831941723823547),
 ('上马', 0.9831820726394653),
 ('投', 0.9825877547264099),
 ('逢', 0.9824584722518921),
 ('次日', 0.9823741912841797)]

In [48]:
model2.wv.most_similar(positive=['曹操', '刘备'], negative=['张飞'])

[('臣', 0.9882100820541382),
 ('吾', 0.9866390228271484),
 ('丞相', 0.9861828088760376),
 ('非', 0.9858428239822388),
 ('今', 0.9830744862556458),
 ('耳', 0.9824613332748413),
 ('虚名', 0.9822484850883484),
 ('敢', 0.981271505355835),
 ('此', 0.9809775352478027),
 ('不可', 0.9807896614074707)]

# 4 保存模型

In [31]:
model2.save('./models/w2v.model')