# 训练词向量word2vec

## 1 语料包预处理

利用jieba进行分词操作（因为utf-8文件直接用jieba会乱码，所以转成了gbk）

命令行执行如下命令预览分词效果

``` cmd
python -m jieba -d ' ' corpus10000.gbk > segmented.gbk
```

- 发现有很多中英文标点，无意义的空格以及广告的URL也被做了分词处理

需要过滤掉这些无意义的文本

### 1.1 滤过中英文标点

- 新建文件cutword.dat,并在其中存入想滤过的标点(一行一个)
- open+readlines读入
- 在列表解析时加入if判断分词是否在过滤列表中

### 1.2 滤过URL

- 利用正则表达式对读入的文本先做删除处理

In [1]:
# 导入必要的包
import re
import jieba

def regex_filter(line):
    url_regex = re.compile(r"""
        (http?://)?
        (https?://)?
        (http?:\\\\)?
        ([a-zA-Z0-9]+)
        (\.[a-zA-Z0-9]+)
        (\.[a-zA-Z0-9]+)*
        (/[a-zA-Z0-9]+)*
    """, re.VERBOSE|re.IGNORECASE)
    space_regex = re.compile(r"\s+")

    line = url_regex.sub(r"", line)
    line = space_regex.sub(r"", line)

    return line

def segment(sen_raw) -> list: # 对一个句子分词并返回一个链表
    stopwords = [line.strip('\n') for line in open('cutword.dat', 'r').readlines()] # 读入过滤列表
    sen = []
    try:
        sen = jieba.lcut(sen_raw)
    except:
        pass
    sen = [i for i in sen if i not in stopwords] # 跳过过滤表中的项目
    return sen

data = [regex_filter(line.strip('\n')) for line in open('corpus10000.gbk', 'r').readlines()] # 读入整个文件，以链表形式储存，去掉换行符

sentences = [segment(i) for i in data]
# print(sentences) # 观察分词情况

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\pc\AppData\Local\Temp\jieba.cache
Loading model cost 0.678 seconds.
Prefix dict has been built successfully.


## 2 训练词向量
- 导入gensim
- 训练模型
- 保存模型

In [2]:
from gensim.models import word2vec
import logging 

# 生成日志
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level = logging.INFO) 
# 词向量维度100，window为5，CBOW模型
model = word2vec.Word2Vec(sentences, min_count=1,iter=25)
# 保存模型
model.save("CBOW_w5_s100_m1.model")

model = word2vec.Word2Vec(sentences, min_count=2,iter=25)
model.save("CBOW_w5_s100_m2.model")
model = word2vec.Word2Vec(sentences, min_count=1,iter=25, size = 200)
model.save("CBOW_w5_s200_m1.model")
model = word2vec.Word2Vec(sentences, min_count=1,iter=25, window = 3)
model.save("CBOW_w3_s100_m1.model")
model = word2vec.Word2Vec(sentences, min_count=1,iter=25, window = 6, sg = 1)
model.save("SKIP_GRAM_w6_s100_m1.model")

s
2020-07-22 07:01:17,510 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-07-22 07:01:17,511 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-07-22 07:01:17,516 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-07-22 07:01:17,517 : INFO : EPOCH - 14 : training on 651856 raw words (565427 effective words) took 0.4s, 1585284 effective words/s
2020-07-22 07:01:17,902 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-07-22 07:01:17,906 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-07-22 07:01:17,908 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-07-22 07:01:17,908 : INFO : EPOCH - 15 : training on 651856 raw words (565292 effective words) took 0.4s, 1456393 effective words/s
2020-07-22 07:01:18,265 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-07-22 07:01:18,267 : INFO : worker thread finished; awaiting finish of 1 more thre

## 3 使用训练好的词向量模型
- 计算词之间的相似性
- 输出词向量

In [3]:
def load(name):
    return word2vec.Word2Vec.load(name)

def similar(model, word1, word2):
    print('\"{0}\"和\"{1}\"间的相似度为: {2}'.format(word1, word2, model.similarity(word1, word2)))

def test(model, i):
    print('model {0}:'.format(i))
    similar(model, '物理','化学')
    similar(model, '早餐','早饭')
    # similar(model, '马虎','粗心') # 相似度-0.01 检查一下词频可以发现“马虎”只出现过一次，可以想见这结果不太可靠
    # similar(model, '寒冷','炎热') # '炎热'不在字典中
    similar(model, '光明','黑暗')
    similar(model, '聪明','愚蠢')
    for key in model.wv.most_similar('大学', topn = 10):
        print('和'+'大学'+'最相关的10个词为:',key[0],'距离为:', key[1])
        
# 加载训练好的模型
test_suite = [load("CBOW_w5_s100_m1.model"), load("CBOW_w5_s100_m2.model"), load("CBOW_w3_s100_m1.model"), load("CBOW_w5_s200_m1.model"), load("SKIP_GRAM_w6_s100_m1.model")]
for i, model in enumerate(test_suite):
    test(model, i+1)


2020-07-22 07:02:53,680 : INFO : loading Word2Vec object from CBOW_w5_s100_m1.model
2020-07-22 07:02:53,933 : INFO : loading wv recursively from CBOW_w5_s100_m1.model.wv.* with mmap=None
2020-07-22 07:02:53,933 : INFO : setting ignored attribute vectors_norm to None
2020-07-22 07:02:53,934 : INFO : loading vocabulary recursively from CBOW_w5_s100_m1.model.vocabulary.* with mmap=None
2020-07-22 07:02:53,935 : INFO : loading trainables recursively from CBOW_w5_s100_m1.model.trainables.* with mmap=None
2020-07-22 07:02:53,935 : INFO : setting ignored attribute cum_table to None
2020-07-22 07:02:53,936 : INFO : loaded CBOW_w5_s100_m1.model
2020-07-22 07:02:53,996 : INFO : loading Word2Vec object from CBOW_w5_s100_m2.model
2020-07-22 07:02:54,117 : INFO : loading wv recursively from CBOW_w5_s100_m2.model.wv.* with mmap=None
2020-07-22 07:02:54,117 : INFO : setting ignored attribute vectors_norm to None
2020-07-22 07:02:54,118 : INFO : loading vocabulary recursively from CBOW_w5_s100_m2.mode

## 4 分析：小语料下模型参数的选择

小语料下Skip-Gram模型表现不佳，CBOW模型表现较好

小语料下，window选择小一些能够提升模型准确度，size变化对结果影响不太大

最小词频提高能够筛选掉部分罕见词，但语料过小，提高最小词频会删去可能并非罕见词的词语，使本就不太丰富的信息进一步减少

## 5 更大语料下的word2vec

采用中文wiki数据500M

### 预处理：
- 将繁体字全部转换为简体字(opencc)
- 用正则表达式去除数据中的标签和空行(Python跑了一个多钟头都没跑完，换成C++了)

--------------------------

预处理C++源码：
```c++
#include <iostream>
#include <fstream>
#include <regex>
#include <cctype>

using namespace std;

bool is_empty_line(string line) {
    for (auto &c : line) {
        if (!isspace(c)) return false;
    }
    return true;
}

int main(void) {
    fstream fin("zhwiki_500.txt", ios::in), fout("zhwiki_500_processed.txt", ios::out);
    char line[100000];
    while (!fin.eof()) {
        fin.getline(line, 100000);
        
        string str = regex_replace(line, regex("<.*>"), ""); // 删除标签
        if (is_empty_line(str)) continue;                    // 空行跳过
        fout << line << endl;
    }
    fin.close();
    fout.close();
    return 0;
}
```
------------------------

最终语料句子共180万句，约500M(*结果分词阶段跑到内存溢出*

*考虑到内存占用和训练时间的关系，只选用了前60万行，估计约160M，共8000+万字*

总训练时间：约90分钟

In [None]:
from gensim.models import word2vec
import jieba
import logging 

def segment(sen_raw) -> list: # 对一个句子分词并返回一个链表
    sen = []
    try:
        sen = jieba.lcut(sen_raw)
    except:
        pass
    return sen

sub = [line.strip('\n') for line in open('zhwiki_500_processed.txt', 'r', errors = 'ignore').readlines()]
sens = [segment(i) for i in sub[:600000]] # 内存装不下，只取前60万行
# 生成日志
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level = logging.INFO) 
model = word2vec.Word2Vec(sens, min_count=2,iter=25, window = 5, size = 500)
model.save("wiki_CBOW_w5_s500_m2.model")
model = word2vec.Word2Vec(sens, min_count=2,iter=25, window = 5, size = 500, sg = 1)
model.save("wiki_SKIP_GRAM_w5_s500_m2.model")

In [13]:
def test(model, i):
    print('model {0}:'.format(i))
    similar(model, '物理','化学')
    similar(model, '早餐','早饭')
    similar(model, '寒冷','炎热') 
    similar(model, '光明','黑暗')
    similar(model, '聪明','愚蠢')
    string = "大学 高中 初中 中学 小学 教室 幼儿园"
    print('在\"'+string+'\"中，不合群的词是{}'.format(model.wv.doesnt_match(string.split())))
    for key in model.wv.most_similar('大学', topn = 10):
        print('和'+'大学'+'最相关的10个词为:',key[0],'距离为:', key[1])
    for key in model.wv.most_similar('中国', topn = 10):
        print('和'+'中国'+'最相关的10个词为:',key[0],'距离为:', key[1])
    for key in model.wv.most_similar('罗马帝国', topn = 10):
        print('和'+'罗马帝国'+'最相关的10个词为:',key[0],'距离为:', key[1])

test_suite = [load("wiki_CBOW_w5_s500_m2.model"), load("wiki_SKIP_GRAM_w5_s500_m2.model")]

for i, model in enumerate(test_suite):
    test(model, i+1)
    

2020-07-22 08:35:29,691 : INFO : loading Word2Vec object from wiki_CBOW_w5_s500_m2.model
2020-07-22 08:35:30,446 : INFO : loading wv recursively from wiki_CBOW_w5_s500_m2.model.wv.* with mmap=None
2020-07-22 08:35:30,447 : INFO : loading vectors from wiki_CBOW_w5_s500_m2.model.wv.vectors.npy with mmap=None
2020-07-22 08:35:31,796 : INFO : setting ignored attribute vectors_norm to None
2020-07-22 08:35:31,797 : INFO : loading vocabulary recursively from wiki_CBOW_w5_s500_m2.model.vocabulary.* with mmap=None
2020-07-22 08:35:31,798 : INFO : loading trainables recursively from wiki_CBOW_w5_s500_m2.model.trainables.* with mmap=None
2020-07-22 08:35:31,798 : INFO : loading syn1neg from wiki_CBOW_w5_s500_m2.model.trainables.syn1neg.npy with mmap=None
2020-07-22 08:35:32,500 : INFO : setting ignored attribute cum_table to None
2020-07-22 08:35:32,501 : INFO : loaded wiki_CBOW_w5_s500_m2.model
2020-07-22 08:35:33,250 : INFO : loading Word2Vec object from wiki_SKIP_GRAM_w5_s500_m2.model
2020-07

## 6 wiki语料训练结果分析

可以明显看出模型训练后的效果明显提升，特别是SKIP-GRAM模型达到了可用的水准

两个模型的不同特点在这里也有所体现，CBOW的结果更倾向于词性和功能相近的归到一类，而SKIP-GRAM则把意思上相关的归到一类

取距离最接近的词，CBOW的结果通常可以直接替换原词，SKIP-GRAM的结果则经常是和原词有某种联系