[gensim doc](https://radimrehurek.com/gensim/models/keyedvectors.html)  

参考资料 (optional)：
> + word2vec 的原论文（Mikolov et al. 2013）:
https://arxiv.org/abs/1301.3781
+ skip-gram word2vec tutorial:
http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
+ improvements on skip-gram (hierarchical softmax & negative sampling):
https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

论文中结论：According to the authors, hierarchical softmax works better for infrequent words while negative sampling works better for frequent words and better with low dimensional vectors.

### 操作过程
在本章，你将使用Gensim和维基百科获得你的第一批词向量，并且感受词向量的基本过程。

- Step-01: Download Wikipedia Chinese Corpus: https://dumps.wikimedia.org/zhwiki/20190720/  
第一步：使用维基百科下载中文语料库

- Step-02: Using https://github.com/attardi/wikiextractor to extract the wikipedia corpus  
第二步：使用python wikipedia extractor抽取维基百科的内容

- Step-03: Using gensim get word vectors:
Reference:
https://radimrehurek.com/gensim/models/word2vec.html
https://www.kaggle.com/jeffd23/visualizing-word-vectors-with-t-sne  
第三步：参考Gensim的文档和Kaggle的参考文档，获得词向量。 注意，你要使用Jieba分词把维基百科的内容切分成一个一个单词，然后存进新的文件中。然后，你需要用Gensim的LineSentence这个类进行文件的读取。  
在训练成词向量Model.

- Step-04: Using some words to test your preformance.  
第四步，测试同义词，找几个单词。

- Step-05: Using visualization tools: https://www.kaggle.com/jeffd23/visualizing-word-vectors-with-t-sne  
第五步：使用Kaggle给出的T-SEN进行词向量的可视化。

### Wikipedia Chinese Corpus

In [1]:
! wget  https://dumps.wikimedia.org/zhwiki/20190720/zhwiki-20190720-pages-articles-multistream.xml.bz2
! wget  https://dumps.wikimedia.org/zhwiki/20190720/zhwiki-20190720-pages-articles-multistream-index.txt.bz2

--2019-08-10 05:59:06--  https://dumps.wikimedia.org/zhwiki/20190720/zhwiki-20190720-pages-articles-multistream.xml.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.155.106, 2620:0:861:4:208:80:155:106
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.155.106|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1901117847 (1.8G) [application/octet-stream]
Saving to: ‘zhwiki-20190720-pages-articles-multistream.xml.bz2’


2019-08-10 06:14:20 (1.99 MB/s) - ‘zhwiki-20190720-pages-articles-multistream.xml.bz2’ saved [1901117847/1901117847]

--2019-08-10 06:14:20--  https://dumps.wikimedia.org/zhwiki/20190720/zhwiki-20190720-pages-articles-multistream-index.txt.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.155.106, 2620:0:861:4:208:80:155:106
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.155.106|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 27138296 (26M) [application/octet-str

In [2]:
# ! mv zhwiki-20190720-pages-articles-multistream.xml.bz2 ./word2vec/zhwiki-20190720-pages-articles-multistream.xml.bz2

# ! mv zhwiki-20190720-pages-articles-multistream-index.txt.bz2 ./word2vec/zhwiki-20190720-pages-articles-multistream-index.txt.bz2

### wikiextractor

In [3]:
# Clone the entire repo.
!git clone -l -s https://github.com/attardi/wikiextractor.git wikiextractor
%cd wikiextractor
!ls

Cloning into 'wikiextractor'...
remote: Enumerating objects: 607, done.[K
remote: Total 607 (delta 0), reused 0 (delta 0), pack-reused 607[K
Receiving objects: 100% (607/607), 1.23 MiB | 0 bytes/s, done.
Resolving deltas: 100% (348/348), done.
Checking connectivity... done.
/home/racleme3/wikiextractor
categories.filter  cirrus-extract.py  extract.sh  README.md  WikiExtractor.py


In [4]:
!python WikiExtractor.py  -q --processes 20 -o .. -b 2G ../zhwiki-20190720-pages-articles-multistream.xml.bz2





In [7]:
%cd ..
! ls -l .

/home/racleme3
total 1883136
drwxrwxr-x  2 racleme3 racleme3       4096 Aug 10 06:21 AA
drwxrwxr-x 26 racleme3 racleme3       4096 Aug 10 05:53 anaconda3
drwxrwxr-x  2 racleme3 racleme3       4096 Mar 27 13:22 bin
-rw-rw-r--  1 racleme3 racleme3       1167 Aug 10 04:38 test.ipynb
drwxrwxr-x  3 racleme3 racleme3       4096 Aug 10 06:19 wikiextractor
-rw-rw-r--  1 racleme3 racleme3      44847 Aug 10 06:51 word2vec.ipynb
-rw-rw-r--  1 racleme3 racleme3   27138296 Jul 26 07:56 zhwiki-20190720-pages-articles-multistream-index.txt.bz2
-rw-rw-r--  1 racleme3 racleme3 1901117847 Jul 26 07:53 zhwiki-20190720-pages-articles-multistream.xml.bz2


### 简繁体转化

In [8]:
! pip install hanziconv==0.3

Collecting hanziconv==0.3
[?25l  Downloading https://files.pythonhosted.org/packages/aa/78/aa953b61c3b4a311728f22ae94ddbcb611b14a174603efd33511927a2ba7/hanziconv-0.3.tar.gz (273kB)
[K     |████████████████████████████████| 276kB 5.0MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: hanziconv
  Building wheel for hanziconv (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/racleme3/.cache/pip/wheels/0c/7c/34/2281a105db2041dd9b128f86ffd34a7c250d8a0f661c9bf255
Successfully built hanziconv
Installing collected packages: hanziconv
Successfully installed hanziconv-0.3


In [9]:
from hanziconv import HanziConv

In [10]:
Simplified_txt = ""
with open('./AA/wiki_00', 'r', encoding='utf-8') as f:
    while True:
        Simplified = HanziConv.toSimplified(f.read(1024 * 30000))
        print(len(Simplified))
        if len(Simplified) == 0:
            break
        Simplified_txt += Simplified
    del Simplified

30720000
30720000
30720000
30720000
30720000
30720000
30720000
30720000
30720000
30720000
30720000
30720000
30720000
30720000
30720000
30720000
30720000
30720000
9289150
0


In [11]:
len(Simplified_txt)

562249150

In [12]:
Simplified_txt[:200]

'<doc id="13" url="https://zh.wikipedia.org/wiki?curid=13" title="数学">\n数学\n\n数学是利用符号语言研究数量、结构、变化以及空间等概念的一门学科，从某种角度看属于形式科学的一种。数学透过抽象化和逻辑推理的使用，由计数、计算、量度和对物体形状及运动的观察而产生。数学家们拓展这些概念，为了公式化新的猜想以及从选定的公理及定义中建立起严谨'

### gensim

In [13]:
import re
from multiprocessing import Pool, Manager

def preprocessing(txt, pattern, queue):
    re_tokens = re.findall(pattern, txt)
    queue.put(re_tokens)

tokens = []
def multipro(txt):
    " "
    pool = Pool(10)
    queue = Manager().Queue()

    step = int(len(Simplified_txt) / 10)
    for i in range(10):
        txt = Simplified_txt[i * step : (i+1) * step]
        pool.apply_async(preprocessing, args=(txt, r'[\u4e00-\u9fa5]+', queue))
    
    i = 0
    while i < 10:
        token = queue.get()
        # print(token[:10])
        tokens.extend(token)
        i += 1
        print('\r{}%'.format(i / 10 * 100), end='')

    pool.close()
    pool.join()

In [14]:
multipro(Simplified_txt)

100.0%

In [16]:
# del Simplified_txt
tokens[:10]

['数学',
 '数学',
 '数学是利用符号语言研究数量',
 '结构',
 '变化以及空间等概念的一门学科',
 '从某种角度看属于形式科学的一种',
 '数学透过抽象化和逻辑推理的使用',
 '由计数',
 '计算',
 '量度和对物体形状及运动的观察而产生']

In [17]:
# 保存临时文件
import pickle

with open('tmp.pickle', 'wb') as f:
    pickle.dump(tokens, f)

In [1]:
# 加载临时文件
import pickle

with open('tmp.pickle', 'rb') as f:
    tokens = pickle.load(f, encoding='utf-8')

In [2]:
len(tokens)

47994811

In [3]:
tokens[:10]

['数学',
 '数学',
 '数学是利用符号语言研究数量',
 '结构',
 '变化以及空间等概念的一门学科',
 '从某种角度看属于形式科学的一种',
 '数学透过抽象化和逻辑推理的使用',
 '由计数',
 '计算',
 '量度和对物体形状及运动的观察而产生']

In [8]:
import jieba
import tqdm

# jieba.enable_parallel(2)
sentence_list = []
with open('sentences.txt', 'a', encoding='utf-8') as f: 
    for i in tqdm.trange(0, int(len(tokens) / 100)):

        tokens_gen =  ' '.join(jieba.cut(' '.join(tokens[i * 100 : (i+1) * 100])))
        f.write(tokens_gen + '\n')
        sentence_list.append(tokens_gen) 

        if i == int(len(tokens) / 100) - 1:
            tokens_gen = ' '.join(jieba.cut(' '.join(tokens[(i+1) * 100 : ])))
            f.write(tokens_gen + '\n')
            sentence_list.append(tokens_gen) 

In [11]:
print(sentence_list[-5:])
print(len(sentence_list))

['实为 自愿 入 黑暗 位面   从 黑暗 圣女 确认 为 新任 黑暗 王   伊之纱   前任 神女   叶心 夏 父亲 文泰 的 妹妹   现任 帕特农 神女 候选人   有 极大 的 野心   试图 通过 复活 神术 来 达到 操纵 所有 人类 强大 的 势力 的 目的   十四年 前 由于 帕特农 神魂 没有 降临到 自己 身上 而是 降临到 了 文泰 身上 而 对 文泰 产生 了 极大 的 妒忌   设计 使用 黑暗 圣裁 杀死 文泰 夺取 帕特农 神魂   但']
1


#### gensim训练

In [6]:
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence

In [12]:
%time
sentences = LineSentence('sentences.txt')

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 7.15 µs


In [14]:
#min_count：忽略频率低于改值的词
#size：训练的词向量的长度
model = Word2Vec(sentences, min_count=5, size=200, workers=6)

In [21]:
model.most_similar('你好'，topn=20)

SyntaxError: invalid character in identifier (<ipython-input-21-ea573f08a5d9>, line 1)

In [26]:
model.most_similar('帅')

  """Entry point for launching an IPython kernel.


[('虎贲', 0.7116403579711914),
 ('中军', 0.6938284635543823),
 ('马军', 0.6936951279640198),
 ('下军', 0.6582335233688354),
 ('六军', 0.6538777351379395),
 ('五威', 0.6505539417266846),
 ('统军', 0.6487195491790771),
 ('军马', 0.6487146615982056),
 ('左军', 0.6485307216644287),
 ('诸军', 0.6386147141456604)]

In [17]:
word_vectors = model.wv

In [22]:
word_vectors.get_vector('库里')

array([ 0.31431022, -0.49659508, -0.41180485, -0.9388593 ,  1.1792672 ,
       -0.68977827,  1.6992861 , -0.23453262, -0.67727226,  0.976191  ,
        0.16994295, -0.27437338,  0.56341237,  0.44257885,  0.24988951,
        0.9972586 ,  0.77312094, -0.62149525, -0.28615594, -0.03804397,
        0.47982404,  0.41198653,  0.73709136,  0.01349468, -0.30985498,
        2.0627558 ,  0.01506568, -0.7384937 , -0.85345733,  1.9329855 ,
        0.04097642,  0.36355823,  0.46070406, -0.48479882,  0.0665291 ,
        1.4704207 ,  0.656128  , -0.14783195, -0.39312753, -0.33782053,
        0.9794409 , -0.1612481 , -0.4890576 , -1.0449501 ,  1.0866667 ,
       -0.31846702,  1.0449609 , -0.08068615,  0.88902146, -0.2146927 ,
       -0.31309587,  0.0656997 , -0.32980585, -0.03476756, -0.8022183 ,
        1.0568886 ,  0.4411424 , -0.89434737,  1.1862843 ,  0.55774057,
        1.8298851 , -1.0133578 , -1.6699753 ,  1.0041021 , -0.5227925 ,
        0.6669368 ,  0.8731735 ,  1.0764211 ,  1.1917291 , -1.07

In [27]:
# 保存一
model.save('word2vec_model_200v.w2v')

# from gensim.test.utils import datapath

# news_model_1 = Word2Vec.load('')

In [28]:
# 保存二
word_vectors.save('word2vec_keyvector_200v.w2v')

# from gensim.models import KeyedVectors

# word_vectors_1 = KeyedVectors.load('')

# or
# word_vectors.save_word2vec_format("word2vec_model_5.txt", binary=True)
# word2vec = KeyedVectors.load_word2vec_format(path)

```python
>>> import gensim.downloader as api
>>>
>>> word_vectors = api.load("glove-wiki-gigaword-100")  # load pre-trained word-vectors from gensim-data
>>>
>>> result = word_vectors.most_similar(positive=['woman', 'king'], negative=['man'])
>>> print("{}: {:.4f}".format(*result[0]))
queen: 0.7699
>>>
>>> result = word_vectors.most_similar_cosmul(positive=['woman', 'king'], negative=['man'])
>>> print("{}: {:.4f}".format(*result[0]))
queen: 0.8965
>>>
>>> print(word_vectors.doesnt_match("breakfast cereal dinner lunch".split()))
cereal
>>>
>>> similarity = word_vectors.similarity('woman', 'man')
>>> similarity > 0.8
True
>>>
>>> result = word_vectors.similar_by_word("cat")
>>> print("{}: {:.4f}".format(*result[0]))
dog: 0.8798
>>>
>>> sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
>>> sentence_president = 'The president greets the press in Chicago'.lower().split()
>>>
>>> similarity = word_vectors.wmdistance(sentence_obama, sentence_president)
>>> print("{:.4f}".format(similarity))
3.4893
>>>
>>> distance = word_vectors.distance("media", "media")
>>> print("{:.1f}".format(distance))
0.0
>>>
>>> sim = word_vectors.n_similarity(['sushi', 'shop'], ['japanese', 'restaurant'])
>>> print("{:.4f}".format(sim))
0.7067
>>>
>>> vector = word_vectors['computer']  # numpy vector of a word
>>> vector.shape
(100,)
>>>
>>> vector = word_vectors.wv.word_vec('office', use_norm=True)
>>> vector.shape
(100,)

# Correlation with human opinion on word similarity
>>> from gensim.test.utils import datapath
>>>
>>> similarities = model.wv.evaluate_word_pairs(datapath('wordsim353.tsv'))


>>> analogy_scores = model.wv.evaluate_word_analogies(datapath('questions-words.txt'))
```

### T-SEN

In [0]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
%matplotlib inline


def tsne_plot(model):
    "Creates and TSNE model and plots it"
    labels = []
    tokens = []
    
    num = 0
    for word in model.wv.vocab:
        tokens.append(model[word])
        labels.append(word)
        num += 1
        if num == 100:
            break
    
    tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
    new_values = tsne_model.fit_transform(tokens)

    x = []
    y = []
    for value in new_values:
        x.append(value[0])
        y.append(value[1])
        
    plt.figure(figsize=(16, 16)) 
    for i in range(len(x)):
        plt.scatter(x[i],y[i])
        plt.annotate(labels[i],
                     xy=(x[i], y[i]),
                     xytext=(5, 2),
                     textcoords='offset points',
                     ha='right',
                     va='bottom')
    plt.show()