# 利用Wikipedia維基百科進行word2vec訓練
Wikimedia Downloads (https://dumps.wikimedia.org/)

以2020/03/01為例 (https://dumps.wikimedia.org/zhwiki/20200301/)

因為wikipedia容量約2G，無法在12小時內完成，因此只下載第一部分的內容

(https://dumps.wikimedia.org/zhwiki/20200301/zhwiki-20200301-pages-articles-multistream1.xml-p1p162886.bz2)


In [1]:
!wget https://dumps.wikimedia.org/zhwiki/20200301/zhwiki-20200301-pages-articles-multistream1.xml-p1p162886.bz2

--2020-04-17 10:34:06--  https://dumps.wikimedia.org/zhwiki/20200301/zhwiki-20200301-pages-articles-multistream1.xml-p1p162886.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.7, 2620:0:861:1:208:80:154:7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 172586252 (165M) [application/octet-stream]
Saving to: ‘zhwiki-20200301-pages-articles-multistream1.xml-p1p162886.bz2’


2020-04-17 10:34:41 (4.79 MB/s) - ‘zhwiki-20200301-pages-articles-multistream1.xml-p1p162886.bz2’ saved [172586252/172586252]



## Gensim 
Gensim是用來建立文字關係模型的python模組

(https://radimrehurek.com/gensim/models/word2vec.html)


In [0]:
from gensim.corpora import WikiCorpus

wiki_corpus = WikiCorpus('zhwiki-20200301-pages-articles-multistream1.xml-p1p162886.bz2', dictionary={})


列出資料前10行

In [3]:
next(iter(wiki_corpus.get_texts()))[:10]

['歐幾里得',
 '西元前三世紀的古希臘數學家',
 '現在被認為是幾何之父',
 '此畫為拉斐爾的作品',
 '雅典學院',
 '数学',
 '是利用符号语言研究數量',
 '结构',
 '变化以及空间等概念的一門学科',
 '从某种角度看屬於形式科學的一種']

將維基百科的資料每一句話改成用空格隔開，並存入wiki_text.txt

In [4]:
text_num = 0

with open('wiki_text.txt', 'w', encoding='utf-8') as f:
    for text in wiki_corpus.get_texts():
        f.write(' '.join(text)+'\n')
        text_num += 1
        if text_num % 10000 == 0:
            print('{} articles processed.'.format(text_num))

    print('{} articles processed.'.format(text_num))

10000 articles processed.
20000 articles processed.
27590 articles processed.


## 安裝Opencc 開放簡繁轉換 

(https://github.com/BYVoid/OpenCC)

In [5]:
!pip install opencc-python-reimplemented

Collecting opencc-python-reimplemented
[?25l  Downloading https://files.pythonhosted.org/packages/53/0c/c499c86a719c925a08586085a56f92f3235c03ee8b4db2e59c1e9aab3f55/opencc-python-reimplemented-0.1.5.tar.gz (482kB)
[K     |▊                               | 10kB 20.6MB/s eta 0:00:01[K     |█▍                              | 20kB 6.3MB/s eta 0:00:01[K     |██                              | 30kB 8.8MB/s eta 0:00:01[K     |██▊                             | 40kB 11.1MB/s eta 0:00:01[K     |███▍                            | 51kB 7.3MB/s eta 0:00:01[K     |████                            | 61kB 8.5MB/s eta 0:00:01[K     |████▊                           | 71kB 9.5MB/s eta 0:00:01[K     |█████▍                          | 81kB 10.6MB/s eta 0:00:01[K     |██████                          | 92kB 8.1MB/s eta 0:00:01[K     |██████▉                         | 102kB 8.7MB/s eta 0:00:01[K     |███████▌                        | 112kB 8.7MB/s eta 0:00:01[K     |████████▏             

## 結巴中文斷詞

(https://github.com/fxsjy/jieba)

**將文章進行斷詞切字，並存至seg.txt，約需10分鐘**

In [7]:
import jieba
from opencc import OpenCC


# Initial
cc = OpenCC('s2t')
train_data = open('wiki_text.txt', 'r', encoding='utf-8').read()
train_data = cc.convert(train_data)
train_data = jieba.lcut(train_data)
train_data = [word for word in train_data if word != '']
train_data = ' '.join(train_data)
open('seg.txt', 'w', encoding='utf-8').write(train_data)


Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
Loading model cost 0.743 seconds.
Prefix dict has been built successfully.


108406476

**將seg.txt進行word2vec訓練，約需10分鐘，並存檔至word2vec.model**

In [8]:
from gensim.models import word2vec


# Settings
seed = 666
sg = 0
window_size = 10
vector_size = 100
min_count = 1
workers = 8
epochs = 5
batch_words = 10000

train_data = word2vec.LineSentence('seg.txt')
model = word2vec.Word2Vec(
    train_data,
    min_count=min_count,
    size=vector_size,
    workers=workers,
    iter=epochs,
    window=window_size,
    sg=sg,
    seed=seed,
    batch_words=batch_words
)

model.save('word2vec.model')

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


**利用word2vec.model進行字詞相關性預測(可更換"微生物"字詞)**

In [9]:
from gensim.models import word2vec

string = '微生物'
model = word2vec.Word2Vec.load('word2vec.model')
print(string)

for item in model.wv.most_similar(string):
    print(item)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


微生物
('細菌', 0.8824469447135925)
('致病', 0.8656439781188965)
('菌', 0.863878607749939)
('真菌', 0.8602553606033325)
('病理', 0.8525905609130859)
('代謝', 0.84392249584198)
('放線菌', 0.8429034948348999)
('激素', 0.8403509855270386)
('免疫', 0.8389871120452881)
('藻類', 0.8362687826156616)


  if np.issubdtype(vec.dtype, np.int):
