<h1 align="center">word2vec 對單字斷詞『全唐詩』進行 Word Embedding 處理</h1>
<hr>
<p>
由 <span style="color:red">corpus_poems</span> 專案對原始文本進行過濾、整理、單字斷詞，最終文本檔案為
<span style="color:red">全唐詩（單字）.txt</span> （繁體，utf-8 編碼，約 542K）
</p>

<hr>
<h3>一些文字前處理程序（暫時保留）</h3>
<pre>
修改了去除非中文的文字與符號（只保留中文字）的函數定義，
<strong style="color:blue">remove_punctuation</strong> 改成 <strong style="color:red">remove_non_chinese</strong>
</pre>

In [1]:
# 將 XML 去除標籤輸出成 TXT 純文字檔

import gensim
import re

# 定義函數：去除非中文的文字與符號（只保留中文字）

def remove_non_chinese(line):
    # 消除英文文數字
    rule = re.compile('[a-zA-Z0-9]')
    line = rule.sub(' ', line)
    # 消除特殊符號（含部分全形符號）
    rule = re.compile('[’!"#$%&\'()*+,-./:;<=>?@，。?★、…【】《》？“”‘’！[\\]^_`{|}~\s]+')
    line = rule.sub(' ', line)
    # 消除不可見字碼
    rule = re.compile('[\001\002\003\004\005\006\007\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a]+')
    line = rule.sub(' ', line)
    # 消除所有全形符號
    rule = re.compile('[^\u4e00-\u9fa5]')
    line = rule.sub(' ', line)
    return line

# 定義函數：去除多餘的空白符號（只留下一個空白當作間隔）

def remove_redundant_space(line):
    line = re.sub(' +', ' ', line)
    return line

# # 下載的 wiki 語料檔
# wiki_file = 'c:/python/wiki/zhwiki-20200601-pages-articles.xml.bz2'

# with open('c:/python/wiki/wiki_texts.txt', 'w', encoding='utf8') as fp:
#     # 利用 gensim 載入
#     wiki = gensim.corpora.WikiCorpus(wiki_file, lemmatize=False, dictionary={})
#     # 取出文字部分（原本是 XML 格式，包含很多標籤）
#     for text in wiki.get_texts():
#         # print(text)
#         # text 是一篇文章，表示成字串串列（List）
#         # text 中的字串連接合併成長字串，以空白字元作為間隔
#         s = ' '.join(text)
#         # 僅保留中文
#         t = remove_non_chinese(s)
#         # 只留下一個空白當作間隔
#         u = remove_redundant_space(t)
#         # 每篇文章一個換行作為間隔，寫入輸出檔案
#         fp.write(u + '\n')

# fp.close()


<hr>
<h3 style="color:blue">word2vec 訓練（單字斷詞文本非常快速）</h3>
<pre>
vector 長度：通常 <strong style="color:red">250-400（全唐詩單字斷詞採用 32）</strong>
最後，輸出 word2vec model 檔：<strong style="color:red">poems.model.bin</strong>（單字斷詞訓練結果模型檔也非常小，2M 左右）
</pre>

In [2]:
# 訓練 word2vec model
#

from gensim.models import word2vec
import logging

def main():
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
    sentences = word2vec.Text8Corpus('全唐詩（單字）.txt')
    model = word2vec.Word2Vec(sentences, size=32, min_count=1, iter=100)
    # Save our model.
    model.save('poems.model.bin')
    # To load a model.
    # model = word2vec.Word2Vec.load('c:/python/poems/poems.model.bin')

if __name__ == "__main__":
    main()


2021-06-09 16:39:34,420 : INFO : collecting all words and their counts
2021-06-09 16:39:34,423 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-06-09 16:39:34,846 : INFO : collected 7325 word types from a corpus of 2563538 raw words and 257 sentences
2021-06-09 16:39:34,846 : INFO : Loading a fresh vocabulary
2021-06-09 16:39:34,876 : INFO : effective_min_count=1 retains 7325 unique words (100% of original 7325, drops 0)
2021-06-09 16:39:34,876 : INFO : effective_min_count=1 leaves 2563538 word corpus (100% of original 2563538, drops 0)
2021-06-09 16:39:34,892 : INFO : deleting the raw counts dictionary of 7325 items
2021-06-09 16:39:34,893 : INFO : sample=0.001 downsamples 46 most-common words
2021-06-09 16:39:34,893 : INFO : downsampling leaves estimated 2434108 word corpus (95.0% of prior 2563538)
2021-06-09 16:39:34,905 : INFO : estimated required memory for 7325 words and 32 dimensions: 5537700 bytes
2021-06-09 16:39:34,906 : INFO : resetting layer w

2021-06-09 16:39:52,760 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-06-09 16:39:52,763 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-06-09 16:39:52,764 : INFO : EPOCH - 18 : training on 2563538 raw words (2434315 effective words) took 0.9s, 2628812 effective words/s
2021-06-09 16:39:53,684 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-06-09 16:39:53,685 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-06-09 16:39:53,688 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-06-09 16:39:53,688 : INFO : EPOCH - 19 : training on 2563538 raw words (2433728 effective words) took 0.9s, 2637814 effective words/s
2021-06-09 16:39:54,608 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-06-09 16:39:54,609 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-06-09 16:39:54,612 : INFO : worker thread finished; awaiting finish of 0 more th

2021-06-09 16:40:11,353 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-06-09 16:40:11,355 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-06-09 16:40:11,355 : INFO : EPOCH - 38 : training on 2563538 raw words (2434687 effective words) took 0.9s, 2630644 effective words/s
2021-06-09 16:40:12,269 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-06-09 16:40:12,269 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-06-09 16:40:12,273 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-06-09 16:40:12,273 : INFO : EPOCH - 39 : training on 2563538 raw words (2433867 effective words) took 0.9s, 2656238 effective words/s
2021-06-09 16:40:13,199 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-06-09 16:40:13,200 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-06-09 16:40:13,201 : INFO : worker thread finished; awaiting finish of 0 more th

2021-06-09 16:40:29,911 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-06-09 16:40:29,914 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-06-09 16:40:29,915 : INFO : EPOCH - 58 : training on 2563538 raw words (2434009 effective words) took 0.9s, 2602048 effective words/s
2021-06-09 16:40:30,841 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-06-09 16:40:30,842 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-06-09 16:40:30,845 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-06-09 16:40:30,846 : INFO : EPOCH - 59 : training on 2563538 raw words (2434462 effective words) took 0.9s, 2618465 effective words/s
2021-06-09 16:40:31,762 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-06-09 16:40:31,763 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-06-09 16:40:31,766 : INFO : worker thread finished; awaiting finish of 0 more th

2021-06-09 16:40:48,489 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-06-09 16:40:48,492 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-06-09 16:40:48,493 : INFO : EPOCH - 78 : training on 2563538 raw words (2434349 effective words) took 0.9s, 2641234 effective words/s
2021-06-09 16:40:49,431 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-06-09 16:40:49,432 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-06-09 16:40:49,433 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-06-09 16:40:49,434 : INFO : EPOCH - 79 : training on 2563538 raw words (2434462 effective words) took 0.9s, 2590516 effective words/s
2021-06-09 16:40:50,371 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-06-09 16:40:50,373 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-06-09 16:40:50,374 : INFO : worker thread finished; awaiting finish of 0 more th

2021-06-09 16:41:07,235 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-06-09 16:41:07,237 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-06-09 16:41:07,238 : INFO : EPOCH - 98 : training on 2563538 raw words (2434165 effective words) took 0.9s, 2602205 effective words/s
2021-06-09 16:41:08,171 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-06-09 16:41:08,172 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-06-09 16:41:08,174 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-06-09 16:41:08,175 : INFO : EPOCH - 99 : training on 2563538 raw words (2434701 effective words) took 0.9s, 2601039 effective words/s
2021-06-09 16:41:09,105 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-06-09 16:41:09,106 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-06-09 16:41:09,108 : INFO : worker thread finished; awaiting finish of 0 more th

<hr>
<h3 style="color:blue">測試（相似單詞測試、相似度測試、詞向量）</h3>

In [3]:
# 載入 word2vec model

import gensim

print(gensim.__version__)

model = gensim.models.Word2Vec.load('poems.model.bin')


2021-06-09 16:41:09,142 : INFO : loading Word2Vec object from poems.model.bin
2021-06-09 16:41:09,164 : INFO : loading wv recursively from poems.model.bin.wv.* with mmap=None
2021-06-09 16:41:09,165 : INFO : setting ignored attribute vectors_norm to None
2021-06-09 16:41:09,166 : INFO : loading vocabulary recursively from poems.model.bin.vocabulary.* with mmap=None
2021-06-09 16:41:09,166 : INFO : loading trainables recursively from poems.model.bin.trainables.* with mmap=None
2021-06-09 16:41:09,167 : INFO : setting ignored attribute cum_table to None
2021-06-09 16:41:09,167 : INFO : loaded poems.model.bin


3.8.0


<hr>
<h3>相似詞測試</h3>

In [4]:
# 相似詞測試

q = '人'
# q = '風'
# q = '水'
# q = '衣'
# q = '篆'
q = '不'

try:
    lst = model.wv.most_similar(q)
except:
    print('No %s in corpus!' % q)
    lst = []

for i in lst:
    t, w = i
    print('%20.16f: %s' % (w, t))


2021-06-09 16:41:09,182 : INFO : precomputing L2-norms of word weight vectors


  0.8352340459823608: 那
  0.8227962255477905: 否
  0.8185630440711975: 未
  0.8061366081237793: 豈
  0.7808675765991211: 誰
  0.7701524496078491: 詎
  0.7530770301818848: 也
  0.7396969795227051: 始
  0.7325989603996277: 只
  0.7218586206436157: 莫


<hr>
<h3>相似度測試</h3>

In [5]:
# 相似度測試

q1 = '春'
q2 = '秋'
# q1 = '不'
# q2 = '否'
# q1 = '水'
# q2 = '川'

try:
    s = model.wv.similarity(q1, q2)
except:
    s = 0
    print('無從測試!')

print(s)


0.64299524


<hr>
<h3>詞向量與餘弦距離計算</h3>

In [6]:
# 詞向量與餘弦距離計算

import gensim
import math

q1 = '春'
q2 = '秋'

sm = model.wv.similarity(q1, q2)

# KeyedVectors Instance gets stored
v1 = model.wv.word_vec(q1)
v2 = model.wv.word_vec(q2)

print(q1, '=')
print(v1)
print(q2, '=')
print(v2)

# Cosine value of word vectors
s0, s1, s2 = 0.0, 0.0, 0.0
for i in range(len(v1)):
    s0 = s0 + (v1[i] * v2[i])
    s1 = s1 + (v1[i] * v1[i])
    s2 = s2 + (v2[i] * v2[i])
s1 = math.sqrt(s1)
s2 = math.sqrt(s2)
cs = s0 / (s1 * s2)

print('similarity =', sm)
print('    cosine =', cs)

# for i in range(len(v1)):
#     t = '%3d. %8.4f %8.4f' % (i, v1[i], v2[i])
#     print(t)


春 =
[ 4.051311    1.1910332  -2.9540813   3.4869964  -0.38508388 -2.1289382
  0.51653314  3.3864117   4.7534957   0.6509034   0.62074137 -1.0021261
 -1.0605375  -0.04934476 -5.0326962  -2.091373    1.8834847  -0.8315506
  2.4946365  -1.3585271  -0.3051079  -3.9430592  -4.4005313  -2.121581
 -1.8241342   4.9539614   1.8375494  -0.14940497 -4.642316   -2.5227432
 -1.9262683  -0.9222319 ]
秋 =
[ 2.7894175   0.3779707  -1.174874    1.6730413  -1.0096142  -2.3177893
  0.30569938  1.1835418   7.5512285   4.097276   -0.61499256  1.9736483
 -1.6048589  -0.98677117 -3.655864   -1.1428906  -1.3572695  -0.5521415
 -0.1674595  -3.8426957  -0.5969252  -0.89309883 -5.859597    2.0801303
 -1.0321467   3.389633    0.566555   -0.48148796 -3.8033376  -0.3151946
  4.9143553   0.7758194 ]
similarity = 0.64299524
    cosine = 0.642995228929133


<hr>
<h3>產生並儲存所有 Word Vectors</h3>
<p style="color:red">word_vec.pkl</p>

In [7]:
import pickle

with open('全唐詩（單字）.txt', 'r', encoding='utf-8') as fp:
    txt = fp.read()
fp.close()

word_vec = dict()
for c in txt:
    if (c != ' '):
        if (c not in word_vec):
            word_vec[c] = model.wv.word_vec(c)

with open('word_vec.pkl', 'wb') as fp:
    pickle.dump(word_vec, fp)
fp.close()


<hr>
<h3>載入所有 Word Vectors</h3>

In [8]:
import pickle

with open('word_vec.pkl', 'rb') as fp:
    word_vec = pickle.load(fp)
fp.close()

i = 0
for c in word_vec:
    print(c, word_vec[c])
    i = i + 1
    if (i == 10):
        break


秦 [ 2.1026301  -2.7536335  -2.0544972  -0.01830175 -3.122754    0.17718643
 -3.8442      2.6048982   0.8562563  -1.5035603   0.05179761  0.33781773
  0.22665319  0.50929093  2.5081782   0.53456926  0.9918102  -2.8749368
  7.0555086  -3.1846507  -0.2567764   1.2465986  -4.0731597  -0.45749757
 -1.7403893  -2.4881248  -2.0756571   3.6660502  -2.0144305  -2.3534062
 -1.4341304   0.3222719 ]
川 [-1.0051473   1.6695576  -4.207304    2.6659398  -0.45646855 -2.1526043
 -3.3446171   0.74518937  1.0484664  -2.3893294  -3.6149206  -0.74381167
  3.22035    -2.1054273  -1.0412881   2.5404737   0.10607705 -0.6205343
 -0.78406346 -1.7048079  -2.6878805   2.238332   -4.95302     4.7836056
 -0.8092342  -2.0167623   3.2475955   3.3284743  -5.144628   -3.5549617
  0.14818315  4.102822  ]
雄 [-3.9963582  -1.8772268   0.62144    -2.6027794  -5.6666255   1.9551077
 -0.7647267  -2.9234245   3.529084   -0.64440626 -4.014011    2.7052875
  3.929941   -1.4227966   3.0367606   0.3749067   8.607285   -0.5752638
  

<hr>
<h3>以下為暫時性測試</h3>

In [9]:
# # OpenCC 測試（目前無法安裝）

# from opencc import OpenCC

# # convert from Simplified Chinese to Traditional Chinese
# openCC = OpenCC('s2t')

# # can also set conversion by calling set_conversion
# # openCC.set_conversion('s2tw')

# chs= '开放中文转换'
# cht= openCC.convert(chs)

# print(cht)


In [10]:
# 去除所有半角全角符号，只留字母、数字、中文。

import re

def remove_non_chinese(line):
    # 消除英文文數字
    rule = re.compile('[a-zA-Z0-9]')
    line = rule.sub(' ', line)
    # 消除特殊符號（含部分全形符號）
    rule = re.compile('[’!"#$%&\'()*+,-./:;<=>?@，。?★、…【】《》？“”‘’！[\\]^_`{|}~\s]+')
    line = rule.sub(' ', line)
    # 消除不可見字碼
    rule = re.compile('[\001\002\003\004\005\006\007\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a]+')
    line = rule.sub(' ', line)
    # 消除所有全形符號
    rule = re.compile('[^\u4e00-\u9fa5]')
    line = rule.sub(' ', line)
    return line

def remove_punctuation(line):
    rule = re.compile(r'[^\u4e00-\u9fa5|\s]')
    line = rule.sub('', line)
    return line

def remove_redundant_space(line):
    line = re.sub(' +', ' ', line)
    return line

s = '开放中文转换 abc, XyZ#$%， 塵土    123 飛揚'

t = remove_non_chinese(s)
# t = remove_punctuation(s)
u = remove_redundant_space(t)

print(u)


开放中文转换 塵土 飛揚


In [11]:
s = '中文字   空白   刪除 ！'
t = re.sub(' +', ' ', s)
print(t)

中文字 空白 刪除 ！
