# 作業 : 調整 word2vec 模型的不同訓練參數

# [作業目標]
- 調整 word2vec 模型的不同參數, 分別觀察效果並比較

# [作業重點]
- 調整 word2vec 模型的不同訓練參數, 分別觀察效果並比較

In [1]:
!pip install gensim



In [2]:
# 載入 gensim 與 word2vec 模型
import gensim
from gensim.models import word2vec
import gensim.downloader as api

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

dataset = api.load('text8')

# Word2Vec 訓練參數
- size : 詞向量的維度
- min_count : 最小次數，一個詞出現的次數若小於 min_count，則拋棄不參與訓練。
- window : 訓練窗格大小，也就是一個詞在看上下文關係時，上下應該各看幾個字的意思。
- 更多參數說明，請參閱官方文件
https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Text8Corpus

In [3]:
# 使用 gensim 訓練 word2vec 詞向量
#sentences = word2vec.Text8Corpus('text8/text8')
# feed text into model
model = word2vec.Word2Vec(dataset, size=10)
#model = word2vec.Word2Vec(dataset, size=10, min_count=3, window=5)

2022-02-06 12:25:24,060 : INFO : collecting all words and their counts
2022-02-06 12:25:24,070 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2022-02-06 12:25:34,887 : INFO : collected 253854 word types from a corpus of 17005207 raw words and 1701 sentences
2022-02-06 12:25:34,893 : INFO : Loading a fresh vocabulary
2022-02-06 12:25:35,608 : INFO : effective_min_count=5 retains 71290 unique words (28% of original 253854, drops 182564)
2022-02-06 12:25:35,612 : INFO : effective_min_count=5 leaves 16718844 word corpus (98% of original 17005207, drops 286363)
2022-02-06 12:25:36,353 : INFO : deleting the raw counts dictionary of 253854 items
2022-02-06 12:25:36,379 : INFO : sample=0.001 downsamples 38 most-common words
2022-02-06 12:25:36,383 : INFO : downsampling leaves estimated 12506280 word corpus (74.8% of prior 16718844)
2022-02-06 12:25:37,159 : INFO : estimated required memory for 71290 words and 10 dimensions: 41348200 bytes
2022-02-06 12:25:37,166 : I

In [4]:
# 顯示最相近的字彙 - show the most similar word 
model.most_similar(['woman'])

  
2022-02-06 12:27:47,580 : INFO : precomputing L2-norms of word weight vectors


[('angry', 0.9509749412536621),
 ('sing', 0.9479236602783203),
 ('siblings', 0.9417492151260376),
 ('dressed', 0.9222750663757324),
 ('girl', 0.9192980527877808),
 ('daughters', 0.9171081185340881),
 ('brave', 0.9130298495292664),
 ('carmilla', 0.9107134342193604),
 ('husband', 0.9100841283798218),
 ('bride', 0.9052940607070923)]

In [5]:
# 顯示最相近的字彙(附加反義詞)
model.most_similar(positive=['woman', 'king'], negative=['man'], topn=5)

  


[('empress', 0.9408944249153137),
 ('reigned', 0.9386017322540283),
 ('reigning', 0.9355425834655762),
 ('deposed', 0.9343705177307129),
 ('tsar', 0.9302720427513123)]

In [6]:
# 挑選最不相同的字彙 - choose the word that does not match 
model.wv.doesnt_match("breakfast cereal dinner lunch".split())

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'cereal'

In [10]:
# 顯示字彙間的相似性 - show similarity between words
model.wv.similarity('woman', 'man')

0.88568246

In [8]:
# 顯示字彙的詞向量 - word vector
model['computer']

  


array([ 5.686946  ,  0.27121416, -4.920066  , -1.9344455 ,  2.6716447 ,
       -2.9096932 ,  0.18301019, -2.1444285 , -6.0022426 ,  2.1373334 ],
      dtype=float32)