<br>
<br>
<br>

**gensim Pypi:** <a href="" style="text-decoration:none;font-size:120%">https://pypi.org/project/gensim/</a>

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.

gensim 主要是用来做主题建模，特别是和深度学习相关的。

<br>

词向量包括：`静态词向量`、`动态词向量`。
* 静态词向量： word2vec，glove，fasttext
* 动态词向量：

<br>
<br>
<br>

## Word2Vec Model

gensim documentation: <a href="https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html" style="text-decoration:none;font-size:100%">Word2Vec Model</a>

### <a href="https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#review-bag-of-words" style="text-decoration:none;font-size:100%">Review: Bag-of-words</a>

`Bag-of-words models` are surprisingly effective, but have several weaknesses.
* First, they lose all information about word order: “John likes Mary” and “Mary likes John” correspond to identical vectors. <br>There is a solution: <font color=maroon>bag of **n-grams** models</font> consider word phrases of length n to represent documents as fixed-length vectors to capture local word order but <font color=maroon>suffer from data sparsity and high dimensionality.</font>


* Second, the model does not attempt to learn the meaning of the underlying words, and as a consequence, the distance between vectors doesn’t always reflect the difference in meaning. <br>The <font color=maroon>**Word2Vec** model</font> addresses this second problem.

### <a href="https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#introducing-the-word2vec-model" style="text-decoration:none;font-size:100%">Introducing: the Word2Vec Model</a>

* **CBOW:** Continuous Bag of words. 即输入是上下文，输出是中间的词。
* **Skipgram:** 输入是中间的词，输出是上下文。


<br>

In [1]:
from gensim.models.word2vec import Word2Vec
from gensim.test.utils import common_texts

<br>

In [2]:
model_original = Word2Vec()

model_original.wv.index_to_key

[]

<br>

In [3]:
common_texts

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

In [4]:
model = Word2Vec(sentences=common_texts, vector_size=100, window=5, min_count=1, workers=4)
model.save("./gensim_data/word2vec.model")

In [12]:
model.wv.index_to_key

# model.wv.index2entity
# 报错：AttributeError: The index2entity attribute has been replaced by index_to_key since Gensim 4.0.0.

# model.wv.index2word
# 报错：AttributeError: The index2word attribute has been replaced by index_to_key since Gensim 4.0.0.

['system',
 'graph',
 'trees',
 'user',
 'minors',
 'eps',
 'time',
 'response',
 'survey',
 'computer',
 'interface',
 'human']

In [6]:
# model.wv['king']  # 报错：KeyError: "Key 'king' not present"

print(len(model.wv['human']))
model.wv['human']

100


array([ 9.7702928e-03,  8.1651136e-03,  1.2809705e-03,  5.0975773e-03,
        1.4081288e-03, -6.4551616e-03, -1.4280510e-03,  6.4491653e-03,
       -4.6173073e-03, -3.9930656e-03,  4.9244044e-03,  2.7130984e-03,
       -1.8479753e-03, -2.8769446e-03,  6.0107303e-03, -5.7167388e-03,
       -3.2367038e-03, -6.4878250e-03, -4.2346334e-03, -8.5809948e-03,
       -4.4697905e-03, -8.5112313e-03,  1.4037776e-03, -8.6181974e-03,
       -9.9166557e-03, -8.2016252e-03, -6.7726658e-03,  6.6805840e-03,
        3.7845564e-03,  3.5616636e-04, -2.9579829e-03, -7.4283220e-03,
        5.3341867e-04,  4.9989222e-04,  1.9561767e-04,  8.5259438e-04,
        7.8633073e-04, -6.8161491e-05, -8.0070542e-03, -5.8702733e-03,
       -8.3829118e-03, -1.3120436e-03,  1.8206357e-03,  7.4171280e-03,
       -1.9634271e-03, -2.3252917e-03,  9.4871549e-03,  7.9703328e-05,
       -2.4045228e-03,  8.6048460e-03,  2.6870037e-03, -5.3439736e-03,
        6.5881060e-03,  4.5101522e-03, -7.0544672e-03, -3.2317400e-04,
      

In [7]:
# model.most_similar('graph')    # 会报错
model.wv.most_similar('graph')

[('user', 0.06793875992298126),
 ('survey', 0.03364057093858719),
 ('eps', 0.009391184896230698),
 ('human', 0.00831596553325653),
 ('minors', 0.004503015894442797),
 ('system', -0.010839187540113926),
 ('trees', -0.023671669885516167),
 ('computer', -0.09575346857309341),
 ('time', -0.11410722136497498),
 ('response', -0.11557209491729736)]

In [8]:
model.wv.most_similar('graph', topn=5)

[('user', 0.06793875992298126),
 ('survey', 0.03364057093858719),
 ('eps', 0.009391184896230698),
 ('human', 0.00831596553325653),
 ('minors', 0.004503015894442797)]

In [9]:
model.wv.similarity(u"graph", u"user")

0.06793875

<br>