# word2vec
---
Word2Vec 是一种著名的 词嵌入（Word Embedding） 方法，它可以计算每个单词在其给定语料库环境下的 分布式词向量（Distributed Representation，亦直接被称为词向量）。词向量表示可以在一定程度上刻画每个单词的语义。

## 简单用法
---
### 读取语料
---
* class gensim.models.word2vec.BrownCorpus（dirname ）
从布朗语料库（NLTK数据的一部分）迭代句子,dirname是存储布朗语料库的根目录(通过nltk.download()下载布朗语料库)，得到的这个对象可以通过循环迭代语料库的句子。

* class gensim.models.word2vec.LineSentence(source, max_sentence_length=10000, limit=None)
与上一样，也是产生迭代器，但需要更改下文件格式。简单的格式：一篇文档=一行; 单词已经过预处理并由空格分隔。

* class gensim.models.word2vec.PathLineSentences（source，max_sentence_length = 10000，limit = None ）
与LineSentence类一样，不过这里是处理根目录下的所有文件，同样文件中句子格式需要处理

* class gensim.models.word2vec.Text8Corpus（fname，max_sentence_length = 10000 ）
从text8语料库中迭代句子


In [1]:
from gensim.models import word2vec

file_path = 'word2vec训练数据.txt'
# 使用LineSentence读取语料
sentences = word2vec.LineSentence(file_path)

### 训练word2vec语义向量
---
```python
class gensim.models.word2vec.Word2Vec(sentences=None, size=100, alpha=0.025, window=5, min_count=5,  
                   max_vocab_size=None, sample=1e-3, seed=1, workers=3, min_alpha=0.0001,  
                   sg=0, hs=0, negative=5, ns_exponent=0.75, cbow_mean=1, hashfxn=hash, iter=5, null_word=0,  
                   trim_rule=None, sorted_vocab=1, batch_words=MAX_WORDS_IN_BATCH, 
                   compute_loss=False,callbacks=(), 
                   max_final_vocab=None)  
```
* sentence(iterable of iterables):可迭代的句子可以是简单的list，但对于较大的语料库，可以考虑直接从磁盘/网络传输句子的迭代。见BrownCorpus，Text8Corpus 或LineSentence.
* SG(INT {1 ，0}) -定义的训练算法。如果是1，则使用skip-gram; 否则，使用CBOW。
* hs：是否采用基于Hierarchical Softmax的模型。参数为1表示使用，0表示不使用
* size(int) - 特征向量的维数。
* window(int) - 句子中当前词和预测词之间的最大距离。
* min_count(int) - 忽略总频率低于此值的所有单词。 

    关于Hierarchical Softmax与negative sampling，可以参考以下博客:  
        http://www.cnblogs.com/pinard/p/7243513.html  
        https://www.cnblogs.com/pinard/p/7249903.html  

In [12]:
model = word2vec.Word2Vec(sentences, hs=1,min_count=1,window=2,size=128)

In [13]:
model["sfsdf"]

  """Entry point for launching an IPython kernel.


KeyError: "word 'sfsdf' not in vocabulary"

### 保存模型
---
model.save(file_name)
* file_name:存储模型的名称

In [14]:
model.save('model_word2vec_test')

### 加载模型
---
word2vec.Word2Vec.load(file_name)
* file_name:存储的模型的名称

In [15]:
model = word2vec.Word2Vec.load('model_word2vec_test')

In [16]:
# 获取单词word2vec值
model['apple']

  


array([-0.05317169, -0.15755793, -0.06250006, -0.01180194,  0.04724145,
       -0.26425636, -0.01873707, -0.17131071,  0.03399659, -0.29721585,
       -0.23619682, -0.00187487, -0.03608605, -0.23502819,  0.1677451 ,
        0.03139535, -0.04964776, -0.20321703,  0.03333145, -0.26338065,
       -0.32002532,  0.2947625 ,  0.18000598, -0.03674206,  0.02373896,
        0.07094915,  0.11757199, -0.22713898, -0.3052054 , -0.16245638,
        0.23368882,  0.11960867, -0.08898318,  0.06621196,  0.1065318 ,
       -0.0490705 , -0.16302302, -0.01123774, -0.11983654, -0.02068417,
        0.2732088 ,  0.2906885 , -0.13971347,  0.07436862, -0.21679416,
       -0.13932157, -0.06862234,  0.06071692,  0.04861921, -0.00767402,
       -0.00748343, -0.12588996,  0.2271392 ,  0.1355575 , -0.04126433,
        0.02798286,  0.11357804,  0.07826936,  0.1436871 ,  0.09072269,
       -0.27041167, -0.13087061, -0.03427159, -0.34049234, -0.05541604,
       -0.09847128, -0.15053135, -0.11456229, -0.19661945,  0.15

In [17]:
# 计算两个单词的语义相似度
print(model.similarity('魅族','4g'))
print(model.similarity('16g','64g'))
print(model.similarity('粉色','金色'))

0.662871
0.9732659
0.97379863


  
  This is separate from the ipykernel package so we can avoid doing imports until
  after removing the cwd from sys.path.


### 网上中文语料库
腾讯AI实验室宣布，正式开源一个大规模、高质量的中文词向量数据集  
https://ai.tencent.com/ailab/nlp/embedding.html  
120G+训练好的word2vec模型（中文词向量）  
https://blog.csdn.net/tu_22/article/details/79035769  
还有一些开源语料库，可以自己拿去训练。。。。。。

In [None]:
[word,word]

In [None]:
中国
0.06030092,  0.13926093,  0.14931089,  0.09587928,  0.01172079,
        0.16481434, -0.03030733, -0.00806136, -0.16576375,  0.07409613,
       -0.14100142,  0.23502573, -0.07417766,  0.10642188,  0.015695  ,
       -0.37878802,  0.07142749, -0.09509656,  0.36309183, -0.04938255,
        0.26763403, -0.2751151 ,  0.07805856,  0.24847876, -0.02804651,
        0.05709159,  0.18295015,  0.2644478 , -0.03482464,  0.06647653,
       -0.08666561, -0.36904642, -0.10884262, -0.01698281, -0.29763272,
       -0.03631932, -0.2168453 , -0.04120192,  0.12301973, -0.1818739 ,
       -0.26965177,  0.04479553, -0.20034713, -0.06285373,  0.12583879,
       -0.07162984,  0.03741832,  0.02025967, -0.14508504,  0.20582141,
       -0.2006449 , -0.03148303,  0.36854672, -0.1676803 , -0.05470579,
       -0.20834959, -0.4159349 , -0.15261981,  0.06619572, -0.0995653 ,
       -0.17534412,  0.07112484,  0.0081148 ,  0.1560343 , -0.1809818 ,
        0.07236055, -0.15453903, -0.1910816 ,  0.06993034, -0.22810018,
        0.2847329 , -0.08550414,  0.11128701,  0.11653139, -0.11435539,
        0.02566967,  0.08604472, -0.48752505, -0.4555698 ,  0.10849363,
        0.15700744,  0.104773  , -0.03272085, -0.33915278,  0.21981613,
        0.08956464,  0.26877728, -0.25333792, -0.05390652, -0.07425766,
        0.32495135, -0.02435999,  0.29123324,  0.22463351,  0.19677334,
       -0.03920394,  0.22039285,  0.30304262,  0.06015437, -0.04324128,
        0.2037986 , -0.01792921, -0.3117134 , -0.0888361 ,  0.0633732 ,
        0.03801356,  0.13587296, -0.01953664, -0.00784068,  0.08482304,
       -0.28995574,  0.28048077, -0.07051264, -0.10778372, -0.21289313,
       -0.22733615, -0.09544067, -0.2891292 , -0.0273876 , -0.09545469,
        0.08318584, -0.21544272, -0.22302033,  0.09569433,  0.32523558,
       -0.35689282, -0.19728719,  0.2617775
天学网
