### word2vec demo


#### Wrod2vec简要介绍
Word2Vec（Word to Vector）是一种用于将单词表示为向量的技术，它是由Tomas Mikolov等人在2013年提出的。这种方法通过将单词映射到高维空间中的向量，使得具有相似含义的单词在这个空间中的表示也是相近的。

Word2Vec有两种主要的模型结构：Skip-gram和Continuous Bag of Words (CBOW)。

Word2Vec的应用包括：

*   词向量表示： 将单词表示为向量，使得可以在向量空间中度量词汇之间的语义相似性。
*   文本相似性分析： 基于词向量，可以度量文本之间的相似性，用于文本分类、聚类等任务。
*   信息检索： 使用词向量来改进搜索引擎的检索效果。
*   推荐系统： 在协同过滤等算法中使用词向量来推荐相关的内容。

work2vec的输出并不是我们真正需要的，真正需要的是embedding层的权重矩阵（lookup table），通过将任意一个one-hot编码的词左乘这个矩阵就得到了低维绸密的词向量,又因为输入为one-hot编码，所以权重矩阵每一行的就是最终的词向量


In [None]:
# 可以挂载google drive

from google.colab import drive
drive.mount('/content/drive')
import os

Mounted at /content/drive


#### 拉取代码
测试的代码来自：[word2vec-pytorch](https://OlgaChernytska/word2vec-pytorch)

在colan中使用linux命令前面要加%，bash命令加!，pro会员也可以直接使用terminal

从git上拉取代码, 并切换到工作目录

In [None]:
%cd /content/drive/MyDrive
%mkdir nlp-project
%cd nlp-project

!git clone https://github.com/OlgaChernytska/word2vec-pytorch.git

%cd word2vec-pytorch/

%pwd

/content/drive/MyDrive
/content/drive/MyDrive/nlp-project
Cloning into 'word2vec-pytorch'...
remote: Enumerating objects: 144, done.[K
remote: Counting objects: 100% (39/39), done.[K
remote: Compressing objects: 100% (14/14), done.[K
remote: Total 144 (delta 30), reused 25 (delta 25), pack-reused 105[K
Receiving objects: 100% (144/144), 18.45 MiB | 17.46 MiB/s, done.
Resolving deltas: 100% (60/60), done.
/content/drive/MyDrive/nlp-project/word2vec-pytorch


'/content/drive/MyDrive/nlp-project/word2vec-pytorch'

#### 数据集
demo使用的数据集为wikitext-2和wikitext-103，它们是从Wikipedia的优秀和精选文章中提取的超过1亿条词汇的文本数据集。[homepage](https://blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/#examples)

example：

= = = Hurricane Nine = = =

 On September 14 , a tropical cyclone formed off the coast of Central America . This tropical storm tracked northwestward and intensified into a hurricane . The sea @-@ level pressure dropped to 975 mbar ( 28 @.@ 8 inHg ) or lower . The hurricane recurved gradually to the northeast and weakened over cool seas . On September 25 , this tropical storm made landfall near Long Beach , California , and dissipated inland .
 The tropical storm caught Southern Californians unprepared . It brought heavy rain and flooding to the area , which killed 45 people . At sea , 48 were killed . The storm caused heavy property damage amounting to $ 2 million ( 1939 USD ) in total , mostly to crops and coastal infrastructure .

In [None]:
%ls

config.yaml  [0m[01;34mdocs[0m/  [01;34mnotebooks[0m/  README.md  requirements.txt  train.py  [01;34mutils[0m/  [01;34mweights[0m/


In [None]:
# 使用gpu，cpu训练太慢。。
import torch
device = 'cuda' if (torch.cuda.is_available()) else 'cpu'
print(device)

cuda


#### 训练并保存相关文件
删除目标路径

可以修改config.yaml文件来切换模型、更改学习率或epoch等参数

运行程序

In [None]:
!rm -rf weights/cbow_WikiText2/
!pip install portalocker
!python train.py --config config.yaml

Vocabulary size: 4099
Adjusting learning rate of group 0 to 2.5000e-02.
Epoch: 1/5, Train Loss=5.29369, Val Loss=5.02354
Adjusting learning rate of group 0 to 2.0000e-02.
Epoch: 2/5, Train Loss=4.96373, Val Loss=4.90678
Adjusting learning rate of group 0 to 1.5000e-02.
Epoch: 3/5, Train Loss=4.84497, Val Loss=4.83733
Adjusting learning rate of group 0 to 1.0000e-02.
Epoch: 4/5, Train Loss=4.75150, Val Loss=4.76742
Adjusting learning rate of group 0 to 5.0000e-03.
Epoch: 5/5, Train Loss=4.64980, Val Loss=4.68193
Adjusting learning rate of group 0 to 0.0000e+00.
Training finished.
Model artifacts saved to folder: weights/cbow_WikiText2


In [None]:
%cd weights/cbow_WikiText2/
%ls

/content/drive/MyDrive/nlp-project/word2vec-pytorch/weights/cbow_WikiText2
config.yaml  loss.json  model.pt  vocab.pt


通过挂载谷歌硬盘，将代码、模型文件及词汇表文件保存到硬盘中，这样下次可以直接使用，而不是每次重新训练模型


In [None]:
# 可以挂载google drive

from google.colab import drive
drive.mount('/content/drive')
import os

%cd /content/drive/MyDrive/nlp-project/word2vec-pytorch/weights/cbow_WikiText2
%ll

Mounted at /content/drive
/content/drive/MyDrive/nlp-project/word2vec-pytorch/weights/cbow_WikiText2
total 9693
-rw------- 1 root     248 Jan  1 10:08 config.yaml
-rw------- 1 root     214 Jan  1 10:08 loss.json
-rw------- 1 root 9856902 Jan  1 10:08 model.pt
-rw------- 1 root   67480 Jan  1 10:08 vocab.pt


#### 一些说明

model.pt为troch保存的模型文件

vocab.pt为词表对象文件(torchtext.vocab.Vocab: A `Vocab` object)，它用于存储单词到index的映射以及相关的统计信息

example：
```
>>> from torchtext.vocab import vocab
>>> from collections import Counter, OrderedDict
>>> counter = Counter(["a", "a", "b", "b", "b"])
>>> sorted_by_freq_tuples = sorted(counter.items(), key=lambda x: x[1], reverse=True)
>>> ordered_dict = OrderedDict(sorted_by_freq_tuples)
>>> v1 = vocab(ordered_dict)
>>> print(v1['a']) #prints 1
>>> print(v1['out of vocab']) #raise RuntimeError since default index is not set
>>> tokens = ['e', 'd', 'c', 'b', 'a']
>>> #adding <unk> token and default index
>>> unk_token = '<unk>'
>>> default_index = -1
>>> v2 = vocab(OrderedDict([(token, 1) for token in tokens]), specials=[unk_token])
>>> v2.set_default_index(default_index)
>>> print(v2['<unk>']) #prints 0
>>> print(v2['out of vocab']) #prints -1
>>> #make default index same as index of unk_token
>>> v2.set_default_index(v2[unk_token])
>>> v2['out of vocab'] is v2[unk_token] #prints True
```