# 第二课 [词向量](https://github.com/tmikolov/word2vec)

第二课学习目标
- 学习词向量的概念
- 用Skip-thought模型训练词向量
- 学习使用PyTorch dataset和dataloader
- 学习定义PyTorch模型
- 学习torch.nn中常见的Module
    - Embedding
- 学习常见的PyTorch operations
    - bmm
    - logsigmoid
- 保存和读取PyTorch模型

在这一份notebook中，我们会（尽可能）尝试复现论文[Distributed Representations of Words and Phrases and their Compositionality](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)中训练词向量的方法. 

我们会实现[Skip-gram](https://blog.csdn.net/u010665216/article/details/78721354)模型，并且使用论文中noice contrastive sampling的目标函数。

![skip_gram](skip_gram.jpg)

这篇论文有很多模型实现的细节，这些细节对于词向量的好坏至关重要。我们虽然无法完全复现论文中的实验结果，主要是由于计算资源等各种细节原因，但是我们还是可以大致展示如何训练词向量。

以下是一些我们没有实现的细节
- subsampling：参考论文section 2.3

## 调用PyTorch常用的包

In [1]:
#基本上所有torch脚本都需要用到
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data as tud #Pytorch读取训练集需要用到torch.utils.data类

torch.nn中大多数layer在torch.nn.funtional中都有一个与之对应的函数。  
二者的[区别](https://blog.csdn.net/hawkcici160/article/details/80140059)在于：  
- torch. nn.Module中实现layer的都是一个特殊的类 会自动提取可学习的参数  
- nn.functional中的函数，更像是纯函数，由def function( )定义，只是进行简单的 数学运算而已。functional中的函数是一个确定的不变的运算公式

## 调用其他需要的包

In [2]:
from collections import Counter
import numpy as np
import random
import math

import pandas as pd
import scipy
import sklearn
from sklearn.metrics.pairwise import cosine_similarity

## 其他初始设置

In [3]:
#调用gpu
USE_CUDA=torch.cuda.is_available()

#为保证实验结果可以浮现，将各种random seed固定到一个特定的值
random.seed(1)
np.random.seed(1)
torch.manual_seed(1)
if USE_CUDA:
    torch.cuda.manual_seed(1)
    
#设定一些hyper parameters
C=3 #nearby words threshold 指定前后3个单词进行预测
K=100 #number of negative samples 负样本随机采样数量；每一个正样本对应K个负样本
NUM_EPOCHS=1 #The num of epochs of training 迭代轮数
MAX_VOCAB_SIZE=30000 #the vocabulary size 词汇表大小
BATCH_SIZE=128
LEARNING_RATE=0.2 #the initial learning rate
EMBEDDING_SIZE=100 #词向量维度

LOG_FILE = "word-embedding.log"

#tokenize函数 将文本转化为一个个单词
def word_tokenize(text):
    return text.split()

## 数据预处理及相关操作
- 从文本文件中读取所有的文字，通过这些文本创建一个vocabulary
- 由于单词数量可能太大，我们只选取最常见的MAX_VOCAB_SIZE个单词
- 我们添加一个UNK单词表示除MAX_VOCAB_SIZE个单词外其他所有不常见的单词
- 我们需要记录单词到index的mapping，以及index到单词的mapping，单词的count，单词的(normalized) frequency，以及单词总数。

In [4]:
#读取文件
with open('./text8/text8.train.txt','r') as fi:
    text=fi.read()
    
# len(text)

#分词
#str.lower()将str中大写转化为小写
text=[w for w in word_tokenize(text.lower())]

#将出现频率最高的MAX_VOCAB_SIZE-1个单词取出来，以字典的形式存储(包含每个单词出现次数)
#-1留给UNK单词
#collection.Counter(text): 计算每个元素出现个数 返回counter对象
#Counter(text).most_common(N): 找到text中出现最多的前N个元素
#https://zhuanlan.zhihu.com/p/350899229
vocab=dict(Counter(text).most_common(MAX_VOCAB_SIZE-1))
#将UNK单词添加进vocab
#UNK出现次数=总单词出现次数-常见单词出现次数
#dic.values() 返回字典中所有值所构成的对象
vocab['<unk>']=len(text)-np.sum(list(vocab.values()))

#从vocab中取出所有单词
idx_to_word=[word for word in vocab.keys()]

#以字典的形式取得单词及其对应的索引
#enumerate: 接收一个可遍历的数据对象['a','b','c'] 返回索引与对象的组合[(0,'a'),(1,'b'),(2,'c')]
#索引值与单词出现次数相反，最常见单词索引为0。
word_to_idx={word:i for i,word in enumerate(idx_to_word)}

# list(word_to_idx.items())[:100]

#计算每个单词频率 负采样时需要使用
#获得所有单词出现的次数
word_counts=np.array([count for count in vocab.values()], dtype=np.float32)
#计算所有单词的频率
word_freqs=word_counts/np.sum(word_counts)
#论文Distributed Representations of Word...中频率取了3/4次方
word_freqs=word_freqs**(3./4.)
#重新normalize 重新计算所有单词频率 类似softmax
word_freqs=word_freqs/np.sum(word_freqs)

#检查单词数为MAX_VOCAB_SIZE
VOCAB_SIZE=len(idx_to_word)
VOCAB_SIZE

30000

## 实现Dataloader

一个dataloader需要以下内容：

- 把所有text编码成数字
- 保存vocabulary，单词count，normalized word frequency
- 每个iteration sample一个中心词
- 根据当前的中心词返回context单词
- 根据中心词sample一些negative单词
- 返回单词的counts

这里有一个好的tutorial介绍如何使用[PyTorch dataloader](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html).
为了使用dataloader，我们需要定义以下两个function:

- ```__len__``` function需要返回整个数据集中有多少个item
- ```__get__``` 根据给定的index返回一个item

有了dataloader之后，我们可以轻松随机打乱整个数据集，拿到一个batch的数据等等。

In [5]:
class WordEmbeddingDataset(tud.Dataset):
    def __init__(self, text, word_to_idx, idx_to_word, word_freqs, word_counts):
        #初始化模型
        #super(WordEmbeddingDataset, self).__init__()
        super().__init__()
        
        #顺序存储每个在text中word的word_to_idx中的索引(序号)，
        #如果word不在word_to_idx中（属于unk）则存储unk在word_to_idx中对应的序号
        self.text_encoded=[word_to_idx.get(word,word_to_idx['<unk>']) for word in text]
        #转化为int型LongTensor
        self.text_encoded=torch.LongTensor(self.text_encoded)
        
        #将输入的参数初始化为torch tensor
        self.word_to_idx=word_to_idx
        self.idx_to_word=idx_to_word #类中没有使用
        self.word_freqs=torch.Tensor(word_freqs)
        self.word_counts=torch.Tensor(word_counts) #类中没有时使用
        
    #数据集一共有多少个item
    def __len__(self):
        return len(self.text_encoded)
    
    #提供一个index 返回一串训练数据
    #index为训练数据集中每个单词对应的序号,即text_encoded中每个元素下标
    def __getitem__(self, index):
        #中心词 根据index可获得text中index位置的词(以数字表示)
        center_word=self.text_encoded[index]
        
        #周围词 为中心词前C个词与后C个词
        #pos_indices_list存储了中心词的周围词对应的序号
        #注意当index=0,1,2, len(self.text_encoded)-3,len(self.text_encoded)-2,len(self.text_encoded)-1时,
        #pos_indices_serialNumber的范围会超出text_encoded的范围
        pos_indices_serialNumber=list(range(index-C,index))+list(range(index+1,index+1+C))
        #print(pos_indices_serialNumber)
        
        #所以需要对pos_indices_serialNumber中的元素逐个同text_encoded的长度取余,
        #个人认为这一步的合理性存在疑问
        #将训练集最后的几个词作为最开始几个中心词的周围词/将训练集最初的几个词作为最后几个中心词的周围词
        #都没有合理性
        pos_indices_new_serialNumber=[i % len(self.text_encoded) for i in pos_indices_serialNumber]
        #print(pos_indices_new_serialNumber)
        #print(type(pos_indices_new_serialNumber))
        
        #由pos_indices_new_serialNumber获得text中对应位置的词(以数字表示)
        #text_encoded为Tensor,可以接收一组数组作为序号,返回序号对应的元素
        pos_words=self.text_encoded[pos_indices_new_serialNumber]
        #print(type(self.text_encoded))
        #print(pos_words)
        
        #用于negative sampling
        #参考https://towardsdatascience.com/nlp-101-negative-sampling-and-glove-936c88f3bc68
        
        #torch.multinomial
        #multinomial distribution 多项式分布
        #https://pytorch.org/docs/stable/generated/torch.multinomial.html
        #作用是对self.word_freqs做K * pos_words.shape[0]次取值，输出的是self.word_freqs对应的下标。
        #取样方式采用有放回的采样，并且self.word_freqs数值越大，取样概率越大。
        #每个正确的单词采样K个，pos_words.shape[0]是正确单词数量,pos_words.shape[0]的值为6
        #replacement=True表示可重复抽取(有放回的抽取)
        neg_words=torch.multinomial(self.word_freqs, K*pos_words.shape[0], True)
        #print(neg_words)
        
        return center_word, pos_words, neg_words

In [6]:
#定义Dataset
dataset=WordEmbeddingDataset(text, word_to_idx, idx_to_word, word_freqs, word_counts)

#定义dataloader
#num_workers: 线程数量
#当num_workers=4时,调用next(iter(dataloader))时,会报错:[Errno 32] Broken pipe 
#原因可能为内存溢出 参考:https://blog.csdn.net/qq_33666011/article/details/81873217
#解决方案为将workers设为0
# dataloader=tud.DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=4)
dataloader=tud.DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=0)

In [7]:
#测试class定义是否存在bug
#这一系列测试最好将class WordEmbeddingDataset的return注释掉再测试
# print(dataset.__getitem__(0))
# print(dataset.__getitem__(1))
# print(dataset.__getitem__(2))
# print(dataset.__getitem__(3))
# print(dataset.__getitem__(len(dte)-3))
# print(dataset.__getitem__(len(dte)-2))
# print(dataset.__getitem__(len(dte)-1))

# dte=dataset.text_encoded
# print(dte)
# print(len(dte))
# dtet=dte.tolist()
# print(dtet[:100])
# print(type(dtet[0]))
# print(dict(Counter(dtet)))
# print([(k,v) for i, (k, v) in enumerate(dict(Counter(dtet)).items()) if i <100])

next(iter(dataloader))
# for i, (center_word, pos_words, neg_words) in enumerate(dataloader):
#     print(center_word, pos_words, neg_words)
#     if i>0:
#         break

[tensor([  819,    45,   621,    15,  1797,    29,   328,   157,    25,   598,
             9,    13,    25,    12,     5,  4344,     3,    13,     0,     5,
          1532,   648,     9,   937,    16, 22599,    85,  7406,  2801,   419,
          1238,     1,   966,  1655,   644,     6,    16, 18573, 11226,    37,
           261,  1514,  3537,     1, 29999,   644,     4,   210,   110,     5,
          3316,  1454,    29,     7,     0,   825,     2,  3992,  2991,  9029,
          1881,     0, 20161,    13,     5,     4, 12028,  7117,   394,     3,
         27580,  3642,    36,  2050,    92,     8, 23976,  2184,   335,   339,
          1314,    15,    34,   284,  4247,  2389,    25,  8552,     0,  1467,
           131,  5437,     1, 10596,     2,     4,  1963,    37,     5,   401,
          2111,     2,     6,     0,    14,    10,     1,     9,  5363, 12439,
          8464,     1,   432,   298,  4171, 11035,     0,  3513,     4,   969,
            28,   836, 29999,    88,     5,     6,  

## 定义PyTorch模型*

In [8]:
class EmbeddingModel(nn.Module):
    #定义网络架构所需参数
    #初始化输入和输出embedding
    def __init__(self, vocab_size, embed_size):
        super().__init__()

        self.vocab_size=vocab_size #30000
        self.embed_size=embed_size #100
        
        #定义in和out两个embedding层 in_embed和out_embed相当于参数w
        #https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html
        #nn.Embedding通常用于word embedding(https://www.zhihu.com/question/32275069/answer/80188672)
        #此处输出(30000,100)的embedding
        self.in_embed=nn.Embedding(self.vocab_size, self.embed_size)
        self.out_embed=nn.Embedding(self.vocab_size, self.embed_size)

        #自定义初始embedding matrix参数
        #将初始参数限制在叫小范围，使loss下降更快
        #此处因为不初始化效果也不错，所以将其注释掉
        # initrange = 0.5 / self.embed_size
        # self.out_embed.weight.data.uniform_(-initrange, initrange)
        # self.in_embed.weight.data.uniform_(-initrange, initrange)
        
    #定义forward函数(定义网络架构)
    #*********************************此处需要进一步分析理解
    # input_labels: 中心词  (batch_size)个单词
    # pos_labels: 正确的周围词  (batch_size,(window_size*2))个单词
    # neg_labels: 负采样中选取的错误的周围词  (batch_size, (window_size*2*K))个单词
    def forward(self, input_labels, pos_labels, neg_labels):

        #这里进行了运算：（batch_size,vocab_size）*（vocab_size,embed_size）= 128(B) * 100 (embed_size)
        input_embedding=self.in_embed(input_labels) #(batch_size, embed_size)
        pos_embedding=self.out_embed(pos_labels) #(batch_size,(window_size*2), embed_size)
        neg_embedding=self.out_embed(neg_labels) #(batch_size, (window_size*2*K), embed_size)

        #a.unsqueeze(n) 在a的第n维增加一个
        input_embedding=input_embedding.unsqueeze(2) #(batch_size, embed_size, 1)
        
        #计算中心词embedding与周围词embedding的乘积
        #bmm: If input1 is a (b×n×m) tensor, input2 is a (b×m×p) tensor, out will be a (b×n×p) tensor.
        #https://pytorch.org/docs/stable/generated/torch.bmm.html
        #squeeze() 删除维度为1的维度
        pos_dot=torch.bmm(pos_embedding, input_embedding).squeeze() #(batch_size, (window_size*2))
        #计算中心词embedding与错误周围词embedding的乘积
        neg_dot=torch.bmm(neg_embedding, input_embedding).squeeze() #(batch_size, (window_size*2*K))
        
        #论文'Distributed Representations of Words and Phrases and their Compositionality'中第3页末尾公式
        #计算加号前'中心词embedding与周围词embedding的乘积'的logsigmoid
        log_pos=F.logsigmoid(pos_dot)
        #计算加号后'中心词embedding与错误周围词embedding的乘积'的logsigmoid
        log_neg=F.logsigmoid(-neg_dot)
        #忽略Ewi∼Pn(w)的部分(此操作已在定义neg_words时完成操作)
        #对加号后面部分log_neg求和
        log_neg=log_neg.sum(1)
        # print('log_neg:',log_neg, log_neg.shape)
        #这一步没有出现在公式中...************************************
        log_pos=log_pos.sum(1)
        # print('log_pos:',log_neg, log_pos.shape)

        loss=log_pos+log_neg

        #为何return -loss？***************************************
        return -loss

    #取出input_embeddings
    def input_embeddings(self):
        return self.in_embed.weight.data.cpu().numpy()


#实例化模型
model=EmbeddingModel(VOCAB_SIZE,EMBEDDING_SIZE)

#使用cuda
if USE_CUDA:
    model=model.cuda()

## 训练模型
- 模型一般需要训练若干个epoch
- 每个epoch我们都把所有的数据分成若干个batch
- 把每个batch的输入和输出都包装成cuda tensor
- forward pass，通过输入的句子预测每个单词的下一个单词
- 用模型的预测和正确的下一个单词计算cross entropy loss
- backward pass
- 更新模型参数
- 清空模型当前gradient
- 每隔一定的iteration输出模型在当前iteration的loss，以及在验证数据集上做模型的评估

评估模型的代码

In [9]:
def evaluate(filename, embedding_weights): 
    if filename.endswith(".csv"):
        data = pd.read_csv(filename, sep=",")
    else:
        data = pd.read_csv(filename, sep="\t")
    human_similarity = []
    model_similarity = []
    for i in data.iloc[:, 0:2].index:
        word1, word2 = data.iloc[i, 0], data.iloc[i, 1]
        if word1 not in word_to_idx or word2 not in word_to_idx:
            continue
        else:
            word1_idx, word2_idx = word_to_idx[word1], word_to_idx[word2]
            word1_embed, word2_embed = embedding_weights[[word1_idx]], embedding_weights[[word2_idx]]
            model_similarity.append(float(sklearn.metrics.pairwise.cosine_similarity(word1_embed, word2_embed)))
            human_similarity.append(float(data.iloc[i, 2]))

    return scipy.stats.spearmanr(human_similarity, model_similarity)# , model_similarity

def find_nearest(word, embedding_weights):
    index = word_to_idx[word]
    embedding = embedding_weights[index]
    cos_dis = np.array([scipy.spatial.distance.cosine(e, embedding) for e in embedding_weights])
    return [idx_to_word[i] for i in cos_dis.argsort()[:10]]

In [10]:
#定义optimizer
optimizer=torch.optim.SGD(model.parameters(), lr=LEARNING_RATE)

for e in range(NUM_EPOCHS):
    for i, (input_labels, pos_labels, neg_labels) in enumerate(dataloader):
        # #for test
        # if i>0:
        #     break
        #需要使用long型数据进行计算
        input_labels=input_labels.long()
        pos_labels=pos_labels.long()
        neg_labels=neg_labels.long()

        #use cuda
        if USE_CUDA:
            input_labels=input_labels.cuda()
            pos_labels=pos_labels.cuda()
            neg_labels=neg_labels.cuda()

        #调用model的forward函数
        #mean(): 将一个batch_size的数据的loss求平均
        loss=model(input_labels, pos_labels, neg_labels).mean()
        #计算gradient
        loss.backward()
        #更新weight
        optimizer.step()
        #grad清零
        optimizer.zero_grad()

        #每100个epochs输出一次loss
        if i%100==0:
            print('epoch', e, 'iteration', i, loss.item())

        if i % 2000 == 0:
            embedding_weights = model.input_embeddings()
            sim_simlex = evaluate("simlex-999.txt", embedding_weights)
            sim_men = evaluate("men.txt", embedding_weights)
            sim_353 = evaluate("wordsim353.csv", embedding_weights)
            with open(LOG_FILE, "a") as fout:
                print("epoch: {}, iteration: {}, simlex-999: {}, men: {}, sim353: {}, nearest to monster: {}\n".format(
                    e, i, sim_simlex, sim_men, sim_353, find_nearest("monster", embedding_weights)))
                fout.write("epoch: {}, iteration: {}, simlex-999: {}, men: {}, sim353: {}, nearest to monster: {}\n".format(
                    e, i, sim_simlex, sim_men, sim_353, find_nearest("monster", embedding_weights)))
                
    embedding_weights = model.input_embeddings()
    np.save("embedding-{}".format(EMBEDDING_SIZE), embedding_weights)
    torch.save(model.state_dict(), "embedding-{}.th".format(EMBEDDING_SIZE))

epoch 0 iteration 0 2386.27685546875
epoch: 0, iteration: 0, simlex-999: SpearmanrResult(correlation=-0.009722900741337371, pvalue=0.7638751311958099), men: SpearmanrResult(correlation=0.003240106713839427, pvalue=0.8690633653000476), sim353: SpearmanrResult(correlation=-0.09117149228575723, pvalue=0.10463225392436157), nearest to monster: ['monster', 'sturdy', 'artificially', 'kievan', 'debunking', 'edmond', 'mbox', 'jack', 'crassus', 'skirts']

epoch 0 iteration 100 1110.298095703125
epoch 0 iteration 200 972.8966064453125
epoch 0 iteration 300 756.5603637695312
epoch 0 iteration 400 733.5953979492188
epoch 0 iteration 500 541.1033935546875
epoch 0 iteration 600 525.4776611328125
epoch 0 iteration 700 462.82183837890625
epoch 0 iteration 800 445.95697021484375
epoch 0 iteration 900 394.46051025390625
epoch 0 iteration 1000 450.0299377441406
epoch 0 iteration 1100 344.4065246582031
epoch 0 iteration 1200 359.37451171875
epoch 0 iteration 1300 288.3186950683594
epoch 0 iteration 1400 1

epoch 0 iteration 12800 45.800384521484375
epoch 0 iteration 12900 48.63939666748047
epoch 0 iteration 13000 53.74571228027344
epoch 0 iteration 13100 50.134117126464844
epoch 0 iteration 13200 59.339317321777344
epoch 0 iteration 13300 49.21236038208008
epoch 0 iteration 13400 62.86875915527344
epoch 0 iteration 13500 51.89896774291992
epoch 0 iteration 13600 46.086788177490234
epoch 0 iteration 13700 44.97549819946289
epoch 0 iteration 13800 50.714263916015625
epoch 0 iteration 13900 49.910621643066406
epoch 0 iteration 14000 44.400062561035156
epoch: 0, iteration: 14000, simlex-999: SpearmanrResult(correlation=-0.07511608963214213, pvalue=0.020125791900891542), men: SpearmanrResult(correlation=0.027988596348304372, pvalue=0.1543702606617894), sim353: SpearmanrResult(correlation=-0.05556033946090101, pvalue=0.3233269208895059), nearest to monster: ['monster', 'jack', 'library', 'reasons', 'committed', 'channels', 'stable', 'participate', 'liberty', 'month']

epoch 0 iteration 14100 5

epoch 0 iteration 26100 39.25852966308594
epoch 0 iteration 26200 42.326454162597656
epoch 0 iteration 26300 39.23174285888672
epoch 0 iteration 26400 43.372196197509766
epoch 0 iteration 26500 41.32109832763672
epoch 0 iteration 26600 41.00151062011719
epoch 0 iteration 26700 42.63359451293945
epoch 0 iteration 26800 42.26885986328125
epoch 0 iteration 26900 39.976707458496094
epoch 0 iteration 27000 37.44080352783203
epoch 0 iteration 27100 40.266658782958984
epoch 0 iteration 27200 43.88325500488281
epoch 0 iteration 27300 40.77418899536133
epoch 0 iteration 27400 41.61466979980469
epoch 0 iteration 27500 40.551841735839844
epoch 0 iteration 27600 44.549468994140625
epoch 0 iteration 27700 38.66312789916992
epoch 0 iteration 27800 40.48982620239258
epoch 0 iteration 27900 38.246185302734375
epoch 0 iteration 28000 39.12178039550781
epoch: 0, iteration: 28000, simlex-999: SpearmanrResult(correlation=-0.07360341652605674, pvalue=0.02278230705018539), men: SpearmanrResult(correlation=0

epoch 0 iteration 39500 36.34688186645508
epoch 0 iteration 39600 36.40477752685547
epoch 0 iteration 39700 36.594337463378906
epoch 0 iteration 39800 36.349571228027344
epoch 0 iteration 39900 37.34979248046875
epoch 0 iteration 40000 37.710411071777344
epoch: 0, iteration: 40000, simlex-999: SpearmanrResult(correlation=-0.0655379533856448, pvalue=0.04266511048636075), men: SpearmanrResult(correlation=0.02717772085400754, pvalue=0.1666695868969715), sim353: SpearmanrResult(correlation=-0.07053323809904731, pvalue=0.20969755032393722), nearest to monster: ['monster', 'jack', 'committed', 'participate', 'stable', 'channels', 'liberty', 'month', 'reasons', 'artificially']

epoch 0 iteration 40100 35.87609100341797
epoch 0 iteration 40200 37.983211517333984
epoch 0 iteration 40300 36.61366271972656
epoch 0 iteration 40400 38.32347106933594
epoch 0 iteration 40500 35.564815521240234
epoch 0 iteration 40600 34.1513671875
epoch 0 iteration 40700 36.15362548828125
epoch 0 iteration 40800 37.2

epoch 0 iteration 52100 34.47724914550781
epoch 0 iteration 52200 39.91930389404297
epoch 0 iteration 52300 36.09491729736328
epoch 0 iteration 52400 34.54351043701172
epoch 0 iteration 52500 36.23900604248047
epoch 0 iteration 52600 34.42832565307617
epoch 0 iteration 52700 36.34729766845703
epoch 0 iteration 52800 36.053035736083984
epoch 0 iteration 52900 37.127220153808594
epoch 0 iteration 53000 34.922603607177734
epoch 0 iteration 53100 35.01988220214844
epoch 0 iteration 53200 36.95650863647461
epoch 0 iteration 53300 36.40315246582031
epoch 0 iteration 53400 34.909034729003906
epoch 0 iteration 53500 34.84117889404297
epoch 0 iteration 53600 34.734718322753906
epoch 0 iteration 53700 35.52421569824219
epoch 0 iteration 53800 37.53095245361328
epoch 0 iteration 53900 35.547523498535156
epoch 0 iteration 54000 35.251182556152344
epoch: 0, iteration: 54000, simlex-999: SpearmanrResult(correlation=-0.06063334886727565, pvalue=0.06079385928604918), men: SpearmanrResult(correlation=0

epoch 0 iteration 65500 35.015052795410156
epoch 0 iteration 65600 34.46564483642578
epoch 0 iteration 65700 33.93989944458008
epoch 0 iteration 65800 35.96330261230469
epoch 0 iteration 65900 34.225074768066406
epoch 0 iteration 66000 35.90700912475586
epoch: 0, iteration: 66000, simlex-999: SpearmanrResult(correlation=-0.05984667177366449, pvalue=0.0642240971539514), men: SpearmanrResult(correlation=0.03300739768366436, pvalue=0.09299913745095996), sim353: SpearmanrResult(correlation=-0.06354647004200535, pvalue=0.25852859432521513), nearest to monster: ['monster', 'jack', 'committed', 'channels', 'month', 'liberty', 'kievan', 'participate', 'artificially', 'stable']

epoch 0 iteration 66100 34.23870849609375
epoch 0 iteration 66200 34.699615478515625
epoch 0 iteration 66300 35.30760955810547
epoch 0 iteration 66400 36.5308837890625
epoch 0 iteration 66500 35.208160400390625
epoch 0 iteration 66600 36.90748596191406
epoch 0 iteration 66700 35.7858772277832
epoch 0 iteration 66800 35.

epoch 0 iteration 78100 35.84868621826172
epoch 0 iteration 78200 34.32862854003906
epoch 0 iteration 78300 34.571876525878906
epoch 0 iteration 78400 35.85554122924805
epoch 0 iteration 78500 35.0119514465332
epoch 0 iteration 78600 35.99940490722656
epoch 0 iteration 78700 34.81113052368164
epoch 0 iteration 78800 35.583045959472656
epoch 0 iteration 78900 33.94286346435547
epoch 0 iteration 79000 35.541236877441406
epoch 0 iteration 79100 35.08162307739258
epoch 0 iteration 79200 35.358482360839844
epoch 0 iteration 79300 34.49041748046875
epoch 0 iteration 79400 35.15098190307617
epoch 0 iteration 79500 34.48786926269531
epoch 0 iteration 79600 33.06779098510742
epoch 0 iteration 79700 35.32145309448242
epoch 0 iteration 79800 36.54274368286133
epoch 0 iteration 79900 33.2576789855957
epoch 0 iteration 80000 33.52606964111328
epoch: 0, iteration: 80000, simlex-999: SpearmanrResult(correlation=-0.06091325267644146, pvalue=0.05961051941876366), men: SpearmanrResult(correlation=0.0377

epoch 0 iteration 91500 33.66630554199219
epoch 0 iteration 91600 34.090126037597656
epoch 0 iteration 91700 33.376243591308594
epoch 0 iteration 91800 33.83186340332031
epoch 0 iteration 91900 34.62105941772461
epoch 0 iteration 92000 34.52644348144531
epoch: 0, iteration: 92000, simlex-999: SpearmanrResult(correlation=-0.059620637484745366, pvalue=0.06523869690229597), men: SpearmanrResult(correlation=0.04063544682523474, pvalue=0.03861405659450815), sim353: SpearmanrResult(correlation=-0.04782852761024794, pvalue=0.3953069005982186), nearest to monster: ['monster', 'month', 'committed', 'kievan', 'channels', 'jack', 'artificially', 'financial', 'liberty', 'adolescence']

epoch 0 iteration 92100 35.15739440917969
epoch 0 iteration 92200 35.186344146728516
epoch 0 iteration 92300 35.916255950927734
epoch 0 iteration 92400 33.31974792480469
epoch 0 iteration 92500 34.2358512878418
epoch 0 iteration 92600 33.65740203857422
epoch 0 iteration 92700 33.924652099609375
epoch 0 iteration 928

epoch 0 iteration 104100 33.18549346923828
epoch 0 iteration 104200 35.65579605102539
epoch 0 iteration 104300 34.245948791503906
epoch 0 iteration 104400 33.095008850097656
epoch 0 iteration 104500 33.4302864074707
epoch 0 iteration 104600 32.97980499267578
epoch 0 iteration 104700 34.20043182373047
epoch 0 iteration 104800 34.710662841796875
epoch 0 iteration 104900 34.269596099853516
epoch 0 iteration 105000 33.32252883911133
epoch 0 iteration 105100 34.886661529541016
epoch 0 iteration 105200 33.302791595458984
epoch 0 iteration 105300 32.88618087768555
epoch 0 iteration 105400 34.295127868652344
epoch 0 iteration 105500 33.67564392089844
epoch 0 iteration 105600 35.04034423828125
epoch 0 iteration 105700 33.955726623535156
epoch 0 iteration 105800 33.945716857910156
epoch 0 iteration 105900 34.01935958862305
epoch 0 iteration 106000 34.78515625
epoch: 0, iteration: 106000, simlex-999: SpearmanrResult(correlation=-0.06285490351123663, pvalue=0.05191672631111801), men: SpearmanrResu

epoch 0 iteration 117100 33.66558837890625
epoch 0 iteration 117200 33.99113845825195
epoch 0 iteration 117300 33.29738998413086
epoch 0 iteration 117400 33.22737503051758
epoch 0 iteration 117500 32.3090934753418
epoch 0 iteration 117600 33.09983444213867
epoch 0 iteration 117700 34.31875991821289
epoch 0 iteration 117800 33.63316345214844
epoch 0 iteration 117900 33.43819808959961
epoch 0 iteration 118000 33.876220703125
epoch: 0, iteration: 118000, simlex-999: SpearmanrResult(correlation=-0.05899583593767795, pvalue=0.06811208053657113), men: SpearmanrResult(correlation=0.04940370441481199, pvalue=0.011901109154228488), sim353: SpearmanrResult(correlation=-0.0399035977214312, pvalue=0.4782851531985749), nearest to monster: ['monster', 'month', 'committed', 'channels', 'artificially', 'adolescence', 'kievan', 'grave', 'financial', 'liberty']

epoch 0 iteration 118100 34.601173400878906
epoch 0 iteration 118200 33.498573303222656
epoch 0 iteration 118300 34.59758758544922
epoch 0 iter

## 评估模型 

### 在 MEN 和 Simplex-999 数据集上做评估

In [11]:
model.load_state_dict(torch.load("embedding-{}.th".format(EMBEDDING_SIZE)))
embedding_weights = model.input_embeddings()
print("simlex-999", evaluate("simlex-999.txt", embedding_weights))
print("men", evaluate("men.txt", embedding_weights))
print("wordsim353", evaluate("wordsim353.csv", embedding_weights))

simlex-999 SpearmanrResult(correlation=-0.06011816895290745, pvalue=0.0630226585276272)
men SpearmanrResult(correlation=0.0494095291547079, pvalue=0.0118910907017675)
wordsim353 SpearmanrResult(correlation=-0.03678907956391633, pvalue=0.5133172651526973)


### 寻找nearest neighbors

In [12]:
for word in ["good", "fresh", "monster", "green", "like", "america", "chicago", "work", "computer", "language"]:
    print(word, find_nearest(word, embedding_weights))

good ['good', 'instead', 'even', 'often', 'though', 'similar', 'without', 'them', 'so', 'being']
fresh ['fresh', 'institutions', 'vision', 'hermitage', 'disco', 'relief', 'elevation', 'consensus', 'hind', 'exact']
monster ['monster', 'month', 'committed', 'channels', 'artificially', 'adolescence', 'kievan', 'grave', 'financial', 'liberty']
green ['green', 'red', 'along', 'based', 'and', 'group', 'see', 'type', 'including', 'white']
like ['like', 'or', 'include', 'also', 'instead', 'as', 'called', 'whose', 'thus', 'a']
america ['america', 'central', 'southern', 'northern', 'europe', 'africa', 'east', 'western', 'west', 'north']
chicago ['chicago', 'california', 'university', 'school', 'founded', 'london', 'national', 'canada', 'washington', 'st']
work ['work', 'made', 'while', 'produced', 'his', 'own', 'their', 'making', 'which', 'being']
computer ['computer', 'software', 'information', 'data', 'system', 'based', 'systems', 'for', 'standard', 'program']
language ['language', 'languages'

### 单词之间的关系

In [13]:
man_idx = word_to_idx["man"] 
king_idx = word_to_idx["king"] 
woman_idx = word_to_idx["woman"]
embedding = embedding_weights[woman_idx] - embedding_weights[man_idx] + embedding_weights[king_idx]
cos_dis = np.array([scipy.spatial.distance.cosine(e, embedding) for e in embedding_weights])
for i in cos_dis.argsort()[:20]:
    print(idx_to_word[i])

woman
king
son
named
england
father
st
charles
born
henry
louis
s
john
published
young
france
robert
former
became
followed
