In [1]:
%matplotlib inline


Word Embeddings: Encoding Lexical Semantics  词向量：编码词汇语义
===========================================

Word embeddings are dense vectors of real numbers, one per word in your
vocabulary. In NLP, it is almost always the case that your features are
words! But how should you represent a word in a computer? You could
store its ascii character representation, but that only tells you what
the word *is*, it doesn't say much about what it *means* (you might be
able to derive its part of speech from its affixes, or properties from
its capitalization, but not much). Even more, in what sense could you
combine these representations? We often want dense outputs from our
neural networks, where the inputs are $|V|$ dimensional, where
$V$ is our vocabulary, but often the outputs are only a few
dimensional (if we are only predicting a handful of labels, for
instance). How do we get from a massive dimensional space to a smaller
dimensional space?  
词向量是稠密实数向量，每个词在你的词汇表中。在 NLP 中，特征都是单词！但是如何在计算机中表示一个单词呢？
你应该存储这些ascii码的表示，但这也智能告诉你这个单词是什么，却不会告诉你这个单词什么意思（你兴许可以
把它分成一些词缀，或者从它的属性分析，但可以做的并不多）。更多的，把这些表示结合到一起有什么意义呢？我们
总是希望我们的神经网络的输出是稠密的，而我们的输入是$|V|$维的（$V$是我们的词汇表），但输出的维数总是很
小的（比如说如果我们值预测少量的标签）。我们如何才能从一个大维度空间获得一个小维度空间呢？

How about instead of ascii representations, we use a one-hot encoding?
That is, we represent the word $w$ by  
让我们试试不用ascii码来表示，而使用独热编码，也就是说我们这样表示单词$w$

\begin{align}\overbrace{\left[ 0, 0, \dots, 1, \dots, 0, 0 \right]}^\text{|V| elements}\end{align}

where the 1 is in a location unique to $w$. Any other word will
have a 1 in some other location, and a 0 everywhere else.

There is an enormous drawback to this representation, besides just how
huge it is. It basically treats all words as independent entities with
no relation to each other. What we really want is some notion of
*similarity* between words. Why? Let's see an example.  
对于这种表示来说有一个巨大的缺点，除了这种表示本身很巨大以外。它把所有的单词都独立对待，并且和其它单词没有任何关系。
我们真正想要的是单词之间的一些相似性。为什么呢？让我们看看以下的例子。

Suppose we are building a language model. Suppose we have seen the
sentences  
假定我们正在建立一个语言模型，假定我们在训练集中见过如下句子

* The mathematician ran to the store.
* The physicist ran to the store.
* The mathematician solved the open problem.

in our training data. Now suppose we get a new sentence never before
seen in our training data:  
现在假定我们得到了一个从未在训练集中见过的句子：

* The physicist solved the open problem.

Our language model might do OK on this sentence, but wouldn't it be much
better if we could use the following two facts:  
我们的语言模型可能在这个句子上工作地很好，但对于以下两个句子来说可能不太乐观：

* We have seen  mathematician and physicist in the same role in a sentence. Somehow they
  have a semantic relation.
* We have seen mathematician in the same role  in this new unseen sentence
  as we are now seeing physicist.

and then infer that physicist is actually a good fit in the new unseen
sentence? This is what we mean by a notion of similarity: we mean
*semantic similarity*, not simply having similar orthographic
representations. It is a technique to combat the sparsity of linguistic
data, by connecting the dots between what we have seen and what we
haven't. This example of course relies on a fundamental linguistic
assumption: that words appearing in similar contexts are related to each
other semantically. This is called the `distributional
hypothesis <https://en.wikipedia.org/wiki/Distributional_semantics>`__.  

然后推断在这个未见过的句子里面 physicist 确实是一个合适的词？这就是我们所说的概念上的相似：语义相似。
对于稀缺的语言数据来讲，做到理解使用逗号连接的已见过和未见过的词的语义类似是个技术上的难题。
这个例子当然是建立这个基本假定上的：在相似上下文出现的单词在语义上是有关联的，我们称之为`distributional
hypothesis`(分布假设）


Getting Dense Word Embeddings  获得密集的词向量
---------------------------------------------------

How can we solve this problem? That is, how could we actually encode
semantic similarity in words? Maybe we think up some semantic
attributes. For example, we see that both mathematicians and physicists
can run, so maybe we give these words a high score for the "is able to
run" semantic attribute. Think of some other attributes, and imagine
what you might score some common words on those attributes.  
我们如何才能解决这个问题呢？也就是说我们如何才能真正地编码语义相似的词汇？也许我们应该考虑语义上的属性。
例如，我们看到过 mathematicians 和 physicists 都能 run， 因此我们可以给这些词对于“可以跑”这个语义属
性一个分数。考虑到一些其它的属性，想象你可能会给其它单词对于这些属性打个分数。

If each attribute is a dimension, then we might give each word a vector,
like this:  
如果每个属性都是一个维度，我们可能给每个单词一个类似如下的向量：

\begin{align}q_\text{mathematician} = \left[ \overbrace{2.3}^\text{can run},\overbrace{9.4}^\text{likes coffee},\overbrace{-5.5}^\text{majored in Physics}, \dots \right]\end{align}

\begin{align}q_\text{physicist} = \left[ \overbrace{2.5}^\text{can run},\overbrace{9.1}^\text{likes coffee},\overbrace{6.4}^\text{majored in Physics}, \dots \right]\end{align}

Then we can get a measure of similarity between these words by doing:  
然后我们能得到一个衡量两个单词之间相似度的方法：

\begin{align}\text{Similarity}(\text{physicist}, \text{mathematician}) = q_\text{physicist} \cdot q_\text{mathematician}\end{align}

Although it is more common to normalize by the lengths:  
尽管更通常的做法是还要标准化它的大小：

\begin{align}\text{Similarity}(\text{physicist}, \text{mathematician}) = \frac{q_\text{physicist} \cdot q_\text{mathematician}}
   {\| q_\text{\physicist} \| \| q_\text{mathematician} \|} = \cos (\phi)\end{align}

Where $\phi$ is the angle between the two vectors. That way,
extremely similar words (words whose embeddings point in the same
direction) will have similarity 1. Extremely dissimilar words should
have similarity -1.  
$phi$是两个向量之间的角度。那样的话非常相似的词的相似度将会接近1,差别巨大的词的相似度接近-1。


You can think of the sparse one-hot vectors from the beginning of this
section as a special case of these new vectors we have defined, where
each word basically has similarity 0, and we gave each word some unique
semantic attribute. These new vectors are *dense*, which is to say their
entries are (typically) non-zero.  
你可以认为这些这些稀疏的独热向量是一种我们定义的特殊的向量，它的每个词目上的相似度几乎都是0,
我们给每个单词一些独特的属性，这些新的向量是稠密的，它们的每个词目都不是0。

But these new vectors are a big pain: you could think of thousands of
different semantic attributes that might be relevant to determining
similarity, and how on earth would you set the values of the different
attributes? Central to the idea of deep learning is that the neural
network learns representations of the features, rather than requiring
the programmer to design them herself. So why not just let the word
embeddings be parameters in our model, and then be updated during
training? This is exactly what we will do. We will have some *latent
semantic attributes* that the network can, in principle, learn. Note
that the word embeddings will probably not be interpretable. That is,
although with our hand-crafted vectors above we can see that
mathematicians and physicists are similar in that they both like coffee,
if we allow a neural network to learn the embeddings and see that both
mathematicians and physicisits have a large value in the second
dimension, it is not clear what that means. They are similar in some
latent semantic dimension, but this probably has no interpretation to
us.  
但是这些新的向量却是一个大麻烦：你可以想象成千上万的语义属性才能确定这个相似度。
但是究竟怎样你才能设置好这些不同的属性呢？深度学习的主要思想就是神经网络能够学习
特征的表示，而不是通过程序员自己去定义。所以为什么不直接让词向量成为我们模型的参数，
然后在训练值更新它呢？这才是我们真正要做的。我们将让神经网络学习一些潜在的属性。
注意到词向量将可能不再是可解释的。也就是说，即使我们使用上面那样手动打的向量我们可
以看出 mathematicians 和 physicists 是相似的，它们都喜欢咖啡，如果我们令一个神
经网络去学习词向量并看出两者在第二个维度上有一个大的值，但是它并不清楚这代表什么。
它们只是在一些潜在的语义维度上类似，而我们可能难以解释它。


In summary, **word embeddings are a representation of the *semantics* of
a word, efficiently encoding semantic information that might be relevant
to the task at hand**. You can embed other things too: part of speech
tags, parse trees, anything! The idea of feature embeddings is central
to the field.


Word Embeddings in Pytorch 在Pytorch中的词向量
-------------------------------

Before we get to a worked example and an exercise, a few quick notes
about how to use embeddings in Pytorch and in deep learning programming
in general. Similar to how we defined a unique index for each word when
making one-hot vectors, we also need to define an index for each word
when using embeddings. These will be keys into a lookup table. That is,
embeddings are stored as a $|V| \times D$ matrix, where $D$
is the dimensionality of the embeddings, such that the word assigned
index $i$ has its embedding stored in the $i$'th row of the
matrix. In all of my code, the mapping from words to indices is a
dictionary named word\_to\_ix.  
在我们得到一个可用的例子和练习之前，有几个关于如何在一般的 Pytorch 中和深度学习编程中
应用词向量的小提示。类似于我们对于每个单词使用独热向量定义一个唯一的索引，我们还要对于
每个单词使用词向量定义一个索引。这些将会成为一个查找表的关键字。也就是说，词向量被保存为
一个$|V| \times D$矩阵，其中$D$是向量的维数，比如说索引为$i$的单词保存在第$i$行。
在我们所有的编码中，从单词到索引的表我们称之为 word\_to\_ix。

The module that allows you to use embeddings is torch.nn.Embedding,
which takes two arguments: the vocabulary size, and the dimensionality
of the embeddings.  
我们将要使用的词向量的模块是 torch.nn.Embedding，它需要两个参数：一个是词汇表的大小，
以及词向量的维度。

To index into this table, you must use torch.LongTensor (since the
indices are integers, not floats).  
为了索引这个表，你必须使用 torch.LongTensor 类（索引是整数而非浮点数）。




In [2]:
# Author: Robert Guthrie

import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x7fcf1a0e55a0>

In [3]:
word_to_ix = {"hello": 0, "world": 1}
embeds = nn.Embedding(2, 5)  # 2 words in vocab, 5 dimensional embeddings
lookup_tensor = torch.LongTensor([word_to_ix["hello"]])
hello_embed = embeds(autograd.Variable(lookup_tensor))
print(hello_embed)

Variable containing:
-2.9718  1.7070 -0.4305 -2.2820  0.5237
[torch.FloatTensor of size 1x5]



An Example: N-Gram Language Modeling  一个例子：N-Gram 语言模型
------------------------------------------

Recall that in an n-gram language model, given a sequence of words
$w$, we want to compute  
对于一个 n-gram 模型，给定一系列的单词$w$，我们要计算

\begin{align}P(w_i | w_{i-1}, w_{i-2}, \dots, w_{i-n+1} )\end{align}

Where $w_i$ is the ith word of the sequence.  
其中$w_i$是词序列中的第$i$个单词。

In this example, we will compute the loss function on some training
examples and update the parameters with backpropagation.  
在这个例子中，我们将在一些训练样本上计算损失函数并在反向计算中更新参数。




In [4]:
CONTEXT_SIZE = 2
EMBEDDING_DIM = 10
# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()
# we should tokenize the input, but we will ignore that for now
# 我们应该对输入分词，但我们现在略过这步
# build a list of tuples.  Each tuple is ([ word_i-2, word_i-1 ], target word)
# 建立一个元祖表。每个元祖是 ([ word_i-2, word_i-1 ], target word)
trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])
            for i in range(len(test_sentence) - 2)]
# print the first 3, just so you can see what they look like
print(trigrams[:3])

vocab = set(test_sentence)
word_to_ix = {word: i for i, word in enumerate(vocab)}


class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out)
        return log_probs


losses = []
loss_function = nn.NLLLoss()
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(10):
    total_loss = torch.Tensor([0])
    for context, target in trigrams:

        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in variables)
        context_idxs = [word_to_ix[w] for w in context]
        context_var = autograd.Variable(torch.LongTensor(context_idxs))

        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        model.zero_grad()

        # Step 3. Run the forward pass, getting log probabilities over next
        # words
        log_probs = model(context_var)

        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a variable)
        loss = loss_function(log_probs, autograd.Variable(
            torch.LongTensor([word_to_ix[target]])))

        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

        total_loss += loss.data
    losses.append(total_loss)
print(losses)  # The loss decreased every iteration over the training data!

[(['When', 'forty'], 'winters'), (['forty', 'winters'], 'shall'), (['winters', 'shall'], 'besiege')]
[
 521.5312
[torch.FloatTensor of size 1]
, 
 519.0699
[torch.FloatTensor of size 1]
, 
 516.6281
[torch.FloatTensor of size 1]
, 
 514.2030
[torch.FloatTensor of size 1]
, 
 511.7951
[torch.FloatTensor of size 1]
, 
 509.4021
[torch.FloatTensor of size 1]
, 
 507.0249
[torch.FloatTensor of size 1]
, 
 504.6612
[torch.FloatTensor of size 1]
, 
 502.3110
[torch.FloatTensor of size 1]
, 
 499.9735
[torch.FloatTensor of size 1]
]


Exercise: Computing Word Embeddings: Continuous Bag-of-Words
--------------------------------------------------------------------
练习:计算词向量：CBOW 模型
--------------------------

The Continuous Bag-of-Words model (CBOW) is frequently used in NLP deep
learning. It is a model that tries to predict words given the context of
a few words before and a few words after the target word. This is
distinct from language modeling, since CBOW is not sequential and does
not have to be probabilistic. Typcially, CBOW is used to quickly train
word embeddings, and these embeddings are used to initialize the
embeddings of some more complicated model. Usually, this is referred to
as *pretraining embeddings*. It almost always helps performance a couple
of percent.  
CBOW 模型经常用于 NLP 深度学习。这是一个尝试在给定上下文后预测中间单词的模型。
这和语言模型有明显区别，自从 CBOW 不再是连续的与概率性的。CBOW 是用来快速训练词向量，
这些词向量是用来初始化更复杂模型的词向量。一般的，这更是一种“预训练词向量”。它几乎总是能
提高一点模型的性能。

The CBOW model is as follows. Given a target word $w_i$ and an
$N$ context window on each side, $w_{i-1}, \dots, w_{i-N}$
and $w_{i+1}, \dots, w_{i+N}$, referring to all context words
collectively as $C$, CBOW tries to minimize  
CBOW 模型是这样的。给定一个目标单词$w_i$和每侧$N$个上下文窗口$w_{i-1}, \dots, w_{i-N}$
和 $w_{i+1}, \dots, w_{i+N}$，根据所有的上下文环境，CBOW 尝试最小化

\begin{align}-\log p(w_i | C) = -\log \text{Softmax}(A(\sum_{w \in C} q_w) + b)\end{align}

where $q_w$ is the embedding for word $w$.  
其中$q_w$是单词$w$的词向量。

Implement this model in Pytorch by filling in the class below. Some
tips:

* Think about which parameters you need to define.
* Make sure you know what shape each operation expects. Use .view() if you need to
  reshape.




In [6]:
CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

# By deriving a set from `raw_text`, we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab)

word_to_ix = {word: i for i, word in enumerate(vocab)}
data = []
for i in range(2, len(raw_text) - 2):
    context = [raw_text[i - 2], raw_text[i - 1],
               raw_text[i + 1], raw_text[i + 2]]
    target = raw_text[i]
    data.append((context, target))
print(data[:5])


class CBOW(nn.Module):

    def __init__(self):
        pass

    def forward(self, inputs):
        pass

# create your model and train.  here are some functions to help you make
# the data ready for use by your module


def make_context_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    tensor = torch.LongTensor(idxs)
    return autograd.Variable(tensor)


make_context_vector(data[0][0], word_to_ix)  # example

[(['We', 'are', 'to', 'study'], 'about'), (['are', 'about', 'study', 'the'], 'to'), (['about', 'to', 'the', 'idea'], 'study'), (['to', 'study', 'idea', 'of'], 'the'), (['study', 'the', 'of', 'a'], 'idea')]


Variable containing:
 31
  5
 42
 23
[torch.LongTensor of size 4]