# N-Gram 模型
上一节课，我们讲了词嵌入以及词嵌入是如何得到的，现在我们来讲讲词嵌入如何来训练语言模型，首先我们介绍一下 N-Gram 模型的原理和其要解决的问题。

对于一句话，单词的排列顺序是非常重要的，所以我们能否由前面的几个词来预测后面的几个单词呢，比如 'I lived in France for 10 years, I can speak _' 这句话中，我们能够预测出最后一个词是 French。

对于一句话 T，其由 $w_1, w_2, \cdots, w_n$ 这 n 个词构成，

$$
P(T) = P(w_1)P(w_2 | w_1)P(w_3 |w_2 w_1) \cdots P(w_n |w_{n-1} w_{n-2}\cdots w_2w_1)
$$

我们可以再简化一下这个模型，比如对于一个词，它并不需要前面所有的词作为条件概率，也就是说一个词可以只与其前面的几个词有关，这就是马尔科夫假设。

对于这里的条件概率，传统的方法是统计语料中每个词出现的频率，根据贝叶斯定理来估计这个条件概率，这里我们就可以用词嵌入对其进行代替，然后使用 RNN 进行条件概率的计算，然后最大化这个条件概率不仅修改词嵌入，同时能够使得模型可以依据计算的条件概率对其中的一个单词进行预测。

下面我们直接用代码进行说明

In [2]:
CONTEXT_SIZE = 2 # 依据的单词数
EMBEDDING_DIM = 10 # 词向量的维度
# 我们使用莎士比亚的诗
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()

这里的 `CONTEXT_SIZE` 表示我们希望由前面几个单词来预测这个单词，这里使用两个单词，`EMBEDDING_DIM` 表示词嵌入的维度。

接着我们建立训练集，便利整个语料库，将单词三个分组，前面两个作为输入，最后一个作为预测的结果。

In [3]:
trigram=[((test_sentence[i],test_sentence[i+1]),test_sentence[i+2]) for i in range(len(test_sentence)-2)]

In [6]:
#总的数据量
print(len(trigram))
# 取出第一个数据看看
print(trigram[0])

113
(('When', 'forty'), 'winters')


In [7]:
#建立词典
vocab=set(test_sentence)
word_to_idx={word:i for i,word in enumerate(vocab)}
idx_to_word=dict(zip(word_to_idx.values(),word_to_idx.keys()))

print(word_to_idx)
print(idx_to_word)

{"feel'st": 0, 'Were': 1, 'beauty': 2, 'livery': 3, 'Thy': 4, 'Proving': 5, 'see': 6, 'sunken': 7, 'make': 8, 'old': 9, 'his': 10, 'besiege': 11, 'brow,': 12, 'thriftless': 13, 'thine!': 14, 'When': 15, 'field,': 16, 'within': 17, 'be': 18, 'were': 19, 'my': 20, 'lies,': 21, 'made': 22, 'count,': 23, "'This": 24, 'thy': 25, 'This': 26, 'proud': 27, 'eyes,': 28, 'mine': 29, 'all-eating': 30, 'How': 31, 'dig': 32, 'say,': 33, 'worth': 34, 'more': 35, 'succession': 36, 'And': 37, 'praise': 38, 'now,': 39, 'small': 40, 'sum': 41, 'a': 42, 'gazed': 43, 'To': 44, 'weed': 45, 'lusty': 46, "totter'd": 47, 'shall': 48, 'days;': 49, 'forty': 50, 'Will': 51, 'treasure': 52, 'an': 53, 'when': 54, 'thou': 55, 'so': 56, 'old,': 57, 'shame,': 58, 'by': 59, 'and': 60, "beauty's": 61, 'of': 62, 'fair': 63, 'held:': 64, 'winters': 65, 'answer': 66, 'thine': 67, 'warm': 68, 'use,': 69, "youth's": 70, 'cold.': 71, 'trenches': 72, 'couldst': 73, 'in': 74, 'art': 75, 'blood': 76, 'Shall': 77, 'praise.': 78,

In [8]:
import torch
from torch.autograd import Variable
import torch.nn.functional as F
from torch import nn

In [20]:
class n_gram(nn.Module):
    def __init__(self,vocab_size,context_size=CONTEXT_SIZE,n_dim=EMBEDDING_DIM):
        super(n_gram,self).__init__()
        self.embed=nn.Embedding(vocab_size,n_dim)#构建embeddingTable
        self.classify=nn.Sequential(nn.Linear(context_size*n_dim,128),
                                   nn.ReLU(),
                                   nn.Linear(128,vocab_size))
    def forward(self,x):
        voc_embed=self.embed(x)#voc_embed shape=[context,n_dim],[2,10]
        voc_embed=voc_embed.view(1,-1)# 将两个词向量拼在一起,[1,20]
        out=self.classify(voc_embed)
        return out

In [21]:
net = n_gram(len(word_to_idx))

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(net.parameters(), lr=1e-2, weight_decay=1e-5)#用的是l2正则

In [22]:
for e in range(100):
    train_loss=0
    for word,label in trigram:
        word=Variable(torch.LongTensor([word_to_idx[i] for i in word]))
        label=Variable(torch.LongTensor([word_to_idx[label]]))
        #前向传播
        out=net(word)
        loss=criterion(out,label)
        train_loss+=loss.item()
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    if (e + 1) % 20 == 0:
        print('epoch: {}, Loss: {:.6f}'.format(e + 1, train_loss / len(trigram)))

epoch: 20, Loss: 0.757198
epoch: 40, Loss: 0.143966
epoch: 60, Loss: 0.095272
epoch: 80, Loss: 0.078094
epoch: 100, Loss: 0.068487


In [32]:
net = net.eval()

# 测试一下结果
word, label = trigram[19]
print('input: {}'.format(word))
print('label: {}'.format(label))
word = Variable(torch.LongTensor([word_to_idx[i] for i in word]))
out = net(word)

pred_label_prob,pred_label_idx = out.max(1)# tensor.max(dim=1)返回的有两个值,一个是具体value,一个是index

predict_word = idx_to_word[pred_label_idx.item()]
print('real word is {}, predicted word is {}'.format(label, predict_word))

input: ('so', 'gazed')
label: on
real word is on, predicted word is on


In [33]:
word, label = trigram[75]
print('input: {}'.format(word))
print('label: {}'.format(label))

word = Variable(torch.LongTensor([word_to_idx[i] for i in word]))
out = net(word)
pred_label_prob,pred_label_idx = out.max(dim=1)
predict_word = idx_to_word[pred_label_idx.item()]
print('real word is {}, predicted word is {}'.format(label, predict_word))

input: ("'This", 'fair')
label: child
real word is child, predicted word is child


### 扩展: tensor.max(dim=)

In [34]:
a=torch.randn(2,5)
print(a)
prob,index=a.max(dim=1)
print(prob)
print(index)

tensor([[-1.0764, -0.6976,  0.0015, -0.9912,  0.2761],
        [ 0.1534,  0.1551,  1.4026, -0.4309,  0.2014]])
tensor([0.2761, 1.4026])
tensor([4, 2])
