# 神经网络语言模型（NNLM）
NNLM在2003年被提出，那时的语言模型主要是以N-gram为代表的统计学模型。语言模型的任务是对一段连续的单词也就是语句的联合概率进行建模，根据的是以下的公式：
$$P(w_1,\cdots,w_T)=P(w_T)P(w_T|w_1,\cdots,w_{T-1})$$
统计学的方法是通过统计$P(w_1,\cdots,w_T)$和$P(w_T)$来求出条件概率，但是由于$w_1,\cdots,w_T$可以有无限的长度和庞大的词语选择空间，统计它们是非常困难的，而且有限的数据集本身也只能覆盖到所有可能句子中的非常小一部分，对于数据集中没有出现的部分不具备泛化能力。
因此人们把能够影响一个单词概率的范围限制在了它之前的固定有限个单词，也就是有公式
$$P(w_{T-n+1},\cdots,w_T)=P(w_T)P(w_T|w_{T-n+1},\cdots,w_{T-1})$$
第T个单词的概率仅受前面的固定几个单词影响，和前面的再多的单词也没有关系了。这样做虽然减少了统计上的压力，但是也使得计算出来的条件概率不再精确。

NNLM能够解决统计所有单词频率难以实现的问题，它使用一个神经网络来构造一个函数$f(i,w_{T-1},\cdots,w_1)=P(i|w_1,\cdots,w_{T-1})$。在了解网络具体模型之前，我们先导入一下训练用的数据集：WikiText-2，这是一个没有标注数据的语料库。

In [36]:
# Prepares training data, uses WikiText-2 dataset from torchtext to train a language model
import torch
import torchtext
from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.utils import get_tokenizer
from torchtext.data.functional import to_map_style_dataset
from torch.utils.data import DataLoader

# Set up training dataset and tokenizer, first iterator of
# WikiText dataset will be loaded (downloading may consume some time).
# Then the train_iter will be converted into list-like variable supporting __getitem__ method.
# Otherwise the dataset could only be accessed once.
train_iter = torchtext.datasets.WikiText2(split='train')
train_set = to_map_style_dataset(train_iter)
tokenizer = get_tokenizer('basic_english')


# Clean and filter the dataset and shrinks size of dataset 
# to 300 for a quick training if small_set=True
def clean_dataset(dataset, small_set=False):
    processed_dataset = [d for d in dataset if len(tokenizer(d)) > 20] # Filters out abnormal short sentences
    if small_set:
        processed_dataset = processed_dataset[:300]
    return processed_dataset
train_set = clean_dataset(train_set, small_set=True)

# Build a vocabulary
# Feeds whole train set into the vocabulary
def yield_tokens(dataset):
    for text in dataset:
        yield tokenizer(text)
vocab = build_vocab_from_iterator(yield_tokens(train_set), specials="<unk>", min_freq=3)
vocab.set_default_index(vocab['<unk>'])
print("Counted %d words from train dataset"%(len(vocab)))

Counted 2169 words from train dataset


In [37]:
# Build a n-gram dataloader
# the batch from dataloader contains two tensor:
# a context tensor with shape[batch_size, num_order-1], with each row a continous n-1 word indexes
# a target tensor with shape[batch_size], 
# with each element one word index correspondent to its previous n-1 words
num_order = 6 # the 'n' in n-gram
batch_size = 256


def build_ngram_dataset(dataset, num_order):
    ngram_set = []

    for text in dataset:
        tokens = tokenizer(text)
        indexes = vocab(tokens)
        len_text = len(indexes)
        for i in range(len_text - num_order):
            input_tensor = torch.tensor(indexes[i:i+num_order-1], dtype=torch.int64)
            target_tensor = torch.tensor(indexes[i+num_order], dtype=torch.int64)
            # We don't expect model to learn to predict "<unk>" token
            if vocab["<unk>"] not in target_tensor:
                ngram_set.append((input_tensor, target_tensor))
    return ngram_set


ngram_set = build_ngram_dataset(train_set, num_order)
train_loader = DataLoader(ngram_set, batch_size, shuffle=True)
print("Counted %d pairs in ngram_set"%(len(ngram_set)))

Counted 36001 pairs in ngram_set


上面的变量中，比较有用的变量有`train_loader`, `vocab`, `tokenizer`，它们的用法如下：
```
>>> for input_tensor in train_loader:
>>>     do sth... # tensor shape: [sentence_length, ]
```
vocab 用法:
```
>>> vocab(['i', 'am', 'on', 'a', 'mat'])
[69, 1791, 17, 13, 17093]
>>> vocab.lookup_token(187)
large'
```
tokenizer 用法:
```
>>> tokenizer("Have you eaten today?")
['have', 'you', 'eaten', 'today', '?']
```

## NNLM网络结构
NNLM整体上来说由两个部分构成：对输入进行词嵌入————把词语的索引映射成一个几十维的实向量（词嵌入），以及输入连续n个词向量预测第n+1个词语概率的前馈网络（概率函数）。用现在的眼光来看这是一个简单的不得了的网络了，但是在2003年，这是第一个提出利用预测词语的任务来同时训练词嵌入参数和概率函数参数的文章，在此之前，这些参数都是独自分别求解或者手动设置的。而且这个简单的模型揭示了一个重要的道理————人工神经网络在大量训练数据面前可以得到超越一般模型的表现

<img src="https://image.panwenbo.icu/blog20210714225940.png" alt="截屏2021-07-14 下午10.59.35" style="zoom:30%;" />

### 词嵌入部分
词嵌入部分和现在的词嵌入方法是一样的————所有词语共享同一个词嵌入矩阵的参数，因此我们可以用这一个词嵌入模型把任意词语$w_i \in {0,1,\cdots,|V|-1}$ 映射到 $x_i \in R^{[m]}$其中m代表了词向量的维数，这个过程记为$x_i = C(w_i)$，而词嵌入部分的最终输出就是
$$x=[C(w_{t-1}),C(w_{t-2}),\cdots, C(w_{t-n+1})] \in R^{[m\times(n-1)]}$$
在NNLM中我们指定一个超参数n，就像n_gram模型一样，我们只输入固定的前n个词语来预测下一个词语。因此这个模型并不一定要使用循环神经网络。词嵌入部分当中使用的参数只有一个矩阵$C \in R^{[|V|, m]}$，$|V|$代表了所有词语的总数，这个矩阵存储了V中每一个词语对应的m维的词向量。

### 概率函数
我们现在有了词嵌入向量$x$，为了算法的非线性性，我们需要把它再进行一次非线性变换：$x'= \tanh(Hx+ d)$。之后我们把x'输入一个线性层（$y = Ux' + b$）后带入Softmax函数得到
$$
\hat P(w_t|w_{t-1}...w_{t-n+1} )=\frac{\exp(y_{w_t})}{\sum_i \exp(y_i)}
$$
如果x'是一个h维的隐藏层向量，y是一个维数为|V|的向量（代表了每个可能的词语的得分），那我们就可以知道其余参数的大小尺寸：
$$H \in R^{[h, m(n-1)]}, d \in R^{[h]}, U \in R^{[|V|, h]}, b \in R^{[|V|]}$$

最终我们的每个单词的得分就是：
$$y = b + U\tanh(d + Hx)$$
当然，在实际实验中我们还会添加一个直接从x到y的连接，也就是：
$$y = b + Wx + U\tanh(d + Hx), \ \ \ \ W \in R^{[|V|, m]}$$

In [38]:
# Purest NNLM class implement
import torch
import torch.nn as nn

class NNLM(nn.Module):
    """Nueral Network Language Model in its purest form

        :param vocab_size: |V|, the nums of words in vocabulary
        :type vocab_size: int
        :param embedded_size: m, the size of embedded vector, defaults to 100
        :type embedded_size: int, optional
        :param num_order: n, the numbers of input words is n - 1, defaults to 6
        :type num_order: int, optional
        :param hidden_size: the hidden layer size in tanh, defaults to 60
        :type hidden_size: int, optional
        """
    
    def __init__(self, vocab_size, embedded_size=100, num_order=6, hidden_size=60):
        super().__init__()
        self.C = nn.Parameter(torch.rand((vocab_size, embedded_size)))
        self.H = nn.Parameter(torch.rand((hidden_size, embedded_size*(num_order-1))))
        self.d = nn.Parameter(torch.rand((hidden_size)))
        self.U = nn.Parameter(torch.rand((vocab_size, hidden_size)))
        self.W = nn.Parameter(torch.rand((vocab_size, embedded_size*(num_order-1))))
        self.b = nn.Parameter(torch.rand((vocab_size)))
        self.softmax = nn.Softmax(dim=0) # Apply softmax on vector
    
    def forward(self, words):
        """forward function

        :param words: the list of word indexes with length of num_order - 1
        :return: probabilities of all vocab words, shape: [vocab_size]
        """
        x = self.C[words] # shape: [num_order-1, embedding_size]
        x = x.view(-1)
        y = self.b + self.W @ x + self.U @ torch.tanh(self.d + self.H @ x)
        return self.softmax(y)

In [39]:
# Compat version of the same model
class NNLM_compat(nn.Module):
    def __init__(self, vocab_size, embedded_size=100, num_order=6, hidden_size=60):
        super().__init__()   
        self.embedding = nn.Embedding(vocab_size, embedded_size)
        self.tanh_layer = nn.Linear(embedded_size*(num_order-1), hidden_size)
        self.out = nn.Linear(embedded_size*(num_order-1) + hidden_size, vocab_size)
        self.softmax = nn.Softmax(dim=1)
    
    def forward(self, words):
        """forward function

        :param words: the tensor of words indexes, shape: [batch_size, num_order - 1]
        :return: possibility of all vocab-size words, shape: [batch_size, vocab_size]
        """
        x = self.embedding(words)
        x = x.view((x.size(0), -1))
        h = torch.tanh(self.tanh_layer(x))
        combine = torch.cat((x, h), dim=1)
        return self.softmax(self.out(combine))

## 训练部分
训练的目标是给定已知的n-gram，最大化条件概率$\hat P(w_t|w_{t-1}, \cdots, w_{t-n+1})$，由于我们的网络输出的就是概率值，想要使该值最大化，我们只需要在输出的$w_t$对应的概率处调用.backward(-1)即可在计算图内求出每个参数的负梯度。对计算出的负梯度使用梯度下降就可以最大化需要的输出概率，当然实际操作时还需要先对SoftMax概率取一下对数来更好的优化。当然也可以使用nn.NLLLoss，有着差不多的效果

In [40]:
# Training section, firstly prepare some variables
num_epoch = 24
embedded_size = 100
hidden_size = 60

# Because pure version of NNLM doesn't support mini-batch 
# gradient descent, uses NNLM_compat instead
model = NNLM_compat(len(vocab), embedded_size, num_order, hidden_size) 
# optimizer =  torch.optim.SGD(params=model.parameters(), lr=1e-3) # setting in original paper, too slow
optimizer =  torch.optim.Adam(params=model.parameters(), lr=1e-3)

# We'll use backward funtion directly on negative log 
# probabilities so loss function is no longer needed
# criterion = nn.NLLLoss() 

for epoch in range(num_epoch):
    # running_loss sum up and average all loss during 
    # the period of one epoch
    running_loss = 0 
    
    for idx, batch in enumerate(train_loader):
        input, target = batch
        output = model(input)

        # Backward propagation
        indexes = target.view((-1, 1))
        optimizer.zero_grad()
        # Uses gather() to pick out elements in different columns for different rows
        # negative log likelihood loss does the same thing as taking
        # probalities into CrossEntropy loss function.
        loss = -torch.log(output.gather(1, indexes)).mean()
        loss.backward()
        optimizer.step()

        # Print progress
        running_loss += loss.item()

    print("Epoch %d:\t|Step %d:\t|loss=%.3f"%(
        epoch + 1, idx, running_loss / batch_size
        ))

Epoch 1:	|Step 140:	|loss=3.504
Epoch 2:	|Step 140:	|loss=2.790
Epoch 3:	|Step 140:	|loss=2.380
Epoch 4:	|Step 140:	|loss=2.048
Epoch 5:	|Step 140:	|loss=1.785
Epoch 6:	|Step 140:	|loss=1.586
Epoch 7:	|Step 140:	|loss=1.433
Epoch 8:	|Step 140:	|loss=1.314
Epoch 9:	|Step 140:	|loss=1.217
Epoch 10:	|Step 140:	|loss=1.137
Epoch 11:	|Step 140:	|loss=1.067
Epoch 12:	|Step 140:	|loss=1.008
Epoch 13:	|Step 140:	|loss=0.954
Epoch 14:	|Step 140:	|loss=0.906
Epoch 15:	|Step 140:	|loss=0.862
Epoch 16:	|Step 140:	|loss=0.821
Epoch 17:	|Step 140:	|loss=0.783
Epoch 18:	|Step 140:	|loss=0.748
Epoch 19:	|Step 140:	|loss=0.716
Epoch 20:	|Step 140:	|loss=0.684
Epoch 21:	|Step 140:	|loss=0.653
Epoch 22:	|Step 140:	|loss=0.625
Epoch 23:	|Step 140:	|loss=0.597
Epoch 24:	|Step 140:	|loss=0.572


## 测试部分
我们挑训练数据中的一段话输入模型，将输出的概率中最大概率的词语再加入原来的句子中并再次输入模型，以此类推，让模型来续写一段话。可以发现模型记住了文本的少量固定短语，而"the"","".""is"等词语由于有最大的先验概，基本上充满了模型的输出。

In [49]:
# Evaluate by making it produce sth from its previous produce
start_sentence = "The building and the surrounding park were"
tokens = tokenizer(start_sentence)
def pick_last_ngram_tensor(num_order):
    indexes = vocab(tokens[-num_order + 1:])
    return torch.tensor(indexes, dtype=torch.int64).view((1, -1))

for i in range(15):
    input = pick_last_ngram_tensor(num_order)
    pred = model(input)
    tokens.append(vocab.lookup_token(pred.argmax()))

print(' '.join(tokens))

the building and the surrounding park were with points that the ' , had just four of and the . is to


原文中还使用了Perplexity($\frac{1}{\hat P(w_t|w_{t-1}, \cdots, w_{t-n+1})}$的几何平均值)困惑程度来衡量模型的表现。我们编写几段文字，代入模型计算一下它们的困惑程度。如果困惑值越低，说明这句话越符合语言模型所统计到的语言模式

In [87]:
text1 = """Three people walking around the river, 
only to find some dead fishes floating all around."""
text2 = """Three around river walking people the, 
only to find some dead fishes floating all around."""
text3 = """Three around river walking people the, 
Tel Aviv, while Haifa gained status in suffered"""

def perplexity(text: str):
    input = build_ngram_dataset([text], 6)
    probs = 1
    for context, target in input:
        probs *= model(context.view(1, -1))[0, target.item()]
    return torch.pow(probs, -1/len(input)).item()

print("text 1 perplexity: %.3f" % perplexity(text1))
print("text 2 perplexity: %.3f" % perplexity(text2))
print("text 3 perplexity: %.3f" % perplexity(text3))

text 1 perplexity: 637.191
text 2 perplexity: 1801.824
text 3 perplexity: 7079.959
