# 语言模型
## 学习目标
* 学习语言模型
* 学习torchtext的基本使用方法
    * 构建vocabulary
    * word_to_index and index_to_word
* 学习torch.nn的一些基本模型
    * Linear
    * RNN
    * LSTM
    * GRU
* RNN的训练技巧
    * Gradient Clipping
* 如何保存和读取模型

首先使用torchtext来创建vocabulary。然后把数据读成batch的格式。可以去github自行阅读torchtext的readme

In [1]:
import torchtext
from torchtext.vocab import Vectors
import torch
import numpy as np
import random
import os
import tqdm
import time
os.environ['CUDA_VISIBLE_DEVICES'] = '1, 2, 3'

USE_CUDA = torch.cuda.is_available()

random.seed(1234)
np.random.seed(1234)
torch.manual_seed(1234)
if USE_CUDA:
    torch.cuda.manual_seed(1234)
    
BATCH_SIZE = 64
EMBEDDING_SIZE = 100
MAX_VOCAB_SIZE = 50000
HIDDEN_SIZE = 100

In [2]:
print(USE_CUDA)
torch.cuda.device_count()

True


3

* 继续使用text8作为训练验证测试
* torchtext 的一个重要的概念是Field，它决定数据如何被处理。吃用TEXT这个Field来处理文本数据。 其中有lower = True 这个参数，即所有的单词都会被lowercase。
* torchtext提供了LanguageModelingDataset这个class来帮助处理语言模型数据集。
* build_vocab可以根据我们提供的训练数据集来创建最高频单词的单词表，max_size帮助我们限定单词总量
* BPTTiterator可以连续地得到连贯的句子，BPTT全称为back propagation through time

In [3]:
TEXT = torchtext.data.Field(lower = True)
train, val, test = torchtext.datasets.LanguageModelingDataset.splits(path = ".",
                            train = "text8.train.txt", validation = "text8.dev.txt",
                            test = "text8.test.txt", text_field = TEXT)

In [4]:
TEXT.build_vocab(train, max_size = MAX_VOCAB_SIZE)

In [5]:
len(TEXT.vocab)

50002

* 之所以是50002是因为torchtext增加了两个特殊的token，<unk>表示unknown的单词，<pad>表示padding
* 模型的输入是一串文字，模型的输出也是遗传文字，他们之间相差一个位置，因为语言模型的目标是根据之前的单词预测下一个单词

In [6]:
# TEXT.vocab.stoi["damn"]

In [7]:
device = torch.device("cuda" if USE_CUDA else "cpu")

In [8]:
device

device(type='cuda')

In [9]:
train_iter , val_iter, test_iter = torchtext.data.BPTTIterator.splits(
    (train, val, test), batch_size = BATCH_SIZE, device=device, 
    bptt_len = 100, repeat = False, shuffle = True)

In [10]:
# for i in enumerate(train_iter):
#     print(i)

In [11]:
it = iter(train_iter)
batch = next(it)

In [12]:
batch


[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.cuda.LongTensor of size 100x64 (GPU 0)]
	[.target]:[torch.cuda.LongTensor of size 100x64 (GPU 0)]

In [13]:
print(" ".join(TEXT.vocab.itos[i] for i in batch.text[:, 0].data.cpu()))
print(" ".join(TEXT.vocab.itos[i] for i in batch.target[:, 0].data.cpu()))

anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans <unk> of the french revolution whilst the term is still used in a pejorative way to describe any act that used violent means to destroy the organization of society it has also been taken up as a positive label by self defined anarchists the word anarchism is derived from the greek without archons ruler chief king anarchism as a political philosophy is the belief that rulers are unnecessary and should be abolished although there are differing
originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans <unk> of the french revolution whilst the term is still used in a pejorative way to describe any act that used violent means to destroy the organization of society it has also been taken up as a positive label by self defined anarchists the word anarchism is derived

接下来定义模型
=

In [14]:
import torch.nn as nn

class RNNModel(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size):
        super(RNNModel, self).__init__()
        self.embed = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size)
        self.linear = nn.Linear(hidden_size, vocab_size)
        self.hidden_size = hidden_size
        
    def forward(self, text, hidden):
        # forward pass
        # text: [seq_length, batch_size]
        emb = self.embed(text) # emb: [seq_length, batch_size, embed_size]
        output, hidden = self.lstm(emb, hidden)
        # output: [seq_length, batch_size, hidden_size]
        # hidden: ([1(number of layers etc. bert), batch_size, hidden_size], [1(number of layers etc. bert) , batch_size, hidden_size])
        out_vocab = self.linear(output.view(-1, output.shape[2])) # [seq_length*batch_size, hidden_size] 这里是因为output是三维的，而linear需要输入是二维的，所以用view压成二维的函数 
        out_vocab = out_vocab.view(output.size(0), output.size(1), out_vocab.size(-1)) # 这里再重新释放出来
        # 这里不用加sigmoid之类的activation，因为本身lstm的gate就扮演了这样的角色 ,当然也可以加，只是默认不加
        return out_vocab, hidden
        
    def init_hidden(self, batchsize, requires_grad = True):
        weight = next(self.parameters())
        return (weight.new_zeros((1, batchsize, self.hidden_size), requires_grad = True),
                weight.new_zeros((1, batchsize, self.hidden_size), requires_grad = True))
        

初始化一个模型
-

In [15]:
model = RNNModel(vocab_size = len(TEXT.vocab), 
                 embed_size = EMBEDDING_SIZE,
                 hidden_size = HIDDEN_SIZE)
if USE_CUDA:
    momdel = model.to(device)

In [16]:
next(model.parameters())

Parameter containing:
tensor([[-0.1117, -0.4966,  0.1631,  ...,  1.5903, -0.1947, -0.2415],
        [ 1.3204,  1.5997, -1.0792,  ...,  0.6060,  0.2209, -0.8245],
        [ 0.7289, -0.7336,  1.5624,  ..., -0.5592, -0.4480, -0.6476],
        ...,
        [ 0.5675,  1.4622, -0.5770,  ..., -0.4970, -0.3513,  1.9668],
        [-0.2747,  1.3695, -0.5266,  ..., -1.2115,  0.1327, -0.7934],
        [ 1.4297, -0.1843,  0.2579,  ..., -0.0684,  0.5642,  0.6348]],
       device='cuda:0', requires_grad=True)

接下来训练模型：
-
* 若干epoch
* 每个epoch分成若干batch
* 把每个batch输入输出的数据包装为 cuda tensor
* forward pass ，通过输入的句子预测每个单词的下个单词
* 用模型的预测和正确的下个单词计算cross entropy loss
* 清空模型当前gradient
* backward pass
* gradient clipping，防止梯度爆炸
* 更新模型参数
* 每隔一定的iteration输出模型当前loss，以及在验证集上做模型评估

In [17]:
def repackage_hidden(h):
    if isinstance(h, torch.Tensor):
        return h.detach()
    else:
        return tuple(repackage_hidden(v) for v in h)    # 这个h是一个全新的开始，没有把历史保存下来

In [18]:
VOCAB_SIZE = len(TEXT.vocab)
loss_fn = nn.CrossEntropyLoss()

learning_rate = 0.001
optimizer = torch.optim.Adam(model.parameters(), lr = learning_rate)
scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, 0.5)

In [19]:
def evaluate(model, data):
    model.eval()
    total_loss = 0.
    total_count = 0.
    it = iter(data)
    with torch.no_grad():
        hidden = model.init_hidden(BATCH_SIZE, requires_grad = False)
        for i, batch in enumerate(it):
            data, target = batch.text, batch.target
            hidden = repackage_hidden(hidden)
            output, hidden = model(data, hidden)
            loss = loss_fn(output.view(-1, VOCAB_SIZE), target.view(-1))
            total_loss = loss.item() * np.multiply(*data.size())
            total_count = np.multiply(*data.size())
            
    loss = total_loss/total_count
    model.train()   #这一步很有必要
    return loss
            

In [25]:

NUM_EPOCHS = 2
GRAD_CLIP = 5.

val_losses = []
for epoch in range(NUM_EPOCHS):
    model.train()
    it = iter(train_iter)
    hidden = model.init_hidden(BATCH_SIZE)

    for i, batch in enumerate(it):
        data, target = batch.text, batch.target
        hidden = repackage_hidden(hidden)
        output, hidden = model(data, hidden) # problem is : backpropogate through all iterations  计算很大很深，并且只有语言模型才会这样传hidden，翻译也不会这么做
        
        loss = loss_fn(output.view(-1, VOCAB_SIZE), target.view(-1))
        # 
        optimizer.zero_grad()
        
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(),GRAD_CLIP)
        optimizer.step()
        
        if i % 100 == 0 :
            print("epoch",epoch,"iterator",i,"       |     loss", loss.item())
        
        if i % 1000 == 0:
            val_loss = evaluate(model, val_iter)
            print("epoch",epoch,"iterator",i,"      |    loss of validation" , val_loss)
            if len(val_losses) == 0 or val_loss < min(val_losses):
                print("best model saved to language_model3.pth")
                torch.save(model.state_dict(),"languagem_model3.pth")
            else:
                # learning rate decay
                scheduler.step()
                
            val.losses.append(val_loss)
        

        
        

epoch 0 iterator 0        |     loss 6.5254669189453125
epoch 0 iterator 0       |    loss of validation 6.448461055755615
best model saved to language_model3.pth
epoch 0 iterator 100        |     loss 6.348505020141602
epoch 0 iterator 200        |     loss 6.3520002365112305
epoch 0 iterator 300        |     loss 6.081464767456055
epoch 0 iterator 400        |     loss 6.179876327514648
epoch 0 iterator 500        |     loss 6.229490756988525
epoch 0 iterator 600        |     loss 6.128426551818848
epoch 0 iterator 700        |     loss 6.121902942657471
epoch 0 iterator 800        |     loss 6.201119422912598
epoch 0 iterator 900        |     loss 6.033709526062012
epoch 0 iterator 1000        |     loss 6.113443374633789
epoch 0 iterator 1000       |    loss of validation 6.225577354431152
best model saved to language_model3.pth
epoch 0 iterator 1100        |     loss 6.0718231201171875
epoch 0 iterator 1200        |     loss 6.109519004821777
epoch 0 iterator 1300        |     los

读取模型
-

In [28]:
best_model = RNNModel(vocab_size = len(TEXT.vocab), 
                 embed_size = EMBEDDING_SIZE,
                 hidden_size = HIDDEN_SIZE)
if USE_CUDA:
    best_momdel = best_model.to(device)
    
best_model.load_state_dict(torch.load("languagem_model3.pth"))

使用最好的模型在valid数据上计算perplexity
-

In [29]:
val_loss = evaluate(best_model , val_iter)
print("perplexity : ", np.exp(val_loss))

perplexity :  308.2964855642628


使用训练好的的模型来生成一些句子
-
很重要，多看

In [31]:
hidden = best_model.init_hidden(1)  # 拿一个batch_size是 1 的hidden state
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
input = torch.randint(VOCAB_SIZE, (1, 1), dtype = torch.long).to(device)
words = []
for i in range(100):
    # run forward pass
    output, hidden = best_model(input, hidden)
    # logits exp (like softmax)
    word_weights = output.squeeze().exp().cpu() #squeeze（）把所有维度为 1 的部分扔掉
    # multinomial sampling : the function gets the index of input instead of the value of the input
    word_idx = torch.multinomial(word_weights, 1)[0]
    # fill in the current predicted word to the current input
    input.fill_(word_idx)
    word = TEXT.vocab.itos[word_idx]
    words.append(word)
    
print(" ".join(words))
    

maronites chamorros korea ecps omelette defeats paperback impressions <unk> spaceflight two nine eight gott f n zero as six zero one nicolson two two zero three g himmel cook and trail slide crimson quotas see a standard numeral for several diameters state or concretes the atari tnt judges in the town six th edition in music was stone dressing but most notably guinness style of bologna points and that admire the air marker the batter still exist the conventional color for workflow during the beatles writing at each of contents that the great majority of alpha is now surprising by
