# 语言模型（输入一个句子，输出这个句子产生的概率）

目标：根据之前的单词预测下一个单词。


学习目标
- 学习语言模型，以及如何训练一个语言模型
- 学习torchtext的基本使用方法
    - 构建 vocabulary
    - word to inde 和 index to word
- 学习torch.nn的一些基本模型
    - Linear
    - RNN
    - LSTM
    - GRU
- RNN的训练技巧
    - Gradient Clipping
- 如何保存和读取模型  
- 更新learning rate

## 调用工程需要的包

In [1]:
import torchtext
import torch
import numpy as np
import random
import os
import torch.nn as nn

USE_CUDA=torch.cuda.is_available()
print(USE_CUDA)
device=torch.device('cuda' if USE_CUDA else 'cpu')

#固定random seed
random.seed(1)
np.random.seed(1)
torch.manual_seed(1)
if USE_CUDA:
    torch.cuda.manual_seed(1)

True


## 定义相关参数

In [2]:
#一个bantch中有多少个句子
BATCH_SIZE=32 
#word embedding 的维度
EMBEDDING_SIZE=100
MAX_VOCAB_SIZE=50000
SEQ_LENGTH=20
#隐含层神经元个数
HIDDEN_SIZE = 100
NUM_EPOCHES=2
learning_rate=0.001
GRAD_CLIP=5.0

## 创建vocabulary(单词表)
- 安装[torchtext](https://github.com/pytorch/text)  (用于文本预处理)  
    pip install torchtext  
- 使用 torchtext 来创建vocabulary, 然后把数据读成batch的格式。请大家自行阅读README来学习torchtext。 
- 注意torchtext与torch的版本是否匹配，不匹配可能会出现torch版本被新版本torch_cpu替代的情况

- **注意变更**：  
    torchtext.data.Field -> torchtext.legacy.data.Field  
    torchtext.datasets.LanguageModelingDataset -> torchtext.legacy.datasets.LanguageModelingDataset  
    torchtext.data.BPTTIterator -> torchtext.legacy.data.BPTTIterator  

### 使用field预处理数据；利用LanguageMOdelingDataset class创建三个dataset
继续使用text8数据集作为训练、验证和测试数据
1. TorchText的一个重要概念是[Field](https://torchtext.readthedocs.io/en/latest/data.html#field)，其决定了数据会被如何处理  
    我们使用TEXT这个field来处理文本数据  
    我们的TEXT field有lower-Ture这个参数，故所有的单词都会被lowercase  
    torchtext提供了LanguageModelingDataset这个class来帮助处理语言模型数据集  
2. build_vocab可以根据我们提供的训练数据集来创建最高频单词的单词表，max_size帮助我们限定单词总量
3. BPTTIterator可以连续地获得连贯的句子，[BPTT](https://zh.d2l.ai/chapter_recurrent-neural-networks/bptt.html): back propagation through time

In [3]:
#确定数据集路径
script_path=os.path.abspath('__file__')
dir_path=os.path.dirname(script_path)
path=os.path.join(dir_path,'text8')
print(path,type(path))

#创建一个名为TEXT的Field
#lower=True: 将所有单词lowercase
TEXT=torchtext.legacy.data.Field(lower=True)
#创建用于language modeling的train, val, test三个dataset
#将data split
train, val, test = torchtext.legacy.datasets.LanguageModelingDataset.splits(path=path, 
                                                                            train='text8.train.txt', 
                                                                            validation='text8.dev.txt', 
                                                                            test='text8.test.txt', 
                                                                            text_field=TEXT)
# print(train)
# print(dir(train))
# print(train.examples)

C:\Users\Re_AC\Desktop\Pytorch\myTorch\3\languageModelNoteBook\text8 <class 'str'>


### 创建Vocabulary
- 创建vocabulary(单词表)相当于__myTorch/2/wordEmbeddingNotebook/2.ipynb#数据预处理及相关操作__中创建vocab参数的过程
- 具体流程是从dataset中取出出现频数最高的前MAX_BOCAB_SIZE个单词作为Vocabulary
- 单词表单词个数为50002个而不是50000个，是因为TorchText为我们增加了两个特殊的token：  
    \< unk \>: 表示未知的，不在单词表中的单词  
    \< pad \>: 表示padding，当句子较短时，将\< pad \>添加进句子末尾补齐长度

In [4]:
#创建training dataset的vocabulary 单词数量为MAX_BOCAB_SIZE
TEXT.build_vocab(train, max_size=MAX_VOCAB_SIZE)
#注意单词个数是50002个，而不是MAX_BOCAB_SIZE指定的50000个

#定义VOCAB_SIZE
VOCAB_SIZE = len(TEXT.vocab)
print(len(TEXT.vocab)) #vocabulary size

#itos: index to string
print(type(TEXT.vocab.itos))
print(TEXT.vocab.itos[:10]) #注意<unk>和<pad>

#stoi: string to index
print(type(TEXT.vocab.stoi))
print(TEXT.vocab.stoi['apple'])

50002
<class 'list'>
['<unk>', '<pad>', 'the', 'of', 'and', 'one', 'in', 'a', 'to', 'zero']
<class 'collections.defaultdict'>
1259


### 创建batch(iterator)
为dataset创建batch，每个batch包含BATCH_SIZE个句子  
句子长度seq_length(=bptt_len) 其沿时间方向  

In [5]:
#bptt_len:  Length of sequences for backpropagation through time.
#此处也决定了batch中每个句子的长度
#具体参考：https://zh.d2l.ai/chapter_recurrent-neural-networks/bptt.html
#repeat=False: 过完一边dataset后就结束一次epoch
train_iter, val_iter, test_iter=torchtext.legacy.data.BPTTIterator.splits(
    (train, val, test), 
    batch_size=BATCH_SIZE, 
    device=device, 
    bptt_len=SEQ_LENGTH, 
    repeat=False, 
    shuffle=True)

In [6]:
#测试+加深理解
it=iter(train_iter)
batch=next(it)
print(batch)
#20: 句子长度seq_length(=bptt_len) 其沿时间方向  32: batch_size
# [torchtext.legacy.data.batch.Batch of size 32]
# 	[.text]:[torch.LongTensor of size 20x32]
# 	[.target]:[torch.LongTensor of size 20x32]

#可以看到text为文件：text8.train.txt的内容
#target与text相似，但从text中的下一个单词开始，比text多一个单词结束
#输入dataset中的一个单词，target（输出）为dataset中的下一个单词
#模型的目的是预测下一个单词是什么
print(batch.text)
print(batch.text.shape)
print(' '.join(TEXT.vocab.itos[i] for i in batch.text[:,0].data))
print()
print(' '.join(TEXT.vocab.itos[i] for i in batch.target[:,0].data))


[torchtext.legacy.data.batch.Batch of size 32]
	[.text]:[torch.cuda.LongTensor of size 20x32 (GPU 0)]
	[.target]:[torch.cuda.LongTensor of size 20x32 (GPU 0)]
tensor([[ 5269,  6271,   417,     9,     6,   375,   317,  2278,     6,    21,
            72,    54,   742,     2,  4434,   283,    23,   531,     0,     5,
           463,  5850,    22,  8624,  1455,    68,    11,    66,     2,  5931,
             3, 24395],
        [ 3110,     6,   288,     2,  3047,     2,    25,   109,   261,    50,
          6129,   892,     7, 24782,    25, 12713,    18,     5,   556,    10,
             7,  4664,     5,    43,   163,     5,     9,     2,  1311,    57,
           168,     6],
        [   13,  3593,   458,  1259,    40,   375,    10,   550,     3, 19798,
            21, 43004, 17114,     3,     2,     7,  2316,    10,   427,     5,
          1185,   127,    48,   504,  2461, 14097,     9,   277,     3,    12,
         27121,   314],
        [    7,     4, 11211, 21733,    55,    19,    11,

In [7]:
#多拿几个train_iter中的batch，看看text和target中的内容
for i in range(5):
    batch=next(it)
    print()
    print(i)
    print(' '.join(TEXT.vocab.itos[i] for i in batch.text[:,0].data))
    print()
    print(' '.join(TEXT.vocab.itos[i] for i in batch.target[:,0].data))


0
revolution and the sans <unk> of the french revolution whilst the term is still used in a pejorative way to

and the sans <unk> of the french revolution whilst the term is still used in a pejorative way to describe

1
describe any act that used violent means to destroy the organization of society it has also been taken up as

any act that used violent means to destroy the organization of society it has also been taken up as a

2
a positive label by self defined anarchists the word anarchism is derived from the greek without archons ruler chief king

positive label by self defined anarchists the word anarchism is derived from the greek without archons ruler chief king anarchism

3
anarchism as a political philosophy is the belief that rulers are unnecessary and should be abolished although there are differing

as a political philosophy is the belief that rulers are unnecessary and should be abolished although there are differing interpretations

4
interpretations of what this means a

## 定义模型（简单的）
- 继承nn.Module
- \_\_init\_\_函数
- forward函数
- 其余可以根据模型需要定义相关函数  

[**nn.Embedding及rnn输入**](https://www.jianshu.com/p/63e7acc5e890)

**PyTorch处理RNN时默认第一个维度为sequence length，第二个维度为batch_size** 


每个batch：  
    第一次输入LSTM的是batch_size个'句子'的第一个单词的embedding  
    第二次输入LSTM的是这batch_size个'句子'的第二个单词的embedding  
    。。。  
    第seq_length次输入LSTM的是这batch_size个'句子'的第seq_length个单词的embedding  
    至此根据这bptt_len即seq_length次输出计算loss和bptt  

In [8]:
#定义一个简单的RNN （一层）
class RNNModel(nn.Module):
    #定义需要参数
    def __init__(self, vocab_size, embed_size, hidden_size):
        super().__init__()
        
        self.hidden_size=hidden_size
        
        #embedding层
        self.embed=nn.Embedding(vocab_size, embed_size) # W大小：(50002, 650) 
        #LSTM层
        #https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html
        self.lstm=nn.LSTM(embed_size, hidden_size)
        # batch_first=True: 将lstm第一个维度改为batch_size
        # self.lstm=nn.LSTM(embed_size, hidden_size, batch_first=True)
        # 将LSTM的结果decode为一个vocab_size维的向量，以确定预测的单词
        self.linear=nn.Linear(hidden_size, vocab_size)
    
    #定义网络架构
    def forward(self, input_text, hidden):
        #forward pass
        #input_text: seq_length * batch_size(32)
        emb= self.embed(input_text) # seq_length * batch_size * embed_size
        #embedding传入LSTM
        #hidden: hidden state & cell state 两者形状相同
        output, hidden = self.lstm(emb, hidden)
        # output: seq_length * batch_size * hidden_size
        # hidden: (1*batch_size*hiddensize, 1*batch_size*hidden_size) 1: LSTM层数为1, hidden state 参数同LSTMdancing的输出的形状相同
        output_reshape=output.view(-1, output.shape[2]) #reshape output: (seq_length * batch_size) * hidden_size
        out_vocab=self.linear(output_reshape) # (seq_length * batch_size) * vocab_size
        #将out_vocab变回原来的形状
        out_vocab=out_vocab.view(output.shape[0], output.shape[1], out_vocab.shape[-1]) # (seq_length * batch_size * vocab_size)
        
        return out_vocab, hidden
    
    #初始化hidden state 和 cell state
    def init_hidden(self, batch_size, requires_grad=True):
        #从model中随便选取一组parameters 为了方便，直接用next
        #此步操作原因见下一步
        weight= next(self.parameters())
        #使用0矩阵初始化hidden state和cell state
        #为了保证创建tensor与model中其他tensor有相同的torch.dtype 和 torch.device， 使用new_zeros函数
        hidden_state=weight.new_zeros((1, batch_size, self.hidden_size), requires_grad= requires_grad)
        cell_state=weight.new_zeros((1, batch_size, self.hidden_size), requires_grad= requires_grad)
        
        return (hidden_state, cell_state)

## 初始化模型

In [9]:
model=RNNModel(vocab_size=VOCAB_SIZE, embed_size=EMBEDDING_SIZE, hidden_size=HIDDEN_SIZE)
if USE_CUDA:
    model=model.to(device)

print(model)
print(next(model.parameters()))

RNNModel(
  (embed): Embedding(50002, 100)
  (lstm): LSTM(100, 100)
  (linear): Linear(in_features=100, out_features=50002, bias=True)
)
Parameter containing:
tensor([[-1.5256, -0.7502, -0.6540,  ...,  1.1899,  0.8165, -0.9135],
        [ 1.3851, -0.8138, -0.9276,  ..., -1.8475, -2.9167, -0.5673],
        [-0.5413,  0.8952, -0.8825,  ..., -0.0586,  1.1788,  0.6222],
        ...,
        [ 0.6637,  0.4019,  1.0508,  ..., -1.6378,  0.6289,  0.1546],
        [ 2.7030,  1.1254,  1.1153,  ...,  1.6220,  0.7710, -0.3384],
        [ 0.3367, -0.3162, -0.1132,  ...,  0.1047,  1.5384, -0.7781]],
       device='cuda:0', requires_grad=True)


## 训练模型及保存模型
- 模型一般需要训练若干个epoch
- 每个epoch我们都把所有的数据分成若干个batch
- 把每个batch的输入和输出都包装成cuda tensor
- forward pass，通过输入的句子预测每个单词的下一个单词
- 用模型的预测和正确的下一个单词计算cross entropy loss
- backward pass
- gradient clipping，防止梯度爆炸
- 更新模型参数
- 清空模型当前gradient
- 每隔一定的iteration输出模型在当前iteration的loss，以及在验证集上做模型的评估

In [10]:
#hidden state/cell Tensor在Torch的graph中作为一个节点，其与W类似，与历史的hidden state/cell都有关系
#由于hidden state/cell 一直往下传递，计算图会非常大非常深，最终可能会导致内存爆炸
#所以利用detach将hidden state/cell同之前的hidden state/cell分离
#这样backpropagation会从分离的部分重新开始

#detach: https://pytorch.org/docs/stable/generated/torch.Tensor.detach.html
#Returns a new Tensor, detached from the current graph.
#The result will never require gradient.

def repackage_hidden(hidden):
    #如果hidden是Tensor
    
    # isinstance(object, classinfo)
    # 如果对象的类型与参数二的类型（classinfo）相同则返回 True，否则返回 False
    if isinstance(hidden, torch.Tensor):
        return hidden.detach()
    #否则是(hidden_state, cell_state)元组
    #递归调用，将两者截断后重新组成元组
    else:
        return tuple(repackage_hidden(i) for i in hidden)

定义loss fun和optimizer

In [11]:
loss_fn=nn.CrossEntropyLoss()
optimizer=torch.optim.Adam(model.parameters(),lr=learning_rate)

#每call一次该函数，将learning下降一点
#0.5： 将learning rate下降到原来的50%
scheduler=torch.optim.lr_scheduler.ExponentialLR(optimizer, 0.5)

定义evaluate函数，用于验证集对模型的评估

In [12]:
#用于保存在验证集上评估的结果
#和训练模型更新参数前的操作基本类似
val_losses=[]

def evaluate(model, input_data):
    #Sets the module in evaluation mode.
    model.eval()
    #保存total loss
    total_loss=0.
    #保存一共预测的单词数
    total_count=0
    #将data转化为迭代器
    it=iter(input_data)
    
    #因为是做预测，所以所有参数都不应该有gradient
    #临时让所有参数不计算grad
    with torch.no_grad():
        #初始化hidden state
        hidden= model.init_hidden(BATCH_SIZE, requires_grad=False)
        #enumerate: 为迭代器每次迭代添加序号
        for i, batch in enumerate(it):
            data, target = batch.text, batch.target #已经在cuda上了，不需要进行设备转换: print(batch.text)

            #在每个batch调用hidden之前，将hidden与其之前的历史分离
            #保证虽然利用了之前的hidden，但是bptt只在此次batch中进行
            hidden=repackage_hidden(hidden)

            #在语言模型中， 训练集中的前一个句子与后一个句子是相连的
            #所以下一个batch/iteration/下一个backpropagationThroughTime的过程仍然可以用上一次的hidden state
            output, hidden =model(data, hidden)

            #output形状：(seq_length， batch_size， vocab_size)
            #为了使用crossentropy计算loss，需要对output reshape为(seq_length * batch_size，vocab_size)
            output=output.reshape(-1, VOCAB_SIZE)

            #计算loss
            #将target也reshape成vector
            #注意CrossEntropyLoss包含了LogSoftmax
            #output: (seq_length * batch_size，vocab_size)
            #target.view: (seq_length * batch_size)
            #loss 为seq_length*batch_size个数据的平均loss
            loss=loss_fn(output, target.view(-1))
            #计算total loss
            total_loss+=loss.item() * np.multiply(*data.size())
            total_count+=np.multiply(*data.size())
    
    #评估完成后返回training模式
    model.train()
    
    loss=total_loss/total_count
    return loss

训练及保存模型

In [13]:
for epoch in range(NUM_EPOCHES):
    #Sets the module in training mode.
    model.train()
    #将train_iter转化为迭代器
    it=iter(train_iter)
    #初始化hidden state
    hidden= model.init_hidden(BATCH_SIZE)
    #enumerate: 为迭代器每次迭代添加序号
    for i, batch in enumerate(it):
        data, target = batch.text, batch.target #已经在cuda上了，不需要进行设备转换: print(batch.text)
        
        #在每个batch调用hidden之前，将hidden与其之前的历史分离
        #保证虽然利用了之前的hidden，但是bptt只在此次batch中进行
        hidden=repackage_hidden(hidden)
        
        #在语言模型中， 训练集中的前一个句子与后一个句子是相连的
        #所以下一个batch/iteration/下一个backpropagationThroughTime的过程仍然可以用上一次的hidden state
        output, hidden =model(data, hidden)
        
        #output形状：(seq_length， batch_size， vocab_size)
        #为了使用crossentropy计算loss，需要对output reshape为(seq_length * batch_size，vocab_size)
        output=output.reshape(-1, VOCAB_SIZE)
        
        #计算loss
        #将target也reshape成vector
        #注意CrossEntropyLoss包含了LogSoftmax
        #output: (seq_length * batch_size，vocab_size)
        #garget.view: (seq_length * batch_size)
        loss=loss_fn(output, target.view(-1))
        
        #backward
        loss.backward()
        
        #将parameters clip，防止vanishing gradients and exploding gradients.
        torch.nn.utils.clip_grad_norm_(model.parameters(), GRAD_CLIP)
        
        #更新网络参数
        optimizer.step()
        
        #清零gradient
        optimizer.zero_grad()
        
        #每100次输出loss
        if i%100==0:
            print('loss', i, ': ', loss.item())
            
            
        ####################
        # 保存模型
        ####################
        #每1000个iteration保存一次
        if i%1000 ==0:
            #在validation数据集上评估模型
            val_loss=evaluate(model, val_iter)
            print(i,', ','evaluate loss: ', val_loss, ' min of val_losses: ', min(val_losses) if len(val_losses)!=0 else 'Nan')
            #如果是第一次评估，或者评估结果比之前都要好，则保存模型
            if len(val_losses)==0 or val_loss < min(val_losses):
                #model.state_dict(): OrdereDict, 保存有模型的所有参数
                #'language_model.pth' ： 模型名称
                torch.save(model.state_dict(), "language_model.pth")
                print('best model saved to ml.pth')
            #否则评估结果比之前最好的结果差，说明learning rate可能过大
            #可以通过调节learning rate让loss继续下降
            else:
                scheduler.step()
                print('learning rate decay')
            val_losses.append(val_loss)
                

loss 0 :  10.816372871398926
0 ,  evaluate loss:  10.80943772664499  min of val_losses:  Nan
best model saved to ml.pth
loss 100 :  7.199383735656738
loss 200 :  7.657475471496582
loss 300 :  7.2424821853637695
loss 400 :  6.971253395080566
loss 500 :  6.809700012207031
loss 600 :  6.5486884117126465
loss 700 :  6.947023868560791
loss 800 :  6.5856828689575195
loss 900 :  6.756621360778809
loss 1000 :  6.849720001220703
1000 ,  evaluate loss:  6.680168256303804  min of val_losses:  10.80943772664499
best model saved to ml.pth
loss 1100 :  6.598322868347168
loss 1200 :  6.6317458152771
loss 1300 :  6.6025519371032715
loss 1400 :  6.171170234680176
loss 1500 :  6.458670139312744
loss 1600 :  6.4029741287231445
loss 1700 :  6.377824306488037
loss 1800 :  6.318821430206299
loss 1900 :  6.365090370178223
loss 2000 :  6.404265403747559
2000 ,  evaluate loss:  6.42416395277828  min of val_losses:  6.680168256303804
best model saved to ml.pth
loss 2100 :  6.0813727378845215
loss 2200 :  6.2218

loss 19300 :  5.998446464538574
loss 19400 :  5.6315999031066895
loss 19500 :  5.305489540100098
loss 19600 :  5.405633449554443
loss 19700 :  5.351130485534668
loss 19800 :  5.340747833251953
loss 19900 :  5.660774230957031
loss 20000 :  5.5968122482299805
20000 ,  evaluate loss:  5.548941659442814  min of val_losses:  5.5720693486321
best model saved to ml.pth
loss 20100 :  5.901849269866943
loss 20200 :  4.952645301818848
loss 20300 :  5.665374755859375
loss 20400 :  5.764018535614014
loss 20500 :  5.888459205627441
loss 20600 :  5.764784336090088
loss 20700 :  5.514475345611572
loss 20800 :  5.5469465255737305
loss 20900 :  5.783713340759277
loss 21000 :  5.575328350067139
21000 ,  evaluate loss:  5.52874816051766  min of val_losses:  5.548941659442814
best model saved to ml.pth
loss 21100 :  5.716557502746582
loss 21200 :  5.741909503936768
loss 21300 :  5.4770588874816895
loss 21400 :  5.939389228820801
loss 21500 :  5.746910095214844
loss 21600 :  6.060485363006592
loss 21700 : 

loss 14900 :  5.453059196472168
loss 15000 :  5.6264967918396
15000 ,  evaluate loss:  5.385779374337618  min of val_losses:  5.385304015428445
learning rate decay
loss 15100 :  5.789181709289551
loss 15200 :  5.269637107849121
loss 15300 :  5.307678699493408
loss 15400 :  5.61185884475708
loss 15500 :  5.940357208251953
loss 15600 :  5.8243632316589355
loss 15700 :  5.576089859008789
loss 15800 :  5.648012161254883
loss 15900 :  5.709059715270996
loss 16000 :  5.940976619720459
16000 ,  evaluate loss:  5.37948165300604  min of val_losses:  5.385304015428445
best model saved to ml.pth
loss 16100 :  5.498838901519775
loss 16200 :  5.255729675292969
loss 16300 :  5.576426029205322
loss 16400 :  5.472335338592529
loss 16500 :  5.629510879516602
loss 16600 :  5.495746612548828
loss 16700 :  5.847012519836426
loss 16800 :  5.7742204666137695
loss 16900 :  5.686188220977783
loss 17000 :  5.3066840171813965
17000 ,  evaluate loss:  5.375715143930925  min of val_losses:  5.37948165300604
best 

KeyboardInterrupt: 

In [14]:
print(model.state_dict())

OrderedDict([('embed.weight', tensor([[-0.9702, -0.6682, -0.5829,  ...,  0.9879,  0.3909, -0.5976],
        [ 1.4040, -0.7950, -0.9087,  ..., -1.8664, -2.8979, -0.5862],
        [-0.5034,  1.0965, -0.7679,  ...,  0.1064,  1.1298,  0.9128],
        ...,
        [ 0.5915,  0.3283,  1.0743,  ..., -1.4168,  0.7816,  0.3870],
        [ 2.4916,  1.0625,  1.0658,  ...,  1.6001,  0.7269, -0.3997],
        [ 0.2299, -0.2738, -0.2193,  ...,  0.0397,  1.6360, -0.7308]],
       device='cuda:0')), ('lstm.weight_ih_l0', tensor([[ 0.5405,  0.2128, -0.4492,  ..., -0.1526, -0.0898, -0.1378],
        [ 0.0942,  0.2235,  0.0158,  ...,  0.3332, -0.0449, -0.0307],
        [ 0.1211, -0.0356,  0.1027,  ...,  0.1102,  0.0684, -0.0661],
        ...,
        [ 0.1776, -0.1508,  0.2254,  ...,  0.2558,  0.2720,  0.6122],
        [-0.1351, -0.0350,  0.0908,  ...,  0.0683,  0.0310, -0.2215],
        [-0.1463,  0.1297, -0.0244,  ..., -0.0258, -0.0372,  0.0979]],
       device='cuda:0')), ('lstm.weight_hh_l0', tensor

## 加载模型

In [18]:
best_model=RNNModel(vocab_size=len(TEXT.vocab), 
                    embed_size=EMBEDDING_SIZE, 
                    hidden_size=HIDDEN_SIZE)
if USE_CUDA:
    best_model=best_model.to(device)

best_model.load_state_dict(torch.load('language_model.pth'))

<All keys matched successfully>

In [15]:
print(torch.load('language_model.pth'))

OrderedDict([('embed.weight', tensor([[-0.9659, -0.6699, -0.5871,  ...,  0.9922,  0.3876, -0.5927],
        [ 1.4040, -0.7950, -0.9087,  ..., -1.8664, -2.8979, -0.5862],
        [-0.5051,  1.0984, -0.7697,  ...,  0.0966,  1.1286,  0.9072],
        ...,
        [ 0.5915,  0.3283,  1.0743,  ..., -1.4168,  0.7816,  0.3870],
        [ 2.4916,  1.0625,  1.0658,  ...,  1.6001,  0.7269, -0.3997],
        [ 0.2299, -0.2738, -0.2193,  ...,  0.0397,  1.6360, -0.7308]],
       device='cuda:0')), ('lstm.weight_ih_l0', tensor([[ 0.5379,  0.2186, -0.4546,  ..., -0.1507, -0.0914, -0.1355],
        [ 0.0958,  0.2222,  0.0127,  ...,  0.3348, -0.0509, -0.0362],
        [ 0.1263, -0.0298,  0.0984,  ...,  0.1136,  0.0658, -0.0637],
        ...,
        [ 0.1749, -0.1459,  0.2257,  ...,  0.2547,  0.2676,  0.6108],
        [-0.1373, -0.0316,  0.0896,  ...,  0.0654,  0.0321, -0.2204],
        [-0.1424,  0.1321, -0.0286,  ..., -0.0247, -0.0383,  0.0964]],
       device='cuda:0')), ('lstm.weight_hh_l0', tensor

## 使用最好的模型在valid数据上计算perplexity

In [19]:
val_loss=evaluate(best_model, val_iter)
print('perplexity: ', np.exp(val_loss))

perplexity:  215.39051555023698


## 使用最好的模型在测试数据上计算perplexity

In [20]:
test_loss=evaluate(best_model, test_iter)
print('perplexity: ', np.exp(test_loss))

perplexity:  266.89665179177774


## 使用训练好的模型生成一些句子

In [25]:
hidden=best_model.init_hidden(1)
device=torch.device('cuda' if torch.cuda.is_available() else 'cpu')
input_word = torch.randint(VOCAB_SIZE, (1, 1), dtype=torch.long).to(device)
print(input)
words=[]

for i in range(100):
    #run forwaard pass
    output, hidden=best_model(input_word,hidden)
    if i==0:
        print(output.shape)
    #logits exp
    word_weights=output.squeeze().exp().cpu()
    #multinomial ssampling
    word_idx=torch.multinomial(word_weights, 1)[0] #greddy (argmax)
    #fill in the current predicted word to the current input
    input_word.fill_(word_idx)
    word=TEXT.vocab.itos[word_idx]
    words.append(word)
print(' '.join(words))

<bound method Kernel.raw_input of <ipykernel.ipkernel.IPythonKernel object at 0x000001FF2409FA08>>
history of bulletin how to efficacy for their <unk> paperback inc on the planet s showed at the science of westminster acclaim one eight four five located in several one seven zero of the five zero zero zero run away from once rating at first prominence reached rank the rhineland one nine seven three besides the union s two three five th counties of sicily one nine six one in fallen during world war ii charitable pool ends with negotiated larger audiences of a peace treaty the nearby only allowed to consonants against the curtain gradually indentured papal court who
