# 语言模型（输入一个句子，输出这个句子产生的概率）

目标：根据之前的单词预测下一个单词。


学习目标
- 学习语言模型，以及如何训练一个语言模型
- 学习torchtext的基本使用方法
    - 构建 vocabulary
    - word to inde 和 index to word
- 学习torch.nn的一些基本模型
    - Linear
    - RNN
    - LSTM
    - GRU
- RNN的训练技巧
    - Gradient Clipping
- 如何保存和读取模型

## 调用工程需要的包

In [1]:
import torchtext
import torch
import numpy as np
import random
import os

USE_CUDA=torch.cuda.is_available()
device=torch.device('cuda' if USE_CUDA else 'cpu')

#固定random seed
random.seed(1)
np.random.seed(1)
torch.manual_seed(1)
if USE_CUDA:
    torch.cudada.manual_seed(1)
    
#一个bantch中有多少个句子
BATCH_SIZE=32 
#word embedding 的维度
EMBEDDING_SIZE=650
MAX_VOCAB_SIZE=50000

## 创建vocabulary(单词表)
- 安装[torchtext](https://github.com/pytorch/text)  (用于文本预处理)  
    pip install torchtext  
- 使用 torchtext 来创建vocabulary, 然后把数据读成batch的格式。请大家自行阅读README来学习torchtext。  
- **注意变更**：  
    torchtext.data.Field -> torchtext.legacy.data.Field  
    torchtext.datasets.LanguageModelingDataset -> torchtext.legacy.datasets.LanguageModelingDataset  
    torchtext.data.BPTTIterator -> torchtext.legacy.data.BPTTIterator  

### 使用field预处理数据；利用LanguageMOdelingDataset class创建三个dataset
继续使用text8数据集作为训练、验证和测试数据
1. TorchText的一个重要概念是[Field](https://torchtext.readthedocs.io/en/latest/data.html#field)，其决定了数据会被如何处理  
    我们使用TEXT这个field来处理文本数据  
    我们的TEXT field有lower-Ture这个参数，故所有的单词都会被lowercase  
    torchtext提供了LanguageModelingDataset这个class来帮助处理语言模型数据集  
2. build_vocab可以根据我们提供的训练数据集来创建最高频单词的单词表，max_size帮助我们限定单词总量
3. BPTTIterator可以连续地获得连贯的句子，[BPTT](https://zh.d2l.ai/chapter_recurrent-neural-networks/bptt.html): back propagation through time

In [29]:
#确定数据集路径
script_path=os.path.abspath('__file__')
dir_path=os.path.dirname(script_path)
path=os.path.join(dir_path,'text8')
print(path,type(path))

#创建一个名为TEXT的Field
#lower=True: 将所有单词lowercase
TEXT=torchtext.legacy.data.Field(lower=True)
#创建用于language modeling的train, val, test三个dataset
#将data split
train, val, test = torchtext.legacy.datasets.LanguageModelingDataset.splits(path=path, 
                                                                            train='text8.train.txt', 
                                                                            validation='text8.dev.txt', 
                                                                            test='text8.test.txt', 
                                                                            text_field=TEXT)
# print(train)
# print(dir(train))
# print(train.examples)

C:\Users\Re_AC\Desktop\Pytorch\myTorch\3\languageModelNoteBook\text8 <class 'str'>


### 创建Vocabulary
- 创建vocabulary(单词表)相当于__myTorch/2/wordEmbeddingNotebook/2.ipynb#数据预处理及相关操作__中创建vocab参数的过程
- 具体流程是从dataset中取出出现频数最高的前MAX_BOCAB_SIZE个单词作为Vocabulary
- 单词表单词个数为50002个而不是50000个，是因为TorchText为我们增加了两个特殊的token：  
    \< unk \>: 表示未知的，不在单词表中的单词  
    \< pad \>: 表示padding，当句子较短时，将\< pad \>添加进句子末尾补齐长度

In [3]:
#创建training dataset的vocabulary 单词数量为MAX_BOCAB_SIZE
TEXT.build_vocab(train, max_size=MAX_BOCAB_SIZE)
#注意单词个数是50002个，而不是MAX_BOCAB_SIZE指定的50000个

VOCAB_SIZE = len(TEXT.vocab)
print(len(TEXT.vocab)) #vocabulary size

#itos: index to string
print(type(TEXT.vocab.itos))
print(TEXT.vocab.itos[:10]) #注意<unk>和<pad>

#stoi: string to index
print(type(TEXT.vocab.stoi))
print(TEXT.vocab.stoi['apple'])

50002
<class 'list'>
['<unk>', '<pad>', 'the', 'of', 'and', 'one', 'in', 'a', 'to', 'zero']
<class 'collections.defaultdict'>
1259


### 创建batch(iterator)
为dataset创建batch，每个batch包含BATCH_SIZE个句子

In [15]:
#bptt_len:  Length of sequences for backpropagation through time.
#此处也决定了batch中每个句子的长度
#具体参考：https://zh.d2l.ai/chapter_recurrent-neural-networks/bptt.html
#repeat=False: 过完一边dataset后就结束一次epoch
train_iter, val_iter, test_iter=torchtext.legacy.data.BPTTIterator.splits(
    (train, val, test), 
    batch_size=BATCH_SIZE, 
    device=device, 
    bptt_len=50, 
    repeat=False, 
    shuffle=True)

In [23]:
#测试+加深理解
it=iter(train_iter)
batch=next(it)
print(batch)
#50: 句子长度(=bptt_len)  32: batch_size
# [torchtext.legacy.data.batch.Batch of size 32]
# 	[.text]:[torch.LongTensor of size 50x32]
# 	[.target]:[torch.LongTensor of size 50x32]

#可以看到text为文件：text8.train.txt的内容
#target与text相似，但从text中的下一个单词开始，比text多一个单词结束
#输入dataset中的一个单词，target（输出）为dataset中的下一个单词
#模型的目的是预测下一个单词是什么
print(batch.text)
print(' '.join(TEXT.vocab.itos[i] for i in batch.text[:,0].data))
print()
print(' '.join(TEXT.vocab.itos[i] for i in batch.target[:,0].data))


[torchtext.legacy.data.batch.Batch of size 32]
	[.text]:[torch.LongTensor of size 50x32]
	[.target]:[torch.LongTensor of size 50x32]
tensor([[ 5269,  6271,   417,  ...,  5931,     3, 24395],
        [ 3110,     6,   288,  ...,    57,   168,     6],
        [   13,  3593,   458,  ...,    12, 27121,   314],
        ...,
        [    8,  1576,     3,  ...,    98,     4,     8],
        [ 3661,     2,   173,  ...,    33,     6,  6264],
        [    2,  2694,  1284,  ...,   479,  2526,    68]])
anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans <unk> of the french revolution whilst the term is still used in a pejorative way to describe any act that used violent means to destroy the

originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans <unk> of the french revolution whilst the term is still used in a pejorativ

In [24]:
#多拿几个train_iter中的batch，看看text和target中的内容
for i in range(5):
    batch=next(it)
    print()
    print(i)
    print(' '.join(TEXT.vocab.itos[i] for i in batch.text[:,0].data))
    print()
    print(' '.join(TEXT.vocab.itos[i] for i in batch.target[:,0].data))


0
organization of society it has also been taken up as a positive label by self defined anarchists the word anarchism is derived from the greek without archons ruler chief king anarchism as a political philosophy is the belief that rulers are unnecessary and should be abolished although there are differing

of society it has also been taken up as a positive label by self defined anarchists the word anarchism is derived from the greek without archons ruler chief king anarchism as a political philosophy is the belief that rulers are unnecessary and should be abolished although there are differing interpretations

1
interpretations of what this means anarchism also refers to related social movements that advocate the elimination of authoritarian institutions particularly the state the word anarchy as most anarchists use it does not imply chaos nihilism or <unk> but rather a harmonious anti authoritarian society in place of what are regarded

of what this means anarchism also refers to re

## 定义模型（简单的）
- 继承nn.Module
- \_\_init\_\_函数
- forward函数
- 其余可以根据模型需要定义相关函数  


**PyTorch处理RNN时默认第一个维度为sequence length，第二个维度为batch_size**  

In [None]:
import torch.nn as nn

#定义一个简单的RNN （一层）
class RNNModel(nn.Module):
    #定义需要参数
    def __init__(self, vocab_size, embed_size, hidden_size):
        super().__init__()
        #embedding层
        self.embed=nn.Embedding(vocab_size, embed_size) # W大小：(50002, 650) 
        #LSTM层
        self.lstm=nn.LSTM(embed_size, hidden_size)
        # batch_first=True: 将lstm第一个维度改为batch_size
        # self.lstm=nn.LSTM(embed_size, hidden_size, batch_first=True)
        # 将LSTM的结果decode为一个vocab_size维的向量，以确定预测的单词
        self.decoder=nn.Linear(hidden_size, vocab_size)
    
    #定义网络架构
    def forward(self, input_text, hidden):
        #forward pass
        #input_text: seq_length(50) * batch_size(32)
        emb= self.embed(input_text) # seq_length * batch_size * embed_size
        #embedding传入LSTM
        output, hidden = self.lstm(emb, hidden)