<a href="https://colab.research.google.com/github/PnZheng/DeepLearning/blob/main/RNN/lstm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LSTM

在这里，我们将实现一个**双向**的LSTM解决之前的问题。

- 我们将使用一些正则化方法来克服我们在本练习中观察到的过度拟合问题。
- 此外，我们将使用 PyTorch 的 torchtext 模块来更高效、更简洁地处理数据加载和处理管道。
- RNN 模型在训练期间过度拟合了数据集，因此为了解决这个问题，我们将在 LSTM 模型中使用 dropouts 作为正则化机制


In [1]:
import os
import time
import numpy as np
from tqdm import tqdm
from string import punctuation
from collections import Counter
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
# 用CUDA的情况下引用CUDA。没有则使用CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# 设置生成随机数的种子，保证实验的可重复性
torch.manual_seed(123)

<torch._C.Generator at 0x1ee7bcb5a30>

- 在这里，不同于之前的RNN训练时，我们将依旧使用之前的IMBd数据集。
- 并且利用强大的torchtext来标记单词和生成词汇表。
- 最后利用nn.LSTM模型来直接填充序列而不是手动进行填充

In [4]:
import random
from torchtext.legacy import (data, datasets)

### 这里我们使用torchtext的子模型来对数据进行读取，并将数据集拆分为训练、验证和测试集


In [7]:
TEXT_FIELD = data.Field(tokenize = data.get_tokenizer("basic_english"), include_lengths = True)
LABEL_FIELD = data.LabelField(dtype = torch.float)

train_dataset, test_dataset = datasets.IMDB.splits(TEXT_FIELD, LABEL_FIELD)
train_dataset, valid_dataset = train_dataset.split(random_state = random.seed(123))

### 利用build_vocab来建立词汇库


In [8]:
MAX_VOCABULARY_SIZE = 25000
TEXT_FIELD.build_vocab(train_dataset, 
                 max_size = MAX_VOCABULARY_SIZE)

LABEL_FIELD.build_vocab(train_dataset)

In [10]:
B_SIZE = 64 # 批大小

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_data_iterator, valid_data_iterator, test_data_iterator = data.BucketIterator.splits(
    (train_dataset, valid_dataset, test_dataset), 
    batch_size = B_SIZE,
    sort_within_batch = True,
    device = device)

### 如果您正在使用GPU进行培训，我们需要使用以下函数才能使pack_padded_sequence方法正常工作


### (参考 : https://discuss.pytorch.org/t/error-with-lengths-in-pack-padded-sequence/35517/3)

In [12]:
if torch.cuda.is_available():
    torch.set_default_tensor_type(torch.cuda.FloatTensor)
from torch.nn.utils.rnn import pack_padded_sequence, PackedSequence

def cuda_pack_padded_sequence(input, lengths, batch_first=False, enforce_sorted=True):
    lengths = torch.as_tensor(lengths, dtype=torch.int64)
    lengths = lengths.cpu()
    if enforce_sorted:
        sorted_indices = None
    else:
        lengths, sorted_indices = torch.sort(lengths, descending=True)
        sorted_indices = sorted_indices.to(input.device)
        batch_dim = 0 if batch_first else 1
        input = input.index_select(batch_dim, sorted_indices)

    data, batch_sizes = \
    torch._C._VariableFunctions._pack_padded_sequence(input, lengths, batch_first)
    return PackedSequence(data, batch_sizes, sorted_indices)

## LSTM的函数定义

在对nn.LSTM使用时，需将双向参数bidirectional设为true，并设置好droptout的概率


In [13]:
class LSTM(nn.Module):
    def __init__(self, vocabulary_size, embedding_dimension, hidden_dimension, output_dimension, dropout, pad_index):
        super().__init__()
        self.embedding_layer = nn.Embedding(vocabulary_size, embedding_dimension, padding_idx = pad_index)
        self.lstm_layer = nn.LSTM(embedding_dimension, 
                           hidden_dimension, 
                           num_layers=1, 
                           bidirectional=True, 
                           dropout=dropout)
        self.fc_layer = nn.Linear(hidden_dimension * 2, output_dimension)
        self.dropout_layer = nn.Dropout(dropout)
        
    def forward(self, sequence, sequence_lengths=None):
        if sequence_lengths is None:
            sequence_lengths = torch.LongTensor([len(sequence)])
        
        # sequence := (sequence_length, batch_size)
        embedded_output = self.dropout_layer(self.embedding_layer(sequence))
        
        
        # embedded_output := (sequence_length, batch_size, embedding_dimension)
        if torch.cuda.is_available():
            packed_embedded_output = cuda_pack_padded_sequence(embedded_output, sequence_lengths)
        else:
            packed_embedded_output = nn.utils.rnn.pack_padded_sequence(embedded_output, sequence_lengths)
        
        packed_output, (hidden_state, cell_state) = self.lstm_layer(packed_embedded_output)
        # hidden_state := (num_layers * num_directions, batch_size, hidden_dimension)
        # cell_state := (num_layers * num_directions, batch_size, hidden_dimension)
        
        op, op_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)
        # op := (sequence_length, batch_size, hidden_dimension * num_directions)
        
        hidden_output = torch.cat((hidden_state[-2,:,:], hidden_state[-1,:,:]), dim = 1)        
        # hidden_output := (batch_size, hidden_dimension * num_directions)
        
        return self.fc_layer(hidden_output)

    
INPUT_DIMENSION = len(TEXT_FIELD.vocab)
EMBEDDING_DIMENSION = 100
HIDDEN_DIMENSION = 32
OUTPUT_DIMENSION = 1
DROPOUT = 0.5
PAD_INDEX = TEXT_FIELD.vocab.stoi[TEXT_FIELD.pad_token]

### LSTM实例化

In [28]:
lstm_model = LSTM(INPUT_DIMENSION, 
            EMBEDDING_DIMENSION, 
            HIDDEN_DIMENSION, 
            OUTPUT_DIMENSION, 
            DROPOUT, 
            PAD_INDEX)

print(lstm_model)

LSTM(
  (embedding_layer): Embedding(25002, 100, padding_idx=1)
  (lstm_layer): LSTM(100, 32, dropout=0.5, bidirectional=True)
  (fc_layer): Linear(in_features=64, out_features=1, bias=True)
  (dropout_layer): Dropout(p=0.5, inplace=False)
)


### 在这里我们添加了两种特殊类型的标记
- UNK_INDEX 用于词汇表中不存在的单词
- PAD_INDEX 用于填充序列的标记

因此，我们将这两个的标记的嵌入设置为零

In [18]:
UNK_INDEX = TEXT_FIELD.vocab.stoi[TEXT_FIELD.unk_token]

lstm_model.embedding_layer.weight.data[UNK_INDEX] = torch.zeros(EMBEDDING_DIMENSION)
lstm_model.embedding_layer.weight.data[PAD_INDEX] = torch.zeros(EMBEDDING_DIMENSION)

### 优化器 和 损失函数的实现

In [19]:
optim = torch.optim.Adam(lstm_model.parameters())
loss_func = nn.BCEWithLogitsLoss()

lstm_model = lstm_model.to(device)
loss_func = loss_func.to(device)

### 准确率的准则函数，表示为0-1之间的浮点数

In [20]:
def accuracy_metric(predictions, ground_truth):
    """
    Returns 0-1 accuracy for the given set of predictions and ground truth
    """
    # round predictions to either 0 or 1
    rounded_predictions = torch.round(torch.sigmoid(predictions))
    success = (rounded_predictions == ground_truth).float() #convert into float for division 
    accuracy = success.sum() / len(success)
    return accuracy

### 训练函数，返回值为其准确度

In [21]:
def train(model, data_iterator, optim, loss_func):
    loss = 0
    accuracy = 0
    model.train()
    
    for curr_batch in data_iterator:
        optim.zero_grad()
        sequence, sequence_lengths = curr_batch.text
        preds = lstm_model(sequence, sequence_lengths).squeeze(1)
        
        loss_curr = loss_func(preds, curr_batch.label)
        accuracy_curr = accuracy_metric(preds, curr_batch.label)
        
        loss_curr.backward()
        optim.step()
        
        loss += loss_curr.item()
        accuracy += accuracy_curr.item()
        
    return loss/len(data_iterator), accuracy/len(data_iterator)

### 验证函数，返回值为其准确度

In [22]:
def validate(model, data_iterator, loss_func):
    loss = 0
    accuracy = 0
    model.eval()
    
    with torch.no_grad():
        for curr_batch in data_iterator:
            sequence, sequence_lengths = curr_batch.text
            preds = model(sequence, sequence_lengths).squeeze(1)
            
            loss_curr = loss_func(preds, curr_batch.label)
            accuracy_curr = accuracy_metric(preds, curr_batch.label)

            loss += loss_curr.item()
            accuracy += accuracy_curr.item()
        
    return loss/len(data_iterator), accuracy/len(data_iterator)

### 训练实现过程

In [23]:
num_epochs = 10
best_validation_loss = float('inf')

for ep in range(num_epochs):

    time_start = time.time()
    
    training_loss, train_accuracy = train(lstm_model, train_data_iterator, optim, loss_func)
    validation_loss, validation_accuracy = validate(lstm_model, valid_data_iterator, loss_func)
    
    time_end = time.time()
    time_delta = time_end - time_start 
    # 保存模型参数
    if validation_loss < best_validation_loss:
        best_validation_loss = validation_loss
        torch.save(lstm_model.state_dict(), 'lstm_model.pt')
    
    print(f'epoch number: {ep+1} | time elapsed: {time_delta}s')
    print(f'training loss: {training_loss:.3f} | training accuracy: {train_accuracy*100:.2f}%')
    print(f'validation loss: {validation_loss:.3f} |  validation accuracy: {validation_accuracy*100:.2f}%')
    print()

epoch number: 1 | time elapsed: 17.161153078079224s
training loss: 0.687 | training accuracy: 54.12%
validation loss: 0.668 |  validation accuracy: 59.39%

epoch number: 2 | time elapsed: 16.466914653778076s
training loss: 0.649 | training accuracy: 62.04%
validation loss: 0.711 |  validation accuracy: 63.63%

epoch number: 3 | time elapsed: 16.44555950164795s
training loss: 0.573 | training accuracy: 70.28%
validation loss: 0.654 |  validation accuracy: 69.75%

epoch number: 4 | time elapsed: 16.377601861953735s
training loss: 0.516 | training accuracy: 74.99%
validation loss: 0.662 |  validation accuracy: 69.77%

epoch number: 5 | time elapsed: 16.301535606384277s
training loss: 0.464 | training accuracy: 78.67%
validation loss: 0.662 |  validation accuracy: 73.17%

epoch number: 6 | time elapsed: 16.297365427017212s
training loss: 0.432 | training accuracy: 80.22%
validation loss: 0.633 |  validation accuracy: 73.68%

epoch number: 7 | time elapsed: 16.212588787078857s
training loss

### 训练完成后我们将加载性能最佳的模型并在测试集上对其进行评估,torch.load的数据为之前保存好的pt文件

In [25]:
lstm_model.load_state_dict(torch.load('./lstm_model.pt'))

test_loss, test_accuracy = validate(lstm_model, test_data_iterator, loss_func)

print(f'test loss: {test_loss:.3f} | test accuracy: {test_accuracy*100:.2f}%')

test loss: 0.597 | test accuracy: 75.92%


### 情感推测函数

In [26]:
def sentiment_inference(model, sentence):
    model.eval()
    
    # 文本转换
    tokenized = data.get_tokenizer("basic_english")(sentence)
    tokenized = [TEXT_FIELD.vocab.stoi[t] for t in tokenized]
    
    # 模型的推理
    model_input = torch.LongTensor(tokenized).to(device)
    model_input = model_input.unsqueeze(1)
    
    pred = torch.sigmoid(model(model_input))
    
    return pred.item()

### 通过手动输入影评来测试该模型的好坏


In [32]:
print(sentiment_inference(lstm_model, "This film is horrible"))
print(sentiment_inference(lstm_model, "Director tried too hard but this film is bad"))
print(sentiment_inference(lstm_model, "Decent movie, although could be shorter"))
print(sentiment_inference(lstm_model, "This film will be houseful for weeks"))
print(sentiment_inference(lstm_model, "I loved the movie, every part of it"))

0.4716528356075287
0.5116175413131714
0.5592977404594421
0.4116787314414978
0.5025923252105713


## 练习

- 在训练过程中，更改dropout的概率，模型的准确率会发生什么样的变化？