<a href="https://colab.research.google.com/github/All4Nothing/pytorch-DL-project/blob/main/Ch04_Deep_RNN(2).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 04. 심층 순환 신경망 아키텍처 (Deep RNN Architecture)

## 양방향 LSTM 만들기

LSTM은 시간 단계상 몇 단계 전이라도 중요한 정보는 보존하고 최근 정보라도 관련 없는 정보는 망각하는 데 도움이 되는 메모리 셀 게이트 덕분에 더 긴 시퀸스를 더 잘 처리할 수 있다. 경사가 폭발하거나 소실하는 문제를 확인하고 긴 영화 리뷰를 처리할 때 LSTM의 성능이 더 좋다.  
또한, 모델이 영화 리뷰의 감성에 대해 좀 더 정보에 입각한 결정을 내릴 수 있게 언제든지 컨텍스트 윈도를 확장할 수 있도록 양방향 모델을 사용할 것이다.  
또, 과적합을 해결하기 위해 LSTM 모델에 regularization 방법으로 드롭아웃을 사용하겠다.

In [1]:
import os
import time
import numpy as np
from tqdm import tqdm
from string import punctuation
from collections import Counter
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
torch.manual_seed(123)

<torch._C.Generator at 0x7f94bdbf4170>

In [2]:
!pip install torchtext==0.6.0 # torchtext error 해결 위해



In [3]:
import random
from torchtext import (data, datasets) # 최신 버전의 torchtext -> from torchtext.legacy import (data, datasets)

In [4]:
TEXT_FIELD = data.Field(tokenize = data.get_tokenizer("basic_english"), include_lengths = True)
LABEL_FIELD = data.LabelField(dtype = torch.float)

train_dataset, test_dataset = datasets.IMDB.splits(TEXT_FIELD, LABEL_FIELD)
train_dataset, valid_dataset = train_dataset.split(random_state = random.seed(123))

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:08<00:00, 9.48MB/s]


In [5]:
MAX_VOCABULARY_SIZE = 25000

TEXT_FIELD.build_vocab(train_dataset,
                 max_size = MAX_VOCABULARY_SIZE)

LABEL_FIELD.build_vocab(train_dataset)

In [6]:
B_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_data_iterator, valid_data_iterator, test_data_iterator = data.BucketIterator.splits(
    (train_dataset, valid_dataset, test_dataset),
    batch_size = B_SIZE,
    sort_within_batch = True,
    device = device)

In [7]:
## If you are training using GPUs, we need to use the following function for the pack_padded_sequence method to work
## (reference : https://discuss.pytorch.org/t/error-with-lengths-in-pack-padded-sequence/35517/3)
if torch.cuda.is_available():
    torch.set_default_tensor_type(torch.cuda.FloatTensor)
from torch.nn.utils.rnn import pack_padded_sequence, PackedSequence

def cuda_pack_padded_sequence(input, lengths, batch_first=False, enforce_sorted=True):
    lengths = torch.as_tensor(lengths, dtype=torch.int64)
    lengths = lengths.cpu()
    if enforce_sorted:
        sorted_indices = None
    else:
        lengths, sorted_indices = torch.sort(lengths, descending=True)
        sorted_indices = sorted_indices.to(input.device)
        batch_dim = 0 if batch_first else 1
        input = input.index_select(batch_dim, sorted_indices)

    data, batch_sizes = \
    torch._C._VariableFunctions._pack_padded_sequence(input, lengths, batch_first)
    return PackedSequence(data, batch_sizes, sorted_indices)

In [8]:
class LSTM(nn.Module):
    def __init__(self, vocabulary_size, embedding_dimension, hidden_dimension, output_dimension, dropout, pad_index):
        super().__init__()
        self.embedding_layer = nn.Embedding(vocabulary_size, embedding_dimension, padding_idx = pad_index)
        self.lstm_layer = nn.LSTM(embedding_dimension,
                           hidden_dimension,
                           num_layers=1,
                           bidirectional=True,
                           dropout=dropout)
        self.fc_layer = nn.Linear(hidden_dimension * 2, output_dimension)
        self.dropout_layer = nn.Dropout(dropout)

    def forward(self, sequence, sequence_lengths=None):
        if sequence_lengths is None:
            sequence_lengths = torch.LongTensor([len(sequence)])

        # sequence := (sequence_length, batch_size)
        embedded_output = self.dropout_layer(self.embedding_layer(sequence))


        # embedded_output := (sequence_length, batch_size, embedding_dimension)
        if torch.cuda.is_available():
            packed_embedded_output = cuda_pack_padded_sequence(embedded_output, sequence_lengths)
        else:
            packed_embedded_output = nn.utils.rnn.pack_padded_sequence(embedded_output, sequence_lengths)

        packed_output, (hidden_state, cell_state) = self.lstm_layer(packed_embedded_output)
        # hidden_state := (num_layers * num_directions, batch_size, hidden_dimension)
        # cell_state := (num_layers * num_directions, batch_size, hidden_dimension)

        op, op_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)
        # op := (sequence_length, batch_size, hidden_dimension * num_directions)

        hidden_output = torch.cat((hidden_state[-2,:,:], hidden_state[-1,:,:]), dim = 1)
        # hidden_output := (batch_size, hidden_dimension * num_directions)

        return self.fc_layer(hidden_output)


INPUT_DIMENSION = len(TEXT_FIELD.vocab)
EMBEDDING_DIMENSION = 100
HIDDEN_DIMENSION = 32
OUTPUT_DIMENSION = 1
DROPOUT = 0.5
PAD_INDEX = TEXT_FIELD.vocab.stoi[TEXT_FIELD.pad_token]

lstm_model = LSTM(INPUT_DIMENSION,
            EMBEDDING_DIMENSION,
            HIDDEN_DIMENSION,
            OUTPUT_DIMENSION,
            DROPOUT,
            PAD_INDEX)



In [9]:
UNK_INDEX = TEXT_FIELD.vocab.stoi[TEXT_FIELD.unk_token]

lstm_model.embedding_layer.weight.data[UNK_INDEX] = torch.zeros(EMBEDDING_DIMENSION)
lstm_model.embedding_layer.weight.data[PAD_INDEX] = torch.zeros(EMBEDDING_DIMENSION)

In [10]:
optim = torch.optim.Adam(lstm_model.parameters())
loss_func = nn.BCEWithLogitsLoss()

lstm_model = lstm_model.to(device)
loss_func = loss_func.to(device)

In [11]:
def accuracy_metric(predictions, ground_truth):
    """
    Returns 0-1 accuracy for the given set of predictions and ground truth
    """
    # round predictions to either 0 or 1
    rounded_predictions = torch.round(torch.sigmoid(predictions))
    success = (rounded_predictions == ground_truth).float() #convert into float for division
    accuracy = success.sum() / len(success)
    return accuracy

In [12]:
def train(model, data_iterator, optim, loss_func):
    loss = 0
    accuracy = 0
    model.train()

    for curr_batch in data_iterator:
        optim.zero_grad()
        sequence, sequence_lengths = curr_batch.text
        preds = lstm_model(sequence, sequence_lengths).squeeze(1)

        loss_curr = loss_func(preds, curr_batch.label)
        accuracy_curr = accuracy_metric(preds, curr_batch.label)

        loss_curr.backward()
        optim.step()

        loss += loss_curr.item()
        accuracy += accuracy_curr.item()

    return loss/len(data_iterator), accuracy/len(data_iterator)

In [13]:
def validate(model, data_iterator, loss_func):
    loss = 0
    accuracy = 0
    model.eval()

    with torch.no_grad():
        for curr_batch in data_iterator:
            sequence, sequence_lengths = curr_batch.text
            preds = model(sequence, sequence_lengths).squeeze(1)

            loss_curr = loss_func(preds, curr_batch.label)
            accuracy_curr = accuracy_metric(preds, curr_batch.label)

            loss += loss_curr.item()
            accuracy += accuracy_curr.item()

    return loss/len(data_iterator), accuracy/len(data_iterator)

In [14]:
num_epochs = 10
best_validation_loss = float('inf')

for ep in range(num_epochs):

    time_start = time.time()

    training_loss, train_accuracy = train(lstm_model, train_data_iterator, optim, loss_func)
    validation_loss, validation_accuracy = validate(lstm_model, valid_data_iterator, loss_func)

    time_end = time.time()
    time_delta = time_end - time_start

    if validation_loss < best_validation_loss:
        best_validation_loss = validation_loss
        torch.save(lstm_model.state_dict(), 'lstm_model.pt')

    print(f'epoch number: {ep+1} | time elapsed: {time_delta}s')
    print(f'training loss: {training_loss:.3f} | training accuracy: {train_accuracy*100:.2f}%')
    print(f'validation loss: {validation_loss:.3f} |  validation accuracy: {validation_accuracy*100:.2f}%')
    print()

epoch number: 1 | time elapsed: 10.536290407180786s
training loss: 0.688 | training accuracy: 53.98%
validation loss: 0.671 |  validation accuracy: 59.79%

epoch number: 2 | time elapsed: 10.250645637512207s
training loss: 0.657 | training accuracy: 60.31%
validation loss: 0.588 |  validation accuracy: 69.64%

epoch number: 3 | time elapsed: 8.079545974731445s
training loss: 0.584 | training accuracy: 69.34%
validation loss: 0.732 |  validation accuracy: 68.91%

epoch number: 4 | time elapsed: 9.13801908493042s
training loss: 0.535 | training accuracy: 73.45%
validation loss: 0.530 |  validation accuracy: 72.25%

epoch number: 5 | time elapsed: 9.264939785003662s
training loss: 0.484 | training accuracy: 76.79%
validation loss: 0.537 |  validation accuracy: 72.28%

epoch number: 6 | time elapsed: 10.407737255096436s
training loss: 0.447 | training accuracy: 79.34%
validation loss: 0.576 |  validation accuracy: 74.81%

epoch number: 7 | time elapsed: 9.219055414199829s
training loss: 0.

In [15]:
#lstm_model.load_state_dict(torch.load('../../mastering_pytorch_packt/04_deep_recurrent_net_architectures/lstm_model.pt'))
lstm_model.load_state_dict(torch.load('lstm_model.pt'))

test_loss, test_accuracy = validate(lstm_model, test_data_iterator, loss_func)

print(f'test loss: {test_loss:.3f} | test accuracy: {test_accuracy*100:.2f}%')

test loss: 0.549 | test accuracy: 74.22%


In [16]:
def sentiment_inference(model, sentence):
    model.eval()

    # text transformations
    tokenized = data.get_tokenizer("basic_english")(sentence)
    tokenized = [TEXT_FIELD.vocab.stoi[t] for t in tokenized]

    # model inference
    model_input = torch.LongTensor(tokenized).to(device)
    model_input = model_input.unsqueeze(1)

    pred = torch.sigmoid(model(model_input))

    return pred.item()

In [17]:
print(sentiment_inference(lstm_model, "This film is horrible"))
print(sentiment_inference(lstm_model, "Director tried too hard but this film is bad"))
print(sentiment_inference(lstm_model, "Decent movie, although could be shorter"))
print(sentiment_inference(lstm_model, "This film will be houseful for weeks"))
print(sentiment_inference(lstm_model, "I loved the movie, every part of it"))

0.06481477618217468
0.06424909085035324
0.3703685998916626
0.5785183310508728
0.9535890817642212


## GRU와 Attention 기반 모델

### GRU와 PyTorch

GRU는 두 개의 게이트(리셋 게이트와 업데이트 게이트)와 하나의 은닉 상태 벡터로 구성된 일종의 메모리 셀이다. 구성 측면에서 GRU는 LSTM보다 단순하지만 경사가 폭발하거나 소실하는 문제를 처리하는 데 있어 똑같이 효과적이다.  
GRU는 LSTM보다 훈련 속도가 빠르고 언어 모델링 같은 수많은 작업에서 훨씬 적은 훈련 데이터로 LSTM만큼 수행할 수 있다.  
파이토치는 코드 한 줄로 GRU 계층을 인스턴스화하는 nn.GRU 모듈을 제공한다.  


```
self.gru_layer = nn.GRU(
  input_size, hidden_size, num_layer=2, dropout=0.8, bidirectional=True
)
```



### Attention 기반 모델

![Attention RNN architecture](https://miro.medium.com/v2/resize:fit:1200/1*TPlS-uko-n3uAxbAQY_STQ.png)

Attention 개념은 우리 인간이 때에 따라, 또 sequence(text)의 어느 부분인지에 따라 주의(attention)을 기울이는 정도가 다르다는 점에 착안했다.  
예를 들어 'Martha sings beautifully, I am hooked to ___ voice.'라는 문장을 완성한다면, 채워야 할 단어가 'her'라는 것을 추측하기 위해 'Martha'라는 단어에 더 주의를 기울인다. 반면, 우리가 완성해야 할 문장이 'Martha sings beautifully, I am hooked to her ___.'라면 채워야 할 단어로 'voice', 'songs', 'sining' 등을 추측하기 위해 단어 'sings'에 더 주의를 기울일 것이다.

모든 recurrent network 아키텍처에는 현 시간 단계에서 출력을 예측하기 위해 sequence의 특정 부분에 초점을 맞추는 메커니즘은 존재하지 않는다. 대신 RNN은 hidden state vector 형태로 과거 sequence의 요약만 얻을 수 있다.

이 아키텍처에서 전역 컨텍스트 벡터는 매시간 단계마다 계산된다. 이후 앞서 나온 모든 단어에 주의를 기울이는 것이 아니라 앞서 나온 k개 단어에만 주의를 기울이는 로컬 컨텍스트 벡터를 사용하는 형태로 아키텍처의 변형이 개발됐다.

순환망은 시간에 따라 펼쳐야 해서 병렬 처리가 불가능했다. 그렇지만 transformer 모델이라는 새로운 모델은 순환 계층과 합성곱 계층이 없어 병렬 처리가 가능하고 계산량 측면에서 가볍다.