In [4]:
import torch
from torch import nn
import numpy as np

### The problem of long-term dependencies
* RNNs connect previous information to present task:
> enough for predicting the next word for "the clouds are in the **sky**"

> may not be enough when more context is needed: "I grew up in France... I speak fluent **French**"

### RNN
* All recurrent neural networks have the form of a chain of repeating modules of neural network
> ht = tanh(W\[ht-1, xt-1\])

### LSTM
* LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer there are four, interacting in a very special way.
* The core idea behind LSTMs: **Cell State**
> Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation. An LSTM has three of these gates, to protect and control the cell state

#### LSTM: Forget gate
> ft = sigmoid(Wf\[ht-1, xt\] + bf)
>> It looks at ht-1 and xt and outputs a number between 0 and 1 for each number in the cell state Ct-1.
>> A 1 represents "completely keep this" while a 0 represents "completely get rid of this"

#### LSTM: Input gate and Cell state
* The next step is to decide what new information we're going to store in the cell state
> a sigmoid layer called the "**input gate layer** decides which values we'll update."
>> it = sigmoid(Wi\[ht-1, xt\] + bi)

> a tanh layer creates a vector of new candidate values, that could be added to the state.
>> Ct^ = tanh(WC\[ht-1, xt\] + bC)

* It's now time to update the old cell state into the new cell state
> Ct = ft * Ct-1 + it * Ct^
>> We multiply the old state by ft forgatting the things we decided to forget earlier. Then, we add the new candidate values, scaled by how much we decided to update each state value.

#### LSTM: Output
* Finally, we need to decide what we're going to output.
> First, we run a sigmoid layer which decides what parts of the cell state we're going to output.
>> ot = sigmoid(Wo\[ht-1, xt\] + bo)

> Then, we put the cell state through tanh (to push the values to be between -1 and 1) and multiply it by the output of sigmoid gate, so that we only output the parts we decided to.
>> ht = ot * tanh(Ct)

#### Intuitive Pipeline
* LSTM memory Cell
> Forget irrelevant parts of previous state --> Selectively update cell state values --> Output certain parts of cell state

* input gate : forget gate : behavior 
> o : 1 : remember the previous value, 1 : 1 : add to the previous value, 0 : 0 : erase the value, 1 : 0 : overwrite the value

### How to solve Gradient Vanishing?
d Ct/ d Ct-1 becomes the sum of four elements, while d hk/d h1 is a multiplication that can be exploding or vanishing if many enough elements < 1 or > 1

### nn.LSTM
* __init__
    * *input_size*: The number of expected features in the input x
    * *hidden_size*: The number of features in the hidden state h
    * *num_layers*: Number of recurrent layers. E.g., setting *num_layers=2* would mean stacking two LSTMs together to form a stacked LSTM, with the second LSTM taking in outputs of the first LSTM and computing the final results. Default: 1

### LSTM.forward()
* out, (ht, ct) = lstm(x, \[ht_0, ct_0\])
    * x: \[seq, b, vec\]
    * h/c: \[num_layer, b, h\]
    * out: \[seq, b, h\]

### nn.LSTMCell
* __init__
    * *input_size*: The number of expected features in the input x
    * *hidden_size*: The number of features in the hidden state h
    * *num_layers*: Number of recurrent layers. E.g., setting *num_layers=2* would mean stacking two LSTMs together to form a stacked LSTM, with the second LSTM taking in outputs of the first LSTM and computing the final results. Default: 1

### LSTMCell.forward()
* ht, ct = lstmcell(xt,\[ht_0, ct_0\])
    * xt: \[b, vec\]
    * ht/ct: \[b, h\]

In [3]:
lstm = nn.LSTM(input_size=100, hidden_size=20, num_layers=4)
print(lstm)
x = torch.randn(10, 3, 100)
out, (h, c) = lstm(x)
print(out.shape, h.shape, c.shape)

LSTM(100, 20, num_layers=4)
torch.Size([10, 3, 20]) torch.Size([4, 3, 20]) torch.Size([4, 3, 20])


In [5]:
# single layer
cell = nn.LSTMCell(input_size=100, hidden_size=20)
h = torch.zeros(3, 20)
c = torch.zeros(3,20)
for xt in x:
    h, c = cell(xt, [h,c])
print(h.shape, c.shape)

torch.Size([3, 20]) torch.Size([3, 20])


In [6]:
# two layers
cell1 = nn.LSTMCell(input_size=100, hidden_size=30)
cell2 = nn.LSTMCell(input_size=30, hidden_size=20)
h1 = torch.zeros(3,30)
c1 = torch.zeros(3,30)
h2 = torch.zeros(3,20)
c2 = torch.zeros(3,20)
for xt in x:
    h1, c1 = cell1(xt, [h1,c1])
    h2, c2 = cell2(h1, [h2,c2])
print(h2.shape, c2.shape)

torch.Size([3, 20]) torch.Size([3, 20])


In [None]:
# in google colab
!pip install torch
!pip install torchtext
!python -m spacy download en

import torch
from torch import nn, optim
from torchtext import data, datasets
# load dataset
TEXT = data.Field(tokenize='spacy')
LABEL = data.LabelField(dtype=torch.float)
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL) # torchtext

In [None]:
# load word embedding
from rnn import RNN
rnn = RNN(len(TEXT.vocab), 100, 256)

pretrained_embedding = TEXT.vocab.vectors
print('pretrained_embedding:', pretrained_embedding.shape)
rnn.embedding.weight.data.copy_(pretrained_embedding)
print('embedding layer inited.')

In [None]:
# train
def train(rnn, iterator, optimizer, criteon):
    avg_acc = []
    rnn.train()

    for i, batch in enumerate(iterator):
        # [seq, b] => [b, 1] => [b]
        pred = rnn(batch.text).squeeze(1)
        loss = criteon(pred, batch.label)
        acc = binary_acc(pred, batch.label).item()
        avg_acc.append(acc)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

# test
def binary_acc(pred, y):
    preds = torch.round(torch.sigmoid(preds))
    correct = torch.eq(preds, y).float()
    acc = correct.sum() / len(correct)
    return acc

def eval(rnn, iterator, criteon):
    avg_acc = []
    rnn.eval()
    with torch.no_grad():
        for batch in iterator:
            # [b, 1] => [b]
            pred = rnn(batch.text).squeeze(1)
            loss = criteon(pred, batch.label)
            acc = binary_acc(pred, batch.label).item()
            avg_acc.append(acc)
    avg_acc = np.array(avg_acc).mean()
    print('>>test:', avg_acc)