## Recurrent neural networks

We have been using rich `semantic representations of text`, and a simple linear classifer on top of the embeddings. What this architecture does is to capture aggregated meaning og words in a sentence, but it does not take into account the `order` of words, because aggregation operation on top of embeddings removed this information from the original text. Beucase these models are unable to model word ordering, they cannot solve more complex or ambiguous tasks such as text generation or question answering.

To capture the meaning of text sequence, we need to use another neural work architecture, which is called a `recurrent neural work`, or `RNN`. In RNN, we pass our sentence through the network one word vector from a new article sequence at a time, and the network produces some `state`, which we then pass to the network again with next one word vector from the sequence. RNN storing a `memory` of the previous in the state, helps the network understand the `context of the sentence` to be able to redict the network word in the sequence.

<figure><img src="https://hostux.social/system/media_attachments/files/110/761/945/870/046/541/original/7f1738b2f377a8af.png" alt="" width="400"><figcaption><p>Source from Microsoft Learning </a> </p></figcaption></figure>

* Given the input sequence of word vectos $X_{0},...,X_{n}$, RNN creates a sequence of neural network blocks, and trains the sequence end-to-end using back propagation.
* Each network block takes a pair($X_{i}$,$h_{i}$) as an input, and produces $h_{i+1}$ as a result.
* Final state $h_{n}$ or output $y$ goes into a linear classifier to produce the result.
* All network blocks share the same weights, and are trained end-to-end using one backpropagation pass.

The hidden cell containing the current and prior state is calculated with the following formula:
* $h(t) = tanh(W_{h}h_{t-1} + W_{x}x_{t}+B_{h})$
* $y(t) = W_{y}h_{t} + B_{y}$
* Tanh is hyperbolic tangent function, which is defined as $tanh(x) = \frac{e^{x}-e^{-x}}{e^{x}+e^{-x}}$

At each network block, weights $W_{x}$ are applied to the numeric word vector input value;applying the previous hidden state $W_{h}$; and the final state $W_{y}$. The $tanh$ activation function is applied to the hidden layer to produce values between [-1, 1].

Because state vectors $h_{0},...,h_{n}$ are passed through the network, it is able to learn the sequntial dependencies between words. For example, when the word *not* appears somewhere in the sequence, it can learn to negate certain elments within the state vector, resulting in negation.

Let's see how recurrent neural networks can help us classify our news dataset.

In [None]:
pip install portalocker

In [None]:
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118

In [None]:
import torch
import torchtext
from torchinfo import summary
import collections


In [None]:
# check the paltform, Apple Silicon or Linux
import os, platform

torch_device="cpu"

if 'kaggle' in os.environ.get('KAGGLE_URL_BASE','localhost'):
    torch_device = 'cuda'
else:
    torch_device = 'mps' if platform.system() == 'Darwin' else 'cpu'

torch_device

In [None]:
# Loading dataset
train_dataset, test_dataset = torchtext.datasets.AG_NEWS(root='./data')
train_dataset, test_dataset = list(train_dataset), list(test_dataset)
classes = ['World','Sports','Business','Sci/Tech']

tokenizer = torchtext.data.utils.get_tokenizer('basic_english')

In [None]:
# Building vocab
def build_vocab(train_dataset,ngrams=1,min_freq=1):
    counter = collections.Counter()
    for (label, line) in train_dataset:
        counter.update(torchtext.data.utils.ngrams_iterator(tokenizer(line), ngrams=ngrams))
    vocab = torchtext.vocab.vocab(counter, min_freq=min_freq)
    return vocab

vocab = build_vocab(train_dataset,ngrams=1,min_freq=1)

vocab_size = len(vocab)

## Simple RNN classifier

In tthe case of simple RNN, each recurrent unit is a simple linear network, which takes concatenated input vector and state vector, and produce a new state vector. pyTorch represents this unit with `RNNCell` class, and a networks of each cells - as `RNN` layer.

To define an RNN classifier, we will first apply an embedding layer to lower the dimensionality of input vocabulary, and then have a RNN layer on top of it:

In [None]:
class RNNClassifier(torch.nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_class):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.embedding = torch.nn.Embedding(vocab_size, embed_dim, sparse=True)
        self.rnn = torch.nn.RNN(embed_dim, hidden_dim, batch_first=True)
        self.fc = torch.nn.Linear(hidden_dim, num_class)
    
    def forward(self,x):
        batch_size = x.size(0)
        x = self.embedding(x)
        x,h = self.rnn(x)
        return self.fc(x.mean(dim=1))

>**Note:** We use untrained embedding layer here for simplicity, but for even better results we can use pre-trained embedding layer with Word2Vec or GloVe embeddings, as described in the previous notebook.

In our case, we will use padded data loader, so each batch will have a number of padded sequences of the same length, RNN layer will take the sequence of embedding tensors, and produce two outputs:
* The `input` to the embedding layer is the word sequence or new article.
The `embedding_layer` ouptut contains the vector index value in vocab for each word in the sequence
* $x$ is a sequence of RNN cell outputs at each step
* $h$ is a final `hidden_state` for the last element of the sequence. Each RNN hidden layer stores the prior word in the sequence and the current as each word in the sequence is passed through the layers


We then apply a fully-connected linear classifier to get the probability for number of classes.

>**Note:** RNNs are quite difficult to train, because once the RNN cells are unrolled along the sequence length, the resulting number of layers involved in back propagation is quite large. Thus we need to select small learning rate,, and train the network on larger dataset to produce good result. It can take quite a long time, so using GPU is preferred.

In [None]:
def encode(x, voc=None,tokenizer=tokenizer):
    v =vocab if not voc else voc
    return [v.get_stoi()[s] for s in tokenizer(x)]

def padify(b):
    # b is the list of tuples of length batch_size
    # - first element of a tuple = label
    # - second = feature (text, sequence)
    # build vectorized sequence
    v = [encode(x[1]) for x in b]
    # first, compute max length of a sequnce in this minibatch
    l = max(map(len,v))
    return (
        # tuple of two tensors - labels and features
        torch.LongTensor([t[0]-1 for t in b]),
        torch.stack([torch.nn.functional.pad(torch.tensor(t),(0,l-len(t)), mode='constant', value=0) for t in v])
    )

def train_epoch(net, dataloader, lr=0.01, optimizer=None, loss_fn=torch.nn.CrossEntropyLoss(), epoch_size=None, report_freq=200):
    optimizer = optimizer or torch.optim.Adam(net.parameters(), lr=lr)
    net.train()
    total_loss, acc, count,i = 0,0,0,0
    for labels, features in dataloader:
        labels, features = labels.to(torch_device), features.to(torch_device)
        optimizer.zero_grad()
        output = net(features)
        loss = loss_fn(output, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss
        _, predicted = torch.max(output, 1)
        acc += (predicted == labels).sum()
        count += len(labels)
        i+=1
        if i%report_freq == 0:
            print(f'iteration {count}, loss {total_loss.item()/count}, accuracy {acc.item()/count}') # item() is used to get the value of a tensor
        if epoch_size  and count >= epoch_size:
            break
    
    return total_loss.item()/count, acc.item()/count

In [None]:
train_loader = torch.utils.data.Dataloader(train_dataset, batch_size=16,collate_fn=padify,shuffle=True)
net = RNNClassifier(vocab_size, 64, 32, len(classes)).to(torch_device)

train_epoch(net, train_loader, lr=0.01, epoch_size=1)


Now, let's load the test dataset to evaluate the trained RNN model. We will be using the 4 different classes of the news categories to map the predicted output with the targeted label.

In [None]:
print(f'class map: {classes}')

test_loader = torch.utils.data.Dataloader(test_dataset, batch_size=16,collate_fn=padify,shuffle=True)

Before we evaluate the model, we'll extract the padded vector dataset from the dataloader. We will use the `vocab.get_itos` function to conver the numeric index to the word it matches in the vocabulary. When conversion from numeric to string happens for padded vectors, the '0' values are set to a special character `<unk>` as an unknown identifier. So, the character needs to be removed, depending on the unknown values from the padded zeros.

Finally, we'll run the model weith our test dataset to verify if the expected output matched the predicted,

In [None]:
net.eval()

with torch.no_grad():
    for batch_idx, (target, data) in enumerate(test_loader):
        word_lookup = [vocab.get_itos()[w] for w in data[batch_idx]]
        unknow_vals = {'<unk>'}
        word_lookup = [ele for ele in word_lookup if ele not in unknow_vals]
        print(f'input text:\n'.format(word_lookup))

        data, target = data.to(torch_device), target.to(torch_device)
        pred=net(data)
        print(torch.argmax(pred[batch_idx])) # torch.argmax returns the indices of the maximum values of a tensor across a dimension.
        print("Actual:\nvalue={}, class_name={}\n".format(target[batch_idx], classes[target[batch_idx]]))
        print("Predicted:\nvalue={}, class_name={}\n".format(pred[0].argmax(0), classes[pred[0].argmax(0)]))
        break

## Long Short Term Memory (LSTM) networks

One of the main problems of classical RNN is the so-called `vanishing gradients` problem. Because RNNs are trained end-to-end in one back-propagation pass, it is having hard times propagating error to the frist layers of the network, and thus the network cannot learn relationships between distant tokens. The gradient help in adjusting the weights during back-propagation to achieve better accuracy and redut the error margin. If the weights are too small the network does not learn. Since the gradient decreases during back-propagation in RNNs, the network does not learn the initial inputs in the network. In other ways, the network "forgets" the ealier word inputs.

One of the ways to avoid this problem is to introduce **explicit state management** by using so called **gates**. There are two most known architectures of this kind: **Long Short Term Memory**(LSTM) and **Gated Relay Unit**(GRU).


<figure><img src="https://raw.githubusercontent.com/hololandscape/notebooks/main/pytorch/natural_language_processing/images/long-short-term-memory-cell.svg" alt="" width="400"><figcaption><p>Source from Microsoft Learning </a> </p></figcaption></figure>

LSTM Network is organized in a manner similar to RNN, but there are two states that are being passed from layer to layer: actual state $c$, and hidden vector $h$. At each unit, hidden vector $h_{i}$ is concatenated with input $x_{i}$, and they control what happens to the state $c$ via **gates**. Each gate is a neural network with sigmoid activation (output in the range[0,1]), which can be thought of as bitwise mask when multiplied by the state vector. There are the following gates (from left to right on the picture above):

* **forget gate** takes hidden vector and determines, which components of the vector $c$ we need to forget, and which to pass though.
* **input gate** takes some information from the input and diffen vector, and inserts it into state.
* **output gate** transforms state via some linear layer with $tanh$ activation, then selects some of its components using hidden vector $h_{i}$ to produce new state $c_{i+1}$

Components of the state $c$ can be throught of as some flags that can be switched on and off. For example, when we encounter a name Alice in the sequence, we may want to assume that it refers to female character, and raise the flag in the state that we have female noun in the sentence. When we further encounter phrases and Tom, we will raise the flag that we have plural noun. Thus by manipulating state we can supposedly keep track of grammatical prperties of sentence parts.

>Note: A greate resource for understanding internals of LSTM is this greate article by Christopher Olah: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

While internal structure of LSTM cell may look complex, PyTorch hides this implementation inside **LSTMCell** class, and provides **LSTM** object to represent the whole LSTM layer. Thus, implementation of LSTM classifier will bre pretty similat to the simple RNN which we have seen above:

In [None]:
class LSTMClassifier(torch.nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_class):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.embedding = torch.nn.Embedding(vocab_size, embed_dim, sparse=True)
        self.embedding.weight.data = torch.randn_like(self.embedding.weight.data)-0.5
        self.rnn = torch.nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.fc = torch.nn.Linear(hidden_dim, num_class)
    
    def forward(self, x):
        batch_size = x.size(0)
        x = self.embedding(x)
        x,(h,c) = self.rnn(x)
        return self.fc(h[-1])

Now let's train the network. Note that training LSTM is also quite slow, and you may need to play with `lr` learning rate parameter to find the learning rate that results in reasonable training speed, and yet does not cause.

In [None]:
net = LSTMClassifier(vocab_size, 64, 32, len(classes)).to(torch_device)
train_epoch(net, train_loader, lr=0.01, epoch_size=1)