# Recurrent Neural Networks

> __Recurrent neural networks are parameter efficient generalization of `nn.Linear` which allows us to work with temporal data structures__

Due to their specific structure it allows us to work with:
- text (appropriately represented)
- timeseries (for example weather prediction)
- video (together with `torch.nn.Conv2d` layers)

Their specific features (when compared to `nn.Linear` or `nn.Conv{1,2,3}d`) include:
- memory of previous timesteps
- can dependent on next batches

## RNN Cell

> Main building block of recurrent neural network(s) is named __cell__ and __is a `torch.nn.Linear` layer__

![rnn_cell](images/rnn_cell.png)


Let's specify above diagram with some math/textual notation:

$$
h_{t-1} \rightarrow \text{hidden state at timestep t (used for current timestep)}
$$

$$
x_t \rightarrow \text{input at timestep t}
$$

$$
[h_{t-1}, x_t] \rightarrow \text{concatenation of previous timestep and current input}
$$

$$
W_{[h_{t-1}, x_t]} \rightarrow \text{Linear weights used to transform concatenated previous timestep and current input}
$$

$$
h_t \rightarrow \text{hidden state at timestep t (output from linear transformation)}
$$


__Keep information below in mind:__

- `h_t` is initially a tensor filled with zeros (__no need to pass initial hidden state__)
- `tanh` activation is applied to the `h_t` __but only when it is passed as initial hidden state to the next cell__

Let's see how to use `torch.nn.RNNCell` with example input:

In [1]:
import torch

features = 30
batch_size = 64

cell = torch.nn.RNNCell(input_size=30, hidden_size=50)

data = torch.randn(batch_size, features)

cell(data).shape

torch.Size([64, 50])

## RNN

Now that we know what single cell does, the whole recurrent neural network is:

> Chain of cells each taking __previous hidden state__ and __current input__ and __outputting next hidden state__

![rnn_classification](images/rnn_classification.png)

Things to keep in mind about RNN layer:
- each cell also outputs it's own hidden state
- __By default input shape should be `[seq_len, batch, features]` but we can use `batch_first=True` argument to make it `[batch, seq_len, features]`!__
- __`seq_len` and `batch` can be of variable length (no need to specify them!)__

This time input and output from `RNN` layer is a little more complicated (check [documentation](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html) if in doubt)

## RNN Inputs

> __Inputs are `data` and `initial_hidden`__

There are two inputs, but __we almost always only pass our data__ because:
- RNN's `initial_hidden` is comprised of zeros is left untouched
- __When would we want to change that?__

All in all we only need our data!

## RNN Outputs

> __Each time a `tuple` with two tensors is returned, NOT A SINGLE TENSOR__

- `outputs` of shape `(seq_len, batch, num_directions * hidden_size)` (__`batch` might be first dimension if specified with `batch_first=True`)!__
- `h_n` of shape `(num_layers * num_directions, batch, hidden_size)` (__`batch` might be first dimension if specified with `batch_first=True`)!__

Now, let's analyze what those mean!

### outputs

> Outputs __from the last layer of RNN__ (we may specify multiple layers via `num_layers`) __FOR EACH TIMESTEP `t`__

Using it we can get:
- all hidden outputs (without activation)
- no matter how many layers we specify
- `reshape` data to obtain specifc parts of data (e.g. `output.reshape(seq_len, batch, num_directions, hidden_size)`)

> __Use when you need data from each timestep (for example attention, transformers etc.)__

### h_n

> Tensor containing __LAST HIDDEN STATE `t_final`__ (also we can get last output __from each layer__)

Using it we can:
- get summarization of sequence in the last hidden state
- perform classification of shorter documents

> __Use for `seq2seq`, basic classification (without attention)__

Let's see both in action:

In [7]:
# Batch first for easier usage
rnn = torch.nn.RNN(input_size=30, hidden_size=50, batch_first=True)

for i, param in enumerate(rnn.parameters()):
    print(i, param.shape)


data = torch.randn(64, 15, 30)
outputs, h_n = rnn(data)

print(f"Direct Outputs: {outputs.shape} | Hidden: {h_n.shape}")

0 torch.Size([50, 30])
1 torch.Size([50, 50])
2 torch.Size([50])
3 torch.Size([50])
Direct Outputs: torch.Size([64, 15, 50]) | Hidden: torch.Size([1, 64, 50])


## Recurrent variations

> Recurrent Neural networks have multiple shortcomings __that we will focus on in the next lessons__

Some of them were fixed (or improved upon) by RNNs themselves, in other cases new architectures were introduced. The shortcomings are:
- Dying gradients due to long sequences and `tanh` (__even `20` timesteps might be a problem!__)
- Context only from the previous timesteps (__does not look into the future timesteps__)
- __All of the history is summed in a single `hidden_state`!__
- __No way to attend to a single influential timestep__ (next lesson and attention)

### Bidirectional

![bi_rnn_classification](images/bi_rnn_classification.png)

Bidirectional RNNs are a simple modification which consists of:
- One RNN going through the sequence __from the beggining towards the end__
- Another RNN going through the sequence __from the end to the beginning__

Pros:
- Improved gradient vanishing if we sum the states together
- Knowledge from next timesteps due to reversed RNN

Cons:
- A little slower
- __Twice as many parameters__ (although RNNs are very parameter efficient)
- __Still everything is summed into single hidden states__
- __For longer sequences gradient might still die__

> We can simply use `bidirectional=True` to turn on this behaviour

In [6]:
bidirectional_rnn = torch.nn.RNN(
    input_size=30, hidden_size=50, batch_first=True, bidirectional=True
)

for i, param in enumerate(bidirectional_rnn.parameters()):
    print(i, param.shape)


# By default outputs are concatenated from both directions
outputs, h_n = bidirectional_rnn(data)
print(f"Direct Outputs: {outputs.shape} | Hidden: {h_n.shape}")

# -1 because we don't know length of sequence beforehand
summed_outputs = outputs.reshape(64, -1, 2, 50).sum(dim=2)
print(f"Outputs after summation: {summed_outputs.shape}")

# Concatenate last hidden for single layer
concatenated_last_hidden = torch.cat((h_n[0], h_n[1]), dim=-1)
print(f"Concatenated Last Hidden: {concatenated_last_hidden.shape}")

0 torch.Size([50, 30])
1 torch.Size([50, 50])
2 torch.Size([50])
3 torch.Size([50])
4 torch.Size([50, 30])
5 torch.Size([50, 50])
6 torch.Size([50])
7 torch.Size([50])
Direct Outputs: torch.Size([64, 15, 100]) | Hidden: torch.Size([2, 64, 50])
Outputs after summation: torch.Size([64, 15, 50])
Concatenated Last Hidden: torch.Size([64, 100])


### Multiple layers

> __Unlike other PyTorch layers, RNNs have `num_layers` parameter in order for us to use multiple layers__

Why the API changes?

- __Each hidden timestep from a given layer is passed on to the next RNN layer__
- No clear way to do that otherwise
- Other approaches are in-efficient from the implementation POV

__Pros:__
- More representational power
- We can access `last_hidden` __from any layer__
- __Useful for harder language problems__

__Let's see multiple bidirectional layers and how to use it's outputs!__

In [8]:
bidirectional_multiple_layers = torch.nn.RNN(
    input_size=30, hidden_size=50, batch_first=True, bidirectional=True, num_layers=3
)

# By default outputs are concatenated from both directions
# Outputs are from last layer, hence we are fine in most cases
outputs, h_n = bidirectional_multiple_layers(data)
print(f"Direct Outputs: {outputs.shape} | Hidden: {h_n.shape}")

h_n_last_layer = h_n.reshape(3, 2, 64, 50)[-1].squeeze(dim=0)
print(f"After choosing last layer: {h_n_last_layer.shape}")
h_n_concatenated = torch.cat((h_n_last_layer[0], h_n_last_layer[1]), dim=-1)
print(f"After concatenating bidirectional: {h_n_concatenated.shape}")

Direct Outputs: torch.Size([64, 15, 100]) | Hidden: torch.Size([6, 64, 50])
After choosing last layer: torch.Size([2, 64, 50])
After concatenating bidirectional: torch.Size([64, 100])


## LSTM

> __LSTMs are improvement over RNN neural networks which allow them to work with longer sequences (up to a `1000` timesteps)__


### What is improved?

> __Vanishing gradient__

Let's see a sentence oriented example and see the dependency:

> A patient with a rare sarcoma of soft tissue on the left thigh was presented to the hospital yesterday.

In this case:
- "was presented" depends on "patient" __and is separated by 11 tokens__
- Gradient can be seen as influence of the past on the future
- __In this case it is is large__ yet __due to `11` linear layers gradient (dependency) vanishes__

Let's go over __new__ notation:


### How is it done?

Let's see how a single LSTM cell looks:

![lstm_full](images/lstm_full.png)

And let's go over content in this image step by step:

- __Cell state__ $$c_{t-1}$$ runs through the cells in the network __retains important information through longer steps__
- Incorporates __forget gate__ $$f_t$$ which decides:
    - how much information we will throw away __from previous steps__ by `[0, 1]` element-wise multiplication
    - `0` - "reset" previous information
    - `1` - keep the information unchanged (anything in-between is possible)
    - __It does this via learnable linear layer with `sigmoid` activation__
- This create $$c_t$$ (current cell state)
- __Next we add new information__ to cell state via:
    - `input` gate (which decides how much data will be added due to `sigmoid` and `linear` layers)
    - `candidate` gate (whether the addition is positive or negative)
    - After multiplication of those two the information is added
- Current concatenated `[h_t, x_t]` is pushed through `sigmoid` __and multiplied by `cell state`__ (squashed to `[-1, 1]` which activates/deactivates parts of the data)

### Features

- __Works a little similar to ResNet__ - only new information is added if needed (__but via learnable layers!__)
- __Additive paths (unlike multiplicative) will not be punished with the vanishing gradient__

Let's see how to use `torch.nn.LSTM` layer (there is also `torch.nn.LSTMCell`):

In [10]:
# Same arguments can be used
lstm = torch.nn.LSTM(
    input_size=30, hidden_size=50, batch_first=True, bidirectional=True, num_layers=3
)

# By default outputs are concatenated from both directions
# Outputs are from last layer, hence we are fine in most cases
outputs, (h_n, c_n) = lstm(data)
# We usually need `outputs` or `h_n` only
# Rest is done the same as with the RNNs previously
print(f"Outputs: {outputs.shape} | Hidden: {h_n.shape} | Cell: {c_n.shape}")

Outputs: torch.Size([64, 15, 100]) | Hidden: torch.Size([6, 64, 50]) | Cell: torch.Size([6, 64, 50])


## Word representation

You have probably heard that `RNN`s are used for text classification, translation and other language related tasks.

On the other hand we know that:
- __Neural networks require numbers to work__
- __Each sample has predefined number of features describing it__

How could we obtain that with words?

### Semantic text representation

> __Each word is represented as `N`-dimensional vector__

This representation has (usually) a few characteristics:
- The more "semantically related" the word is, the closer they are in `N` dimensional space
- Arithmetic on word representations gives us intuitively correct results (__result is closest distance-wise from all available words__)

![word_repr](images/word_representation.png)

For example:

$$
\text{king} + \text{cap} = \text{crown}
$$

> __These representations are learned in a self-supervised fashion, see first Assessment Challenge__

We will use [`spacy`](https://spacy.io/usage/spacy-101) in order to load those (pretrained) representations:

In [11]:
!pip install spacy
# Load textual representations created by spacy (small version)
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl (13.7 MB)
[K     |████████████████████████████████| 13.7 MB 3.6 MB/s eta 0:00:01


[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [12]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text, token.pos_, token.dep_)

Apple PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.K. PROPN dobj
startup NOUN advcl
for ADP prep
$ SYM quantmod
1 NUM compound
billion NUM pobj


## Working with spacy

In general, here are the required steps when working with `spacy` and neural networks:
- Get a list of sentences you want to work on
- Create a [pipeline](https://spacy.io/usage/processing-pipelines) __from batch of text data__ (see [`nlp.pipe`](https://spacy.io/api/language#pipe))
- Iterate over pipeline(s) and, using __`vector` attribute__, __obtain semantic representations of each word__

Let's see part of this pipeline below:

In [18]:
texts = [
    "Net income was $9.4 million compared to the prior year of $2.7 million.",
    "Revenue exceeded twelve billion dollars, with a loss of $1b.",
]

nlp = spacy.load("en_core_web_sm")
# Each document is spacy.doc instance respectively for each word
for doc in nlp.pipe(texts):
    # Do something with the doc here
    text_representation = [token.vector for token in doc]
    print(
        len(text_representation),
        type(text_representation[0]),
        text_representation[0].shape,
        text_representation[0].dtype,
    )

16 <class 'numpy.ndarray'> (96,) float32
13 <class 'numpy.ndarray'> (96,) float32


### Output

Let's analyze the output:
- __Each sentence is of different length while batches of data CANNOT HAVE VARIABLE SIZE__
- `np.ndarray` is returned, __we can transform it to PyTorch tensor easily__
- Each has `96` element (number of features for this representation)

First one would be a problem, __fortunately PyTorch provides [`torch.nn.utils.pack_sequence`](https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pack_sequence.html#torch.nn.utils.rnn.pack_sequence)__ which:
- creates sequence and implicitly __pads it with zero-filled tokens__
- __Special data structure USABLE ONLY FOR RNNs__.

Let's see `pack_sequence` in action:

In [21]:
a = torch.tensor([1, 2])
b = torch.tensor([3, 4, 5])
c = torch.tensor([6])

# If sequence is not sorted by length we have to use enforce_sorted=False
packed = torch.nn.utils.rnn.pack_sequence([a, b, c], enforce_sorted=False)
packed

PackedSequence(data=tensor([3, 1, 6, 4, 2, 5]), batch_sizes=tensor([3, 2, 1]), sorted_indices=tensor([1, 0, 2]), unsorted_indices=tensor([1, 0, 2]))

A little more realistic, just like our data above:

In [29]:
a = torch.randn(16, 96) # seq_len x features
b = torch.randn(13, 96) # seq_len x features

packed = torch.nn.utils.rnn.pack_sequence([a, b], enforce_sorted=False)

module = torch.nn.LSTM(96, 128)

_, (h_n, _) = module(packed)

# We need to pad sequence if we want ALL OF THE TIMESTEP OUTPUTS
# No need for h_n, we get out our tensor!
h_n.shape

torch.Size([1, 2, 128])

## DataLoader's collate_fn

> `collate_fn` argument to `torch.utils.data.DataLoader` allows to specify custom behaviour for __batch creation__

Default `collate_fn`:

- Prepends batch dimension
- Transforms `np.array`s/Python scalars to `torch.Tensor`
- Concatenates data and preserves structure (e.g. `dict` return values from `torch.utils.data.Dataset`)

> By default it gets only a single argument (list of samples from `torch.utils.data.Dataset`)

We will use it to:
- transform `list` of texts to vectors via `nlp.pipe`
- transform labels for classification task to `torch.Tensor`

## torchtext

> `torchtext` is PyTorch library used for working with text

In our case, we will use it only to load data (rest of the pipeline will be done via `spacy`):

In [13]:
!pip install torchtext

Collecting torchtext
  Downloading torchtext-0.9.1-cp39-cp39-manylinux1_x86_64.whl (7.0 MB)
[K     |████████████████████████████████| 7.0 MB 5.7 MB/s eta 0:00:01
Installing collected packages: torchtext
Successfully installed torchtext-0.9.1


In [7]:
import torchtext

train, test = torchtext.datasets.AG_NEWS(root='.data', split=('train', 'test'))
print(type(train))

<class 'torchtext.data.datasets_utils._RawTextIterableDataset'>


> __Notice this dataset is iterable only and NON-INDEXABLE__

This means we have to gather our samples in custom `torch.utils.data.Dataset`

In [6]:
for elem in train:
    print(elem)
    break

(3, "Fears for T N pension after talks Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.")


# Exercise

Build `LSTM` classification network.

> __Most of the work will be about getting and processing textual data into correct form__

To do that we have to:
- Load `AG_NEWS` test (we will use it as a validation) and train splits
- Implement `torch.utils.data.Dataset` which gets one of the above datasets and:
    - saves all of the samples in `self.samples`
    - saves all of the targets as `torch.Tensor` in `self.targets`
    - __calculates total count of unique targets__ (we will later use for neural network) and saves it as `self.n_targets`
    - uses `__getitem__` to return specific sample and respective label
    - __SAMPLES ARE RETURNED AS `str` AND single `torch.Tensor` value respectively!__
- Create `collate_fn` __FUNCTOR__ which:
    - Inside `__init__` sets up `nlp` object creted via `spacy.load("en_core_web_sm")` and assign to self
    - Inside `__call__(self, batch)`:
        - gets `[0]` elements from batch (all of the sentences)
        - gets `[1]` elements from batch (all of the targets)
        - Pushes sentences through pipeline (specify the same `batch_size` as number of sentences)
            - Gets `token.vector` for each word in the sentence and transforms it to `torch.Tensor` instance
            - Creates a list from these tokens (shape: `(seq_len, 96)`)
            - Adds this list to another `list` (outside of loop) __which will keep all of the sentences representations__
        - Given our `list` containing all sentences use `torch.nn.utils.rnn.pad_sequence` to pack all the sentences into `RNN` digestable form
        - Return `padded_sequence` and concatenated `targets` (__remember to also transform them into `torch.Tensor` instance!__)
- Create simple bidirectional `LSTM` model with a few layers (pack it in `torch.nn.Module`):
    - Add `torch.nn.Linear` to project `hidden_size` to the targets we want (__use `dataset.n_targets` attribute here!__)
    - `forward` has to select appropriate output (__reshapes and indexing needed!__) and push output from last recurrent step through `nn.ReLU` and `nn.Linear` defined previously
- Once all of that is done create a simple training loop (or reuse the one you had previously) choosing `Adam` and appropriate loss (based on

> __Good luck :)__

In [None]:
# Mini-project

# Challenges

## Assessment

- What is [Word2Vec](https://wiki.pathmind.com/word2vec) and how do we obtain semantic representation of words/characters?
- What does the `teacher forcing` for recurrent neural networks do and why would we use it?
- Read more about [spaCy pipelines](https://spacy.io/usage/processing-pipelines). What do they consist of?
- What is [perplexity](https://towardsdatascience.com/perplexity-intuition-and-derivation-105dd481c8f3) metric and what does it measure?

## Non-assessment

- When would we feed `(h_n, c_n)` as input to `nn.LSTM` layer? What would it achieve?
- What is [GRU Unit](https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be) and why would one want to use it instead of `nn.LSTM` layer?
- How does the `BLEU` score work?
- How to optimize data loading pipeline we have created?
