# Explanation

Descriptions are based on the awesome lecture provided by [Andrej Karpathy](https://www.youtube.com/watch?v=yCC09vCHzF8)

## Recurrent Networks offer a lot of flexibility
![Types](./Images/types.png)

* **One to One:** A fixed-sized input vector (Red) process the image with some hidden layers (Green) and produces a fixed-size output vector (Blue)
* **One to many:** e.g. **Image captioning**. Image -> sequence of words that describes the content of the image.
* **Many to One:** e.g. **Sentiment Classification (In NLP)**. sequence of words -> sentiment (Sentence is positive or negative).
* **Many To Many:** e.g. **Machine Translation**. Seq of words (English) -> Seq of words (Persian)
* **Many To Many(Last case):** e.g. Video Classification. Classifying every single frame of a video with some number of classes but the prediction at every single time step is a function of all the frames that have come in up to that point.

## Simplified RNN Box
![Simplified RNN Box Image](Images/SimplifiedRNNBox.png)

RNN is basically this green box and it has a state. Over time, it recieves input vectors. So every single time we can feed in an input vector into the RNN and it has some state internally and then it can modify that state as a function of what it receives at every single time step. So they're weights inside this RNN and when we tune those weights the RNN will have different behavior in terms of how its state evolves as it receives these inputs.

Usually we also be interested in poducing an output based on the RNN state so we can produce these vectors on top of the RNN.

We can denote this state as a vector $h_{t}$ or a collection of vectors. Then, we're going to base it as a function of the previous hidden state at previous iteration time (t-1) and current input vector $x_{t}$ and say that:
$$h_{t} = f_{w}(h_{t-1},x_{t})$$
* $h_{t}$: new state
* $f_{w}$: recurrence function
* $h_{t-1}$: old state
* $x_{t}$: input vector at some time step

As we change the W in $f_{w}$, we see that RNN will have different behaviors. Hence, we can train those weights on data

**Important:** the same function is used at every single time step. We have this fixed function of weights W and we applied that single function at every single time step and that allows us to use the kernel network on sequences without having to commit to the size of the sequence because we apply the exact same function at every single time step.**No matter how long the input or output sequences are.**

## Vanilla RNN
the simplest way we can set this up is by setting just a single hidden state H and then we have a recurrence formula that basically informs us how we should update this hidden state H as a function of the previous hidden state and the current input $x_{t}$.

In the simplest case, we're going to have these weight matrices $w_{hh}$ and $w_{xh}$. They're both going to project the hidden state from the previous time step and current input and then those are going to add and then we squash them with a tanh. 
$$h_{t} = tanh(w_{hh}h_{t-1} + w_{xh}x_{t})$$

Then we can base predictions on top of H. For example using just another matrix projection on top of the hidden state. So this is the simplest complete case in which we can wire up a NN.
$$y_{t} = w_{hy}h_{t}$$

## One RNN Application
### Character-level language model
In this application, we will feed a sequence of characters into the RNN and at every single time step we ask the RNN to predict the next character in the sequence. So a prediction of entire distribution for what it thinks should come next in the sequence that it has seen so far.

Example training sequence: **"hello"**  
Vocabulary: [h, e, l, o]

In the example above, we have the training sequence **"hello"**, so we have the vocabulary of four characters. Now, we're going to try to get a RNN to learn to predict the next character in a sequence on this training data.  
![Character-level language model](Images/char_level_language_model.png)
On the picture above, the x axis is time, and on the input layer, we use [One-hot encoding](https://en.wikipedia.org/wiki/One-hot). This encoders set a bit for each vocabulary based on its index.  

Then we use the recurrence formula. Such that $h_{t-1}$ in the first step is all zero. Then we apply this recurrence to compute the hidden state vector at every single time step using the fixed recurrence formula. So suppose here we have only three numbers in the hidden state (the three number we can see on each green block). We're going to end up with a three dimensional representation that basically at any point in time, summarizes all the characters that have come until then. So we apply this recurrence at every single time step and now we're going to predict at every single time step what should be the next character in the sequence.

Since we have four characters in this vocabulary, we're going to predict four numbers at every single time step (numbers in blue blocks). For instance, in the very first time step we fed in the letter 'h' and the RNN with it's current setting of weights computes the unnormalized lock probabilities we can see at the first blue block (for what should come next).

We know that after 'h' in "hello" comes 'e'. So 2.2 for 'e' is the correct answer since it has the highest value. So every single time step we have a target for what next character should come in the sequence and so we just want all those numbers to be high and all the other numbers to be low.

# Codes

Codes are based on [this tutorial](https://www.youtube.com/watch?v=Gl2WXLIMvKA&list=PLhhyoLH6IjfxeoooqP9rhU3HJIAVAJ3Vz&index=5)

## Imports

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F  # Functions without parameters such as relu, tanh, ...
from torch.utils.data import DataLoader  # Dataset management. It helps us create mini-batches to train and ..
import torchvision.datasets as datasets  # Datasets such as MNIST
import torchvision.transforms as transforms  # transformations that we can do on our datasets 

## Create a RNN

Normally we don't use RNN for images but here we do for the sake of learning it 

In [None]:
class RNN(nn.Module):
    def __init(self, input_size, hidden_size, num_layers, num_classes):
        super(RNN, self).__init__()
        
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        """
        We don't have to explicitly say how many sequences we want to have. RNN works for any number of sequences
        that we send in.
        """
        self.rnn = nn.RNN(input_size, hidden_size, num_classes, batch_first=True)
        
        

## Hyperparameters

In [2]:
# image is 28x28. We consider it as 28 sequences each has 28 features
input_size = 28 
sequence_length = 28

num_layers = 2
hidden_size = 256

num_classes = 10
learning_rate = 0.001 
batch_size = 64
num_epochs = 2