In [1]:
# settings for tutorial presentation with RISE
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
              'width': '100%',
              'height': '100%',
              'scroll': True,
              'enable_chalkboard': False,
})

{'width': '100%', 'height': '100%', 'scroll': True, 'enable_chalkboard': False}

In [2]:
import torch

# ASDS 2

## Tutorial: Recurrent Neural Nets with PyTorch

Anna Rogers

# Recap: basic neural network in PyTorch


In [3]:
class LinearClassifier(torch.nn.Module):
    # initialization parameters
    def __init__ (self, n_features, n_classes):
        super().__init__()
        # we will have only one linear layer which takes the given number of features as its inputs,
        # and outputs a score for each of the given number of classes
        self.linear = torch.nn.Linear(n_features, n_classes)

    # you always need to define the forward() method which defines how your model performs
    # forward propagation
    def forward(self, x):
        linear_out = self.linear(x)
        return linear_out

# RNN in PyTorch

Introducing a new layer type: the [RNN](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html) layer. The cells in the RNN layer have `input_size` and `hidden_size` parameters.

In [4]:
rnn_layer = torch.nn.RNN(input_size=5, hidden_size=2, batch_first=True)

`batch_first` means that we will be feeding it multidimensional tensors where the first dimension is the batches.

The input of the RNN layer has three dimensions: the number of observations (sequences) in the batch, the number of inputs in one sequence, and the number of features per input.

In [5]:
sample_input = torch.rand(3, 4, 5)

The output of the RNN layer is a tuple of (all_hidden_states, final_hidden_state). There are as many hidden states as tokens in the input sequence, because each RNN cell takes in one input token and produces one hidden state. The final hidden state is just the hidden state of only the last cell.

In [6]:
all_hidden_states, last_hidden_state = rnn_layer(sample_input)

In [7]:
# we have a batch of 3 sequences; each sequence leads to 4 hidden states (~ input tokens),
# and we told the model to produce 2-dimensional hidden states. So we get 3x4x2 numbers.
print(all_hidden_states)

tensor([[[0.5234, 0.2906],
         [0.4931, 0.7405],
         [0.9336, 0.8856],
         [0.8287, 0.6910]],

        [[0.1451, 0.5003],
         [0.5696, 0.7911],
         [0.7198, 0.7579],
         [0.7475, 0.6627]],

        [[0.2829, 0.5905],
         [0.8499, 0.8172],
         [0.8443, 0.7516],
         [0.6159, 0.8172]]], grad_fn=<TransposeBackward1>)


In [8]:
# last_hidden_state just reproduces the final hidden state from each batch. So 3x2 numbers.
print(last_hidden_state)

tensor([[[0.8287, 0.6910],
         [0.7475, 0.6627],
         [0.6159, 0.8172]]], grad_fn=<StackBackward0>)


# Using the RNN activations

In the exercise we will once again try to classify tweets into three sentiment categories. This is a many-to-one set-up. We will therefore only use the final hidden state of the RNN. We can then feed it into an output layer to produce our final output.

In [11]:
# linear combination of the (in this case two) elements of the hidden states
# output three scores: one for each of the options in our classification problem 
linear_layer = torch.nn.Linear(2, 3)
# softmax activation function - normalizing output to have sum=1
activ_layer = torch.nn.Softmax(dim=1)

# produce some example output. Note we are only using the last hidden state
linear = linear_layer(last_hidden_state)
activ = activ_layer(linear)

# for each of the 3 sequences in our example batch,
# we get probabilities for the 3 classes 
activ

tensor([[[0.3221, 0.3350, 0.3363],
         [0.2947, 0.3695, 0.3500],
         [0.3833, 0.2955, 0.3137]]], grad_fn=<SoftmaxBackward0>)

**Note:** in the exercise set, we will not be using a softmax layer. We will only be doing a linear step in the output layer.

The reason is that when we later calculate the cross-entropy loss with PyTorch's built-in [CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html) function, that function applies a softmax transformation for us (before calculating the loss).

So, we are "outsourcing" the final softmax activation function of the model to the loss calculation. (we actually did this wrong in PSet 4.1 and added a softmax layer where none was needed--will fix!)