<a href="https://colab.research.google.com/github/Jacobluke-/FYPI/blob/main/Pytorch_Tutorial/LSTM_related.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sequence Models and Long short-term memory Networks
At this point, we have seen various feed-forward networks. That is, there is no state maintained by the network at all. This might not be the behavior we want. Sequence models are central to NLP: they are models where there is some sort of dependence through time between your inputs. The classical example of a sequence model is the Hidden Markov Model for part-of-speech tagging. Another example is the conditional random field.

A recurrent neural network is a network that maintains some kind of state. For example, its output could be used as part of the next input, so that information can propogate along as the network passes over the sequence. In the case of an LSTM, for each element in the sequence, there is a corresponding *hidden state* $ h_t $​, which in principle can contain information from arbitrary points earlier in the sequence. We can use the hidden state to predict words in a language model, part-of-speech tags, and a myriad of other things.


Before getting to the example, note a few things. Pytorch's LSTM expects
all of its inputs to be 3D tensors. The semantics of the axes of these
tensors is important. The first axis is the sequence itself, the second
indexes instances in the mini-batch, and the third indexes elements of
the input. We haven't discussed mini-batching, so let's just ignore that
and assume we will always have just 1 dimension on the second axis. If
we want to run the sequence model over the sentence "The cow jumped",
our input should look like
 
\begin{align}\begin{bmatrix}
   \overbrace{q_\text{The}}^\text{row vector} \\
   q_\text{cow} \\
   q_\text{jumped}
   \end{bmatrix}\end{align}

Except remember there is an additional 2nd dimension with size 1.
 
In addition, you could go through the sequence one at a time, in which
case the 1st axis will have size 1 also.
 
Let's see a quick example.

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x7fce5f47cab0>

In [6]:
lstm = nn.LSTM(3,3) # Input dimension is 3, and the output dimension is 3 too.
inputs = [torch.randn(1, 3) for _ in range(5)] # make a sequence of length 5

# initialize the hidden state
hidden = (torch.randn(1,1,3),
          torch.randn(1,1,3))
for i in inputs:
  out, hidden = lstm(i.view(1,1,-1),hidden)

In [7]:
# alternatively, we can do the entire sequence all at once.
# the first value returned by LSTM is all of the hidden states throughout
# the sequence. the second is just the most recent hidden state
# (compare the last slice of "out" with "hidden" below, they are the same)
# The reason for this is that:
# "out" will give you access to all hidden states in the sequence
# "hidden" will allow you to continue the sequence and backpropagate,
# by passing it as an argument  to the lstm at a later time
# Add the extra 2nd dimension
inputs = torch.cat(inputs).view(len(inputs),1,-1)
hidden = (torch.randn(1,1,3),torch.randn(1,1,3))
out, hidden = lstm(inputs, hidden)
print(out)
print(hidden)


tensor([[[ 0.3941, -0.2793,  0.1220]],

        [[ 0.3006,  0.0331,  0.3227]],

        [[ 0.1311,  0.0330, -0.1106]],

        [[ 0.1626,  0.0292,  0.0416]],

        [[ 0.1789,  0.1566,  0.0381]]], grad_fn=<StackBackward>)
(tensor([[[0.1789, 0.1566, 0.0381]]], grad_fn=<StackBackward>), tensor([[[0.4998, 0.3463, 0.0569]]], grad_fn=<StackBackward>))


## Example: An LSTM for Part-of-Speech Tagging
In this section, we will use an LSTM to get part of speech tags. We won't use Viterbi or Forward-Backward or anything like that, but as a (challenging) exercise to the reader, think about how Viterbi could be used after you have seen what is going on. 
In this example, we also refer to embeddings.

The model is as follows: let our input sentence be $ w_1, \cdots, w_M $, where $ w_i $ belongs to $ V $, our vocab.
Also, let $ T $ be our tag set, and $ y_i $ the tag of word $ w_i $. Denote our prediction of the tag of word $ w_i $ by $ \hat y_i $.

This is a structure prediction, model, where our output is a sequence  $ \hat y_1, \cdots, \hat y_M $, where $ \hat y_i ∈ T$.

To do the prediction, pass an LSTM over the sentence. Denote the hidden
state at timestep $i$ as $h_i$. Also, assign each tag a
unique index (like how we had word\_to\_ix in the word embeddings
section). Then our prediction rule for $\hat{y}_i$ is
 
\begin{align}\hat{y}_i = \text{argmax}_j \  (\log \text{Softmax}(Ah_i + b))_j\end{align}

That is, take the log softmax of the affine map of the hidden state,
and the predicted tag is the tag that has the maximum value in this
vector. Note this implies immediately that the dimensionality of the
target space of $A$ is $|T|$.
 
 
Prepare data:

In [11]:
def prepare_sequence(seq, to_ix):
  idxs = [to_ix[w] for w in seq]
  return torch.tensor(idxs,dtype = torch.long)

training_data = [
  # Tags are: DET - determiner; NN - noun; V - ver
  # For example, the word "The" is a determiner
  ("The dog ate the apple".split(), ["DET","NN","V","DET","NN"]),
  ("Everybody read that book".split(), ["NN", "V", "DET", "NN"])
]
word_to_ix = {}
# For each words-list (sentence) and tags-list in each tuple of training_data
for sent, tags in training_data:
  for word in sent:
    if word not in word_to_ix:
      word_to_ix[word] = len(word_to_ix)
print(word_to_ix)
tag_to_ix = {"DET":0, "NN":1, "V":2}


{'The': 0, 'dog': 1, 'ate': 2, 'the': 3, 'apple': 4, 'Everybody': 5, 'read': 6, 'that': 7, 'book': 8}


In [16]:
EMBEDDING_DIM = 6
HIDDEN_DIM = 6

class LSTMTagger(nn.Module):

  def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
    super(LSTMTagger, self).__init__()
    self.hidden_dim = hidden_dim
    self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
    self.lstm = nn.LSTM(embedding_dim, hidden_dim)
    self.hidden2tag = nn.Linear(hidden_dim, tagset_size)
  
  def forward(self, sentence):
    embeds = self.word_embeddings(sentence)
    lstm_out, _ = self.lstm(embeds.view(len(sentence),1,-1))
    tag_space = self.hidden2tag(lstm_out.view(len(sentence),-1))
    tag_scores = F.log_softmax(tag_space, dim = 1)
    return tag_scores

In [17]:
model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM,len(word_to_ix),len(tag_to_ix))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(),lr = 0.1)

with torch.no_grad():
  inputs = prepare_sequence(training_data[0][0], word_to_ix)
  tag_scores = model(inputs)
  print(tag_scores)

for epoch in range(300):
  for sentence, tags in training_data:
    model.zero_grad()

    sentence_in = prepare_sequence(sentence,word_to_ix)
    targets = prepare_sequence(tags, tag_to_ix)

    tag_scores = model(sentence_in)

    loss = loss_function(tag_scores, targets)
    loss.backward()
    optimizer.step()

with torch.no_grad():
    inputs = prepare_sequence(training_data[0][0], word_to_ix)
    tag_scores = model(inputs)

    print(tag_scores)

tensor([[-1.3747, -0.6954, -1.3936],
        [-1.3148, -0.7536, -1.3439],
        [-1.3490, -0.7199, -1.3716],
        [-1.3210, -0.7491, -1.3458],
        [-1.2844, -0.7818, -1.3258]])
tensor([[-0.0268, -4.3567, -4.2986],
        [-5.2154, -0.0301, -3.7195],
        [-3.2546, -3.6441, -0.0669],
        [-0.0309, -5.1158, -3.7140],
        [-4.9647, -0.0170, -4.6165]])


## Exercise: Augmenting the LSTM part-of-speech tagger with character-level features

In the example above, each word had an embedding, which served as the
inputs to our sequence model. Let's augment the word embeddings with a
representation derived from the characters of the word. We expect that
this should help significantly, since character-level information like
affixes have a large bearing on part-of-speech. For example, words with
the affix *-ly* are almost always tagged as adverbs in English.

To do this, let $c_w$ be the character-level representation of
word $w$. Let $x_w$ be the word embedding as before. Then
the input to our sequence model is the concatenation of $x_w$ and
$c_w$. So if $x_w$ has dimension 5, and $c_w$
dimension 3, then our LSTM should accept an input of dimension 8.

To get the character level representation, do an LSTM over the
characters of a word, and let $c_w$ be the final hidden state of
this LSTM. Hints:

* There are going to be two LSTM's in your new model.
  The original one that outputs POS tag scores, and the new one that
  outputs a character-level representation of each word.
* To do a sequence model over characters, you will have to embed characters.
  The character embeddings will be the input to the character LSTM.