# Lab 6: A LSTM for Part-of-Speech Tagging



## Part-of-Speech Tagging


In this section, we will use an LSTM to get part of speech tags.

The model is as follows: let our input sentence be
$w_1, \dots, w_M$, where $w_i \in V$, our vocabulary. Also, let
$T$ be our tag set, and $y_i$ the tag of word $w_i$.
Denote our prediction of the tag of word $w_i$ by
$\hat{y}_i$.

This is a structure prediction, model, where our output is a sequence
$\hat{y}_1, \dots, \hat{y}_M$, where $\hat{y}_i \in T$.

To do the prediction, pass an LSTM over the sentence. Denote the hidden
state at timestep $i$ as $h_i$. Also, assign each tag a
unique index (like how we had word\_to\_ix in the word embeddings
section). Then our prediction rule for $\hat{y}_i$ is

$$\begin{align}\hat{y}_i = \text{argmax}_j \  (\log \text{Softmax}(Ah_i + b))_j\end{align}$$

That is, take the log softmax of the affine map of the hidden state,
and the predicted tag is the tag that has the maximum value in this
vector. Note this implies immediately that the dimensionality of the
target space of $A$ is $|T|$.





In [42]:
%matplotlib inline

## Sequence Models and Long-Short Term Memory Networks


At this point, we have seen various feed-forward networks. That is,
there is no state maintained by the network at all. This might not be
the behavior we want. Sequence models are central to NLP: they are
models where there is some sort of dependence through time between your
inputs. The classical example of a sequence model is the Hidden Markov
Model for part-of-speech tagging. Another example is the conditional
random field.

A recurrent neural network is a network that maintains some kind of
state. For example, its output could be used as part of the next input,
so that information can propogate along as the network passes over the
sequence. In the case of an LSTM, for each element in the sequence,
there is a corresponding hidden state $h_t$, which in principle
can contain information from arbitrary points earlier in the sequence.
We can use the hidden state to predict words in a language model,
part-of-speech tags, and a myriad of other things.






In [43]:
# Author: Robert Guthrie

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x17b5d4a70b0>

## Prepare data:

In [44]:
training_data = [
    ("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Everybody read that book".split(), ["NN", "V", "DET", "NN"])
]

Print the structure of the training data

In [45]:
print(training_data)
print(training_data[0])
print(training_data[0][0])

[(['The', 'dog', 'ate', 'the', 'apple'], ['DET', 'NN', 'V', 'DET', 'NN']), (['Everybody', 'read', 'that', 'book'], ['NN', 'V', 'DET', 'NN'])]
(['The', 'dog', 'ate', 'the', 'apple'], ['DET', 'NN', 'V', 'DET', 'NN'])
['The', 'dog', 'ate', 'the', 'apple']


In [46]:
def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)

word_to_ix = {}
for sent, tags in training_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)
print(word_to_ix)

tag_to_ix = {"DET": 0, "NN": 1, "V": 2}



{'The': 0, 'dog': 1, 'ate': 2, 'the': 3, 'apple': 4, 'Everybody': 5, 'read': 6, 'that': 7, 'book': 8}


In [47]:
ix_to_tag = {}
for i, tag in enumerate(tag_to_ix):
    ix_to_tag[i] = tag

print(ix_to_tag)

{0: 'DET', 1: 'NN', 2: 'V'}


In [48]:
# These will usually be more like 32 or 64 dimensional.
# We will keep them small, so we can see how the weights change as we train.
EMBEDDING_DIM = 6
HIDDEN_DIM = 10

## Play with the LSTM:



A few cells to understand and test the LSTM

In [49]:
vocab_size = len(word_to_ix)
print(vocab_size)
embedding_dim = EMBEDDING_DIM
hidden_dim = HIDDEN_DIM
word_embeddings = nn.Embedding(vocab_size, embedding_dim)
print(word_embeddings)


9
Embedding(9, 6)


In [50]:
sentence = training_data[0][0]
print(sentence)
sentence_in = prepare_sequence(sentence, word_to_ix)
print(sentence_in)


['The', 'dog', 'ate', 'the', 'apple']
tensor([0, 1, 2, 3, 4])


### Question: how to interpret the content of "embeds" in the next cell?

embeds is a matrix containing the vectorial representation of the words in the sentence given to the function.

In [51]:
embeds = word_embeddings(sentence_in)
print(sentence_in)
print(embeds)
print(embeds.shape)

tensor([0, 1, 2, 3, 4])
tensor([[-1.5256, -0.7502, -0.6540, -1.6095, -0.1002, -0.6092],
        [-0.9798, -1.6091, -0.7121,  0.3037, -0.7773, -0.2515],
        [-0.2223,  1.6871,  0.2284,  0.4676, -0.6970, -1.1608],
        [ 0.6995,  0.1991,  0.8657,  0.2444, -0.6629,  0.8073],
        [ 1.1017, -0.1759, -2.2456, -1.4465,  0.0612, -0.6177]],
       grad_fn=<EmbeddingBackward0>)
torch.Size([5, 6])


### Question: how to interpret the content of "lstm_out" in the next cell?

lstm_out contains the hidden states.

In [52]:
embeds_view = embeds.view(len(sentence), 1, -1)
print(embeds_view)
print(embeds_view.shape)

lstm = nn.LSTM(embedding_dim, hidden_dim)
# Parameters of LSTM
# input_size – The number of expected features in the input x
# hidden_size – The number of features in the hidden state h
lstm_out, _ = lstm(embeds_view)


tensor([[[-1.5256, -0.7502, -0.6540, -1.6095, -0.1002, -0.6092]],

        [[-0.9798, -1.6091, -0.7121,  0.3037, -0.7773, -0.2515]],

        [[-0.2223,  1.6871,  0.2284,  0.4676, -0.6970, -1.1608]],

        [[ 0.6995,  0.1991,  0.8657,  0.2444, -0.6629,  0.8073]],

        [[ 1.1017, -0.1759, -2.2456, -1.4465,  0.0612, -0.6177]]],
       grad_fn=<ViewBackward0>)
torch.Size([5, 1, 6])


## Create the model:

### Question: add a LSTM layer in the following neural network with an embedding layer.

Write your answer here.

### Question: what is the role of the linear layer in the following neural network?

Write your answer here.

### Question: how to interpret the output of the "forward" function?

Write your answer here.

In [53]:
class LSTMTagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim

        # Write your code here.
        
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim,hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence):

        vect = self.word_embeddings(sentence).view(len(sentence),1,-1)
        lstm_out,_ = self.lstm(vect)
        tag_space = self.hidden2tag(lstm_out.view(len(sentence),-1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

## Train the model:



In [65]:
model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

In [55]:
def get_index_of_max(input):
    index = 0
    for i in range(1, len(input)):
        if input[i] > input[index]:
            index = i 
    return index

def get_max_prob_result(input, ix_to_tag):
    return ix_to_tag[get_index_of_max(input)]

### Question: compute and print the tags before the training for the input defined below.

In [67]:
with torch.no_grad():
    sentence = training_data[0][0]
    inputs = prepare_sequence(sentence, word_to_ix)
    tags_score = model(inputs)
    print(tags_score)
    tags = [get_max_prob_result(tag_val,ix_to_tag) for tag_val in tags_score]
    print(" ".join(sentence))
    print(" ".join(tags))



tensor([[-1.0275, -0.9555, -1.3567],
        [-0.9928, -0.9257, -1.4559],
        [-0.9537, -1.0331, -1.3517],
        [-0.9893, -0.9985, -1.3481],
        [-1.0563, -0.9388, -1.3426]])
The dog ate the apple
NN NN DET DET NN


### Question: write the code that trains the neural network.

In [68]:
for epoch in range(300):
	for sentence,tag in training_data:
		optimizer.zero_grad()
		inputs = prepare_sequence(sentence,word_to_ix)
		target = prepare_sequence(tag,tag_to_ix)
		loss_function(model(inputs),target).backward()
		optimizer.step()


### Question: compute and print the tags after the training for the input defined below.

In [73]:

with torch.no_grad():
    sentence = training_data[0][0]
    inputs = prepare_sequence(sentence, word_to_ix)
    tags_score = model(inputs)
    print(tags_score)
    tags = [get_max_prob_result(tag_val,ix_to_tag) for tag_val in tags_score]
    print(" ".join(sentence))
    print(" ".join(tags))


tensor([[-0.0547, -3.4119, -3.9009],
        [-3.6310, -0.0289, -6.2381],
        [-3.4301, -6.1423, -0.0351],
        [-0.0469, -4.5868, -3.3348],
        [-4.7391, -0.0097, -7.0291]])
The dog ate the apple
DET NN V DET NN
