# PS2, Part 2: POS Tagging with an LSTM

This is adapted from the [PyTorch tutorial on LSTMs for sequence tagging](https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html). We are using the same testing data that you used for your HMM POS tagger, and we are using the same training data from which the provided emission probabilities and transition probabilities were dervied.

You'll be able to experiment with adjusting the dimensions of embeddings and the hidden layers, as well as the number of epochs, to see whether you can beat the HMM baseline.

**In your PDF containing the answers to the HMM questions for Part 1, please include the answers to the final questions, Q6 and Q7, at the very end of this notebook. I have included a reminder about these questions in the README so you don't forget, but you'll need to actually do the work here to answer the question.**

Let's get started with importing the libraries we need and mounting your Google Drive.


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

from google.colab import drive
drive.mount('/content/drive')

## Step 1: Prepare the training data

First, we need to read in the training data. You did not actually deal with the training data in the Viterbi code because I calculated the probabilities for you.

You'll need to mount your Google Drive (as seen in lab 7 and lab 8) and upload `train_pos.txt`, `train_tok.txt`, `test_pos.txt`, and `test_tok.txt` to a folder on your Google Drive called `ps2` for the code below to work. 

If you'd rather use the little file system thingy in the left panel in Colab, that's fine, but you'll need to rewrite the code below.

In [None]:
f = open("/content/drive/MyDrive/ps2/train_pos.txt")
posbyline = f.read().split("\n")
f.close()


f = open("/content/drive/MyDrive/ps2/train_tok.txt")
tokbyline = f.read().split("\n")
f.close()

# just print out the tags and text for a random example so you can see them
print(posbyline[2])
print(tokbyline[2])

In [None]:
# Create a list of tuples of lists to store each pair of 
# token sequence and tag sequence

training_data = []
for t, p in zip(tokbyline, posbyline):
  training_data.append( (t.split(),p.split()))

# training_data will be a list of tuples
# each tuple will consist of a list of tokens and a list
# of their corresponding tags
print(training_data[2][0])
print(training_data[2][1])

In [None]:
# My code for reading in the data can't seem to get rid of the final empty line
# that gets added to files sometimes, so I'm just deleting it here manually.
print(training_data[-1])
del(training_data[-1])
print(training_data[-1])

In [None]:
# This code creates a dictionary mapping each word to a unique integer ID,
# and each tag to a unique integer ID.
# I also create a list of postags for easy lookup by index later on.

word_to_ix = {}
tag_to_ix = {}
postaglist = []

# For each tok list and pos list in each tuple of training_data
for sent, tags in training_data:

    # add any new word to the word dictionary with the next integer ID
    for word in sent:
        if word not in word_to_ix: 
            word_to_ix[word] = len(word_to_ix)

    # add any new tag to the tag dictionary with the next integer ID
    for t in tags:
        if t not in tag_to_ix:
            tag_to_ix[t] = len(tag_to_ix)
            postaglist.append(t)

# Add unknown word "UNK" to word_to_ix.
# Doing this in case you find an unknown word in testing.
word_to_ix["UNK"] = len(word_to_ix)


In [None]:
# This little function converts a list of words into a list of their integer IDs
# or a list of tags into a list of their integer IDs.
# We account for possibility of OOV words by getting the "UNK" ID.

def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] if w in to_ix else to_ix["UNK"] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)

## Step 2: Setting up the model

For fun, I am setting the embedding dimensions and the hidden dimensions for the model at 8, and the number of epochs at 3. Those are all stupidly small, but it will be interesting to see how well the model does with such a small network and so few epochs. Later on you will change these dimensions to more ambitious values (like 32 or 64 or 128), as well as increasing the number of epochs, to see if you can improve accuracy,

In [None]:
EMBEDDING_DIM = 8
HIDDEN_DIM = 8
EPOCHS = 3

Now we set up the LSTM itself. When we instantiate it, we'll pass in the two variables above, along with the size of the vocabulary (i.e., the length of `word_to_ix`), and the size of the tagset (i.e., the length of `tag_to_ix`).

First we want the model to create embeddings (of the dimension specified by the `embedding_dim` parameter) for the set of possible input tokens (size = `vocab_size`)

Then, we have the LSTM layer, which goes from embeddings to a layer of hidden states, with the dimention specified by the `hidden_dim` parameter.

Finally, we have a linear layer that maps from the hidden states to the output, i.e., a probability distribution over the possible tags for the input token, whose size specified by the `tagset_size` parameter.



In [None]:
class LSTMTagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

Now instantiate the model,  specify the loss function (here we choose  negative log likelihood), and specify the optimizer (we choose SGD) with its learning rate.*italicized text*

In [None]:
model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

## Part 3: Train the model

First let's look at the output of the first example before training. Remember, the output will be a probability distribution over the set of possible POS tags. Without training these probabilities will be random. Remember that these are log probabilities so the closer they are to 0, the more probable they are.

In [None]:
# For fun, let's see what the probabilities are before we even train.
# Element i,j of the output is the score for tag j for word i.
# Scores are log probabilities. The closer they are to 0, the more probable.

# We don't want to do any training yet,
# so the code is wrapped in torch.no_grad()
with torch.no_grad():
    inputs = prepare_sequence(training_data[0][0], word_to_ix)
    tag_scores = model(inputs)
    print(tag_scores)


Okay, now let's train the model as we have parameterized it above.

In [None]:
# And here the training begins
for epoch in range(EPOCHS):

    print(f"Epoch  {epoch}")

    for sentence, tags in training_data:
        # Step 1. Remember that Pytorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()

        # Step 2. Get our inputs ready for the network, that is, turn them into
        # Tensors of word indices using the little function we wrote above.
        sentence_in = prepare_sequence(sentence, word_to_ix)
        targets = prepare_sequence(tags, tag_to_ix)

        # Step 3. Run our forward pass.
        tag_scores = model(sentence_in)

        # Step 4. Compute the loss, gradients, and update the parameters by
        #  calling optimizer.step()
        loss = loss_function(tag_scores, targets)
        loss.backward()
        optimizer.step()

print("TRAINING COMPLETE")

## Part 4: Evaluate the LSTM on the test data

First, read in the test data. It's just like what we did before for reading in the training data.

In [None]:
import numpy as np

# Read in test data, just as we read in the training data.
f = open("/content/drive/MyDrive/ps2/test_pos.txt")
test_posbyline = f.read().split("\n")
print(test_posbyline[2])
f.close()

f = open("/content/drive/MyDrive/ps2/test_tok.txt")
test_tokbyline = f.read().split("\n")
print(test_tokbyline[2])
f.close()

test_data = []
for t, p in zip(test_tokbyline, test_posbyline):
  test_data.append( (t.split(),p.split()))


Now evaluate! For each test token sequence, convert it to integer IDs, then get the predicted tag sequence. Then compare that tag sequence to the known tag sequence, and count how many tags you get right.

In [None]:
# No need to train. We're just passing our test data through the
# model we trained above, so wrap it in torch.no_grad() again.
with torch.no_grad():
 
  # some variable to store how many tags and how many correct
  totaltags = 0
  totalcorrect = 0

  # For each input in the test data and its correct tag sequence...
  for toks, tags in test_data:

    # Convert the input tokens to integer IDs.
    inputs = prepare_sequence(toks, word_to_ix)

    # Run that sequence through the model to get the scores (probabilities)
    # for each tag in the set of possible tags.
    tag_scores = model(inputs)

    # Get the predicted tags as follows:
    # for each output tensor, find the largest score,
    # then look up the tag associated with that index 
    pred_tag = []
    for sc in tag_scores:
      pred_tag.append(postaglist[np.argmax(sc)])

    # Count up how many of the predicted tags were correct.
    for i in range(len(pred_tag)):
      totaltags += 1
      if pred_tag[i] == tags[i]:
        totalcorrect += 1

# Print out accuracy
print(f"The accuracy of the model with \n* {EMBEDDING_DIM}-dimensional embeddings \
and \n* {HIDDEN_DIM}-dimensional hidden layer \
\n* trained for {EPOCHS} epochs \n is {totalcorrect/totaltags}")

# Q6: Impact of adjusting the hyperparameters
Adjust and experiment with the `EMBEDDING_DIM`, `HIDDEN_DIM`, and `EPOCHS` hyperparameters, above. Specifically, you'll want to increase them gradually and see whether even just small changes can result in improvement. Dimensions are usually powers of 2, so you'll jump from 8 to 16 to 32, etc. You won't need to go bigger than 32 or 64. The number of epochs you won't need to go higher than 10.

In the PDF with your answers to other questions **create a table** that shows the POS tagging accuracy for each `EMBEDDING_DIM`, `HIDDEN_DIM`, and `EPOCHS` combination that you explore, including the default one provided. Try at least 4 different combinations. Some ideas: increase the layer dimensions but keep epochs small, incease epochs but keep layer dimensions small, increase one layer dimension but not the other, increase everthing a little, increase everything a lot. **The final two rows of the table should be the results you got in part 1 for the most frequent POS tag and the Viterbi search.**

# Q7: Which adjustment resulted in the largest improvement? Speculate about why.
There is no right or wrong answer. Try to reason about this yourself based on what you have learned in class. The important thing is to show me that you have given this some thought.


**Put your table (Q6) and your discussion (Q7) in the PDF you submit with your answers to the other questions from Part 1.**


