**POS-Tagging**

Welcome to the forth lab! In this excercise you will build a simple pos-tagger.
The excercise is inspired from Pytorch tutorial site: https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x7f64a715a0f0>

In [3]:
lstm = nn.LSTM(3, 3)  # Input dim is 3, output dim is 3
inputs = [torch.randn(1, 3) for _ in range(5)]  # make a sequence of length 5

# initialize the hidden state.
hidden = (torch.randn(1, 1, 3),
          torch.randn(1, 1, 3))
for i in inputs:
    # Step through the sequence one element at a time.
    # after each step, hidden contains the hidden state.
    out, hidden = lstm(i.view(1, 1, -1), hidden)

# alternatively, we can do the entire sequence all at once.
# the first value returned by LSTM is all of the hidden states throughout
# the sequence. the second is just the most recent hidden state
# (compare the last slice of "out" with "hidden" below, they are the same)
# The reason for this is that:
# "out" will give you access to all hidden states in the sequence
# "hidden" will allow you to continue the sequence and backpropagate,
# by passing it as an argument  to the lstm at a later time
# Add the extra 2nd dimension
inputs = torch.cat(inputs).view(len(inputs), 1, -1)
hidden = (torch.randn(1, 1, 3), torch.randn(1, 1, 3))  # clean out hidden state
out, hidden = lstm(inputs, hidden)
print(out)
print(hidden)

tensor([[[-0.0187,  0.1713, -0.2944]],

        [[-0.3521,  0.1026, -0.2971]],

        [[-0.3191,  0.0781, -0.1957]],

        [[-0.1634,  0.0941, -0.1637]],

        [[-0.3368,  0.0959, -0.0538]]], grad_fn=<StackBackward0>)
(tensor([[[-0.3368,  0.0959, -0.0538]]], grad_fn=<StackBackward0>), tensor([[[-0.9825,  0.4715, -0.0633]]], grad_fn=<StackBackward0>))


**Task:**

Load the `training_data` from `corpus-small.train`, and modify the `tag_to_ix` dictionary to have all different tags.

In [4]:
!head -n 5 corpus-small.train

In/IN an/DT Oct./NNP 19/CD review/NN of/IN ``/`` The/DT Misanthrope/NN ''/'' at/IN Chicago/NNP 's/POS Goodman/NNP Theatre/NNP (/-LRB- ``/`` Revitalized/VBN Classics/NNS Take/VBP the/DT Stage/NN in/IN Windy/NNP City/NNP ,/, ''/'' Leisure/NN &/CC Arts/NNS )/-RRB- ,/, the/DT role/NN of/IN Celimene/NNP ,/, played/VBN by/IN Kim/NNP Cattrall/NNP ,/, was/VBD mistakenly/RB attributed/VBN to/TO Christina/NNP Haag/NNP ./.
Ms./NNP Haag/NNP plays/VBZ Elianti/NNP ./.
Rolls-Royce/NNP Motor/NNP Cars/NNPS Inc./NNP said/VBD it/PRP expects/VBZ its/PRP$ U.S./NNP sales/NNS to/TO remain/VB steady/JJ at/IN about/IN 1,200/CD cars/NNS in/IN 1990/CD ./.
The/DT luxury/NN auto/NN maker/NN last/JJ year/NN sold/VBD 1,214/CD cars/NNS in/IN the/DT U.S./NNP
Howard/NNP Mosher/NNP ,/, president/NN and/CC chief/JJ executive/NN officer/NN ,/, said/VBD he/PRP anticipates/VBZ growth/NN for/IN the/DT luxury/NN auto/NN maker/NN in/IN Britain/NNP and/CC Europe/NNP ,/, and/CC in/IN Far/JJ Eastern/JJ markets/NNS ./.


In [6]:
def prepare_sequence(seq, to_ix):  
    idxs = [to_ix[w] if w in to_ix else to_ix['the'] for w in seq ]
    return torch.tensor(idxs, dtype=torch.long)

def load_data(path):
    def process(line):
        ### YOUR CODE HERE ~7 lines
        tagged_words = line.split()
        words, tags = [], []
        for word_tag in tagged_words:
            w, t = word_tag.rsplit("/", maxsplit=1)
            words.append(w)
            tags.append(t)
        return (words, tags)
    file = open(path, 'r')
    lines = file.read().split("\n")
    return [process(line) for line in lines if line]


training_data = [
    ("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Everybody read that book".split(), ["NN", "V", "DET", "NN"])
]
training_data = load_data("corpus-small.train") ### YOUR CODE HERE 1 line

word_to_ix = {}
tag_to_ix = {}
for sent, tags in training_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)

    # Write your code here to make fill the tag_to_ix
    ### YOUR CODE HERE ~3 lines
    for tag in tags:
        if tag not in tag_to_ix:
            tag_to_ix[tag] = len(tag_to_ix)
print(word_to_ix)
print(tag_to_ix)
# These will usually be more like 32 or 64 dimensional.
# We will keep them small, so we can see how the weights change as we train.
EMBEDDING_DIM = 6
HIDDEN_DIM = 6

{'In': 0, 'an': 1, 'Oct.': 2, '19': 3, 'review': 4, 'of': 5, '``': 6, 'The': 7, 'Misanthrope': 8, "''": 9, 'at': 10, 'Chicago': 11, "'s": 12, 'Goodman': 13, 'Theatre': 14, '(': 15, 'Revitalized': 16, 'Classics': 17, 'Take': 18, 'the': 19, 'Stage': 20, 'in': 21, 'Windy': 22, 'City': 23, ',': 24, 'Leisure': 25, '&': 26, 'Arts': 27, ')': 28, 'role': 29, 'Celimene': 30, 'played': 31, 'by': 32, 'Kim': 33, 'Cattrall': 34, 'was': 35, 'mistakenly': 36, 'attributed': 37, 'to': 38, 'Christina': 39, 'Haag': 40, '.': 41, 'Ms.': 42, 'plays': 43, 'Elianti': 44, 'Rolls-Royce': 45, 'Motor': 46, 'Cars': 47, 'Inc.': 48, 'said': 49, 'it': 50, 'expects': 51, 'its': 52, 'U.S.': 53, 'sales': 54, 'remain': 55, 'steady': 56, 'about': 57, '1,200': 58, 'cars': 59, '1990': 60, 'luxury': 61, 'auto': 62, 'maker': 63, 'last': 64, 'year': 65, 'sold': 66, '1,214': 67, 'Howard': 68, 'Mosher': 69, 'president': 70, 'and': 71, 'chief': 72, 'executive': 73, 'officer': 74, 'he': 75, 'anticipates': 76, 'growth': 77, 'for': 

Task:
Add dropout inside the LSTM module.

In [8]:
class LSTMTagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, dropout=0.2) ### CHANGE THIS LINE

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

In [9]:
model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# See what the scores are before training
# Note that element i,j of the output is the score for tag j for word i.
# Here we don't need to train, so the code is wrapped in torch.no_grad()
with torch.no_grad():
    inputs = prepare_sequence(training_data[0][0], word_to_ix)
    tag_scores = model(inputs)
    print(tag_scores)

### Reduce number of epochs, if training data is big
for epoch in range(300):  # again, normally you would NOT do 300 epochs, it is toy data
    for sentence, tags in training_data:
        # Step 1. Remember that Pytorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()

        # Step 2. Get our inputs ready for the network, that is, turn them into
        # Tensors of word indices.
        sentence_in = prepare_sequence(sentence, word_to_ix)
        targets = prepare_sequence(tags, tag_to_ix)

        # Step 3. Run our forward pass.
        tag_scores = model(sentence_in)

        # Step 4. Compute the loss, gradients, and update the parameters by
        # calling optimizer.step()
        loss = loss_function(tag_scores, targets)
        loss.backward()
        optimizer.step()

# See what the scores are after training
import numpy as np
with torch.no_grad():
    inputs = prepare_sequence(training_data[0][0], word_to_ix)
    tag_scores = model(inputs)
    # The sentence is "the dog ate the apple".  i,j corresponds to score for tag j
    # for word i. The predicted tag is the maximum scoring tag.
    # Here, we can see the predicted sequence below is 0 1 2 0 1
    # since 0 is index of the maximum value of row 1,
    # 1 is the index of maximum value of row 2, etc.
    # Which is DET NOUN VERB DET NOUN, the correct sequence!
    print(tag_scores)

  "num_layers={}".format(dropout, num_layers))


tensor([[-3.7736, -3.4513, -3.7759,  ..., -3.6650, -3.9193, -3.7274],
        [-3.7482, -3.4301, -3.7756,  ..., -3.6177, -3.9206, -3.6928],
        [-3.8730, -3.4990, -3.7137,  ..., -3.6426, -4.0071, -3.8072],
        ...,
        [-3.8692, -3.4937, -3.7748,  ..., -3.5918, -3.9295, -3.6922],
        [-3.8353, -3.5286, -3.9019,  ..., -3.5698, -3.9514, -3.8075],
        [-3.6774, -3.3781, -3.8495,  ..., -3.6257, -3.9517, -3.7885]])
tensor([[-9.4084e-01, -1.5780e+01, -1.0764e+01,  ..., -9.9006e+00,
         -7.1127e+00, -1.0230e+01],
        [-1.3622e+01, -1.7584e-02, -1.1248e+01,  ..., -1.2287e+01,
         -8.9040e+00, -1.0716e+01],
        [-1.2903e+01, -1.1716e+01, -9.1159e-02,  ..., -1.0576e+01,
         -1.0262e+01, -1.0857e+01],
        ...,
        [-1.0035e+01, -1.1695e+01, -1.6151e+00,  ..., -9.3889e+00,
         -7.5078e+00, -9.7556e+00],
        [-1.7440e+01, -5.9350e+00, -1.5731e-01,  ..., -1.3054e+01,
         -1.0158e+01, -1.2978e+01],
        [-4.9526e+00, -7.2188e+00, -2.

**Task:**

Read the test data `corpus-small.test` and process it then get the predicitions.Write down the output tagged predicitons in file `corpus-small.out` in the same form as `corpus-small.answer`.

Note: in-case of unseen word in the testing dataset, replace it with a random seen one! (There's a better solution).

At the end, run the last cell, to get the accuracy of your model.

In [13]:
def load_test_data(path):
    with open(path, 'r') as f:
      return [line.split() for line in f.readlines() if line]  ### FINISH THIS LINE

test_data = load_test_data('corpus-small.test')

reverse_tag = { v : k for k , v in tag_to_ix.items()}

answers = []
model.zero_grad()
model.eval()
with torch.no_grad():
    for sentence in test_data:
        # Write your code here to make the predictions.
        sentence_in = prepare_sequence(sentence, word_to_ix)
        tag_scores =  model(sentence_in)
        labels = torch.argmax(tag_scores,dim=1).detach().cpu().numpy()
        tagss = [reverse_tag[ids] for ids in labels]
        results = " ".join([w + "/" + t for w, t in zip(sentence, tagss)])
        answers.append(results)

print(answers)
file = open('corpus-small.out','w')
file.write("\n".join(answers))
file.close()

['For/IN six/DT years/NNS ,/, T./DT Marshall/DT Hahn/DT Jr./DT has/NN made/DT corporate/JJ acquisitions/DT in/JJ the/DT George/DT Bush/JJ mode/DT :/VBP kind/DT and/CC gentle/DT ./.', "The/DT question/DT now/VBD :/'' Can/DT he/PRP act/DT more/JJR like/IN hard-charging/DT Teddy/DT Roosevelt/DT ?/.", "Mr./NNP Hahn/DT ,/, the/DT 62-year-old/DT chairman/DT and/CC chief/JJ executive/NN officer/NN of/IN Georgia-Pacific/DT Corp./DT is/VBZ leading/DT the/DT forest-product/DT concern/NN 's/POS unsolicited/DT $/NN 3.19/DT billion/JJ bid/DT for/IN Great/DT Northern/NNP Nekoosa/DT Corp/DT ./.", "Nekoosa/DT has/VBZ given/DT the/DT offer/IN a/DT public/DT cold/DT shoulder/DT ,/, a/DT reaction/DT Mr./NNP Hahn/DT has/VBZ n't/RB faced/DT in/JJ his/PRP$ 18/DT earlier/DT acquisitions/DT ,/, all/VBN of/IN which/IN were/VBD negotiated/DT behind/DT the/DT scenes/DT ./.", 'So/DT far/DT ,/, Mr./NNP Hahn/DT is/VBZ trying/DT to/NNP entice/DT Nekoosa/DT into/IN negotiating/DT a/DT friendly/DT surrender/DT while/J

In [14]:
%run tagger_eval.py corpus-small.out corpus-small.answer

Accuracy= 0.4881422924901186
