# A POS tagger based on deep neural network, Pytorch Version

## Creating Vocabulary from the training data
First we should read from data file. The data looks like the following
```txt
Two     NUM
of      ADP
them    PRON
were    AUX
being   AUX
run     VERB
by      ADP
2       NUM
officials       NOUN
of      ADP
the     DET
Ministry        PROPN
of      ADP
the     DET
Interior        PROPN
!       PUNCT

The     DET
MoI     PROPN
in      ADP
Iraq    PROPN
is      AUX
equivalent      ADJ
to      ADP
the     DET
US      PROPN
FBI     PROPN
,       PUNCT
so      ADV
this    PRON
would   AUX
be      VERB
like    SCONJ
having  VERB
J.      PROPN
Edgar   PROPN
Hoover  PROPN
unwittingly     ADV
employ  VERB
at      ADP
a       DET
high    ADJ
level   NOUN
members NOUN
of      ADP
the     DET
Weathermen      PROPN
bombers NOUN
back    ADV
in      ADP
the     DET
1960s   NOUN
.       PUNCT


```

In [9]:
data_path = 'data/en.pos.train'
sentences = open(data_path, 'r').read().strip().split('\n\n')

Then we should count the frequency of words and pos tags

In [11]:
from collections import defaultdict

word_count, tags = defaultdict(int), set()
for sentence in sentences:
    lines = sentence.strip().split('\n')
    for line in lines:
        word, tag = line.strip().split('\t')
        word_count[word] += 1
        tags.add(tag)
tags = list(tags)

Now we assume that words with frequency less than one should be disregarded.

In [12]:
words = [word for word in word_count.keys() if word_count[word]>1]

We should also take into account special symbols start of a sentence and end of a sentence

In [13]:
words = ['<UNK>', '<s>', '</s>'] + words
feat_tags = ['<s>'] + tags
output_tags = tags

We should also create string to integer mapping (because neural network libraries work with integers)

In [14]:
word_dict = {word: i for i, word in enumerate(words)}
feat_tags_dict = {tag: i for i, tag in enumerate(feat_tags)}
output_tag_dict = {tag: i for i, tag in enumerate(output_tags)}

We define some auxiliary functions to access the words, tag feature and tag output

In [15]:
def tagid2tag_str(id):
    return output_tags[id]

def tag2id(tag):
    return output_tag_dict[tag]

def feat_tag2id(tag):
    return feat_tags_dict[tag]

def word2id(word):
    return word_dict[word] if word in word_dict else word_dict['<UNK>']

def num_words():
    return len(words)

def num_tag_feats():
    return len(feat_tags)

def num_tags():
    return len(output_tags)

# Converting training data to a csv-style format


In [16]:
sens = open(data_path, 'r').read().strip().split('\n\n')
writer = open(data_path+'.data', 'w')

for sen in sens:
    lines = sen.strip().split('\n')
    ws, ts = ['<s>', '<s>'], ['<s>', '<s>']
    for line in lines:
        word, tag = line.strip().split()
        ws.append(word)
        ts.append(tag)
    ws += ['</s>', '</s>']

    for i in range(len(lines)):
        feats = [ws[i], ws[i + 1], ws[i + 2], ws[i + 3], ws[i + 4], ts[i], ts[i + 1]]
        label = ts[i + 2]
        writer.write('\t'.join(feats) + '\t' + label + '\n')
writer.close()


The output data should look like the following (__first 5 are word features, the other two are pos features, the last is the pos label__)
```txt
<s>     <s>     Al      -       Zaman   <s>     <s>     PROPN
<s>     Al      -       Zaman   :       <s>     PROPN   PUNCT
Al      -       Zaman   :       American        PROPN   PUNCT   PROPN
-       Zaman   :       American        forces  PUNCT   PROPN   PUNCT
Zaman   :       American        forces  killed  PROPN   PUNCT   ADJ
:       American        forces  killed  Shaikh  PUNCT   ADJ     NOUN
American        forces  killed  Shaikh  Abdullah        ADJ     NOUN    VERB
forces  killed  Shaikh  Abdullah        al      NOUN    VERB    PROPN
killed  Shaikh  Abdullah        al      -       VERB    PROPN   PROPN
Shaikh  Abdullah        al      -       Ani     PROPN   PROPN   PROPN
Abdullah        al      -       Ani     ,       PROPN   PROPN   PUNCT
al      -       Ani     ,       the     PROPN   PUNCT   PROPN
-       Ani     ,       the     preacher        PUNCT   PROPN   PUNCT
Ani     ,       the     preacher        at      PROPN   PUNCT   DET
,       the     preacher        at      the     PUNCT   DET     NOUN
the     preacher        at      the     mosque  DET     NOUN    ADP
preacher        at      the     mosque  in      NOUN    ADP     DET
at      the     mosque  in      the     ADP     DET     NOUN
the     mosque  in      the     town    DET     NOUN    ADP
mosque  in      the     town    of      NOUN    ADP     DET
in      the     town    of      Qaim    ADP     DET     NOUN
the     town    of      Qaim    ,       DET     NOUN    ADP
town    of      Qaim    ,       near    NOUN    ADP     PROPN
of      Qaim    ,       near    the     ADP     PROPN   PUNCT
Qaim    ,       near    the     Syrian  PROPN   PUNCT   ADP
,       near    the     Syrian  border  PUNCT   ADP     DET
near    the     Syrian  border  .       ADP     DET     ADJ
the     Syrian  border  .       </s>    DET     ADJ     NOUN
Syrian  border  .       </s>    </s>    ADJ     NOUN    PUNCT
<s>     <s>     [       This    killing <s>     <s>     PUNCT
<s>     [       This    killing of      <s>     PUNCT   DET
[       This    killing of      a       PUNCT   DET     NOUN
This    killing of      a       respected       DET     NOUN    ADP
killing of      a       respected       cleric  NOUN    ADP     DET
of      a       respected       cleric  will    ADP     DET     ADJ
a       respected       cleric  will    be      DET     ADJ     NOUN
respected       cleric  will    be      causing ADJ     NOUN    AUX
cleric  will    be      causing us      NOUN    AUX     AUX
will    be      causing us      trouble AUX     AUX     VERB
be      causing us      trouble for     AUX     VERB    PRON
```

# Defining the network in Pytorch
Here we introduce the major change from the forked library, which comes from writing the network in Pytorch. 

In [17]:
import torch
import torch.nn as nn
%matplotlib nbagg
import random
import matplotlib.pyplot as plt
import numpy as np

In [18]:
cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if cuda else "cpu")
seed = 1008
torch.manual_seed(seed)
if cuda:
    torch.cuda.manual_seed_all(seed)

Our model has the following architecture (the same of the Dynet example):
    1. Embedding layer for words and parts of speech (of respective dimensions word_embed_dim and pos_embed_dim);
    2. Hidden layer (linear+ReLu) having as input the concatenation of 5 words and 2 parts of speech corresponding to the first two words and output of dimension hidden_dim;
    3. A linear output layer trying to predict the part of speech corresponding to the third word.

In [20]:
word_embed_dim, pos_embed_dim = 2,2
input_dim=5*word_embed_dim+2*pos_embed_dim
hidden_dim,output_dim=200,len(feat_tags)


In [21]:
class POS_tagging(nn.Module):
    def __init__(self):
        super(POS_tagging, self).__init__()
        self.word_embeddings=nn.Embedding(len(words),word_embed_dim)
        self.tag_embeddings=nn.Embedding(len(feat_tags),pos_embed_dim)
        self.network=torch.nn.Sequential(torch.nn.Linear(input_dim, hidden_dim),nn.ReLU(),nn.Linear(hidden_dim,output_dim))
       
    def forward(self, features):
        word_ids = torch.tensor([word2id(word_feat) for word_feat in features[0:5]], dtype=torch.long)
        tag_ids = torch.tensor([feat_tag2id(tag_feat) for tag_feat in features[5:]],dtype=torch.long)
        word_embeds = self.word_embeddings(word_ids).view((1, -1))
        tag_embeds = self.tag_embeddings(tag_ids).view((1,-1))
        embedding_layer=torch.cat((word_embeds,tag_embeds),1)
        out=self.network(embedding_layer)
        output=nn.functional.log_softmax(out, dim=1)
        
        return output


Uploading training data and writing the training function with Negative Log-Likelihood loss and batching

In [23]:
train_data = open(data_path+'.data', 'r').read().strip().split('\n') 
minibatch_size=1000

In [28]:
def train(model,epochs,train_data):
    model.train()
    total_loss=torch.tensor([0.0])
    random.shuffle(train_data)
    loss_function=nn.NLLLoss()
    optimizer = torch.optim.SGD(model.parameters(),lr=0.1)
    
    for epochs in range(epochs):
         print('epoch:',epochs+1)
        
         for j,line in enumerate(train_data):
            fields = line.strip().split('\t')
            features, label, gold_label = fields[:-1], fields[-1], tag2id(fields[-1])
            result = model(features)
            loss = loss_function(result, torch.tensor([gold_label], dtype=torch.long))
            total_loss+=loss
            if j % minibatch_size == 0:
                minibatch_loss = total_loss / minibatch_size
                optimizer.zero_grad()
                minibatch_loss.backward()
                optimizer.step()
                total_loss=torch.tensor([0.0])
                


        
            
    return result.detach()

        

In [25]:
model= POS_tagging().to(device)

Training the model on a fixed number of epochs

In [27]:
train(model,5,train_data)
print('finished training!') 

epoch: 0
epoch: 1
epoch: 2
epoch: 3
epoch: 4
finished training!


Using the trained model to classify our test data

In [29]:
def decode(model,ws):
   # first putting two start symbols
    ws = ['<s>', '<s>'] + ws + ['</s>', '</s>']
    ts = ['<s>', '<s>']
    with torch.no_grad():
        for i in range(2, len(ws) - 2):
            features = ws[i - 2:i + 3] + ts[i - 2:i]

       # running forward
            output = model(features)

       # getting best tag
            best_tag_id = np.argmax(output)

       # assigning the best tag
            ts.append(tagid2tag_str(best_tag_id.item()))

    return ts[2:]

We now upload and classify test data, evaluating the accuracy of the model depending on the embedding dimension

In [31]:
test_file = 'data/en.pos.dev.raw'
writer = open(test_file+'.output.pyex.dim2', 'w')
for sentence in open(test_file, 'r'):
    words = sentence.strip().split()
    tags = decode(model, words)
    output = [word + '\t' + tag for word, tag in zip(words, tags)]
    writer.write('\n'.join(output) + '\n\n')
writer.close()

In [32]:
def evaluate_test(w_test_file,data_file):
    true=0
    compare=open(data_file,'r')
    l=[]
    k=[]
    for sentence1 in compare:
        words1=sentence1.strip().split()
        if len(words1)==2:
            l.append(words1[1])
    for sentence2 in open(w_test_file,'r'):
        words2=sentence2.strip().split()
        if len(words2)==2:
            k.append(words2[1])
    for i in range(len(l)):
        if l[i]==k[i]:
            true+=1
    accuracy=true/len(l)
    return accuracy


In [33]:
print(evaluate_test('data/en.pos.dev.raw.output.pyex.dim2','data/en.pos.dev'))

0.40361147303323364
