# Training and Evaluating a POS Tagger

**Goal**
- assign POS tag to each word
    - x = first column of conll, y = second column of conll

**Plan**
- preprocess data
    - numerical classes bsed on unique POS tags
    - encoded strings (look at ways to do that)
- decide on model (have to be able to explain it!)
- train and evaluation loop
- pick appropriate metric (F1! precision? recall?)
- optional: create some nice plots, e.g.: confusion matrix, learning curve, precision-recall curve
- optional: analyse dataset (distribution of POS tages, most common words per POS tag, etc.)

**Model**
- decision tree: create features for each word (https://nlpforhackers.io/training-pos-tagger/)
- LSTM/RNN with word ids based on unique words --> study how LSTM/RNN network works!

if tokenization: use spcy en_core_web_sm

In [61]:
import pandas as pd
import spacy

## Preprocessing

In [48]:
with open('train.txt') as f:
    train_data = f.readlines()

In [49]:
# split dataset into sentences

lines = list()
lines.append(list()) 
current_idx = 0

for string in train_data:
    if string == "\n":
        lines.append(list())
        current_idx += 1
    else:
        lines[current_idx].append(string)

In [54]:
lines[1]

['Chancellor NNP O\n',
 'of IN B-PP\n',
 'the DT B-NP\n',
 'Exchequer NNP I-NP\n',
 'Nigel NNP B-NP\n',
 'Lawson NNP I-NP\n',
 "'s POS B-NP\n",
 'restated VBN I-NP\n',
 'commitment NN I-NP\n',
 'to TO B-PP\n',
 'a DT B-NP\n',
 'firm NN I-NP\n',
 'monetary JJ I-NP\n',
 'policy NN I-NP\n',
 'has VBZ B-VP\n',
 'helped VBN I-VP\n',
 'to TO I-VP\n',
 'prevent VB I-VP\n',
 'a DT B-NP\n',
 'freefall NN I-NP\n',
 'in IN B-PP\n',
 'sterling NN B-NP\n',
 'over IN B-PP\n',
 'the DT B-NP\n',
 'past JJ I-NP\n',
 'week NN I-NP\n',
 '. . O\n']

In [26]:
# for each sentence, extract each word and corresponding POS tag

text = list()
target = list()

for line in lines:
    words = list()
    pos_tags = list()
    for string in line:
        word, pos, _ = string.split()
        words.append(word)
        pos_tags.append(pos)
    text.append(words)
    target.append(pos_tags)

In [34]:
df = pd.DataFrame(data={"text": text, "target": target})

In [36]:
df.head(10)

Unnamed: 0,text,target
0,"[Confidence, in, the, pound, is, widely, expec...","[NN, IN, DT, NN, VBZ, RB, VBN, TO, VB, DT, JJ,..."
1,"[Chancellor, of, the, Exchequer, Nigel, Lawson...","[NNP, IN, DT, NNP, NNP, NNP, POS, VBN, NN, TO,..."
2,"[But, analysts, reckon, underlying, support, f...","[CC, NNS, VBP, VBG, NN, IN, NN, VBZ, VBN, VBN,..."
3,"[This, has, increased, the, risk, of, the, gov...","[DT, VBZ, VBN, DT, NN, IN, DT, NN, VBG, VBN, T..."
4,"[``, The, risks, for, sterling, of, a, bad, tr...","[``, DT, NNS, IN, NN, IN, DT, JJ, NN, NN, VBP,..."
5,"[``, If, there, is, another, bad, trade, numbe...","[``, IN, EX, VBZ, DT, JJ, NN, NN, ,, EX, MD, V..."
6,"[Forecasts, for, the, trade, figures, range, w...","[NNS, IN, DT, NN, NNS, VBP, RB, ,, CC, JJ, NNS..."
7,"[The, August, deficit, and, the, #, 2.2, billi...","[DT, NNP, NN, CC, DT, #, CD, CD, NN, VBN, IN, ..."
8,"[Sanjay, Joshi, ,, European, economist, at, Ba...","[NNP, NNP, ,, JJ, NN, IN, NNP, NNPS, CC, NNP, ..."
9,"[At, the, same, time, ,, he, remains, fairly, ...","[IN, DT, JJ, NN, ,, PRP, VBZ, RB, JJ, IN, DT, ..."


In [57]:
# check if length of X and Y are the same for each sample

for idx in range(len(df)):
    if len(df["text"].iloc[idx]) != len(df["target"].iloc[idx]):
        print(idx)

In [59]:
# create POS encodings
unique_pos_tags = set()
for idx in range(len(df)):
    for tag in df["target"].iloc[idx]:
        if tag not in unique_pos_tags:
            unique_pos_tags.add(tag)

print(unique_pos_tags)

{'DT', 'RBS', 'TO', 'VBG', 'MD', 'POS', ',', 'PRP$', '#', 'SYM', 'PRP', 'UH', 'VBP', 'EX', '(', 'VBN', 'NNP', 'CC', 'WP', 'NNS', 'JJR', 'WRB', '.', 'RP', 'FW', 'WP$', 'CD', 'IN', 'NN', 'JJS', ':', '$', 'RBR', "''", 'RB', 'NNPS', 'WDT', ')', '``', 'VBD', 'VB', 'PDT', 'JJ', 'VBZ'}


In [60]:
pos2value = dict()
for idx, tag in enumerate(unique_pos_tags):
    pos2value[idx] = tag

print(pos2value)

{0: 'DT', 1: 'RBS', 2: 'TO', 3: 'VBG', 4: 'MD', 5: 'POS', 6: ',', 7: 'PRP$', 8: '#', 9: 'SYM', 10: 'PRP', 11: 'UH', 12: 'VBP', 13: 'EX', 14: '(', 15: 'VBN', 16: 'NNP', 17: 'CC', 18: 'WP', 19: 'NNS', 20: 'JJR', 21: 'WRB', 22: '.', 23: 'RP', 24: 'FW', 25: 'WP$', 26: 'CD', 27: 'IN', 28: 'NN', 29: 'JJS', 30: ':', 31: '$', 32: 'RBR', 33: "''", 34: 'RB', 35: 'NNPS', 36: 'WDT', 37: ')', 38: '``', 39: 'VBD', 40: 'VB', 41: 'PDT', 42: 'JJ', 43: 'VBZ'}
