# Assignment 2

This assignment is about training and evaluating a POS tagger with some real data. The dataset is available through the Universal Dependencies (https://universaldependencies.org/) (UD) project. To get to know the project, please visit https://universaldependencies.org/introduction.html)

In [None]:
!pip install conllutils
!pip install conllu


In [None]:
import numpy as np
import operator
import nltk

import conllutils
from io import open
from conllu import parse_incr

from collections import defaultdict
from conllutils import pipe





**Part 1** (getting the data)

You can download the dataset files directly from the UD website, but it will let you only download all the languages in one compressed file. In this assignment you will be working with th GUM dataset, which you can download directly from:
https://github.com/UniversalDependencies/UD_English-GUM.
Please download it to your colab machine.



In [None]:
!git clone https://github.com/UniversalDependencies/UD_English-GUM

Cloning into 'UD_English-GUM'...
remote: Enumerating objects: 2259, done.[K
remote: Counting objects: 100% (1270/1270), done.[K
remote: Compressing objects: 100% (499/499), done.[K
remote: Total 2259 (delta 1114), reused 920 (delta 771), pack-reused 989[K
Receiving objects: 100% (2259/2259), 14.99 MiB | 9.96 MiB/s, done.
Resolving deltas: 100% (2004/2004), done.


We will use the (train/dev/test) files:

UD_English-GUM/en_gum-ud-train.conllu

UD_English-GUM/en_gum-ud-dev.conllu

UD_English-GUM/en_gum-ud-test.conllu

They are all formatted in the conllu format. You may read about it [here](https://universaldependencies.org/format.html). There is a utility library **conllutils**, which can help you read the data into the memory. It has already been installed and imported above.

You should write a code that reads the three datasets into memory. 
You may choose the data structure by yourself. 
As you can see
1. every word is represented by a line
2. columns representing specific features. 
   * We are only interested in the first and fourth columns 
   * corresponding to the word and its POS tag.

In [None]:
def read_conllu(path):
    "https://www.youtube.com/watch?v=lvJRFMvWtFI"
    # train data
    list_of_words  = []
    idx = 0
    data_file = open(path, "r", encoding="utf-8")
    annotation = data_file.read()
    parse_annotation = conllu.parse(annotation)
    data_dict = defaultdict(lambda : defaultdict(dict))
    # item index
    item_idx = 1
    for item in parse_annotation:
        
        # get text tokens
        line_tokens = item.filter()
        
        
        # get token
        sentence_tokens = list(map(lambda x: x, line_tokens))
        
     
        # get meta data
        metadata = item.metadata
        text = metadata['text']
        
        # token index 
        for i_token in sentence_tokens:
            
            if 'lemma' in i_token.keys():
                # get token line
                word = i_token['lemma']
                pos_tag =  i_token['upos']
                list_of_words.append(word)
                if not 'text' in  data_dict[item_idx].keys():
                    data_dict[item_idx]['text'] = text

                data_dict[item_idx][word] = pos_tag
            else:
                continue
        item_idx += 1

    return data_dict, list_of_words

## Set Path's

In [None]:
# Your code goes here
user = 'Or'
if user == 'Or':
    ud_dev =   r"C:\MSC\NLP2\HW2\UD_English-GUM\en_gum-ud-dev.conllu"
    ud_train = r"C:\MSC\NLP2\HW2\UD_English-GUM\en_gum-ud-train.conllu"
    ud_test =  r"C:\MSC\NLP2\HW2\UD_English-GUM\en_gum-ud-test.conllu"
elif == roni:
    ud_dev =   r"C:\MSC\NLP2\HW2\UD_English-GUM\en_gum-ud-dev.conllu"
    ud_train = r"C:\MSC\NLP2\HW2\UD_English-GUM\en_gum-ud-train.conllu"
    ud_test =  r"C:\MSC\NLP2\HW2\UD_English-GUM\en_gum-ud-test.conllu"
else:
    ud_dev =   r"C:\MSC\NLP2\HW2\UD_English-GUM\en_gum-ud-dev.conllu"
    ud_train = r"C:\MSC\NLP2\HW2\UD_English-GUM\en_gum-ud-train.conllu"
    ud_test =  r"C:\MSC\NLP2\HW2\UD_English-GUM\en_gum-ud-test.conllu"

In [None]:

# get data

train_dict, train_list_of_word = read_conllu(ud_train)
dev_dict, dev_list_of_word = read_conllu(ud_dev)
test_dict, test_list_of_word  = read_conllu(ud_test)
train_words, train_counts_word = np.unique(np.array(train_list_of_word), return_counts=True)
word_freq = train_counts_word/np.sum(train_counts_word)


**Part 2**

Write a class **simple_tagger**
1. with methods *train* and *evaluate*. 
2. The method *train* receives the data as a list of sentences
3. use it for training the tagger.
4. In this case, it should learn a simple dictionary that maps words to tags
    * defined as the most frequent tag for every word (in case there is more than one most frequent tag, you may select one of them randomly).
    * The dictionary should be stored as a class member for evaluation.

The method *evaluate* 
1. receives the data as a list of sentences
2. use it to evaluate the tagger performance. 
3. Specifically, you should calculate the word and sentence level accuracy.
4. The evaluation process is simply going word by word
5. querying the dictionary (created by the train method) for each word’s tag and compare it to the true tag of that word. 
6. The word-level accuracy is the number of successes divided by the number of words.
7. For OOV (out of vocabulary, or unknown) words
8. the tagger should assign the most frequent tag in the entire training set (i.e., the mode).
9. The function should return the two numbers:
    * word level accuracy
    * sentence level accuracy.


In [None]:
class simple_tagger:
      def train(self, data):
        # TODO
  
      def evaluate(self, data):
        # TODO

**Part 3**

Similar to part 2, write the class hmm_tagger which implements HMM tagging. 
1. The method *train* should build the matrices A, B and Pi, from the data as discussed in class.
2. The method *evaluate* should find the best tag sequence for every input sentence using he Viterbi decoding algorithm, and then calculate the word and sentence level accuracy using the gold-standard tags.
    * You should implement the Viterbi algorithm in the next block and call it from your class.

## Additional guidance:
1. The matrix B represents the emissions probabilities. 
    * Since B is a matrix, you should build a dictionary that maps every unique word in the corpus to a serial numeric id (starting with 0). 
    * This way columns in B represents word ids.
2. During the evaluation
    * you should first convert each word into it’s index 
    * and then create the observation array to be given to Viterbi as a list of ids.
    * OOV words should be assigned with a random tag.
        * To make sure Viterbi works appropriately, you can simply break the sentence into multiple segments every time you see an OOV word, and decode every segment individually using Viterbi.


In [None]:
class hmm_tagger:
    def train(self, data):
        # TODO

    def evaluate(self, data):
        # TODO

In [None]:
# Viterbi
def viterbi (observations, A, B, Pi):
  #...

  return best_sequence

# A simple example to run the Viterbi algorithm:
#( Same as in presentation "NLP 3 - Tagging" on slide 35)

# A = np.array([[0.3, 0.7], [0.2, 0.8]])
# B = np.array([[0.1, 0.1, 0.3, 0.5], [0.3, 0.3, 0.2, 0.2]])
# Pi = np.array([0.4, 0.6])
# print(viterbi([0, 3, 2, 0], A, B, Pi))
# Expected output: 1, 1, 1, 1

**Part 4**

Compare the results obtained from both taggers and a MEMM tagger, implemented by NLTK (a known NLP library), over both, the dev and test datasets. To train the NLTK MEMM tagger you should execute the following lines (it may take some time to train...):

In [None]:
from nltk.tag import tnt 

tnt_pos_tagger = tnt.TnT()
tnt_pos_tagger.train(train_data)
print(tnt_pos_tagger.evaluate(test_data))

Print both, word level and sentence level accuracy for all the three taggers in a table.

In [None]:
# Your code goes here