# Assignment 2

This assignment is about training and evaluating a POS tagger with some real data. The dataset is available through the Universal Dependencies (https://universaldependencies.org/) (UD) project. To get to know the project, please visit https://universaldependencies.org/introduction.html)

In [1]:
! pip install conllutils
! pip install conllu
! pip install conll-df


Collecting conllutils
  Downloading conllutils-1.1.4.tar.gz (18 kB)
Building wheels for collected packages: conllutils
  Building wheel for conllutils (setup.py) ... [?25l[?25hdone
  Created wheel for conllutils: filename=conllutils-1.1.4-py3-none-any.whl size=17697 sha256=15e49b94130707a99ee738b6cf2d8f5c8fa0e57db16e396c92ddd04914286ad6
  Stored in directory: /root/.cache/pip/wheels/70/9c/af/495f50326290abb66f82ac92273619cdad168cc1b79af379db
Successfully built conllutils
Installing collected packages: conllutils
Successfully installed conllutils-1.1.4
Collecting conllu
  Downloading conllu-4.4.1-py2.py3-none-any.whl (15 kB)
Installing collected packages: conllu
Successfully installed conllu-4.4.1
Collecting conll-df
  Downloading conll_df-0.0.4.tar.gz (3.5 kB)
Building wheels for collected packages: conll-df
  Building wheel for conll-df (setup.py) ... [?25lerror
[31m  ERROR: Failed building wheel for conll-df[0m
[?25h  Running setup.py clean for conll-df
Failed to build conll-df

In [2]:
import numpy as np
import pandas as pd
import operator
import nltk

import conllutils
from io import open
from conllu import parse_incr

from collections import defaultdict
from conllutils import pipe
from conll_df import conll_df



**Part 1** (getting the data)

You can download the dataset files directly from the UD website, but it will let you only download all the languages in one compressed file. In this assignment you will be working with th GUM dataset, which you can download directly from:
https://github.com/UniversalDependencies/UD_English-GUM.
Please download it to your colab machine.



In [3]:
!git clone https://github.com/UniversalDependencies/UD_English-GUM

Cloning into 'UD_English-GUM'...
remote: Enumerating objects: 3888, done.[K
remote: Counting objects: 100% (190/190), done.[K
remote: Compressing objects: 100% (184/184), done.[K
remote: Total 3888 (delta 149), reused 11 (delta 6), pack-reused 3698[K
Receiving objects: 100% (3888/3888), 36.19 MiB | 15.15 MiB/s, done.
Resolving deltas: 100% (3499/3499), done.


We will use the (train/dev/test) files:

UD_English-GUM/en_gum-ud-train.conllu

UD_English-GUM/en_gum-ud-dev.conllu

UD_English-GUM/en_gum-ud-test.conllu

They are all formatted in the conllu format. You may read about it [here](https://universaldependencies.org/format.html). There is a utility library **conllutils**, which can help you read the data into the memory. It has already been installed and imported above.

You should write a code that reads the three datasets into memory. 
You may choose the data structure by yourself. 
As you can see
1. every word is represented by a line
2. columns representing specific features. 
   * We are only interested in the first and fourth columns 
   * corresponding to the word and its POS tag.

### Set Path's

In [29]:
# Your code goes here
FOLDER = 'UD_English-GUM'



user = 'va'
if user == 'Or':
    ud_dev =   r"C:\MSC\NLP2\HW2\UD_English-GUM\en_gum-ud-dev.conllu"
    ud_train = r"C:\MSC\NLP2\HW2\UD_English-GUM\en_gum-ud-train.conllu"
    ud_test =  r"C:\MSC\NLP2\HW2\UD_English-GUM\en_gum-ud-test.conllu"
elif user == 'Roni':
    ud_dev =   '/Users/ronibendom/Master/NLP/HW2/UD_English-GUM/en_gum-ud-dev.conllu'
    ud_train = '/Users/ronibendom/Master/NLP/HW2/UD_English-GUM/en_gum-ud-train.conllu'
    ud_test =  '/Users/ronibendom/Master/NLP/HW2/UD_English-GUM/en_gum-ud-test.conllu'
else:
    ud_dev =   f'{FOLDER}/en_gum-ud-dev.conllu'
    ud_train = f'{FOLDER}/en_gum-ud-train.conllu'
    ud_test =  f'{FOLDER}/en_gum-ud-test.conllu'

train_csv = FOLDER + '/en_gum-ud-train.csv'
test_csv = FOLDER + '/en_gum-ud-test.csv'
dev_csv = FOLDER + '/en_gum-ud-dev.csv'


### Get Data

In [30]:
train_df = conll_df(ud_train, file_index=False)
train_df = train_df.iloc[:, [0, 3]]

dev_df = conll_df(ud_dev, file_index=False)
dev_df = dev_df.iloc[:, [0, 3]]

test_df = conll_df(ud_test, file_index=False)
test_df = test_df.iloc[:, [0, 3]]

# train_df.to_csv(train_csv)
# test_df.to_csv(test_csv)
# dev_df.to_csv(dev_csv)

In [31]:

# train_df = pd.read_csv(train_csv)
# dev_df = pd.read_csv(dev_csv)
# test_df = pd.read_csv(test_csv)

In [32]:
def extract_ommision_matrix_B(train_df, unique_pos, words_indexing_dict):

    B = np.zeros([len(unique_pos), len(words_indexing_dict)])
    B_row_index = 0
    for i_pos in unique_pos:
        i_pos_train_df = train_df.loc[train_df['p'] == i_pos]
        i_pos_words, i_pos_word_count = np.unique(i_pos_train_df.loc[:, 'w'].values, return_counts=True)
        i_pos_percent = i_pos_word_count / np.sum(i_pos_word_count)
        for i_word in i_pos_words:
            updated_percent_per_word_per_pos = i_pos_percent[np.where(i_pos_words == i_word)[0][0]]
            B_column_word_index = words_indexing_dict[i_word]
            B[B_row_index, B_column_word_index] = updated_percent_per_word_per_pos
        B_row_index += 1

    return B

def generate_transition_matrix_A_and_pi(train_df, unique_pos, pos_indexing_dict):
    A = np.zeros((len(unique_pos), len(unique_pos)))
    pi = np.zeros(([len(unique_pos), 1]))

    sentence_ind = 1
    pi[pos_indexing_dict[train_df.iloc[0, :]['p']]] += 1

    for i in range(1, train_df.shape[0]):
        curr_sentence_ind = train_df.index[i][0]
        if curr_sentence_ind != sentence_ind:
            pi[pos_indexing_dict[train_df.iloc[i, :]['p']]] += 1
            sentence_ind += 1
        else:
            A[pos_indexing_dict[train_df.iloc[i-1, :]['p']], pos_indexing_dict[train_df.iloc[i, :]['p']]] += 1
    
    A = A/A.sum(axis=1, keepdims=True)
    pi = pi / sum(pi)

    return A, pi

### Create matrices

In [35]:
### Create matrices
pos_values = list(np.unique(train_df.loc[:, 'p'].values, return_counts=True))
unique_words = np.unique(train_df.loc[:, 'w'].values)
unique_pos = pos_values[0]

words_indexing_dict = {unique_words[i] : i for i in range(len(unique_words))}
pos_indexing_dict = {unique_pos[i] : i for i in range(len(unique_pos))}

# Generate omissiom matrix
B = extract_ommision_matrix_B(train_df, unique_pos, words_indexing_dict)

# Generate transition matrix and initial marix
A, pi_initial_matrix = generate_transition_matrix_A_and_pi(train_df, unique_pos, pos_indexing_dict)

### Create sentences from df

In [None]:
def sentences_from_df(df):
    sentences = []
    sentence_ind = 1
    sentence = []

    for i in range(df.shape[0]):
        curr_sentence_ind = df.index[i][0]
        if curr_sentence_ind != sentence_ind:
            sentences.append(str.join(" ", sentence))
            sentence_ind += 1
            sentence = []
        sentence.append(df.iloc[i, :]['w'])

    return sentences

In [None]:
sentences = sentences_from_df(train_df)

**Part 2**

Write a class **simple_tagger**
1. with methods *train* and *evaluate*. 
2. The method *train* receives the data as a list of sentences
3. use it for training the tagger.
4. In this case, it should learn a simple dictionary that maps words to tags
    * defined as the most frequent tag for every word (in case there is more than one most frequent tag, you may select one of them randomly).
    * The dictionary should be stored as a class member for evaluation.

The method *evaluate* 
1. receives the data as a list of sentences
2. use it to evaluate the tagger performance. 
3. Specifically, you should calculate the word and sentence level accuracy.
4. The evaluation process is simply going word by word
5. querying the dictionary (created by the train method) for each word’s tag and compare it to the true tag of that word. 
6. The word-level accuracy is the number of successes divided by the number of words.
7. For OOV (out of vocabulary, or unknown) words
8. the tagger should assign the most frequent tag in the entire training set (i.e., the mode).
9. The function should return the two numbers:
    * word level accuracy
    * sentence level accuracy.


Write a class simple_tagger, with methods train and evaluate. The method train receives the data as a list of sentences, and use it for training the tagger. In this case, it should learn a simple dictionary that maps words to tags, defined as the most frequent tag for every word (in case there is more than one most frequent tag, you may select one of them randomly). The dictionary should be stored as a class member for evaluation.

The method evaluate receives the data as a list of sentences, and use it to evaluate the tagger performance. Specifically, you should calculate the word and sentence level accuracy. The evaluation process is simply going word by word, querying the dictionary (created by the train method) for each word’s tag and compare it to the true tag of that word. The word-level accuracy is the number of successes divided by the number of words. For OOV (out of vocabulary, or unknown) words, the tagger should assign the most frequent tag in the entire training set (i.e., the mode). The function should return the two numbers: word level accuracy and sentence level accuracy.

In [None]:
class simple_tagger:
  def __init__(self):
    self.tagger = {}
    self.most_frequent_pos = ''

  def train(self, data):
    unique_words = np.unique(data.loc[:, 'w'].values)
    tagger = {}

    for word in unique_words:
        pos_of_word = np.unique(data.loc[data['w'] == word, 'p'], return_counts=True)
        tagger[word] = pos_of_word[0][np.where(pos_of_word[1] == max(pos_of_word[1]))][0]

    self.tagger = tagger

    pos = np.unique(data['p'], return_counts=True)
    self.most_frequent_pos = pos[0][np.where(pos[1] == max(pos[1]))][0]
    
  def evaluate(self, data):
    # TODO
    pass

In [None]:

tagger = simple_tagger()
tagger.train(train_df)

In [None]:
# https://piazza.com/class/klxc3m1tzqz2o8?cid=40 - "You should evaluate on the test and dev datasets separately. The train file is for training only"
simple_tagger_word_level_accuracy_train, simple_tagger_sentence_level_accuracy_train = tagger.evaluate(train_df)
simple_tagger_word_level_accuracy_test, simple_tagger_sentence_level_accuracy_test = tagger.evaluate(test_df)
simple_tagger_word_level_accuracy_dev, simple_tagger_sentence_level_accuracy_dev = tagger.evaluate(dev_df)



In [None]:
print(f'*train* data: word accuracy = {simple_tagger_word_level_accuracy_train} %, sentence accuracy = {simple_tagger_sentence_level_accuracy_train} %')
print(f'*test* data: word accuracy = {simple_tagger_word_level_accuracy_test} %, sentence accuracy = {simple_tagger_sentence_level_accuracy_test} %')
print(f'*dev* data: word accuracy = {simple_tagger_word_level_accuracy_dev} %, sentence accuracy = {simple_tagger_sentence_level_accuracy_dev} %')



**Part 3**

Similar to part 2, write the class hmm_tagger, which implements HMM tagging. The method *train* should build the matrices A, B and Pi, from the data as discussed in class. The method *evaluate* should find the best tag sequence for every input sentence using he Viterbi decoding algorithm, and then calculate the word and sentence level accuracy using the gold-standard tags. You should implement the Viterbi algorithm in the next block and call it from your class.

Additional guidance:
1. The matrix B represents the emissions probabilities. Since B is a matrix, you should build a dictionary that maps every unique word in the corpus to a serial numeric id (starting with 0). This way columns in B represents word ids.
2. During the evaluation, you should first convert each word into it’s index and then create the observation array to be given to Viterbi, as a list of ids. OOV words should be assigned with a random tag. To make sure Viterbi works appropriately, you can simply break the sentence into multiple segments every time you see an OOV word, and decode every segment individually using Viterbi.


In [None]:
class hmm_tagger:
    def train(self, data):
        # TODO

    def evaluate(self, data):
        # TODO

In [None]:
# Viterbi
def viterbi (observations, A, B, Pi):
  #...

  return best_sequence

# A simple example to run the Viterbi algorithm:
#( Same as in presentation "NLP 3 - Tagging" on slide 35)

# A = np.array([[0.3, 0.7], [0.2, 0.8]])
# B = np.array([[0.1, 0.1, 0.3, 0.5], [0.3, 0.3, 0.2, 0.2]])
# Pi = np.array([0.4, 0.6])
# print(viterbi([0, 3, 2, 0], A, B, Pi))
# Expected output: 1, 1, 1, 1

**Part 4**

Compare the results obtained from both taggers and a MEMM tagger, implemented by NLTK (a known NLP library), over both, the dev and test datasets. To train the NLTK MEMM tagger you should execute the following lines (it may take some time to train...):

In [None]:
from nltk.tag import tnt 

tnt_pos_tagger = tnt.TnT()
tnt_pos_tagger.train(train_data)
print(tnt_pos_tagger.evaluate(test_data))

Print both, word level and sentence level accuracy for all the three taggers in a table.

In [None]:
# Your code goes here