# HMM PoS tagger

## Introduction

**The Georgetown University Multilayer Corpus (GUM) is an open source multilayer corpus of richly annotated texts from 16 text types.**

- https://gucorpling.org/gum/

- Zeldes, Amir (2017) "The GUM Corpus: Creating Multilayer Resources in the Classroom". Language Resources and Evaluation 51(3), 581–612. https://link.springer.com/article/10.1007/s10579-016-9343-x

- https://github.com/UniversalDependencies/UD_English-GUM/blob/master/README.md


**A Genre-Diverse Multilayer Challenge Set for English NLP and Linguistic Evaluation (GENTLE) is a manually annotated multilayer corpus following the same design and annotation layers as GUM, but of unusual text types.**

- https://gucorpling.org/gum/gentle.html

- Aoyama, Tatsuya, Behzad, Shabnam, Gessler, Luke, Levine, Lauren, Lin, Jessica, Liu, Yang Janet, Peng, Siyao, Zhu, Yilun and Zeldes, Amir (2023) "GENTLE: A Genre-Diverse Multilayer Challenge Set for English NLP and Linguistic Evaluation". In: Proceedings of the Seventeenth Linguistic Annotation Workshop (LAW-XVII 2023). Toronto, Canada. https://arxiv.org/abs/2306.01966

- https://github.com/UniversalDependencies/UD_English-GENTLE/blob/master/README.md

### PoS Tags

There are 17 unique PoS Tags in the datasets. In the following code chunk the number of tokens per tag is shown followed by the most common lemmas. However, GUM's stats file presents some incosistences regarding the number of tokens per tag (as our algorithm recovers correctly these numbers for the GENTLE case, but are some minimal differences with GUM, we believe the stats file of the repo is wrong). 

It is worth mentioning that exists a _ tag that should be avoided like the following example:

- 12-13	workforce’s	_	_	_	_	_	_	_	_

- 12	workforce	workforce	NOUN	NN	Number=Sing	14	nmod:poss	14:nmod:poss	_

- 13	’s	's	PART	POS	_	12	case	12:case	Entity=42)

In [None]:
""" https://github.com/UniversalDependencies/UD_English-GUM/blob/master/stats.xml
<!-- Statistics of universal POS tags. The comments show the most frequent lemmas. -->
<tags unique="17">
<tag name="ADJ">13961</tag><!-- good, other, first, new, many, great, little, large, more, different -->
<tag name="ADP">20170</tag><!-- of, in, to, for, on, with, at, from, by, as -->
<tag name="ADV">10103</tag><!-- so, when, just, then, also, how, now, more, here, really -->
<tag name="AUX">11355</tag><!-- be, have, do, can, will, would, should, could, may, might -->
<tag name="CCONJ">7057</tag><!-- and, or, but, both, &, either, nor, yet, neither, plus -->
<tag name="DET">17331</tag><!-- the, a, this, all, that, some, no, any, every, another -->
<tag name="INTJ">2023</tag><!-- like, yeah, oh, well, so, um, uh, no, okay, yes -->
<tag name="NOUN">35507</tag><!-- person, time, year, day, thing, way, life, city, world, work -->
<tag name="NUM">3994</tag><!-- one, two, 1, three, 2, 3, four, 4, five, 10 -->
<tag name="PART">5113</tag><!-- to, not, 's -->
<tag name="PRON">17819</tag><!-- I, it, you, we, that, they, he, his, this, your -->
<tag name="PROPN">12184</tag><!-- State, President, University, America, York, New, Warhol, Figure, American, south -->
<tag name="PUNCT">28955</tag><!-- ,, ., '', -, ?, (, ), —, [, : -->
<tag name="SCONJ">3393</tag><!-- that, if, as, because, for, of, by, while, in, after -->
<tag name="SYM">317</tag><!-- -, /, $, %, +, =, DKK, €, £, § -->
<tag name="VERB">22277</tag><!-- have, know, go, make, do, get, say, be, take, think -->
<tag name="X">361</tag><!-- _, et, al., de, 1, 1., 2., in, situ, 2 -->
</tags>
"""

""" https://github.com/UniversalDependencies/UD_English-GENTLE/blob/master/stats.xml
<!-- Statistics of universal POS tags. The comments show the most frequent lemmas. -->
<tags unique="17">
<tag name="ADJ">1240</tag><!-- next, other, first, old, open, more, good, straight, chronic, right -->
<tag name="ADP">1588</tag><!-- of, in, to, for, with, on, from, by, at, as -->
<tag name="ADV">729</tag><!-- then, just, so, well, here, also, thus, how, where, now -->
<tag name="AUX">753</tag><!-- be, will, have, can, do, would, may, should, could, shall -->
<tag name="CCONJ">618</tag><!-- and, or, but, &, either, /, plus, yet, both, neither -->
<tag name="DET">1195</tag><!-- the, a, this, all, no, any, that, some, each, every -->
<tag name="INTJ">60</tag><!-- fucking, please, ah, well, oh, okay, so, uh, ha, now -->
<tag name="NOUN">3783</tag><!-- week, x, T, project, school, S, person, day, time, y -->
<tag name="NUM">386</tag><!-- one, 1, 5, 2, two, 4, 3, X, 10, three -->
<tag name="PART">349</tag><!-- to, not, 's -->
<tag name="PRON">1188</tag><!-- I, you, he, it, we, his, that, my, your, they -->
<tag name="PROPN">901</tag><!-- Company, JavaScript, Book, Proposition, Court, English, Week, Career, React, Agreement -->
<tag name="PUNCT">2655</tag><!-- ,, ., :, '', (, ), -, ;, —, ! -->
<tag name="SCONJ">234</tag><!-- that, if, as, because, in, by, like, of, while, before -->
<tag name="SYM">167</tag><!-- ⪯, ∈, =, -, ⋅, /, %, +, $, ≤ -->
<tag name="VERB">1653</tag><!-- have, go, do, get, take, see, follow, make, know, let -->
<tag name="X">300</tag><!-- 1., 2., 3., 4., 5., 6., 7., 8., 9., 10. -->
</tags>
"""

PoS_tags = [
    ("ADJ", 12348, 1621, 1240),
    ("ADP", 17702, 2481, 1588),
    ("ADV", 8989, 1115, 729),
    ("AUX", 10174, 1189, 753),
    ("CCONJ", 6218, 839, 618),
    ("DET", 15224, 2111, 1195),
    ("INTJ", 1859, 164, 60),
    ("NOUN", 31274, 4240, 3783),
    ("NUM", 3554, 440, 386),
    ("PART", 4595, 519, 349),
    ("PRON", 16077, 1746, 1188),
    ("PROPN", 10557, 1627, 901),
    ("PUNCT", 25928, 3027, 2655),
    ("SCONJ", 3053, 340, 234),
    ("SYM", 282, 35, 167),
    ("VERB", 19865, 2479, 1653),
    ("X", 329, 32, 300),
]

### Document type

Documents come from a wide variety of sources. In order to group them in an efficient way, those documents belonging to same area are gathered together. 

For instance, GUM_academic_art, GUM_academic_census... will belong to GUM_academic. 

In the following code block, each distinction is shown.

In [None]:
# https://github.com/UniversalDependencies/UD_English-GUM/tree/master/not-to-release/sources

doc_type_GUM = [
    "GUM_academic",
    "GUM_bio",
    "GUM_conversation",
    "GUM_court",
    "GUM_essay",
    "GUM_fiction",
    "GUM_interview",
    "GUM_letter",
    "GUM_news",
    "GUM_podcast",
    "GUM_speech",
    "GUM_textbook",
    "GUM_vlog",
    "GUM_voyage",
    "GUM_whow",
]

# https://github.com/UniversalDependencies/UD_English-GENTLE/tree/master/not-to-release/sources

doc_type_GENTLE = [
    "GENTLE_dictionary",
    "GENTLE_esports",
    "GENTLE_legal",
    "GENTLE_medical",
    "GENTLE_poetry",
    "GENTLE_proof",
    "GENTLE_syllabus",
    "GENTLE_threat",
]

## HMM implementation

In [None]:
import os
from conllu import parse_incr
from typing import Dict, List, Tuple
from collections import Counter
import numpy as np
import pandas as pd

In [None]:
# Implementation of the HMM model
class HMM_PoS_tagger:
    def __init__(self, path_data: str, lemmatize: bool, threshold: float):
        self.path_data = path_data
        self.lemmatize = lemmatize
        self.threshold = threshold
        self.counter = Counter()  # 12770 unique tokens in train, 188028 total tokens
        # Read data from train and test datasets & check that is correct
        self.train = self.read_train_data()
        self.test_GUM = self.read_test_data(is_GUM=True)
        self.test_GENTLE = self.read_test_data(is_GUM=False)
        # Create vocabulary
        self.vocab = self.create_vocab()
        # Add UNK token to the vocabulary
        self.vocab = ["UNK"] + self.vocab
        # Add START, STOP and UNK tags to the tags
        self.tags = ["UNK", "START", "STOP"] + [x[0] for x in PoS_tags]
        # Train the model and get the emission and transmission matrices
        self.emission, self.transmission = self.train_model()

    def read_train_data(self) -> Dict[str, List[List[Tuple[str, str]]]]:

        # Train and Dev datasets will be used to train the model
        paths = [
            os.path.join(self.path_data, "en_gum-ud-train.conllu"),
            os.path.join(self.path_data, "en_gum-ud-dev.conllu"),
        ]

        data = self.read_data(paths)

        # Check that number of sentences and tokens per tag are correct
        # Also, update the vocabulary for future processing
        tags_counter = Counter()
        cont_sentences = 0
        for sentences in data.values():
            cont_sentences += len(sentences)
            for sentence in sentences:
                for tok, tag in sentence:
                    tags_counter.update([tag])
                    self.counter.update([tok])

        # Number of sentences is correct
        assert cont_sentences == 9520 + 1341
        # Document types are correct
        assert list(data.keys()) == doc_type_GUM
        # Number of tokens per tag is correct
        assert [(x[0], x[-3]) for x in PoS_tags] == sorted(tags_counter.items())
        return data

    def read_test_data(self, is_GUM: bool) -> Dict[str, List[List[Tuple[str, str]]]]:

        # Both GUM and GENTLE test datasets will be used to evaluate the model
        path = (
            os.path.join(self.path_data, "en_gum-ud-test.conllu")
            if is_GUM
            else os.path.join(self.path_data, "en_gentle-ud-test.conllu")
        )

        data = self.read_data([path])

        # Check that number of sentences and tokens per tag are correct
        cont_sentences = 0
        tags_counter = Counter()

        for sentences in data.values():
            cont_sentences += len(sentences)
            for sentence in sentences:
                for _, tag in sentence:
                    tags_counter.update([tag])

        if is_GUM:
            # Number of sentences is correct
            assert cont_sentences == 1285
            # Document types are correct
            assert list(data.keys()) == doc_type_GUM
            # Number of tokens per tag is correct
            assert [(x[0], x[-2]) for x in PoS_tags] == sorted(tags_counter.items())
        else:
            # Number of sentences is correct
            assert cont_sentences == 1334
            # Document types are correct
            assert list(data.keys()) == doc_type_GENTLE
            # Number of tokens per tag is correct
            assert [(x[0], x[-1]) for x in PoS_tags] == sorted(tags_counter.items())

        return data

    def read_data(self, paths: list[str]) -> Dict[str, List[List[Tuple[str, str]]]]:
        data = {}
        for path in paths:
            assert os.path.exists(path), f"The {path} path does not exist"
            # Name of the read last document type
            last_doc_id = ""
            with open(path, "r", encoding="utf-8") as file:
                for tokenlist in parse_incr(file):
                    if "newdoc id" in tokenlist.metadata:
                        # Get the document type and remove unnecessary additional information
                        doc_type = "_".join(
                            tokenlist.metadata["newdoc id"].split("_")[:2]
                        )
                        # Avoid the first case ("") and change of document type if new is found
                        if doc_type != last_doc_id:
                            last_doc_id = doc_type

                    # Auxiliar list to store in Tuples the tokens and tags of the sentence
                    auxiliar = []
                    for token in tokenlist:
                        # Token has the following internal structure:
                        # token: <class 'conllu.models.Token'> /// dict_keys(['id', 'form', 'lemma', 'upos', 'xpos', 'feats', 'head', 'deprel', 'deps', 'misc'])

                        # Avoid _ case as previously explained
                        if token["upos"] == "_":
                            continue
                        # Possibility to use the lemma or the form of the token
                        auxiliar.append(
                            (
                                token["lemma" if self.lemmatize else "form"].lower(),
                                token["upos"],
                            )
                        )

                    # If the document type is already in the dictionary, append the new data, otherwise, create a new key
                    if doc_type in data:
                        data[doc_type].append(auxiliar)
                    else:
                        data[doc_type] = [auxiliar]
        return data

    def create_vocab(self) -> List[str]:
        # Get the most common tokens in the train dataset
        tokens, times = zip(*self.counter.most_common())
        tokens = np.array(tokens)
        times = np.array(times)

        # Calculate the index of the tokens that are necessary to reach the threshold
        cum_times = np.cumsum(times)
        total_tokens = cum_times[-1]
        idx = np.searchsorted(cum_times, self.threshold * total_tokens)

        return tokens[: idx + 1].tolist()

    def train_model(self) -> Tuple[pd.DataFrame, pd.DataFrame]:
        emission = pd.DataFrame(
            np.zeros((len(self.tags), len(self.vocab)), dtype=np.float64),
            columns=self.vocab,
            index=self.tags,
        )
        transmission = pd.DataFrame(
            np.zeros((len(self.tags), len(self.tags)), dtype=np.float64),
            columns=self.tags,
            index=self.tags,
        )

        for sentences in self.train.values():
            previous_tag = "START"
            for sentence in sentences:
                for tok, tag in sentence:
                    if tok not in self.vocab:
                        emission.loc[tag, "UNK"] += 1
                        transmission.loc[previous_tag, "UNK"] += 1
                        previous_tag = "UNK"
                    else:
                        emission.loc[tag, tok] += 1
                        transmission.loc[previous_tag, tag] += 1
                        previous_tag = tag
                transmission.loc[previous_tag, "STOP"] += 1

        ###############################################################################
        ################################ EMISSION #####################################
        ###############################################################################

        row_sums = emission.sum(axis=1)
        # Convert row_sums to a numpy array otherwise muldimensional indexing error
        row_sums = np.array(row_sums)

        # Considering alpha = 1 for smoothing
        #'''
        emission = np.log2(emission + 1) - np.log2(
            row_sums[:, np.newaxis] + len(self.vocab)
        )
        #'''

        """ Check that the sum of the rows is 1 as it is a probability distribution
        emission = (emission + 1) / (row_sums[:, np.newaxis] + len(self.vocab))
        row_sums_check = emission.sum(axis=1)
        assert np.allclose(row_sums_check, 1), f"Sums of rows must be 1"
        """

        ###############################################################################
        ################################ TRANSMISSION #################################
        ###############################################################################

        row_sums = transmission.sum(axis=1)
        # Convert row_sums to a numpy array otherwise muldimensional indexing error
        row_sums = np.array(row_sums)

        # Considering alpha = 1 for smoothing
        #'''
        transmission = np.log2(transmission + 1) - np.log2(
            row_sums[:, np.newaxis] + len(self.tags)
        )
        #'''

        """ Check that the sum of the rows is 1 as it is a probability distribution
        transmission = (transmission + 1) / (row_sums[:, np.newaxis] + len(self.tags))
        row_sums_check = transmission.sum(axis=1)
        assert np.allclose(row_sums_check, 1), f"Sums of rows must be 1"
        """

        ###############################################################################

        return emission, transmission

In [None]:
PATH_TO_REPO_FOLDER = input(
    "Enter the path to the repository folder (must end in HMM_PoS_Tagger): "
)

PATH_TO_DATA_FOLDER = os.path.join(PATH_TO_REPO_FOLDER, "data")
assert os.path.exists(
    PATH_TO_DATA_FOLDER
), f"The {PATH_TO_DATA_FOLDER} path does not exist"
LEMMATIZE = True
THRESHOLD = 0.9

tagger = HMM_PoS_tagger(path_data=PATH_TO_DATA_FOLDER, lemmatize=LEMMATIZE, threshold=THRESHOLD)

In [None]:
print(tagger.train["GUM_academic"][0])
print(tagger.test_GUM["GUM_academic"][0])
print(tagger.test_GENTLE["GENTLE_dictionary"][0])

## Viterbi algorithm

In [None]:
# Implementation of the viterbi algorithm

##Experiments

### In-domain

### Out-of-domain

## Conclusion