# Hidden Markov Models for Sequence Tagging

The aim of this lab session is to develop an HMM sequence tagger and evaluate it on classical sequence tagging
tasks, such as **part-of-speech (POS) tagging** or **Name Entity Recognition**.
Sequence tagging is a structured prediction problem consisting in assigning a sequence of $n$ tags $c = c_1,c_2,...,c_n$
to an input sequence of the same length $w = w_1,w_2,...,w_n$.
For exemple, in POS tagging, the French sentence `"Le chat mange une pomme ."` should be assigned the POS sequence: `"D N V D N PONCT"`.

**Note** : The whole implementation should be carried out without importing additional libraries (numpy, collections and datasets are allowed)

In [25]:
!pip install datasets



In [26]:
import numpy as np
from collections import defaultdict
import datasets

## HiddenMarkovModel initial implementation



Implement the `train` function in the `hiddenMarkovModel` class. This function must estimate the parameters of the HMM model from a corpus. The `test_estimation` function will test estimation on the 2 sentences in Section 2:
  - `le/D chat/N ferme/V la/D porte/N`
  - `le/D chien/N le/CL porte/N`


In [27]:
class HiddenMarkovModel():
    """
    Sequence tagging model: Hidden Markov model
    """

    def __init__(self) :

        # log probabilities of transitions P( tag_i | tag_i-1.)
        # for example, self.transitions["NOUN"]["DET"] should contain log P(DET | NOUN)
        #              self.transitions[START]["DET"] should contain log P(DET | début de phrase)
        #              self.transitions["V"][END] should contain log  P(fin de phrase | "V")
        self.transitions            = defaultdict(lambda : defaultdict(float))

        # log probabilities of emissions: P( word | tag )
        # For example, self.emissions["NOUN"]["chat"] should contain log P(chat | NOUN)
        self.emissions              = defaultdict(lambda : defaultdict(float))

        self.tags = []              # list of unique possible tags (without END and START)
        self.voc = {UNKNOWN}        # vocabulary
    def train(self, minicorpusn, precision):
      pass



    def print_parameters(self, exp = np.exp) :
        """
        Print all parameters of the model
        """
        print("Transitions")
        for k1 in sorted(self.transitions) :
            for k2 in sorted(self.transitions[k1]) :
                v = self.transitions[k1][k2]
                print(f"P({k2}|{k1}) = {round(exp(v),5)}\t\t(log p = {round(v, 5)})")

        print("Emissions")
        for t in sorted(self.emissions) :
            for w,v in sorted(self.emissions[t].items()) :
                print(f"P({w}|{t}) = {round(exp(v),5)}\t\t(log p = {round(v,5)})")

    def predict(self, sentence):
        """
        Predict a sequence of tags for the input sentence (list of tokens)
        """
        emissions_scores = [ {tag : self.emissions[tag][word] if word in self.emissions[tag] else self.emissions[tag][UNKNOWN] for tag in self.tags } for word in sentence ]
        y = self.viterbi(self.transitions, emissions_scores)
        return y

    def predict_corpus(self, list_of_sentences):
        """
        Perform  predictions for a list_of_sentences
        """
        return [self.predict(sentence) for sentence in list_of_sentences]


In [28]:
def test_estimation():
    minicorpus=[("le chat ferme la porte", "D N V D N"),("le chien le porte","D N CL V")]
    for i in range(len(minicorpus)) :
        minicorpus[i] = {"tokens": minicorpus[i][0].split(), "upos": minicorpus[i][1].split()}
    # print(minicorpus)

    hmm = HiddenMarkovModel()
    hmm.train(minicorpus, 1e-8)

    hmm.print_parameters()

    print()
    print("Expected results")
    print('Transitions\nP(</S>|<S>) = 0.0		(log p = -18.42068)\nP(CL|<S>) = 0.0		(log p = -18.42068)\nP(D|<S>) = 1.0		(log p = 0.0)\nP(N|<S>) = 0.0		(log p = -18.42068)\nP(V|<S>) = 0.0		(log p = -18.42068)\nP(</S>|CL) = 0.0		(log p = -18.42068)\nP(CL|CL) = 0.0		(log p = -18.42068)\nP(D|CL) = 0.0		(log p = -18.42068)\nP(N|CL) = 0.0		(log p = -18.42068)\nP(V|CL) = 1.0		(log p = 0.0)\nP(</S>|D) = 0.0		(log p = -18.42068)\nP(CL|D) = 0.0		(log p = -18.42068)\nP(D|D) = 0.0		(log p = -18.42068)\nP(N|D) = 1.0		(log p = 0.0)\nP(V|D) = 0.0		(log p = -18.42068)\nP(</S>|N) = 0.33333		(log p = -1.09861)\nP(CL|N) = 0.33333		(log p = -1.09861)\nP(D|N) = 0.0		(log p = -18.42068)\nP(N|N) = 0.0		(log p = -18.42068)\nP(V|N) = 0.33333		(log p = -1.09861)\nP(</S>|V) = 0.5		(log p = -0.69315)\nP(CL|V) = 0.0		(log p = -18.42068)\nP(D|V) = 0.5		(log p = -0.69315)\nP(N|V) = 0.0		(log p = -18.42068)\nP(V|V) = 0.0		(log p = -18.42068)\nEmissions\nP(<UNK>|CL) = 0.0		(log p = -18.42068)\nP(chat|CL) = 0.0		(log p = -18.42068)\nP(chien|CL) = 0.0		(log p = -18.42068)\nP(ferme|CL) = 0.0		(log p = -18.42068)\nP(la|CL) = 0.0		(log p = -18.42068)\nP(le|CL) = 1.0		(log p = 0.0)\nP(porte|CL) = 0.0		(log p = -18.42068)\nP(<UNK>|D) = 0.0		(log p = -18.42068)\nP(chat|D) = 0.0		(log p = -18.42068)\nP(chien|D) = 0.0		(log p = -18.42068)\nP(ferme|D) = 0.0		(log p = -18.42068)\nP(la|D) = 0.33333		(log p = -1.09861)\nP(le|D) = 0.66667		(log p = -0.40547)\nP(porte|D) = 0.0		(log p = -18.42068)\nP(<UNK>|N) = 0.0		(log p = -18.42068)\nP(chat|N) = 0.33333		(log p = -1.09861)\nP(chien|N) = 0.33333		(log p = -1.09861)\nP(ferme|N) = 0.0		(log p = -18.42068)\nP(la|N) = 0.0		(log p = -18.42068)\nP(le|N) = 0.0		(log p = -18.42068)\nP(porte|N) = 0.33333		(log p = -1.09861)\nP(<UNK>|V) = 0.0		(log p = -18.42068)\nP(chat|V) = 0.0		(log p = -18.42068)\nP(chien|V) = 0.0		(log p = -18.42068)\nP(ferme|V) = 0.5		(log p = -0.69315)\nP(la|V) = 0.0		(log p = -18.42068)\nP(le|V) = 0.0		(log p = -18.42068)\nP(porte|V) = 0.5		(log p = -0.69315)\n')


In [29]:
### Expected Results:

START="<S>"     # Start of sentence
END="</S>"      # End of sentence
UNKNOWN="<UNK>" # Unkown word s

test_estimation()

Transitions
Emissions

Expected results
Transitions
P(</S>|<S>) = 0.0		(log p = -18.42068)
P(CL|<S>) = 0.0		(log p = -18.42068)
P(D|<S>) = 1.0		(log p = 0.0)
P(N|<S>) = 0.0		(log p = -18.42068)
P(V|<S>) = 0.0		(log p = -18.42068)
P(</S>|CL) = 0.0		(log p = -18.42068)
P(CL|CL) = 0.0		(log p = -18.42068)
P(D|CL) = 0.0		(log p = -18.42068)
P(N|CL) = 0.0		(log p = -18.42068)
P(V|CL) = 1.0		(log p = 0.0)
P(</S>|D) = 0.0		(log p = -18.42068)
P(CL|D) = 0.0		(log p = -18.42068)
P(D|D) = 0.0		(log p = -18.42068)
P(N|D) = 1.0		(log p = 0.0)
P(V|D) = 0.0		(log p = -18.42068)
P(</S>|N) = 0.33333		(log p = -1.09861)
P(CL|N) = 0.33333		(log p = -1.09861)
P(D|N) = 0.0		(log p = -18.42068)
P(N|N) = 0.0		(log p = -18.42068)
P(V|N) = 0.33333		(log p = -1.09861)
P(</S>|V) = 0.5		(log p = -0.69315)
P(CL|V) = 0.0		(log p = -18.42068)
P(D|V) = 0.5		(log p = -0.69315)
P(N|V) = 0.0		(log p = -18.42068)
P(V|V) = 0.0		(log p = -18.42068)
Emissions
P(<UNK>|CL) = 0.0		(log p = -18.42068)
P(chat|CL) = 0.0		(log p 

**Let's define the training function for the HiddenMarkovModel class:**

In [30]:
def hmm_train(self, train_data, smooth=1e-8) :
    """
    Parameter (transitions and emissions) estimation from a train corpus


    parameters:
        train_data : Dataset object with fields: "tokens" and "upos"
        smooth : value for probability smoothing

    """
    self.voc = {}
    self.tags = {}
    card = defaultdict(float) # store the number of occurences of each upos

    # count the number of transitions and emissions by looping through each (token,upos)
    for corpus in train_data:
        last = START
        card[START] += 1
        for token, upos in zip(corpus["tokens"], corpus["upos"]):
            self.transitions[last][upos] += 1
            self.emissions[upos][token] += 1
            last = upos
            card[upos] += 1
        self.transitions[last][END] += 1
        card[END] += 1

    # divide by the number of occurences and take log

    for key,item in self.transitions.items():
        for upos, val in item.items():
            item[upos] = np.log(val/card[key])

    for upos,tokens in self.emissions.items():
        for token, val in tokens.items():
            tokens[token] = np.log(val/card[upos])


    # TODO: implement this function
    # update self.transitions, self.emissions, self.tags and self.voc instance variables
    # log probabilities of transitions P( tag_i | tag_i-1.)
        # for example, self.transitions["NOUN"]["DET"] should contain log P(DET | NOUN)
        #              self.transitions[START]["DET"] should contain log P(DET | début de phrase)
        #              self.transitions["V"][END] should contain log  P(fin de phrase | "V")
    # log probabilities of emissions: P( word | tag )
        # For example, self.emissions["NOUN"]["chat"] should contain log P(chat | NOUN)
    # - don't forget to smooth probabilities
    # - don't forget to switch to log space at the end of the function

    pass

HiddenMarkovModel.train = hmm_train

test_estimation()

Transitions
P(D|<S>) = 1.0		(log p = 0.0)
P(V|CL) = 1.0		(log p = 0.0)
P(N|D) = 1.0		(log p = 0.0)
P(</S>|N) = 0.33333		(log p = -1.09861)
P(CL|N) = 0.33333		(log p = -1.09861)
P(V|N) = 0.33333		(log p = -1.09861)
P(</S>|V) = 0.5		(log p = -0.69315)
P(D|V) = 0.5		(log p = -0.69315)
Emissions
P(le|CL) = 1.0		(log p = 0.0)
P(la|D) = 0.33333		(log p = -1.09861)
P(le|D) = 0.66667		(log p = -0.40547)
P(chat|N) = 0.33333		(log p = -1.09861)
P(chien|N) = 0.33333		(log p = -1.09861)
P(porte|N) = 0.33333		(log p = -1.09861)
P(ferme|V) = 0.5		(log p = -0.69315)
P(porte|V) = 0.5		(log p = -0.69315)

Expected results
Transitions
P(</S>|<S>) = 0.0		(log p = -18.42068)
P(CL|<S>) = 0.0		(log p = -18.42068)
P(D|<S>) = 1.0		(log p = 0.0)
P(N|<S>) = 0.0		(log p = -18.42068)
P(V|<S>) = 0.0		(log p = -18.42068)
P(</S>|CL) = 0.0		(log p = -18.42068)
P(CL|CL) = 0.0		(log p = -18.42068)
P(D|CL) = 0.0		(log p = -18.42068)
P(N|CL) = 0.0		(log p = -18.42068)
P(V|CL) = 1.0		(log p = 0.0)
P(</S>|D) = 0.0		(log p 

## BaselineTagger

Implement a baseline tagger that uses the most-frequent class strategy:
- For a known word: assign the most frequent tag for this word. For example, if ferme is tagged 5 time `VERB` and twice `NOUN` in the training corpus, tag all occurrences of ferme (in the test corpus) as `VERB`.
- For an unknown word: assign the most frequent tag overall.

In [42]:
class BaselineTagger() :
    """
    Baseline tagger
    """

    #### TODO:
    # implement the methods: __init__, train and predict of this class

    def __init__(self) :

        self.frequency = defaultdict(float)
        self.classification = defaultdict(float)


        self.tags = set()           # list of unique possible tags (without END and START)
        self.voc = set()
        self.max_tag_glob = ''
        self.max_tag_freq = 0.0


    def train(self, sentences_lst) :

        for sentence in sentences_lst:
            mapper = ['NOUN', 'PUNCT', 'ADP', 'NUM', 'SYM', 'SCONJ', 'ADJ', 'PART', 'DET', 'CCONJ', 'PROPN', 'PRON', 'X', '_', 'ADV', 'INTJ', 'VERB', 'AUX']
            tags = [mapper[tag_id] for tag_id in sentence["upos"]]
            for i in range(len(sentence['tokens'])):
                self.tags.add(tags[i])
                self.voc.add(sentence['tokens'][i])



        for sentence in sentences_lst:
            mapper = ['NOUN', 'PUNCT', 'ADP', 'NUM', 'SYM', 'SCONJ', 'ADJ', 'PART', 'DET', 'CCONJ', 'PROPN', 'PRON', 'X', '_', 'ADV', 'INTJ', 'VERB', 'AUX']
            tags = [mapper[tag_id] for tag_id in sentence["upos"]]
            for i in range(len(sentence['tokens'])):


              # tags = map_id_to_UPOS(sentence['tokens'])
              assign = sentence['tokens'][i] +'|' + tags[i]
              self.frequency[assign] = 1 + self.frequency[assign]
        for word in self.voc:
            max_freq = 0.0
            max_tag = ''
            for tag in self.tags:
                if self.frequency[word + '|' + tag] > max_freq:
                  max_freq = self.frequency[word + '|' + tag]
                  max_tag = tag
                if self.frequency[word + '|' + tag] > self.max_tag_freq:
                  self.max_tag_freq = self.frequency[word + '|' + tag]
                  self.max_tag_glob = tag

            self.classification[word] = max_tag


    def predict(self, sentence) :
        pred = []
        for token in sentence:
            if self.classification[token] != 0:
                pred.append(self.classification[token])
            else :
                pred.append(self.max_tag_glob)
        return pred

    def predict_corpus(self, list_of_sentences):
        """
        Perform  predictions for a list_of_sentences
        """
        return [self.predict(sentence) for sentence in list_of_sentences]



**Testing the BaselineTagger**
1. load the dataset `"universal_dependencies", "fr_gsd"`
2. implement the accuracy function
3. run the `test_baseline` code

In [32]:
!pip install conllu



In [43]:
## 1. load the dataset
def map_id_to_UPOS(example):
    mapper = ['NOUN', 'PUNCT', 'ADP', 'NUM', 'SYM', 'SCONJ', 'ADJ', 'PART', 'DET', 'CCONJ', 'PROPN', 'PRON', 'X', '_', 'ADV', 'INTJ', 'VERB', 'AUX']
    example["upos"] = [mapper[tag_id] for tag_id in example["upos"]]
    return example

corpus = datasets.load_dataset("universal_dependencies", "fr_gsd")
corpus = corpus.map(map_id_to_UPOS)

train_data = corpus["train"]
dev_data = corpus["validation"]
test_data = corpus["test"]
print(test_data,'\n', test_data['text'][0],'\n', test_data['upos'][0])

Dataset({
    features: ['idx', 'text', 'tokens', 'lemmas', 'upos', 'xpos', 'feats', 'head', 'deprel', 'deps', 'misc'],
    num_rows: 416
}) 
 Je sens qu'entre ça et les films de médecins et scientifiques fous que nous avons déjà vus, nous pourrions emprunter un autre chemin pour l'origine. 
 [11, 16, 5, 2, 11, 9, 8, 0, 2, 0, 9, 0, 6, 11, 11, 17, 14, 16, 1, 11, 16, 16, 8, 6, 0, 2, 8, 0, 1]


**Implement the `compute_accuracy` function:**

In [44]:
def compute_accuracy(test_data, predictions):
    """
    Computes token-level accuracy for

    parameters:
        test_data: Dataset object
        predictions: list of list of tags



    """

    n = 0
    n_good = 0
    for t, p in zip(test_data, predictions):

        test = map_id_to_UPOS(t)
        for x, y in zip(test['upos'], p):
          n+=1
          if x == y:
            n_good += 1
    return n_good / n

In [45]:
def test_baseline(train_data, test_data) :
    baseline = BaselineTagger()
    baseline.train(train_data)

    test_sentences = [sentence["tokens"] for sentence in test_data]

    accuracy = compute_accuracy(test_data, baseline.predict_corpus(test_sentences))
    print(f"Baseline accuracy: {accuracy}, expected result ~= 90.8")

In [46]:
## 3. run the test
test_baseline(train_data, test_data)    # test the baseline

Baseline accuracy: 0.8934744610604001, expected result ~= 90.8


## HiddenMarkovModel virterbi implementation

1. Implement the `viterbi` function for the `HiddenMarkovModel` class. The function should return the best sequence of tags (see inline documentation for details).


2. Test your code by using the `test_viterbi()` function (evaluation on short examples).

3. If everything goes well, you can evaluate the tagger on the whole corpus. by uncommenting the line test_tagger(train data, test data).

In [47]:
### 1. Implement the viterbi funtion
def viterbi(self, transitions, emissions) :
  """
  Uses the Viterbi algorithm to compute the best sequence
  of tags for a given sentence.

  parameters:
      transitions : dict of dicts such that
          transitions[previous_tag][tag] contains log(P(tag | previous_tag))

      emissions : dict list such that
          emissions[i][tag] contains log(P(w_i | tag))    (where w_i is the i^th token in the sentence

  """


  n_classes = len(self.tags)   # number of labels
  n_words = len(emissions)     # length of sentence

  # stores the weights of paths (initialized to -inf)
  scores = np.zeros((n_classes, n_words), dtype = float) - np.inf

  # backtrack will store pointers to recover the best paths
  backtrack = np.zeros((n_classes, n_words), dtype = int) - 1


  tags = list(emissions[0].keys())

  for i in range(n_classes):
      scores[i][0] = transitions[START][tags[i]] + emissions[0][tags[i]]

  for j in range(1,n_words):
      for i in range(n_classes):
          maxi, index =  -np.inf, 0
          for k, tag in enumerate(tags):
              v = transitions[tag][tags[i]] + emissions[j][tags[i]] + scores[k][j-1]
              if v > maxi:
                  maxi, index = v, k
          scores[i][j] = maxi
          backtrack[i][j] = index

  maxi, index =  -np.inf, 0
  for k, tag in enumerate(tags):
      v = transitions[tag][END] + scores[k][n_words-1]
      if v > maxi:
          maxi, index = v, k

  res = [tags[index]]
  for i in range(n_words-1, 0, -1):
      res.append(tags[backtrack[index][i]])
      index = backtrack[index][i]

  res.reverse()
  return res

HiddenMarkovModel.viterbi = viterbi

In [48]:
def pretrainedHMM() :
    tagger = HiddenMarkovModel()
    tagger.transitions = {'V': {END: -18.420680743952367, 'V': -18.420680743952367, 'D': -0.69314720555994502, 'N': -18.420680743952367, 'P': -0.69314720555994502}, 'N': {'V': -0.69314719305994521, END: -0.69314719305994521, 'D': -18.420680743952367, 'N': -18.420680743952367, 'P': -18.420680743952367}, START: {END: -18.420680743952367, 'V': -18.420680743952367, 'D': -2.4999999716474007e-08, 'N': -18.420680743952367, 'P': -18.420680743952367}, 'P': {END: -18.420680743952367, 'V': -18.420680743952367, 'D': -4.9999998614658012e-08, 'N': -18.420680743952367, 'P': -18.420680743952367}, 'D': {END: -18.420680743952367, 'V': -18.420680743952367, 'D': -18.420680743952367, 'P': -18.420680743952367, 'N': -1.2499999891134309e-08}}
    tagger.emissions =  {'V': {'mange': -0.69314723555994395, 'pomme': -18.420680743952367, 'dort': -0.69314723555994395, 'la': -18.420680743952367, 'le': -18.420680743952367, 'chien': -18.420680743952367, UNKNOWN: -18.420680743952367, 'chat': -18.420680743952367, 'dans': -18.420680743952367, 'cuisine': -18.420680743952367, 'une': -18.420680743952367}, 'D': {'mange': -18.420680743952367, 'pomme': -18.420680743952367, 'dort': -18.420680743952367, 'la': -1.3862943886198902, 'le': -0.69314720805994501, 'chien': -18.420680743952367, UNKNOWN: -18.420680743952367, 'chat': -18.420680743952367, 'dans': -18.420680743952367, 'cuisine': -18.420680743952367, 'une': -1.3862943886198902}, 'P': {'mange': -18.420680743952367, 'pomme': -18.420680743952367, 'dort': -18.420680743952367, 'la': -18.420680743952367, 'le': -18.420680743952367, 'chien': -18.420680743952367, UNKNOWN: -18.420680743952367, 'chat': -18.420680743952367, 'dans': -1.0999999394618015e-07, 'cuisine': -18.420680743952367, 'une': -18.420680743952367}, 'N': {'mange': -18.420680743952367, 'pomme': -1.3862943886198902, 'dort': -18.420680743952367, 'la': -18.420680743952367, 'cuisine': -1.3862943886198902, 'chien': -1.3862943886198902, UNKNOWN: -18.420680743952367, 'chat': -1.3862943886198902, 'dans': -18.420680743952367, 'le': -18.420680743952367, 'une': -18.420680743952367}}
    tagger.tags= ['D', 'N', 'P', 'V']
    return tagger


def test_viterbi() :
    print("Test viterbi function: ")
    tagger = pretrainedHMM()
    sentences = [sentence.split() for sentence in ["le chat mange une pomme", "le chien dort dans la cuisine"]]
    tags      = [t.split() for t in ["D N V D N", "D N V P D N"]]
    for i,sentence in enumerate(sentences):
        print(sentence, " : ")
        print(f"Prediction      : {tagger.predict(sentence)}")
        print(f"Expected result : {tags[i]}")



In [49]:
test_viterbi()  # test the viterbi algorithm

Test viterbi function: 
['le', 'chat', 'mange', 'une', 'pomme']  : 
Prediction      : ['D', 'N', 'V', 'D', 'N']
Expected result : ['D', 'N', 'V', 'D', 'N']
['le', 'chien', 'dort', 'dans', 'la', 'cuisine']  : 
Prediction      : ['D', 'N', 'V', 'P', 'D', 'N']
Expected result : ['D', 'N', 'V', 'P', 'D', 'N']


In [50]:
def test_tagger(train_data, test_data, smooth=1e-7) :
    print("Test tagger")
    print(f"Train corpus size (number of sentences): {len(train_data)}")
    print(f"Test corpus size (number of sentences): {len(test_data)}")

    print("Training HMM...")
    tagger = HiddenMarkovModel()
    tagger.train(train_data, smooth)

    print("HMM evaluation ...")
    test_sentences = [sentence["tokens"] for sentence in test_data]
    predictions = tagger.predict_corpus(test_sentences)
    accuracy = compute_accuracy(test_data, predictions)
    print(f"Accuracy = {accuracy} , Expected result: > 93%")

In [51]:
smooth=1e-7
test_tagger(train_data, test_data,smooth)      # evaluate the tagger

Test tagger
Train corpus size (number of sentences): 14449
Test corpus size (number of sentences): 416
Training HMM...
HMM evaluation ...


IndexError: list index out of range

## Additional questions (go as fas as possible!):



1. Test different values for the smoothing of probabilities and check how this impacts results on the validation set.

In [None]:
# TODO

2. Write a function that computes a confusion matrix that takes as input (i) the evaluation corpus (ii) the predicted sequences for the evaluation corpus. Test your function, what are the main types of mistakes made by the model?

In [None]:
#TODO

We now want to evaluate the HMM model on a different task: named entity recognition (NER), on the CONLL-2003 dataset, a standard English benchmark for this task.

To do so, load the copus using the following lines:

```
corpus = datasets.load_dataset("conllpp")
corpus = corpus.map(map_id_to_NER_TAGS)
```

And, write an evaluation function that computes the F1 score (standard
evaluation metric for NER).




In [None]:
def map_id_to_NER_TAGS(example):
    mapper = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']
    example["upos"] = [mapper[tag_id] for tag_id in example["ner_tags"]]
    return example

corpus = datasets.load_dataset("conllpp")
corpus = corpus.map(map_id_to_NER_TAGS)

train_data = corpus["train"]
dev_data = corpus["validation"]
test_data = corpus["test"]

## TODO

In [None]:
##TODO compute_f1