# Sequence Labelling

In this session we will build an HMM model for PoS-tagging and then CRF and neural models for Named Entity Recognition.

## Building a simple Hidden Markov Model

In this first part of the lab we will build a very simple bigram HMM using probability estimates over the Brown corpus, which is Part-of-Speech tagged.

Recall from course 6: probability estimates (with MLE estimation) can be calculated by dividing the number of occurrences of a bigram by the number of occurrences of the first word.

First of all, we import the corpus where we will estimate the probabilities:


In [4]:
import nltk
from nltk.corpus import brown

This corpus is in the form of sequences of sentences, where each sentence is made by a sequence of pairs (word, POS-tag), like this:

In [5]:
nltk.download("brown")
brown.tagged_sents()

[nltk_data] Downloading package brown to
[nltk_data]     /Users/maximemoutet/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


[[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')], [('The', 'AT'), ('jury', 'NN'), ('further', 'RBR'), ('said', 'VBD'), ('in', 'IN'), ('term-end', 'NN'), ('presentments', 'NNS'), ('that', 'CS'), ('the', 'AT'), ('City', 'NN-TL'), ('Executive', 'JJ-TL'), ('Committee', 'NN-TL'), (',', ','), ('which', 'WDT'), ('had', 'HVD'), ('over-all', 'JJ'), ('charge', 'NN'), ('of', 'IN'), ('the', 'AT'), ('election', 'NN'), (',', ','), ('``', '``'), ('deserves', 'VBZ'), ('the', 'AT'), ('praise', 'NN'), ('and', 'CC'), ('thanks', 'NNS'), ('of', 'IN'), ('the', 'AT'), ('City', 'NN-TL'), ('of', 'IN-TL'), ('Atlant

We recall (from the course) that a Hidden Markov Model is composed by:

- A set of $N$ states $Q = \{q_1, q_2, \ldots, q_N\}$
- A transition probability matrix $A=a_{11}\ldots a_{ij} \ldots a_{NN}$, where each $a_{ij}$ represents the probability of transitioning from state $q_i$ to $q_j$; note that $\sum_{j=1}^N{a_{ij}} = 1 \forall i$
- A sequence of $T$ observations $O = o_1, o_2, \ldots o_T$, each one drawn from a vocabulary of size $V=v_1, v_2, \ldots, v_M$, of size $M$;
- A sequence of *observation likelihoods* $B=b_i(o_t)$, also called **emission probabilities**, each expressing the probability of an observation $o_t$ being generated from a state $q_i$;
- Finally, an initial probability distribution $\Pi = \pi_i, \pi_2, \ldots, \pi_N$ where $\pi_i$ indicates the probability that the Markov chain will start from state $q_i$. Some states $q_j$ may have $\pi_j = 0$, meaning that they cannot be initial states. Also, $\sum_{i=1}^N{\pi_i}=1$.

In our case, the set of states $Q$ is made by the vocabulary of labels (the POS-tags). The vocabulary $V$ corresponds to the word vocabulary (i.e. all the set of different words that appear in our corpus). The observations correspond to the sentences in the corpus.

We will now split our corpus in a training and test set:



In [6]:
corpus = brown.tagged_sents()

training = corpus[:-10]
testing = corpus[-10:]

**Exercise 1**: Extract $Q$ and $V$ from the Brown corpus and determine their respective size

In [7]:
Q, V = set(), set()

for sentence in corpus:
    for word, tag in sentence:
        Q.add(tag)
        V.add(word)

**Exercise 2**: Create the matrices ($A$, $B$ and $\Pi$) by using the probabilities estimated on the training set; since we are considering bigrams, the probabilities of the transition matrix will be calculated as $\frac{count(t_{-1}, t)}{count(t_{-1})}$.

*Important*: you will need to add smoothing (for instance Lidstone with $k=0.1$) on $B$ otherwise the output will be $0$.

In [8]:
from nltk.probability import ConditionalFreqDist, LidstoneProbDist
from nltk.util import ngrams
import numpy as np


def transition_probs(training_data, tags):
    transition_count = np.zeros((len(tags), len(tags)), dtype=np.float64)

    for sentence in training_data:
        tags_in_sentence = [tag for _, tag in sentence]
        bigrams = list(ngrams(tags_in_sentence, 2))
        for prev_tag, tag in bigrams:
            prev_tag_index = tags.index(prev_tag)
            tag_index = tags.index(tag)
            transition_count[prev_tag_index][tag_index] += 1

    transition_prob = (transition_count.T / transition_count.sum(axis=1)).T

    return transition_prob


def emission_probs(training_data, tags, words, k):
    emission_matrix = np.zeros((len(tags), len(words)), dtype=np.float64)

    cfdist = ConditionalFreqDist(
        (tag, word) for sentence in training_data for word, tag in sentence
    )

    for i, tag in enumerate(tags):
        for j, word in enumerate(words):
            freq = cfdist[tag][word]
            total = cfdist[tag].N()
            smoothed_prob = (freq + k) / (total + k * len(words))
            emission_matrix[i][j] = smoothed_prob

    return emission_matrix


def initial_state_probs(training_data, tags):
    initial_state_count = np.zeros(len(tags), dtype=np.float64)
    for sentence in training_data:
        initial_state_count[tags.index(sentence[0][1])] += 1

    initial_state_prob = initial_state_count / initial_state_count.sum()

    return initial_state_prob


Q = sorted(Q)
V = sorted(V)


A = transition_probs(training, Q)

k = 0.1
B = emission_probs(training, Q, V, k)

pi = initial_state_probs(training, Q)


print("Shape of transition matrix (A):", A.shape)
print("Shape of emission matrix (B):", B.shape)
print("Shape of initial state probabilities (Pi):", B.shape)

Shape of transition matrix (A): (472, 472)
Shape of emission matrix (B): (472, 56057)
Shape of initial state probabilities (Pi): (472, 56057)


We have now a model and we can estimate its performance over the test set.

To do this, we need the Viterbi algorithm for the decoding. To help you, an implementation of Viterbi is provided:
(note: to use this version you need to assign a numeric id to each word and tag, if you haven't already)

In [9]:
"""
params is a triple (pi, A, B) where
pi = initial probability distribution over states
A = transition probability matrix
B = emission probability matrix

observations is the sequence of observations (in our case, the observed words)

the function returns the optimal sequence of states and its score
"""


def viterbi(params, observations):
    pi, A, B = params
    M = len(observations)
    S = pi.shape[0]

    alpha = np.zeros((M, S))
    alpha[:, :] = float("-inf")  # cases that have not been treated
    backpointers = np.zeros((M, S), "int")

    # base case
    alpha[0, :] = pi * B[:, observations[0]]

    # recursive case
    for t in range(1, M):
        for s2 in range(S):
            for s1 in range(S):
                score = alpha[t - 1, s1] * A[s1, s2] * B[s2, observations[t]]
                if score > alpha[t, s2]:
                    alpha[t, s2] = score
                    backpointers[t, s2] = s1
    # now follow backpointers to resolve the state sequence
    ss = []
    ss.append(np.argmax(alpha[M - 1, :]))
    for i in range(M - 1, 0, -1):
        ss.append(backpointers[i, ss[-1]])

    return list(reversed(ss)), np.max(alpha[M - 1, :])


# Example:

# original sentence: you can't very well sidle up to people on the street and ask if they want to buy a hot Bodhisattva .
# sentence as sequence of word indexes:
# word_index=[42350, 44913, 3024, 50638, 15858, 16209, 36949, 31092, 28334, 45518, 22719, 26179, 32651, 52996, 25205, 16840, 36949, 1402, 46003, 10606, 19795, 3739]

# predicted, score = viterbi((pi, A, B), word_index)

# predicted will be a sequence of tag indexes:
# [12, 55, 86, 39, 29, 4, 70, 7, 14, 7, 0, 6, 21, 28, 12, 55, 28, 27, 28, 0, 9, 14, 15]

In [11]:
# Example of results calculation
word_to_index = {word: i for i, word in enumerate(V)}

testing_formated = []
for sentence in testing:
    sent = ""
    words_index = []  # vector of word indices to be passed to Viterbi
    true_label = []  # vector of the true labels from labeled corpus
    for word, tag in sentence:
        words_index.append(
            word_to_index[word]
        )  # word_to_index is a dictionary mapping a word to its index
        true_label.append(tag)
        sent = sent + " " + word
    testing_formated.append((words_index, true_label, sent))


for word_index, labels, sentence in testing_formated:
    print("  The sentence is:", sentence)
    print("   ##TRUE##    ##PRED##")
    predicted, score = viterbi((pi, A, B), word_index)  # call the viterbi decoder
    for i, true_label in enumerate(labels):
        predicted_label = Q[
            predicted[i]
        ]  # Q here is the vector of tags, so that at Q[i] we have the i_th tag in literal form
        print("      " + true_label + "         " + predicted_label)

  The sentence is:  you can't very well sidle up to people on the street and ask if they want to buy a hot Bodhisattva .
   ##TRUE##    ##PRED##
      PPSS         PPSS
      MD*         MD*
      QL         QL
      RB         RB
      VB         VBD
      IN         RP
      IN         IN
      NNS         NNS
      IN         IN
      AT         AT
      NN         NN
      CC         CC
      VB         VB
      CS         CS
      PPSS         PPSS
      VB         VB
      TO         TO
      VB         VB
      AT         AT
      JJ         JJ
      NP         .
      .         .
  The sentence is:  Additionally , since you're going to be hors de combat pretty soon with sprue , yaws , Delhi boil , the Granville wilt , liver fluke , bilharziasis , and a host of other complications of the hex you've aroused , you mustn't expect to be lionized socially .
   ##TRUE##    ##PRED##
      RB         RB
      ,         ,
      CS         CS
      PPSS+BER         PPSS+BER
      VBG     

**Exercise 3**: calculate Precision, Recall and F-measure for the bigram model

In [12]:
# Function to perform part-of-speech tagging using the bigram model
def pos_tagging(
    sentence,
    transition_matrix,
    emission_matrix,
    initial_state_probabilities,
    tags,
    words,
):
    predicted_tags = []
    prev_tag = None
    for word in sentence:
        if word.lower() in words:
            word_index = words.index(word.lower())
        else:
            word_index = 0
        if prev_tag:
            prev_tag_index = tags.index(prev_tag)
            transition_probs = transition_matrix[prev_tag_index]
            emission_prob = emission_matrix[:, word_index]
            joint_probs = transition_probs * emission_prob

            predicted_tag_index = np.argmax(joint_probs)
            predicted_tag = tags[predicted_tag_index]
            predicted_tags.append(predicted_tag)
        else:
            predicted_tag_index = np.argmax(initial_state_probabilities)
            predicted_tag = tags[predicted_tag_index]
            predicted_tags.append(predicted_tag)
        prev_tag = predicted_tag
    return predicted_tags


def evaluate_model(
    test_data,
    transition_matrix,
    emission_matrix,
    initial_state_probabilities,
    tags,
    words,
):
    true_positive, false_positive, false_negative = 0, 0, 0

    for sentence in test_data:
        words_in_sentence = [word.lower() for word, _ in sentence]
        gold_tags = [tag for _, tag in sentence]
        predicted_tags = pos_tagging(
            words_in_sentence,
            transition_matrix,
            emission_matrix,
            initial_state_probabilities,
            tags,
            words,
        )
        for gold_tag, predicted_tag in zip(gold_tags, predicted_tags):
            if gold_tag == predicted_tag:
                true_positive += 1
            else:
                if gold_tag in tags:
                    false_negative += 1
                if predicted_tag in tags:
                    false_positive += 1

    precision = true_positive / (true_positive + false_positive)
    recall = true_positive / (true_positive + false_negative)
    f_measure = 2 * (precision * recall) / (precision + recall)

    return precision, recall, f_measure


precision, recall, f_measure = evaluate_model(testing, A, B, pi, Q, V)

print("Precision:", precision)
print("Recall:", recall)
print("F-measure:", f_measure)

Precision: 0.47280334728033474
Recall: 0.47280334728033474
F-measure: 0.47280334728033474


**Exercise 4**: modify your HMM to use trigrams instead of bigrams, and re-evaluate the results

In [13]:
def transition_probs_trigram(training_data, tags):
    trigram_transition_count = np.zeros(
        (len(tags), len(tags), len(tags)), dtype=np.float64
    )

    for sentence in training_data:
        tags_in_sentence = [tag for _, tag in sentence]
        trigrams = list(ngrams(tags_in_sentence, 3))
        for prev_prev_tag, prev_tag, tag in trigrams:
            prev_prev_tag_index = tags.index(prev_prev_tag)
            prev_tag_index = tags.index(prev_tag)
            tag_index = tags.index(tag)
            trigram_transition_count[prev_prev_tag_index][prev_tag_index][
                tag_index
            ] += 1

    trigram_transition_prob = np.zeros(
        (len(tags), len(tags), len(tags)), dtype=np.float64
    )

    for i in range(len(tags)):
        for j in range(len(tags)):
            total_count = np.sum(trigram_transition_count[i, j])
            if total_count > 0:
                trigram_transition_prob[i, j] = (
                    trigram_transition_count[i, j] / total_count
                )

    return trigram_transition_prob


def pos_tagging(
    sentence,
    transition_matrix,
    emission_matrix,
    initial_state_probabilities,
    tags,
    words,
):
    predicted_tags = []
    prev_prev_tag = None
    prev_tag = None
    for word in sentence:
        if word.lower() in words:
            word_index = words.index(word.lower())
        else:
            word_index = 0
        if prev_tag and prev_prev_tag:
            prev_prev_tag_index = tags.index(prev_prev_tag)
            prev_tag_index = tags.index(prev_tag)
            transition_probs = transition_matrix[prev_prev_tag_index, prev_tag_index]
            emission_prob = emission_matrix[:, word_index]
            joint_probs = transition_probs * emission_prob

            predicted_tag_index = np.argmax(joint_probs)
            predicted_tag = tags[predicted_tag_index]
            predicted_tags.append(predicted_tag)
        else:
            predicted_tag_index = np.argmax(initial_state_probabilities)
            predicted_tag = tags[predicted_tag_index]
            predicted_tags.append(predicted_tag)
        prev_prev_tag = prev_tag
        prev_tag = predicted_tag
    return predicted_tags


precision, recall, f_measure = evaluate_model(
    testing, transition_probs_trigram(training, Q), B, pi, Q, V
)

print("Precision:", precision)
print("Recall:", recall)
print("F-measure:", f_measure)

Precision: 0.5690376569037657
Recall: 0.5690376569037657
F-measure: 0.5690376569037657


## Using NLTK's HMM implementation

We will compare now our model built from scratch to the implementation provided by NLTK:

In [14]:
import nltk
from nltk.tag import hmm

trainer = hmm.HiddenMarkovModelTrainer(states=Q, symbols=V)

model = trainer.train_supervised(
    training, estimator=lambda fd, bins: hmm.LidstoneProbDist(fd, 0.1, bins)
)

for sent in testing:
    u_sent = []
    for word, tag in sent:
        u_sent.append(word)
    tagged = model.tag(u_sent)
    print(sent)
    print(tagged)

[('you', 'PPSS'), ("can't", 'MD*'), ('very', 'QL'), ('well', 'RB'), ('sidle', 'VB'), ('up', 'IN'), ('to', 'IN'), ('people', 'NNS'), ('on', 'IN'), ('the', 'AT'), ('street', 'NN'), ('and', 'CC'), ('ask', 'VB'), ('if', 'CS'), ('they', 'PPSS'), ('want', 'VB'), ('to', 'TO'), ('buy', 'VB'), ('a', 'AT'), ('hot', 'JJ'), ('Bodhisattva', 'NP'), ('.', '.')]
[('you', 'PPSS'), ("can't", 'MD*'), ('very', 'QL'), ('well', 'RB'), ('sidle', 'VBD'), ('up', 'RP'), ('to', 'IN'), ('people', 'NNS'), ('on', 'IN'), ('the', 'AT'), ('street', 'NN'), ('and', 'CC'), ('ask', 'VB'), ('if', 'CS'), ('they', 'PPSS'), ('want', 'VB'), ('to', 'TO'), ('buy', 'VB'), ('a', 'AT'), ('hot', 'JJ'), ('Bodhisattva', '.'), ('.', '.')]
[('Additionally', 'RB'), (',', ','), ('since', 'CS'), ("you're", 'PPSS+BER'), ('going', 'VBG'), ('to', 'TO'), ('be', 'BE'), ('hors', 'FW-RB'), ('de', 'FW-IN'), ('combat', 'FW-NN'), ('pretty', 'QL'), ('soon', 'RB'), ('with', 'IN'), ('sprue', 'NN'), (',', ','), ('yaws', 'NNS'), (',', ','), ('Delhi', 'NP

**Exercise 5**: Calculate precision, recall and F-measure and compare them to the results that you obtained with the two models (bigram and trigram) that you implemented before. Can you deduce whether the NLTK model is using bigrams or trigrams? (It is not stated in the manual)

In [15]:
true_tags = []
predicted_tags = []

for sent in testing:
    words = [word for word, tag in sent]
    true_tags.extend([tag for word, tag in sent])
    predicted_tags.extend([tag for word, tag in model.tag(words)])

true_tags_set = set(true_tags)
predicted_tags_set = set(predicted_tags)

precision = nltk.precision(true_tags_set, predicted_tags_set)
recall = nltk.recall(true_tags_set, predicted_tags_set)
f_measure = nltk.f_measure(true_tags_set, predicted_tags_set)

print("Precision:", precision)
print("Recall:", recall)
print("F-measure:", f_measure)

Precision: 0.8888888888888888
Recall: 0.8163265306122449
F-measure: 0.8510638297872342


## Named Entity Recognition with Conditional Random Fields

For this exercise we will need to use the sklearn_crfsuite package. If it is not installed, it can be installed using pip with ```pip install sklearn-crfsuite```.

We will work on a Kaggle dataset named ```ner_dataset.csv``` (it should be in the same directory as the notebook).

Pandas can be used to read the content of the file:

In [16]:
import pandas as pd

data = pd.read_csv("ner_dataset.csv", encoding="latin1")
data = data.fillna(method="ffill")  # repeat sentence number on each row

words = list(set(data["Word"].values))  # vocabulary V
n_words = len(words)

print(words[:10])
print(n_words)

  data = data.fillna(method="ffill") #repeat sentence number on each row


['fresh-cut', 'Thirty-six', 'surroundings', '130.6', 'voyage', 'Coup', 'viewership', 'Holder', 'horrific', 'Group-E']
35177


We provide you with some code that can read the sentences and produce the features in the format required by crf_suite. The ```SentenceGetter``` class transforms sentences into sequences of ```(word, POS, tag)``` triples

In [17]:
class SentenceGetter(object):

    def __init__(self, data):
        self.n_sent = 1
        self.data = data
        self.empty = False
        agg_func = lambda s: [
            (w, p, t)
            for w, p, t in zip(
                s["Word"].values.tolist(),
                s["POS"].values.tolist(),
                s["Tag"].values.tolist(),
            )
        ]
        self.grouped = self.data.groupby("Sentence #").apply(agg_func)
        self.sentences = [s for s in self.grouped]

    def get_next(self):
        try:
            s = self.grouped["Sentence: {}".format(self.n_sent)]
            self.n_sent += 1
            return s
        except:
            return None


# load data
getter = SentenceGetter(data)  # transform sentences into sequences of (Word, POS, Tag)
sentences = getter.sentences

  self.grouped = self.data.groupby("Sentence #").apply(agg_func)


The next function allows us to define features that are used in the CRF. The features are stored in a dictionary.

In [18]:
def word2features(sent, i):
    """
    input:
       sent: sentence in the format of sequence of (Word, POS, Tag) triples
       i: position in the sentence
    output:
       features: a dictionary mapping the feature name into a value
    """
    word = sent[i][0]
    postag = sent[i][1]

    features = {  # features related to the current position
        "bias": 1.0,
        "word.lower()": word.lower(),
        "postag": postag,
    }
    if i > 0:  # features related to preceding word/tag
        word1 = sent[i - 1][0]
        postag1 = sent[i - 1][1]
        features.update(
            {
                "-1:word.lower()": word1.lower(),
                "-1:postag": postag1,
            }
        )
    else:
        features["BOS"] = True  # feature for Beginning of Sentence

    if i < len(sent) - 1:  # features related to the following word/tag
        word1 = sent[i + 1][0]
        postag1 = sent[i + 1][1]
        features.update(
            {
                "+1:word.lower()": word1.lower(),
                "+1:postag": postag1,
            }
        )
    else:
        features["EOS"] = True  # feature for end of sentence

    return features


def sent2features(sent):
    # transforms the sentence in a sequence of features
    return [word2features(sent, i) for i in range(len(sent))]


def sent2labels(sent):
    # transforms the sentence in a sequence of labels
    return [label for token, postag, label in sent]


def sent2tokens(sent):
    # transforms the sentence in a sequence of tokens (removes POS tags and labels)
    return [token for token, postag, label in sent]

We can now build the features and label vectors, and create a CRF model:

In [19]:
!pip install sklearn_crfsuite

Collecting sklearn_crfsuite
  Obtaining dependency information for sklearn_crfsuite from https://files.pythonhosted.org/packages/b2/11/a8370dd6fce65f8f4e74a0adffae72be9db5799d8ed8ddbf84415356a764/sklearn_crfsuite-0.5.0-py2.py3-none-any.whl.metadata
  Downloading sklearn_crfsuite-0.5.0-py2.py3-none-any.whl.metadata (4.9 kB)
Collecting python-crfsuite>=0.9.7 (from sklearn_crfsuite)
  Obtaining dependency information for python-crfsuite>=0.9.7 from https://files.pythonhosted.org/packages/bc/32/743048adf41ba3ebc4d82deed5d4a336164dc0066ef83b28d2a4b1979d66/python_crfsuite-0.9.11-cp311-cp311-macosx_11_0_arm64.whl.metadata
  Downloading python_crfsuite-0.9.11-cp311-cp311-macosx_11_0_arm64.whl.metadata (4.3 kB)
Collecting tabulate>=0.4.2 (from sklearn_crfsuite)
  Obtaining dependency information for tabulate>=0.4.2 from https://files.pythonhosted.org/packages/40/44/4a5f08c96eb108af5cb50b41f76142f0afa346dfa99d5296fe7202a11854/tabulate-0.9.0-py3-none-any.whl.metadata
  Downloading tabulate-0.9.0-

In [20]:
X = [sent2features(s) for s in sentences]
y = [sent2labels(s) for s in sentences]

from sklearn_crfsuite import CRF

crf = CRF(algorithm="lbfgs", max_iterations=100)

This will create a model with gradient descent algorithm ("lbfgs") and a limit of $100$ iterations.

Now we build the model and evaluate it on a 66/33 split between training and testing:

In [21]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42
)
crf.fit(X_train, y_train)
pred = crf.predict(X_test)

n_pred = [item for sublist in pred for item in sublist]
n_test = [item for sublist in y_test for item in sublist]

report = classification_report(y_pred=n_pred, y_true=n_test)
print(report)

              precision    recall  f1-score   support

       B-art       0.40      0.01      0.03       137
       B-eve       0.69      0.31      0.43       111
       B-geo       0.83      0.91      0.87     12357
       B-gpe       0.98      0.82      0.89      5226
       B-nat       0.60      0.09      0.15        69
       B-org       0.78      0.67      0.72      6762
       B-per       0.83      0.79      0.81      5649
       B-tim       0.93      0.84      0.88      6650
       I-art       0.38      0.02      0.05       124
       I-eve       0.56      0.21      0.31        89
       I-geo       0.80      0.79      0.80      2433
       I-gpe       0.95      0.33      0.49        55
       I-nat       0.33      0.05      0.08        21
       I-org       0.76      0.78      0.77      5545
       I-per       0.85      0.87      0.86      5730
       I-tim       0.81      0.74      0.78      2110
           O       0.99      0.99      0.99    292571

    accuracy              

The report shows accuracy stats for all classes, but we are not interested in the **O** class. We can see that the scores for people names, place names and organizations are relatively low.

**Exercise 6**: Can you think of some new features for the CRF model to improve the results, especially on **B-org** ? Modify the *word2features* function to include the additional features and compare with the above results.

In [26]:
# first of all we need to retrieve the number of different tags (a.k.a categories or classes of the words)
tags = list(set(data["Tag"].values))
n_tags = len(tags)

vocab = {w: i + 1 for i, w in enumerate(words)}  # map words into a number
tag_map = {t: i for i, t in enumerate(tags)}  # map tags into a number

from keras.preprocessing.sequence import pad_sequences

max_len = 75

X = [[vocab[w[0]] for w in s] for s in sentences]
X = pad_sequences(
    maxlen=max_len, sequences=X, padding="post", value=n_words
)  # pad with special token PAD, with ID=n_words
y = [[tag_map[w[2]] for w in s] for s in sentences]
y = pad_sequences(
    maxlen=max_len, sequences=y, padding="post", value=-1
)  # -1 is associated to the PAD token

## NER using a LSTM model

In this final section we will see an example of a neural network model written in PyTorch that uses a LSTM-based architecture for Named Entity Recognition.

First of all, we will prepare the data to have all information coded numerically (words and tags) and the sentences padded to a max length, in order to have all sentences of the same size.


In [27]:
# first of all we need to retrieve the number of different tags (a.k.a categories or classes of the words)
tags = list(set(data["Tag"].values))
n_tags = len(tags)

vocab = {w: i + 1 for i, w in enumerate(words)}  # map words into a number
tag_map = {t: i for i, t in enumerate(tags)}  # map tags into a number

from keras.preprocessing.sequence import pad_sequences

max_len = 75

X = [[vocab[w[0]] for w in s] for s in sentences]
X = pad_sequences(
    maxlen=max_len, sequences=X, padding="post", value=n_words
)  # pad with special token PAD, with ID=n_words
y = [[tag_map[w[2]] for w in s] for s in sentences]
y = pad_sequences(
    maxlen=max_len, sequences=y, padding="post", value=-1
)  # -1 is associated to the PAD token

Now we will load the data and split them into training and test. We set batch size at 32.

In [28]:
import torch
from NERDataset import NERDataset
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42
)

ner_train = NERDataset(X_train, y_train)
ner_test = NERDataset(X_test, y_test)

trainloader = torch.utils.data.DataLoader(ner_train, batch_size=32, shuffle=True)

The LSTM model is defined here. We have an embedding that maps each word in a vector (embedding) of size 100, which is learnt from the dataset. The embedded sentence is fed to a LSTM layer of size 50. The output is transferred to a fully connected layer with *n_tags* output, one for each of the possible labels. The loss is a cross-entropy loss over all tokens (excluding the "pad" tokens).

In [29]:
import torch.nn as nn
import torch.nn.functional as F

vocab_size = n_words + 1
embedding_dim = 100
lstm_hidden_dim = 50


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()

        # maps each token to an embedding_dim vector
        self.embedding = nn.Embedding(vocab_size, embedding_dim)

        # the LSTM takens embedded sentence
        self.lstm = nn.LSTM(embedding_dim, lstm_hidden_dim, batch_first=True)

        # fc layer transforms the output to give the final output layer
        self.fc = nn.Linear(lstm_hidden_dim, n_tags)

    def forward(self, s):
        # apply the embedding layer that maps each token to its embedding
        s = self.embedding(s)  # dim: batch_size x batch_max_len x embedding_dim

        # run the LSTM along the sentences of length batch_max_len. We discard the cell state as output
        s, _ = self.lstm(s)  # dim: batch_size x batch_max_len x lstm_hidden_dim

        # reshape the Variable so that each row contains one token
        s = s.reshape(-1, s.shape[2])  # dim: batch_size*batch_max_len x lstm_hidden_dim

        # apply the fully connected layer and obtain the output for each token
        s = self.fc(s)  # dim: batch_size*batch_max_len x num_tags

        return F.log_softmax(s, dim=1)  # dim: batch_size*batch_max_len x num_tags


def loss_fn(outputs, labels):
    # reshape labels to give a flat vector of length batch_size*seq_len
    labels = labels.view(-1)

    # discard 'PAD' tokens
    mask = (labels >= 0).float()

    # the number of tokens is the sum of elements in mask
    num_tokens = int(torch.sum(mask))

    # pick the values corresponding to labels and multiply by mask
    outputs = outputs[range(outputs.shape[0]), labels] * mask

    # cross entropy loss for all non 'PAD' tokens
    return -torch.sum(outputs) / num_tokens

In the following block we carry out the training over 5 epochs

In [30]:
network = Net()
# Initialize optimizer
optimizer = torch.optim.Adam(network.parameters(), lr=1e-4)

num_epochs = 5
# Run the training loop for defined number of epochs
for epoch in range(0, num_epochs):
    # Print epoch
    print(f"Starting epoch {epoch+1}")

    # Set current loss value
    current_loss = 0.0

    i = 0
    for inputs, targets in trainloader:
        # print(inputs, targets)

        optimizer.zero_grad()  # Zero the gradients
        outputs = network(inputs)  # Perform forward pass
        loss = loss_fn(outputs, targets)  # Compute loss
        loss.backward()  # Backprop
        optimizer.step()  # Optimization

        # Print statistics
        current_loss += loss.item()
        if i % 500 == 499:
            print("Loss after mini-batch %5d: %.3f" % (i + 1, current_loss / 500))
            current_loss = 0.0
        i += 1

  from .autonotebook import tqdm as notebook_tqdm


Starting epoch 1
Loss after mini-batch   500: 1.667
Loss after mini-batch  1000: 0.838
Starting epoch 2
Loss after mini-batch   500: 0.753
Loss after mini-batch  1000: 0.693
Starting epoch 3
Loss after mini-batch   500: 0.635
Loss after mini-batch  1000: 0.576
Starting epoch 4
Loss after mini-batch   500: 0.534
Loss after mini-batch  1000: 0.495
Starting epoch 5
Loss after mini-batch   500: 0.460
Loss after mini-batch  1000: 0.426
