# Conditional Random Fields (CRF) for Sequence Tagging

## Definition

Let $\mathcal{V}^m$ be the space of all senteces and $\mathcal{S}^m$ the space of all possible states, where $|\mathcal{S}|=k$. We will denote a sentence $\vec{x} \in \mathcal{V}^m$ as $\vec{x} = (x_1, \dots, x_m)$ and a state $\vec{s} \in \mathcal{S}^m$ as $\vec{s} = (s_1, \dots, s_m)$. We assume the existence of some feature vector $\Phi:\mathcal{V}^m \times \mathcal{S}^m \rightarrow \mathbb{R}^d$ and model the conditional probability $p(\vec{s}|\vec{x})$ as

$$
p(\vec{s}|\vec{x}, w) = \frac{exp(w \cdot \Phi(\vec{x}, \vec{s}))}{\sum_{\vec{s}' \in \mathcal{S}^m} exp(w \cdot \Phi(\vec{x}, \vec{s}'))},\;\; w \in \mathbb{R}^d
$$


Those models that try to estimate the conditional probability $p(s|x)$ directly are called discriminative models. The CRF can also be seen as a single layer neural network with softmax as activation function or a variant of logistic regression.

## Feature Vector

One way to define the feature vector $\Phi$ is to set it to $\Phi(\vec{x}, \vec{s}) = \sum_{i=1}^m \phi(\vec{x}, j, s_{j-1}, s_j)$. In this case, we are making the assumption that the state at time $t$ depends only on the state at time $t-1$, this makes the CRF a class of so called Maximum Entropy Markov Models (MEMM's), since it relies on the markovian independence assumption.

## Parameter Estimation

Assume we have a finite set of pairs $\{(\vec{x}_i, \vec{s}_i)\}_{i=1}^n \subset \mathcal{V}^m \times \mathcal{S}^m$, we set $w^*$ to be the vector that maximizes the regularized log-likehood function $L_{\lambda}(w)$:

$$
w^* = \arg\max\limits_{w \in \mathbb{R}^d} L_{\lambda}(w) = \arg\max\limits_{w \in \mathbb{R}^d} \sum_{i=1}^n log(p(\vec{s}_i, \vec{x_i}, w)) - \lambda\frac{||w||^2}{2}
$$

Since $L_{\lambda}$ is differentiable with respect to the parameter $w$, any gradient based optimization process can be applied here to retrieve $w^*$, the most common one for the CRF is the L-BFGS.

## Decoding

Once the model is trained, the predicted state for a given sentence $\vec{x}$ is given by

$$
\arg\max\limits_{\vec{s}' \in \mathcal{S}^m} p(\vec{s}'|\vec{x}, w)
$$

Since $p$ is an monotone function, it is enough to output the state $\vec{s}$ such that

$$
\vec{s} = \arg\max\limits_{\vec{s}' \in \mathcal{S}^m} \sum_{i=1}^m \phi(\vec{x}, j, s'_{j-1}, s'_j)
$$

We will now apply a variation of the Viterbi algorithm to compute $\vec{s}$ efficiently, the procedure is the following:

- Initialize $\pi[1, s] = w \cdot \phi(\vec{x}, 1, s_0, s)$, where $s_0$ is a special symbol;
- For $j=2, \dots, m; s=1, \dots, k$
$$
\pi[j, s] = \max\limits_{s' \in \mathcal{S}} \pi[j-1, s'] + w \cdot \phi(\vec{x}, j, s', s)
$$

Therefore,

$$
\max\limits_{s_1, \dots, s_m} \sum_{i=1}^m w \cdot \phi(\vec{x}, j, s_{j-1}, sj) = \max\limits_{s} \pi[m, s]
$$

If one adds a pointer to track the indexes of the sequence of states that maximizes the probability $p$, we can easily recover the label in $\mathcal{O}(mk^2)$

## Drawbacks

The feature vector $\Phi$ must be defined a priori.

## References

All the ideas presented here are discussed in details in this link: http://www.cs.columbia.edu/~mcollins/crf.pdf

In [1]:
import nltk
import time
from pprint import pprint
from functools import reduce
from operator import iconcat
from nltk.tag import CRFTagger
from sklearn import metrics

## Dataset

We will use the MIT restaurant corpus as a toy dataset to perform named entity recognition (NER), the data is available here: https://groups.csail.mit.edu/sls/downloads/restaurant/

In [2]:
def process_file(fname:str)->list:
    with open(fname, "r") as f:
        data = f.read()
    data = data.split("\n\n")
    data = list(map(lambda x:x.split("\n"), data))
    data = list(map(lambda x:[tuple(s.split("\t"))[::-1] for s in x], data))
    return data

In [3]:
train = process_file("./datasets/restauranttrain.bio")

In [4]:
train[:3]

[[('2', 'B-Rating'),
  ('start', 'I-Rating'),
  ('restaurants', 'O'),
  ('with', 'O'),
  ('inside', 'B-Amenity'),
  ('dining', 'I-Amenity')],
 [('34', 'O')],
 [('5', 'B-Rating'),
  ('star', 'I-Rating'),
  ('resturants', 'O'),
  ('in', 'B-Location'),
  ('my', 'I-Location'),
  ('town', 'I-Location')]]

The format nltk uses to represent tagged sentences is quite simple, the training set is a list of sentences. Each sentence is a list of tuples, the first element is a word and the second is its corresponding tag. Also note that, the tag O represents that the corresponding word is not an entity, and B-SOMETHING represents the entity beginning and I-SOMETHING represents both the intermediate and last words of the entity.

In [5]:
train.pop() # Last sentence is empty

[('',)]

In [6]:
f"There are {len(train)} sentences in the training set"

'There are 7660 sentences in the training set'

In [7]:
def to_list(data:list)->list:
    return reduce(iconcat, data, [])

def split_words_n_tags(data:list)->tuple:
    words, tags = map(list, zip(*data))
    return words, tags

In [8]:
all_pairs = to_list(train)
all_words, all_tags = split_words_n_tags(all_pairs)

In [9]:
f"There are {len(set(all_words))} unique words in the training set"

'There are 3804 unique words in the training set'

In [10]:
f"There are {len(set(all_tags))} unique tags in the training set"

'There are 17 unique tags in the training set'

In [11]:
hist = nltk.probability.FreqDist(all_tags)
hist.most_common()

[('O', 43670),
 ('B-Location', 3817),
 ('I-Location', 3658),
 ('B-Cuisine', 2839),
 ('I-Amenity', 2676),
 ('B-Amenity', 2541),
 ('B-Restaurant_Name', 1901),
 ('I-Restaurant_Name', 1668),
 ('B-Dish', 1475),
 ('I-Hours', 1283),
 ('B-Rating', 1070),
 ('B-Hours', 990),
 ('I-Dish', 767),
 ('B-Price', 730),
 ('I-Cuisine', 630),
 ('I-Rating', 527),
 ('I-Price', 283)]

The class nltk.probability.FreqDist builds a histogram that tells us how many times each tag appeared on the training set. We can already expect that the tags with low frequency will be the ones the model will have more trouble during testing time.

## Training

Before training this model we need to setup the following paramaters:
- The feature vector;
- The paramaters for the L-BFGS optimizer.

### Feature Vector

The paramater $\textit{feature_func}$ must be a function that takes two parameters: a list of tokens (list of strings) and an index (integer), and returns a list of features. The API already provides a default function, so we just need to modificate it to our purposes:

In [12]:
def feature_func(tokens, idx):
    token = tokens[idx]
    
    feature_list = []
    
    # We have hours, rating, price, then the presence of digits could be a useful feature
    if not token.isalpha():
        feature_list.append("HAS_NUMBER")
    
    n = len(token)
    # Let's extract some prefixes and suffixes from the current word
    if n>1:
        feature_list.append("PREF_"+token[:2])
        feature_list.append("SUF_"+token[-1:])
    if n>2:
        feature_list.append("PREF_"+token[:3])
        feature_list.append("SUF_"+token[-2:])
        
    if n>3:
        feature_list.append("PREF_"+token[:4])
        feature_list.append("SUF_"+token[-3:])
        
    if idx==0:
        return feature_list
        
    # The same for the previous word
    previous_token = tokens[idx-1]
    
    if not previous_token.isalpha():
        feature_list.append("PHAS_NUMBER")
    
    n = len(previous_token)
    if n>1:
        feature_list.append("PPREF_"+previous_token[:2])
        feature_list.append("PSUF_"+previous_token[-1:])
    if n>2:
        feature_list.append("PPREF_"+previous_token[:3])
        feature_list.append("PSUF_"+previous_token[-2:])
        
    if n>3:
        feature_list.append("PPREF_"+previous_token[:4])
        feature_list.append("PSUF_"+previous_token[-3:])
        
    if idx==len(tokens)-1:
        return feature_list
    
    # The same for the next word
    next_token = tokens[idx+1]
    
    if not next_token.isalpha():
        feature_list.append("NHAS_NUMBER")
    
    n = len(next_token)
    if n>1:
        feature_list.append("NPREF_"+next_token[:2])
        feature_list.append("NSUF_"+next_token[-1:])
    if n>2:
        feature_list.append("NPREF_"+next_token[:3])
        feature_list.append("NSUF_"+next_token[-2:])
        
    if n>3:
        feature_list.append("NPREF_"+next_token[:4])
        feature_list.append("NSUF_"+next_token[-3:])
        
    return feature_list    

Let's see how the output of this function looks like for some sentence in the training set

In [13]:
toy = [w for w,t in train[0]]
toy

['2', 'start', 'restaurants', 'with', 'inside', 'dining']

In [14]:
feature_func(toy, 0)

['HAS_NUMBER']

In [15]:
feature_func(toy, 1)

['PREF_st',
 'SUF_t',
 'PREF_sta',
 'SUF_rt',
 'PREF_star',
 'SUF_art',
 'PHAS_NUMBER',
 'NPREF_re',
 'NSUF_s',
 'NPREF_res',
 'NSUF_ts',
 'NPREF_rest',
 'NSUF_nts']

In [16]:
feature_func(toy, 5)

['PREF_di',
 'SUF_g',
 'PREF_din',
 'SUF_ng',
 'PREF_dini',
 'SUF_ing',
 'PPREF_in',
 'PSUF_e',
 'PPREF_ins',
 'PSUF_de',
 'PPREF_insi',
 'PSUF_ide']

Notice that the feature space will be huge, fortunately the log-linear models are well prepared to deal with sparsity. The next step is to define the parameters for the optimizer L-BFGS

### L-BFGS parameters

The possible (actually the most important) parameters to set are:
- feature.possible_states : Force to generate possible state features;
- c1: Coefficient for L1 regularization;
- c2: Coefficient for L2 regularization;
- max_iterations: The maximum number of iterations for L-BFGS optimization;
- num_memories: The number of limited memories for approximating the inverse hessian matrix;
- period: The duration of iterations to test the stopping criterion;

The last 3 are useful when the dataset is too large, they control for how long the model will keep training and how much memory it is allowed to store. The first 3 are related to the optimization process itself, L2 regularization to help out with generalization, L1 to perform feature selection and feature.possible_states to add more features to the model related to the states themselves.

In [17]:
class CRFTaggerCustom(CRFTagger):
    def probability(self, tokens):
        y = self.tag(tokens)
        y = list(list(zip(*y))[1])
        return self._tagger.probability(y)

The $\textit{nltk}$ package does not provide a method to calculate the probability $P(\vec{s}|\vec{x})$ so we need to add a custom method to the original API as above.

In [18]:
crf = CRFTaggerCustom(feature_func, training_opt={"feature.possible_states":True, "c1":0.25, "c2":0.3})

In [20]:
tic = time.time()
crf.train(train, "model.cfr.tagger")  # The trained model will be save to a file by default
toc = time.time()
f"CRF took {(toc-tic):.5f} seconds to train"

'CRF took 93.97305 seconds to train'

Let's pick a sentence in the test set to see what we can get from the trained model.

In [21]:
test = process_file("./datasets/restauranttest.bio")
test.pop()

[('',)]

In [22]:
toy = [w for w,t in test[2]]
toy

['any', 'bbq', 'places', 'open', 'before', '5', 'nearby']

In [23]:
crf.probability(toy)

0.613394375287388

As we said before, this is the probability of the label $\vec{s}$ appears given the sentence $\vec{x}$. It can also be seen as an uncertainty measure, in the sense that, if the probability is high, the model is more confident about the prediction. Although, this does not mean that the prediction will be correct.

In [24]:
crf.tag(toy)  # Tags the sentence

[('any', 'O'),
 ('bbq', 'B-Cuisine'),
 ('places', 'O'),
 ('open', 'B-Hours'),
 ('before', 'I-Hours'),
 ('5', 'I-Hours'),
 ('nearby', 'B-Location')]

In [25]:
test[2]  # Compare with the actual labels

[('any', 'O'),
 ('bbq', 'B-Cuisine'),
 ('places', 'O'),
 ('open', 'B-Hours'),
 ('before', 'I-Hours'),
 ('5', 'I-Hours'),
 ('nearby', 'B-Location')]

## Testing

Let's test our model on all sentences in the testing set to see how it performed. Firstly, we will unlabel the testing set to simulate when real predictions are desired.

In [26]:
def retrive_sents(data:list)->list:
    return list(map(lambda x:[w for w,t in x], data))

In [27]:
_, labels = split_words_n_tags(to_list(test))
unlabeled_sents = retrive_sents(test)

In [28]:
unlabeled_sents[:3]

[['a', 'four', 'star', 'restaurant', 'with', 'a', 'bar'],
 ['any', 'asian', 'cuisine', 'around'],
 ['any', 'bbq', 'places', 'open', 'before', '5', 'nearby']]

By calling the method $\textit{tag_sents}$ we can tag all sentences at once using the CRF.

In [29]:
tic = time.time()
preds = crf.tag_sents(unlabeled_sents)
toc = time.time()
f"CRF took {(toc-tic):.5f} seconds to tag all sequences in the testing set"

'CRF took 0.17402 seconds to tag all sequences in the testing set'

In [30]:
_, preds = split_words_n_tags(to_list(preds))
preds[:10]

['O',
 'B-Rating',
 'I-Rating',
 'O',
 'O',
 'O',
 'B-Amenity',
 'O',
 'B-Cuisine',
 'O']

In [31]:
labels[:10]

['O',
 'B-Rating',
 'I-Rating',
 'O',
 'B-Location',
 'I-Location',
 'B-Amenity',
 'O',
 'B-Cuisine',
 'O']

Now we have a list with all predicted tags regardless the sentence, since we are not interested in sentence-wise performance. The $\textit{labels}$ variable is a list paired with the $\textit{preds}$, so we can call the $\textit{classification_report}$ function from the $\textit{sklearn}$ API to measure the accuracy of the CRF's predictions.

In [32]:
pprint(metrics.classification_report(labels, preds))

('                   precision    recall  f1-score   support\n'
 '\n'
 '        B-Amenity       0.77      0.70      0.73       533\n'
 '        B-Cuisine       0.86      0.85      0.86       532\n'
 '           B-Dish       0.77      0.80      0.78       288\n'
 '          B-Hours       0.74      0.68      0.71       212\n'
 '       B-Location       0.86      0.84      0.85       812\n'
 '          B-Price       0.82      0.79      0.81       171\n'
 '         B-Rating       0.80      0.82      0.81       201\n'
 'B-Restaurant_Name       0.88      0.79      0.83       402\n'
 '        I-Amenity       0.72      0.72      0.72       524\n'
 '        I-Cuisine       0.66      0.65      0.66       135\n'
 '           I-Dish       0.70      0.76      0.73       121\n'
 '          I-Hours       0.86      0.86      0.86       295\n'
 '       I-Location       0.83      0.84      0.84       788\n'
 '          I-Price       0.79      0.64      0.71        66\n'
 '         I-Rating       0.88    

As we can see the model performed quite well in this dataset, but again to make sure it is practically useful it is necessary to dig deeper into the analysis of those metrics.