## CRF

CRF(Conditional Random Field) is one of the classic representatives of probabilistic graphical models. The CRF model is a model that can consider adjacent timing information. For example, part-of-speech tagging is one of the most commonly used scenarios for CRF. In addition, in early deep learning semantic segmentation models, CRF was also used as a post-processing technique to optimize the segmentation results of neural networks.

### Probabilistic Undirected Graph

Probabilistic Undirected Graphical Model is also called Markov Random Field, which uses undirected graph to represent joint probability distribution.

Suppose the joint probability distribution $P(Y)$ is represented by undirected graph $G=(V,E)$, and the nodes of the graph represent random variables, and the edges represent the dependencies between random variables. If the joint probability distribution satisfies the pairwise, local or global Markov property, then the joint probability distribution is a probabilistic undirected graph model. Markov property, that is, given a set of random variables, every two random variables are conditional independent with each other.

A subset of nodes in an undirected graph $G$ where any two nodes of it are connected by an edge is called a clique. If $C$ is a clique of $G$, and no node can be added to make it a larger clique, it is called the maximal clique. Based on maximal cliques, the joint probability distribution $P$ of a probabilistic undirected graph model can be written in the form of the product of functions $\Psi_{C}\left(Y_{C}\right)$ over all maximal cliques $C$ in the graph:
$$
P(Y)=\frac{1}{Z} \prod_{C} \Psi_{C}\left(Y_{C}\right)
$$

$Z$ is the normalization factor:
$$
Z=\sum _{Y} \prod_{C} \Psi_{C}\left(Y_{C}\right)
$$

CRF is a probabilistic undirected graph model. So it satisfies some characteristics of probabilistic undirected graphs, including the above-mentioned maximum corpuscle product condition.

### Definition of CRF

CRF is the Markov random field of random variable $Y$ given the condition of random variable $X$. Suppose $X$ and $Y$ are random variables, $P(Y|X)$ is conditional probability distribution of $Y$ given the condition of $X$. $P(Y|X)$ can constitute a Markov random field represented by undirected graph $G=(V,E)$:
$$
P(Y_{v}|X,Y_{w},w \neq v) = P(Y_{v}|X,Y_{w},w \sim v)
$$

$w \neq v$ indicates all nodes in the graph except node $v$, $w \sim v$ represents all the nodes $w$ that are connected with node $v$ with an edge in the graph.

### Parametric Expression of CRF

Assuming that $P(Y|X)$ is linear CRF, Under the condition that the random variable $X$ takes the value $x$, the conditional probability that the random variable $Y$ takes the value $y$ has the following form:
$$
\begin{aligned}
&P(y \mid x)=\frac{1}{Z(x)} \exp \left(\sum_{i, k} \lambda_{k} t_{k}\left(y_{i-1}, y_{i}, x, i\right)+\sum_{i, l} u_{l} s_{l}\left(y_{i}, x, i\right)\right) \\
&Z(x)=\sum_{y} \exp \left(\sum_{i, k} \lambda_{k} t_{k}\left(y_{i-1}, y_{i}, x, i\right)+\sum_{i, l} u_{l} s_{l}\left(y_{i}, x, i\right)\right)
\end{aligned}
$$

In the formula above, $t_{k}$ and $s_{l}$ are characteristic function, $\lambda_{k}$ and $u_{l}$ are corresponding weight, $Z(x)$ is normalization factor, and the summation is performed over all possible output sequences.

For example, in a part-of-speech tagging task, $x$ is the input of the full sentence, $i$ is the current position, $y_{i}$ and $y_{i-1}$ are the labels of current position and previous position. The above four items are used as the input of the characteristic function. $t_{k}$ is the transition characteristic function and $s_{l}$ is the state characteristic function, and they take the value 1 when the characteristic condition is satisfied, otherwise, they take the value 0.

### Three Key Problems in CRF

Linear CRF needs to solve three core problems, including forward-backward probability estimation algorithm, learning algorithm based on maximum likelihood and Newton optimization, and prediction algorithm based on Viterbi algorithm.

• Forward-Backward Algorithm

The probability estimation algorithm for CRF is calculating conditional probability $P(y_{i}|x)$, $P(y_{i-1},y_{i}|x)$ and corresponding estimation under the given conditional probability distribution $P(y|x)$, input sequence $x$ and out put sequence $y$.

Forward-Backward algorithm can be used to calculate conditional probability $P(y_{i}|x)$ and $P(y_{i-1},y_{i}|x)$. In the forward part, $\alpha_{i}(y_{i}|x)$ represents the denormalized probability of the partial label sequence preceding position $i$ when the label at sequence position $i$ is $y_{i}$. The following defines the denormalized probability of transition from $y_{i-1}$ to $y_{i}$ when $y_{i-1}$ is given:
$$
M_{i}\left(y_{i-1}, y_{i} \mid x\right)=\exp \left(\sum_{k=1}^{K} w_{k} f_{k}\left(y_{i-1}, y_{i}, x, i\right)\right)
$$

Correspondingly, when the label at sequence position $i$+1 is $y_{i+1},$ the denormalized probability of the partial label sequence before position $i$+1 can be obtained:
$$
\alpha_{i+1}\left(y_{i+1} \mid x\right)=\alpha_{I}\left(y_{i} \mid x\right) M_{i+1}\left(y_{i+1}, y_{i} \mid x\right)
$$

Definition at the start of the sequence:
$$
\alpha_{0}\left(y_{0} \mid x\right)=\left\{\begin{array}{lr}
1 & y_{0}=\text { start } \\
0 & \text { else }
\end{array}\right.
$$

Assuming that the number of possible labels is $m$, then there are $m$ different values of $y_{i}$. Using $\alpha_{i}(x)$ to represent the forward vector composed of these $m$ values as follows:
$$
\alpha_{i}(x) = (\alpha_{i}(y_{i}=1|x), \alpha_{i}(y_{i}=2|x), \cdots, \alpha_{i}(y_{i}=m|x))^{\top}
$$

Matrix $M_{i}(x)$ represents a $m \times n$ matrix that consists of $M_{i}\left(y_{i-1}, y_{i} \mid x\right)$:
$$
M_{i}(x)=[M_{i}\left(y_{i-1}, y_{i} \mid x\right)]
$$

The final recursion formula can be expressed by the matrix as:
$$
\alpha^{\top}_{i+1}(x)=\alpha^{\top}_{i}(x)M_{i}(x)
$$

Correspondingly, the backward calculation process can be defined in the similar way. Define the denormalized probability $\beta_{i}(y_{i}|x)$ of the partial label sequence after position $i$ when the label at sequence position $i$ is $y_{i}$:
$$
\beta_{i}\left(y_{i} \mid x\right)=M_{i}\left(y_{i}, y_{i+1} \mid x\right) \beta_{i+1}\left(y_{i+1} \mid x\right)
$$

Definition at the end of the sequence:
$$
\beta_{n+1}\left(y_{n+1} \mid x\right)=\left\{\begin{array}{lr}
1 & y_{n+1}=\text { stop } \\
0 & \text { else }
\end{array}\right.
$$

The vectorized expression of above formula is:
$$
\beta_{i}\left( x\right)=M_{i}\left( x\right) \beta_{i+1}\left( x\right)
$$

The normalization factor is:
$$
Z(x)=\sum_{c=1}^{m} \alpha_{n}\left(y_{n} \mid x\right)=\sum_{c=1}^{m} \beta_{1}\left(y_{c} \mid x\right)
$$

The vectorized expression of $Z(x)$ is:
$$
Z(x) = \alpha^{\top}_{n}(x) \cdot 1 = 1^{\top} \cdot \beta_{1}(x)
$$

According to forward-backward algorithm, the conditional probability when the label of sequence position $i$ is $y_{i}$, the conditional probability when the label of sequence position $i$-1 and $i$ is $y_{i}$ are $y_{i-1}$ and $y_{i}$:
$$
\begin{gathered}
P\left(Y_{i}=y_{i} \mid x\right)=\frac{\alpha_{i}^{T}\left(y_{i} \mid x\right) \beta_{i}\left(y_{i} \mid x\right)}{Z(x)} \\
P\left(Y_{i-1}=y_{i}, Y_{i}=y_{i} \mid x\right)=\frac{\alpha_{i-1}^{T}\left(y_{i-1} \mid x\right) M_{i}\left(y_{i-1}, y_{i} \mid x\right) \beta_{i}\left(y_{i} \mid x\right)}{Z(x)}
\end{gathered}
$$

• Learning Algorithm

When the training data set $X$, the corresponding label sequence $Y$ and $K$ characteristic function $f_{k}(x,y)$ are given, CRF needs to learn the model parameter $w_{k}$ and conditional probability $P_{w}(y|x)$, $w_{k}$ and $P_{w}(y|x)$ satisfy the following condition:
$$
P_{w}(y \mid x)=\frac{1}{Z_{w}(x)} \exp \sum_{k=1}^{K} w_{k} f_{k}(x, y)=\frac{\exp \sum_{k=1}^{K} w_{k} f_{k}(x, y)}{\sum_{y} \exp \sum_{k=1}^{K} w_{k} f_{k}(x, y)}
$$

The formula above is a softmax function. When model parameter $w_{k}$ is obtained after the training, it can be put in the softmax function to calculate $P_{w}(y|x)$.

The learning model of linear CRF is actually a logarithmic linear model defined on time series data, and its learning methods include maximum likelihood estimation and regularized maximum likelihood estimation. Model optimization algorithms include gradient descent, Newton, quasi-Newton, and iterative scaling methods, etc.

• Prediction Algorithm

The prediction problem in CRF is to find the output sequence $y^{*}$ with the largest conditional probability when the conditional random field $P(Y|X)$ and the input sequence $x$ are given. CRF uses Viterbi algorithm to process label prediction.

In Viterbi algorithm, the inputs are feature vector $F(y,x)$, weight vector $w$ and observation sequence $x=(x_{1},x_{2},\cdots,x_{n})$, the output is the optimal path $y^{*}=(y^{*}_{1},y^{*}_{2},\cdots,y^{*}_{n})$. The algorithm flow is shown as below:

(1) Initialization:
$$
\delta_{1}(i)=w \times F_{1}(y_{0}=\text{start}, y_{1}=k, x), \enspace j=1,2, \ldots, m
$$

(2) Recursion: for $i=2,3,\cdots,n$
$$
\delta_{i}(l)=\max _{1<=j<=m}\left\{\delta_{i-1}(j)+w \cdot F_{i}\left(y_{i-1}=j, y_{i}=l, x\right)\right\}, l=1,2, \ldots, m \\
\Psi_{i}(l)=\arg \max _{1<=j<m}\left\{\delta_{i-1}(j)+w \cdot F_{i}\left(y_{i-1}=j, y_{i}=l, x\right)\right\}, l=1,2, \ldots, m
$$

(3) Termination: 
$$
\max _{y}(w \cdot F(y, x))=\max _{1<=j<=m} \delta_{n}(j), y_{n}^{*}=\arg \max _{1<=j<=m} \delta_{n}(j)
$$

(4) Optimal path backtracking: for $i=n-1, n-2, \cdots, 1$
$$
y^{*}_{i}= \Psi_{i+1}(y^{*}_{i+1})
$$

Example of Viterbi algorithm:
 
<img src="1.png" />

To find the shortest path from $S$ to $E$, Viterbi algorithm would not compare all these paths one by one. Instead, it first compares the three paths that include $B1$ to find the shortest one among these three. And then do the same to $B2$ and $B3$. Finally, comparing the three paths found above to obtain the optimal one. From this process, it's clear that the Viterbi algorithm is a dynamic programming algorithm.

In [13]:
# CRF in sklearn
import ssl
import nltk
import sklearn
import scipy.stats
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV

import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics

In [14]:
try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context
# 基于NLTK下载示例数据集
nltk.download('conll2002')

[nltk_data] Downloading package conll2002 to
[nltk_data]     /Users/imchengliang/nltk_data...
[nltk_data]   Unzipping corpora/conll2002.zip.


True

In [15]:
# 设置训练和测试样本
train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train'))
test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb'))

train_sents[0]

[('Melbourne', 'NP', 'B-LOC'),
 ('(', 'Fpa', 'O'),
 ('Australia', 'NP', 'B-LOC'),
 (')', 'Fpt', 'O'),
 (',', 'Fc', 'O'),
 ('25', 'Z', 'O'),
 ('may', 'NC', 'O'),
 ('(', 'Fpa', 'O'),
 ('EFE', 'NC', 'B-ORG'),
 (')', 'Fpt', 'O'),
 ('.', 'Fp', 'O')]

In [16]:
# 单词转化为数值特征
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]

In [17]:
sent2features(train_sents[0])[0]

{'bias': 1.0,
 'word.lower()': 'melbourne',
 'word[-3:]': 'rne',
 'word[-2:]': 'ne',
 'word.isupper()': False,
 'word.istitle()': True,
 'word.isdigit()': False,
 'postag': 'NP',
 'postag[:2]': 'NP',
 'BOS': True,
 '+1:word.lower()': '(',
 '+1:word.istitle()': False,
 '+1:word.isupper()': False,
 '+1:postag': 'Fpa',
 '+1:postag[:2]': 'Fp'}

In [18]:
# 构造训练集和测试集
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

print(len(X_train), len(X_test))

8323 1517


In [19]:
# 创建CRF模型实例
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)
# 模型训练
crf.fit(X_train, y_train)
# 类别标签
labels = list(crf.classes_)
labels.remove('O')
# 模型预测
y_pred = crf.predict(X_test)
# 计算F1得分
metrics.flat_f1_score(y_test, y_pred,
                    average='weighted', labels=labels)

0.7964686316443963

In [27]:
# 打印B和I组的模型结果
sorted_label = sorted(
    labels,
    key=lambda name: (name[1:], name[0])
)
print(metrics.flat_classification_report(
    y_test, y_pred, labels=sorted_label, digits=3
))

TypeError: classification_report() takes 2 positional arguments but 3 positional arguments (and 1 keyword-only argument) were given