# Скрытая марковская модель

В данной работе будет рассмотрена скрытая марковская модель или hidden markov model (HMM) для задачи определения, является ли буква гласной или нет.

## Импорты, загрузка и подготовка данных

In [1]:
import re

import numpy as np
import nltk
from nltk.tag import hmm
from nltk.corpus import brown
import pandas as pd

In [2]:
nltk.download('brown')
english = re.compile('^[a-z]+$')

[nltk_data] Downloading package brown to /home/twlvth/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


Каждый токен переводим в нижний регистр, оставляем только токены в латинице 

In [3]:
tokens = []
for sent in brown.sents():
    for w in sent:
        w = w.lower()
        if english.match(w):
                tokens.append(w)
print(f'Number of tokens: {len(tokens)}')

Number of tokens: 981716


In [4]:
text = ' '.join(tokens)
len(text), text[:100]

(5579335,
 'the fulton county grand jury said friday an investigation of recent primary election produced no evi')

## Unsupervised обучение скрытой марковской модели (Алгоритм Баума-Велша)

Извлечем "словарь" - множество всех букв

In [5]:
vocab = sorted(list(set(text)))
len(vocab)

27

Обучение

In [6]:
trainer = hmm.HiddenMarkovModelTrainer(range(2), vocab)

In [7]:
tagger = trainer.train_unsupervised([text[:50000]], max_iterations=50)

iteration 0 logprob -245753.192254148
iteration 1 logprob -205572.80184320567
iteration 2 logprob -205474.42046654748
iteration 3 logprob -205398.91249035764
iteration 4 logprob -205335.90739825784
iteration 5 logprob -205279.1175605272
iteration 6 logprob -205224.5351019055
iteration 7 logprob -205169.6013560973
iteration 8 logprob -205112.80920344943
iteration 9 logprob -205053.49705048298
iteration 10 logprob -204991.71232615146
iteration 11 logprob -204928.0813018238
iteration 12 logprob -204863.65900253117
iteration 13 logprob -204799.75797225125
iteration 14 logprob -204737.77034030575
iteration 15 logprob -204679.0069611271
iteration 16 logprob -204624.5797564362
iteration 17 logprob -204575.3400475508
iteration 18 logprob -204531.8546765011
iteration 19 logprob -204494.3832365968
iteration 20 logprob -204462.84901943678
iteration 21 logprob -204436.84423482
iteration 22 logprob -204415.70614163717
iteration 23 logprob -204398.64542127788
iteration 24 logprob -204384.87232856566

### Исследуем полученную модель

Матрица переходов $\{a_{ij} = p(s_j|s_i)\}_{i,j = 1}^{|S|}$

In [8]:
trans_matr = pd.DataFrame(data=np.array([
    [2 ** log_p for log_p in tagger._transitions[0]._data],
    [2 ** log_p for log_p in tagger._transitions[1]._data]
]),
                         columns=[0, 1],
                         index=[0, 1])
trans_matr

Unnamed: 0,0,1
0,0.766817,0.233183
1,0.379457,0.620543


In [9]:
trans_matr.sum(axis=1)

0    1.0
1    1.0
dtype: float64

Матрица выходных вероятностей $\{ b_{ij} = p(x_j|s_i) \}_{i, j = 1}^{|S|, |X|}$

In [10]:
out_matr = pd.DataFrame(data=np.array([
    [2 ** log_p for log_p in tagger._outputs[0]._data],
    [2 ** log_p for log_p in tagger._outputs[1]._data]
]),
                        index=[0, 1],
                        columns=vocab)
out_matr

Unnamed: 0,Unnamed: 1,a,b,c,d,e,f,g,h,i,...,q,r,s,t,u,v,w,x,y,z
0,0.191898,0.083814,0.009678,0.026559,0.049537,0.152913,3.088636e-08,0.008149,0.06148013,0.050879,...,2.737425e-13,0.03977,0.082238,0.116546,0.000232,0.008488,0.015652,0.00361643,0.019588,0.000967
1,0.126219,0.043683,0.014885,0.033339,0.011817,0.028238,0.04808024,0.024572,8.189423e-08,0.076839,...,0.002101871,0.069643,0.012672,0.019635,0.057792,0.008256,0.005636,8.331189e-08,0.002962,4e-06


In [11]:
out_matr.sum(axis=1)

0    1.0
1    1.0
dtype: float64

## Supervised обучение скрытой марковской модели (максимум правдоподобия)

In [12]:
def make_tag(c):
    if c in 'aeiou':
        return (c,'1')
    else:
        return (c,'0')
supervised = [make_tag(c) for c in text]

In [13]:
tagger = trainer.train_supervised([supervised[:500]])

### Исследуем полученную модель

Частоты совстречаемостей тегов

In [14]:
for t in tagger._transitions:
    print(t, tagger._transitions[t].__dict__)

0 {'_freqdist': FreqDist({'0': 199, '1': 142})}
1 {'_freqdist': FreqDist({'0': 142, '1': 16})}


Матрица переходов $\{a_{ij} = p(s_j|s_i)\}_{i,j = 1}^{|S|}$

In [15]:
trans_matr = pd.DataFrame(data=np.array([
    [tagger._transitions['0'].prob('0'), tagger._transitions['0'].prob('1')],
    [tagger._transitions['1'].prob('0'), tagger._transitions['1'].prob('1')]
]),
                         columns=[0, 1],
                         index=[0, 1])
trans_matr

Unnamed: 0,0,1
0,0.583578,0.416422
1,0.898734,0.101266


In [16]:
trans_matr.sum(axis=1)

0    1.0
1    1.0
dtype: float64

Матрица выходных вероятностей $\{ b_{ij} = p(x_j|s_i) \}_{i, j = 1}^{|S|, |X|}$

In [17]:
out_matr = pd.DataFrame(data=np.array([
    [tagger._outputs['0'].prob(c) for c in vocab],
    [tagger._outputs['1'].prob(c) for c in vocab]
]),
                        index=[0, 1],
                        columns=vocab)
out_matr

Unnamed: 0,Unnamed: 1,a,b,c,d,e,f,g,h,i,...,q,r,s,t,u,v,w,x,y,z
0,0.236842,0.0,0.008772,0.05848,0.049708,0.0,0.02924,0.023392,0.070175,0.0,...,0.0,0.096491,0.052632,0.128655,0.0,0.01462,0.017544,0.002924,0.035088,0.0
1,0.0,0.177215,0.0,0.0,0.0,0.348101,0.0,0.0,0.0,0.208861,...,0.0,0.0,0.0,0.0,0.101266,0.0,0.0,0.0,0.0,0.0


In [18]:
out_matr.sum(axis=1)

0    1.0
1    1.0
dtype: float64