# Reference
[hmmlearn](https://hmmlearn.readthedocs.io/en/latest/tutorial.html)

[Sampling from HMM](https://hmmlearn.readthedocs.io/en/latest/auto_examples/plot_hmm_sampling.html#sphx-glr-auto-examples-plot-hmm-sampling-py)

# HMM 的三个待解决问题
There are three fundamental problems for HMMs:

Given the model parameters and observed data, estimate the optimal sequence of hidden states.
Given the model parameters and observed data, calculate the likelihood of the data.
Given just the observed data, estimate the model parameters.

# NLTK 实例
[demo_pos](https://www.nltk.org/_modules/nltk/tag/hmm.html#demo_pos)


In [1]:
import nltk

[nltk.tag.hmm module](https://www.nltk.org/api/nltk.tag.html?highlight=hmm#module-nltk.tag.hmm)

# HMM 的表述
- the output observation alphabet. This is the set of symbols which may be observed as output of the system. 
输出观测结果的集合，如[红球，白球..]，[阴天，晴天...]
- the set of states. 隐藏状态集合，如 [盒子1，盒子2...]
- the transition probabilities a_{ij} = P(s_t = j | s_{t-1} = i). These represent the probability of transition to each state from a given state. 转移概率，如从盒子 i 到盒子 j 的转移概率。
- the output probability matrix b_i(k) = P(X_t = o_k | s_t = i). These represent the probability of observing each symbol in a given state. 在给定隐藏状态的时候，不同观测结果出现的概率。如在盒子 1 中有 3 个红球 2 个白球，那么在盒子 1 中的观测概率为 [0.6, 0.4]。
- the initial state distribution. This gives the probability of starting in each state. 观测初始概率，即第一个观测结果出现在哪个隐藏状态的概率。


> To ground this discussion, take a common NLP application, part-of-speech (POS) tagging. An HMM is desirable for this task as the highest probability tag sequence can be calculated for a given sequence of word forms.
hmm 一次计算一串输出的一串输出标签的概率

In [2]:
from nltk.tag.hmm import *  # 加载 hmm 函数中的函数，包括各种辅助函数，像什么提供数据的函数、训练函数

print()
print("HMM POS tagging demo")
print()

print('Training HMM...')
'''
load_pos
'''
labelled_sequences, tag_set, symbols = load_pos(20000)
trainer = HiddenMarkovModelTrainer(tag_set, symbols)
hmm = trainer.train_supervised(
    labelled_sequences[10:],
    estimator=lambda fd, bins: LidstoneProbDist(fd, 0.1, bins),
)

print('Testing...')
hmm.test(labelled_sequences[:10], verbose=True)

()
HMM POS tagging demo
()
Training HMM...
Testing...
Test: the/AT fulton/NP county/NN grand/JJ jury/NN said/VBD friday/NR an/AT investigation/NN of/IN atlanta's/NP$ recent/JJ primary/NN election/NN produced/VBD ``/`` no/AT evidence/NN ''/'' that/CS any/DTI irregularities/NNS took/VBD place/NN ./.

Untagged: the fulton county grand jury said friday an investigation of atlanta's recent primary election produced `` no evidence '' that any irregularities took place .

HMM-tagged: the/AT fulton/NP county/NN grand/JJ jury/NN said/VBD friday/NR an/AT investigation/NN of/IN atlanta's/NP$ recent/JJ primary/NN election/NN produced/VBD ``/`` no/AT evidence/NN ''/'' that/CS any/DTI irregularities/NNS took/VBD place/NN ./.

Entropy: 18.7331739704

------------------------------------------------------------
Test: the/AT jury/NN further/RBR said/VBD in/IN term-end/NN presentments/NNS that/CS the/AT city/NN executive/JJ committee/NN ,/, which/WDT had/HVD over-all/JJ charge/NN of/IN the/AT election/N

In [3]:
labelled_sequences[:2]

[[(u'the', u'AT'),
  (u'fulton', u'NP'),
  (u'county', u'NN'),
  (u'grand', u'JJ'),
  (u'jury', u'NN'),
  (u'said', u'VBD'),
  (u'friday', u'NR'),
  (u'an', u'AT'),
  (u'investigation', u'NN'),
  (u'of', u'IN'),
  (u"atlanta's", u'NP$'),
  (u'recent', u'JJ'),
  (u'primary', u'NN'),
  (u'election', u'NN'),
  (u'produced', u'VBD'),
  (u'``', u'``'),
  (u'no', u'AT'),
  (u'evidence', u'NN'),
  (u"''", u"''"),
  (u'that', u'CS'),
  (u'any', u'DTI'),
  (u'irregularities', u'NNS'),
  (u'took', u'VBD'),
  (u'place', u'NN'),
  (u'.', u'.')],
 [(u'the', u'AT'),
  (u'jury', u'NN'),
  (u'further', u'RBR'),
  (u'said', u'VBD'),
  (u'in', u'IN'),
  (u'term-end', u'NN'),
  (u'presentments', u'NNS'),
  (u'that', u'CS'),
  (u'the', u'AT'),
  (u'city', u'NN'),
  (u'executive', u'JJ'),
  (u'committee', u'NN'),
  (u',', u','),
  (u'which', u'WDT'),
  (u'had', u'HVD'),
  (u'over-all', u'JJ'),
  (u'charge', u'NN'),
  (u'of', u'IN'),
  (u'the', u'AT'),
  (u'election', u'NN'),
  (u',', u','),
  (u'``', u'``'

In [4]:
hmm.test(labelled_sequences[:1], verbose=True)

Test: the/AT fulton/NP county/NN grand/JJ jury/NN said/VBD friday/NR an/AT investigation/NN of/IN atlanta's/NP$ recent/JJ primary/NN election/NN produced/VBD ``/`` no/AT evidence/NN ''/'' that/CS any/DTI irregularities/NNS took/VBD place/NN ./.

Untagged: the fulton county grand jury said friday an investigation of atlanta's recent primary election produced `` no evidence '' that any irregularities took place .

HMM-tagged: the/AT fulton/NP county/NN grand/JJ jury/NN said/VBD friday/NR an/AT investigation/NN of/IN atlanta's/NP$ recent/JJ primary/NN election/NN produced/VBD ``/`` no/AT evidence/NN ''/'' that/CS any/DTI irregularities/NNS took/VBD place/NN ./.

Entropy: 18.7331739704

------------------------------------------------------------
accuracy over 25 tokens: 100.00


In [5]:
hmm.test(labelled_sequences[:1])

accuracy over 25 tokens: 100.00


In [6]:
tag_set

[u'BE',
 u'NP$',
 u'CS',
 u'WDT',
 u'JJ',
 u'AP$',
 u'RP',
 u'(',
 u'FW',
 u',',
 u'RB',
 u'NNS',
 u'WRB',
 u'NN$',
 u'PPLS',
 u'NNS$',
 u'--',
 u'OD',
 u'PP$$',
 u'NPS',
 u'DTI',
 u'BEN',
 u'BEM',
 u'HV',
 u'BEG',
 u'BED',
 u'HVD',
 u'BEZ',
 u'DTX',
 u'NPS$',
 u'DTS',
 u"'",
 u'WQL',
 u'BEDZ',
 u'PN',
 u'EX',
 u'MD',
 u'DT$',
 u'UH',
 u'CD$',
 u'VBG',
 u'VBD',
 u'VBN',
 u'DOD',
 u'DOZ',
 u'VBZ',
 u'.',
 u'NN',
 u'*',
 u'WPS',
 u'WPO',
 u'NP',
 u'NR',
 u':',
 u'``',
 u'CC',
 u'CD',
 u'WP$',
 u'HVN',
 u'JJT',
 u'JJS',
 u'JJR',
 u'HVG',
 u'HVZ',
 u'DO',
 u'PPSS',
 u'NR$',
 u"''",
 u'BER',
 u'DT',
 u'PN$',
 u')',
 u'PPO',
 u'PPL',
 u'PPS',
 u'TO',
 u'RB$',
 u'VB',
 u'PP$',
 u'RBT',
 u'ABL',
 u'RBR',
 u'ABN',
 u'AP',
 u'AT',
 u'IN',
 u'ABX',
 u'QLP',
 u'QL']

In [7]:
symbols

[u'sunbonnet',
 u'yellow',
 u'narcotic',
 u'four',
 u'woods',
 u'boogie',
 u'railing',
 u'francesca',
 u'aggression',
 u'marching',
 u'someplace',
 u'augustine',
 u'eligible',
 u'electricity',
 u'$25-a-plate',
 u'consulate',
 u'sanantonio',
 u'all-county',
 u'averell',
 u'lord',
 u'1959-60',
 u'sinking',
 u'1,119',
 u'co-operation',
 u'desired',
 u'regional',
 u'fortier',
 u'appropriation',
 u'leisurely',
 u'unify',
 u'bringing',
 u'lumia',
 u'durocher',
 u'prize',
 u'wooden',
 u'clientele',
 u'963',
 u'wednesday',
 u'journal-bulletin',
 u'raoul',
 u'specialties',
 u'succession',
 u'debuting',
 u'commented',
 u'charter',
 u'tired',
 u'miller',
 u'pulse',
 u'tires',
 u'271',
 u'second',
 u'273',
 u'sustaining',
 u'melvin',
 u'errors',
 u'forgetting',
 u'ruthless',
 u"u.'s",
 u'contributed',
 u"communism's",
 u'designing',
 u'increasing',
 u'hero',
 u'whose',
 u'avery',
 u'herb',
 u'fronts',
 u'here',
 u'reported',
 u'china',
 u'affiliated',
 u'doldrums',
 u'cyclical',
 u'kids',
 u'oxfor

In [13]:
_TEXT = 0
def _untag(sentences):
    unlabeled = []
    for sentence in sentences:
        unlabeled.append([(token[_TEXT], None) for token in sentence])
    return unlabeled

def demo_pos_bw(
    test=10, supervised=20, unsupervised=10, verbose=True, max_iterations=5
):
    # demonstrates the Baum-Welch algorithm in POS tagging

    print()
    print("Baum-Welch demo for POS tagging")
    print()

    print('Training HMM (supervised, %d sentences)...' % supervised)

    sentences, tag_set, symbols = load_pos(test + supervised + unsupervised)

    symbols = set()
    for sentence in sentences:
        for token in sentence:
            symbols.add(token[_TEXT])

    trainer = HiddenMarkovModelTrainer(tag_set, list(symbols))
#     hmm = trainer.train_supervised(
#         sentences[test : test + supervised],
#         estimator=lambda fd, bins: LidstoneProbDist(fd, 0.1, bins),
#     )

#     hmm.test(sentences[:test], verbose=verbose)

    print('Training (unsupervised, %d sentences)...' % unsupervised)
    # it's rather slow - so only use 10 samples by default
    unlabeled = _untag(sentences[test + supervised :])
    print unlabeled[0]
    hmm = trainer.train_unsupervised(
        unlabeled, max_iterations=max_iterations
    )
    hmm.test(sentences[:test], verbose=verbose)

demo_pos_bw()

()
Baum-Welch demo for POS tagging
()
Training HMM (supervised, 20 sentences)...
Training (unsupervised, 10 sentences)...
[(u'mayor', None), (u'william', None), (u'b.', None), (u'hartsfield', None), (u'filed', None), (u'suit', None), (u'for', None), (u'divorce', None), (u'from', None), (u'his', None), (u'wife', None), (u',', None), (u'pearl', None), (u'williams', None), (u'hartsfield', None), (u',', None), (u'in', None), (u'fulton', None), (u'superior', None), (u'court', None), (u'friday', None), (u'.', None)]


  B_numer[i] = np.logaddexp2(B_numer[i], seq_B_numer[i] - lpk)


iteration 0 logprob -1271.45168234
iteration 1 logprob -898.231520722
iteration 2 logprob -894.900970355
iteration 3 logprob -891.214340108
iteration 4 logprob -885.94389566
Test: the/AT fulton/NP county/NN grand/JJ jury/NN said/VBD friday/NR an/AT investigation/NN of/IN atlanta's/NP$ recent/JJ primary/NN election/NN produced/VBD ``/`` no/AT evidence/NN ''/'' that/CS any/DTI irregularities/NNS took/VBD place/NN ./.

Untagged: the fulton county grand jury said friday an investigation of atlanta's recent primary election produced `` no evidence '' that any irregularities took place .

HMM-tagged: the/CD fulton/BEZ county/DO grand/DO jury/DO said/DO friday/DO an/DO investigation/DO of/DO atlanta's/DO recent/DO primary/DO election/DO produced/DO ``/DO no/DO evidence/DO ''/DO that/DO any/DO irregularities/DO took/DO place/DO ./DO



  return np.log2(np.sum(2 ** (arr - max_))) + max_


Entropy: nan

------------------------------------------------------------
Test: the/AT jury/NN further/RBR said/VBD in/IN term-end/NN presentments/NNS that/CS the/AT city/NN executive/JJ committee/NN ,/, which/WDT had/HVD over-all/JJ charge/NN of/IN the/AT election/NN ,/, ``/`` deserves/VBZ the/AT praise/NN and/CC thanks/NNS of/IN the/AT city/NN of/IN atlanta/NP ''/'' for/IN the/AT manner/NN in/IN which/WDT the/AT election/NN was/BEDZ conducted/VBN ./.

Untagged: the jury further said in term-end presentments that the city executive committee , which had over-all charge of the election , `` deserves the praise and thanks of the city of atlanta '' for the manner in which the election was conducted .

HMM-tagged: the/ABX jury/DO further/DO said/DO in/DO term-end/DO presentments/DO that/DO the/DO city/DO executive/DO committee/DO ,/DO which/DO had/DO over-all/DO charge/DO of/DO the/DO election/DO ,/DO ``/DO deserves/DO the/DO praise/DO and/DO thanks/DO of/DO the/DO city/DO of/DO atlanta/

1. hmm train_unsupervised 里面的 unlabeled 的结构和 http://www.nltk.org/_modules/nltk/tag/hmm.html 里面介绍的不同。具体要看 _untag
2. symbols 和 训练时的输入字符编码要一致，不然会出现 KeyError

# 疑惑
1. hmm 为什么会有 supervied training 和 unsupervised training？我的概念里，它应该只有无监督的统计呀？