In [1]:
#!/usr/bin/env python3

This file illustrates how you might experiment with the HMM interface.
You can paste these commands in at the Python prompt, or execute `test_en.py` directly.
A notebook interface is nicer than the plain Python prompt, so we provide
a notebook version of this file as `test_en.ipynb`, which you can open with
`jupyter` or with Visual Studio `code` (run it with the `nlp-class` kernel).

In [2]:
import logging
import math
import os
from pathlib import Path

In [3]:
from corpus import TaggedCorpus
from eval import eval_tagging, model_cross_entropy, viterbi_error_rate
from hmm import HiddenMarkovModel
from crf import ConditionalRandomField

Set up logging.

In [4]:
logging.root.setLevel(level=logging.INFO)
log = logging.getLogger("test_en")       # For usage, see findsim.py in earlier assignment.
logging.basicConfig(format="%(levelname)s : %(message)s", level=logging.INFO)  # could change INFO to DEBUG

Switch working directory to the directory where the data live.  You may need to edit this line.

In [5]:
os.chdir("../data")

In [6]:
entrain = TaggedCorpus(Path("ensup"), Path("enraw"))                               # all training
ensup =   TaggedCorpus(Path("ensup"), tagset=entrain.tagset, vocab=entrain.vocab)  # supervised training
endev =   TaggedCorpus(Path("endev"), tagset=entrain.tagset, vocab=entrain.vocab)  # evaluation
print(f"{len(entrain)=}  {len(ensup)=}  {len(endev)=}")

INFO : Read 191873 tokens from ensup, enraw
INFO : Created 26 tag types
INFO : Created 18461 word types


len(entrain)=8064  len(ensup)=4051  len(endev)=996


In [7]:
known_vocab = TaggedCorpus(Path("ensup")).vocab    # words seen with supervised tags; used in evaluation
log.info(f"Tagset: f{list(entrain.tagset)}")

INFO : Read 95936 tokens from ensup
INFO : Created 26 tag types
INFO : Created 12466 word types
INFO : Tagset: f['W', 'J', 'N', 'C', 'V', 'I', 'D', ',', 'M', 'P', '.', 'E', 'R', '`', "'", 'T', '$', ':', '-', '#', 'S', 'F', 'U', 'L', '_EOS_TAG_', '_BOS_TAG_']


Make an HMM.  Let's do some pre-training to approximately maximize the
regularized log-likelihood on supervised training data.  In other words, the
probabilities at the M step will just be supervised count ratios.

On each epoch, you will see two progress bars: first it collects counts from
all the sentences (E step), and then after the M step, it evaluates the loss
function, which is the (unregularized) cross-entropy on the training set.

The parameters don't actually matter during the E step because there are no
hidden tags to impute.  The first M step will jump right to the optimal
solution.  The code will try a second epoch with the revised parameters, but
the result will be identical, so it will detect convergence and stop.

We arbitrarily choose λ=1 for our add-λ smoothing at the M step, but it would
be better to search for the best value of this hyperparameter.

In [9]:
log.info("*** Hidden Markov Model (HMM)")
hmm = HiddenMarkovModel(entrain.tagset, entrain.vocab)  # randomly initialized parameters  
loss_sup = lambda model: model_cross_entropy(model, eval_corpus=ensup)
hmm.train(corpus=ensup, loss=loss_sup, λ=1.0,
          save_path="en_hmm.pkl") 

INFO : *** Hidden Markov Model (HMM)
100%|██████████| 4051/4051 [00:14<00:00, 274.26it/s]
INFO : Cross-entropy: 12.6439 nats (= perplexity 309875.493)
100%|██████████| 4051/4051 [00:38<00:00, 104.92it/s]
100%|██████████| 4051/4051 [00:13<00:00, 290.43it/s]
INFO : Cross-entropy: 7.4505 nats (= perplexity 1720.751)
100%|██████████| 4051/4051 [00:41<00:00, 97.10it/s] 
100%|██████████| 4051/4051 [00:15<00:00, 264.05it/s]
INFO : Cross-entropy: 7.4505 nats (= perplexity 1720.764)
INFO : Saved model to en_hmm.pkl


Now let's throw in the unsupervised training data as well, and continue
training as before, in order to increase the regularized log-likelihood on
this larger, semi-supervised training set.  It's now the *incomplete-data*
log-likelihood.

This time, we'll use a different evaluation loss function: we'll stop when the
*tagging error rate* on a held-out dev set stops getting better.  Also, the
implementation of this loss function (`viterbi_error_rate`) includes a helpful
side effect: it logs the *cross-entropy* on the held-out dataset as well, just
for your information.

We hope that held-out tagging accuracy will go up for a little bit before it
goes down again (see Merialdo 1994). (Log-likelihood on training data will
continue to improve, and that improvement may generalize to held-out
cross-entropy.  But getting accuracy to increase is harder.)

In [11]:
hmm = HiddenMarkovModel.load("en_hmm.pkl")  # reset to supervised model (in case you're re-executing this bit)
loss_dev = lambda model: viterbi_error_rate(model, eval_corpus=endev, 
                                            known_vocab=known_vocab)
hmm.train(corpus=entrain, loss=loss_dev, λ=1.0,
          save_path="en_hmm_raw.pkl")

INFO : Loaded model from en_hmm.pkl
  0%|          | 0/996 [00:00<?, ?it/s]

100%|██████████| 996/996 [00:04<00:00, 207.06it/s]
INFO : Cross-entropy: 7.5995 nats (= perplexity 1997.178)
100%|██████████| 996/996 [00:05<00:00, 166.80it/s]
INFO : Tagging accuracy: all: 88.663%, known: 93.059%, seen: 44.108%, novel: 42.734%
100%|██████████| 8064/8064 [01:22<00:00, 97.38it/s] 
100%|██████████| 996/996 [00:04<00:00, 210.13it/s]
INFO : Cross-entropy: 7.3486 nats (= perplexity 1553.990)
100%|██████████| 996/996 [00:05<00:00, 168.23it/s]
INFO : Tagging accuracy: all: 87.035%, known: 91.397%, seen: 45.791%, novel: 40.291%
INFO : Saved model to en_hmm_raw.pkl


You can also retry the above workflow where you start with a worse supervised
model (like Merialdo).  Does EM help more in that case?  It's easiest to rerun
exactly the code above, but first make the `ensup` file smaller by copying
`ensup-tiny` over it.  `ensup-tiny` is only 25 sentences (that happen to cover
all tags in `endev`).  Back up your old `ensup` and your old `*.pkl` models
before you do this.

More detailed look at the first 10 sentences in the held-out corpus,
including Viterbi tagging.

In [12]:
def look_at_your_data(model, dev, N):
    for m, sentence in enumerate(dev):
        if m >= N: break
        viterbi = model.viterbi_tagging(sentence.desupervise(), endev)
        counts = eval_tagging(predicted=viterbi, gold=sentence, 
                              known_vocab=known_vocab)
        num = counts['NUM', 'ALL']
        denom = counts['DENOM', 'ALL']
        
        log.info(f"Gold:    {sentence}")
        log.info(f"Viterbi: {viterbi}")
        log.info(f"Loss:    {denom - num}/{denom}")
        xent = -model.logprob(sentence, endev) / len(sentence)  # measured in nats
        log.info(f"Cross-entropy: {xent/math.log(2)} nats (= perplexity {math.exp(xent)})\n---")

In [13]:
look_at_your_data(hmm, endev, 10)

INFO : Gold:    ``/` We/P 're/V strongly/R _OOV_/V that/I anyone/N who/W has/V eaten/V in/I the/D cafeteria/N this/D month/N have/V the/D shot/N ,/, ''/' Mr./N Mattausch/N added/V ,/, ``/` and/C that/D means/V virtually/R everyone/N who/W works/V here/R ./.
INFO : Viterbi: ``/` We/P 're/V strongly/R _OOV_/V that/I anyone/N who/W has/V eaten/V in/I the/D cafeteria/N this/D month/N have/V the/D shot/N ,/, ''/' Mr./N Mattausch/T added/V ,/, ``/` and/C that/I means/V virtually/R everyone/, who/W works/V here/R ./.
INFO : Loss:    3/34
INFO : Cross-entropy: 10.617973327636719 nats (= perplexity 1571.5512353802962)
---
INFO : Gold:    I/P was/V _OOV_/V to/T read/V the/D _OOV_/N of/I facts/N in/I your/P Oct./N 13/C editorial/N ``/` _OOV_/N 's/P _OOV_/N _OOV_/N ./. ''/'
INFO : Viterbi: I/P was/V _OOV_/V to/T read/V the/D _OOV_/N of/I facts/N in/I your/P Oct./N 13/C editorial/, ``/` _OOV_/P 's/V _OOV_/D _OOV_/N ./. ''/'
INFO : Loss:    4/21
INFO : Cross-entropy: 10.876396179199219 nats (= perpl

Now let's try supervised training of a CRF (this doesn't use the unsupervised
part of the data, so it is comparable to the supervised pre-training we did
for the HMM).  We will use SGD to approximately maximize the regularized
log-likelihood. 

As with the semi-supervised HMM training, we'll periodically evaluate the
tagging accuracy (and also print the cross-entropy) on a held-out dev set.
We use the default `eval_interval` and `tolerance`.  If you want to stop
sooner, then you could increase the `tolerance` so the training method decides
sooner that it has converged.

We arbitrarily choose reg = 1.0 for L2 regularization, learning rate = 0.05,
and a minibatch size of 10, but it would be better to search for the best
value of these hyperparameters.

Note that the logger reports the CRF's *conditional* cross-entropy, log p(tags
| words) / n.  This is much lower than the HMM's *joint* cross-entropy log
p(tags, words) / n, but that doesn't mean the CRF is worse at tagging.  The
CRF is just predicting less information.

In [None]:
log.info("*** Conditional Random Field (CRF)\n")
crf = ConditionalRandomField(entrain.tagset, entrain.vocab)  # randomly initialized parameters  
crf.train(corpus=ensup, loss=loss_dev, reg=1.0, lr=0.05, minibatch_size=10,
          save_path="ensup_crf.pkl")

INFO : *** Conditional Random Field (CRF)

100%|██████████| 996/996 [00:06<00:00, 160.34it/s]
INFO : Cross-entropy: 3.0507 nats (= perplexity 21.131)
100%|██████████| 996/996 [00:04<00:00, 222.31it/s]
INFO : Tagging accuracy: all: 6.764%, known: 6.831%, seen: 4.209%, novel: 6.803%
100%|██████████| 500/500 [00:09<00:00, 51.92it/s]
100%|██████████| 996/996 [00:06<00:00, 155.69it/s]
INFO : Cross-entropy: 0.9112 nats (= perplexity 2.487)
100%|██████████| 996/996 [00:04<00:00, 227.68it/s]
INFO : Tagging accuracy: all: 72.542%, known: 73.513%, seen: 58.754%, novel: 63.937%
100%|██████████| 500/500 [00:09<00:00, 50.71it/s]
100%|██████████| 996/996 [00:06<00:00, 152.10it/s]
INFO : Cross-entropy: 0.7513 nats (= perplexity 2.120)
100%|██████████| 996/996 [00:04<00:00, 217.26it/s]
INFO : Tagging accuracy: all: 75.310%, known: 77.061%, seen: 55.892%, novel: 57.662%
100%|██████████| 500/500 [00:09<00:00, 50.83it/s]
100%|██████████| 996/996 [00:06<00:00, 164.51it/s]
INFO : Cross-entropy: 0.6580 nats

Let's examine how the CRF does on individual sentences. 
(Do you see any error patterns here that would inspire additional CRF features?)

In [15]:
look_at_your_data(crf, endev, 10)

INFO : Gold:    ``/` We/P 're/V strongly/R _OOV_/V that/I anyone/N who/W has/V eaten/V in/I the/D cafeteria/N this/D month/N have/V the/D shot/N ,/, ''/' Mr./N Mattausch/N added/V ,/, ``/` and/C that/D means/V virtually/R everyone/N who/W works/V here/R ./.
INFO : Viterbi: ``/` We/P 're/V strongly/J _OOV_/N that/I anyone/N who/W has/V eaten/N in/I the/D cafeteria/N this/D month/N have/V the/D shot/N ,/, ''/' Mr./N Mattausch/N added/N ,/, ``/` and/C that/I means/J virtually/N everyone/N who/W works/V here/R ./.
INFO : Loss:    7/34
INFO : Cross-entropy: 0.7668604254722595 nats (= perplexity 1.7015628106627378)
---
INFO : Gold:    I/P was/V _OOV_/V to/T read/V the/D _OOV_/N of/I facts/N in/I your/P Oct./N 13/C editorial/N ``/` _OOV_/N 's/P _OOV_/N _OOV_/N ./. ''/'
INFO : Viterbi: I/P was/V _OOV_/V to/T read/V the/D _OOV_/N of/I facts/N in/I your/J Oct./N 13/C editorial/N ``/` _OOV_/N 's/P _OOV_/N _OOV_/N ./. ''/'
INFO : Loss:    1/21
INFO : Cross-entropy: 0.4758842885494232 nats (= perpl

In [16]:
hmm = ConditionalRandomField.load("en_crf.pkl")  # reset to supervised model (in case you're re-executing this bit)
loss_dev = lambda model: viterbi_error_rate(model, eval_corpus=endev, 
                                            known_vocab=known_vocab)
hmm.train(corpus=entrain, loss=loss_dev, reg=1.0, lr=0.05, minibatch_size=10,
          save_path="en_crf_raw.pkl")

INFO : Loaded model from en_crf.pkl
100%|██████████| 996/996 [00:10<00:00, 98.56it/s] 
INFO : Cross-entropy: 0.3986 nats (= perplexity 1.490)
100%|██████████| 996/996 [00:05<00:00, 182.65it/s]
INFO : Tagging accuracy: all: 86.283%, known: 88.490%, seen: 62.963%, novel: 63.606%
100%|██████████| 500/500 [00:10<00:00, 46.07it/s]
100%|██████████| 996/996 [00:10<00:00, 94.32it/s] 
INFO : Cross-entropy: 0.3972 nats (= perplexity 1.488)
100%|██████████| 996/996 [00:05<00:00, 181.47it/s]
INFO : Tagging accuracy: all: 86.350%, known: 88.284%, seen: 64.983%, novel: 66.843%
100%|██████████| 500/500 [00:10<00:00, 49.15it/s]
100%|██████████| 996/996 [00:10<00:00, 97.72it/s] 
INFO : Cross-entropy: 0.3918 nats (= perplexity 1.480)
100%|██████████| 996/996 [00:05<00:00, 183.46it/s]
INFO : Tagging accuracy: all: 86.964%, known: 89.103%, seen: 63.468%, novel: 65.324%
100%|██████████| 500/500 [00:10<00:00, 49.68it/s]
100%|██████████| 996/996 [00:09<00:00, 103.99it/s]
INFO : Cross-entropy: 0.3867 nats (= 