# Project: Part of Speech Tagging with Hidden Markov Models 
---
### Introduction

Part of speech tagging is the process of determining the syntactic category of a word from the words in its surrounding context. It is often used to help disambiguate natural language phrases because it can be done quickly with high accuracy. Tagging can be used for many NLP tasks like determining correct pronunciation during speech synthesis (for example, _dis_-count as a noun vs dis-_count_ as a verb), for information retrieval, and for word sense disambiguation.

In this notebook, you'll use the [Pomegranate](http://pomegranate.readthedocs.io/) library to build a hidden Markov model for part of speech tagging using a "universal" tagset. Hidden Markov models have been able to achieve [>96% tag accuracy with larger tagsets on realistic text corpora](http://www.coli.uni-saarland.de/~thorsten/publications/Brants-ANLP00.pdf). Hidden Markov models have also been used for speech recognition and speech generation, machine translation, gene recognition for bioinformatics, and human gesture recognition for computer vision, and more. 

![](_post-hmm.png)

The notebook already contains some code to get you started. You only need to add some new functionality in the areas indicated to complete the project; you will not need to modify the included code beyond what is requested. Sections that begin with **'IMPLEMENTATION'** in the header indicate that you must provide code in the block that follows. Instructions will be provided for each section, and the specifics of the implementation are marked in the code block with a 'TODO' statement. Please be sure to read the instructions carefully!

<div class="alert alert-block alert-info">
**Note:** Once you have completed all of the code implementations, you need to finalize your work by exporting the iPython Notebook as an HTML document. Before exporting the notebook to html, all of the code cells need to have been run so that reviewers can see the final implementation and output. You must then **export the notebook** by running the last cell in the notebook, or by using the menu above and navigating to **File -> Download as -> HTML (.html)** Your submissions should include both the `html` and `ipynb` files.
</div>

<div class="alert alert-block alert-info">
**Note:** Code and Markdown cells can be executed using the `Shift + Enter` keyboard shortcut. Markdown cells can be edited by double-clicking the cell to enter edit mode.
</div>

### The Road Ahead
You must complete Steps 1-3 below to pass the project. The section on Step 4 includes references & resources you can use to further explore HMM taggers.

- [Step 1](#Step-1:-Read-and-preprocess-the-dataset): Review the provided interface to load and access the text corpus
- [Step 2](#Step-2:-Build-a-Most-Frequent-Class-tagger): Build a Most Frequent Class tagger to use as a baseline
- [Step 3](#Step-3:-Build-an-HMM-tagger): Build an HMM Part of Speech tagger and compare to the MFC baseline
- [Step 4](#Step-4:-[Optional]-Improving-model-performance): (Optional) Improve the HMM tagger

<div class="alert alert-block alert-warning">
**Note:** Make sure you have selected a **Python 3** kernel in Workspaces or the hmm-tagger conda environment if you are running the Jupyter server on your own machine.
</div>

In [1]:
# Jupyter "magic methods" -- only need to be run once per kernel restart
%load_ext autoreload
%aimport helpers, tests
%autoreload 1

In [2]:
# import python modules -- this cell needs to be run again if you make changes to any of the files
import matplotlib.pyplot as plt
import numpy as np

from IPython.core.display import HTML
from itertools import chain
from collections import Counter, defaultdict
from helpers import show_model, Dataset
from pomegranate import State, HiddenMarkovModel, DiscreteDistribution

## Step 1: Read and preprocess the dataset
---
We'll start by reading in a text corpus and splitting it into a training and testing dataset. The data set is a copy of the [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus) (originally from the [NLTK](https://www.nltk.org/) library) that has already been pre-processed to only include the [universal tagset](https://arxiv.org/pdf/1104.2086.pdf). You should expect to get slightly higher accuracy using this simplified tagset than the same model would achieve on a larger tagset like the full [Penn treebank tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html), but the process you'll follow would be the same.

The `Dataset` class provided in helpers.py will read and parse the corpus. You can generate your own datasets compatible with the reader by writing them to the following format. The dataset is stored in plaintext as a collection of words and corresponding tags. Each sentence starts with a unique identifier on the first line, followed by one tab-separated word/tag pair on each following line. Sentences are separated by a single blank line.

Example from the Brown corpus. 
```
b100-38532
Perhaps	ADV
it	PRON
was	VERB
right	ADJ
;	.
;	.

b100-35577
...
```

In [3]:
data = Dataset("tags-universal.txt", "brown-universal.txt", train_test_split=0.8)

print("There are {} sentences in the corpus.".format(len(data)))
print("There are {} sentences in the training set.".format(len(data.training_set)))
print("There are {} sentences in the testing set.".format(len(data.testing_set)))

assert len(data) == len(data.training_set) + len(data.testing_set), \
       "The number of sentences in the training set + testing set should sum to the number of sentences in the corpus"

There are 57340 sentences in the corpus.
There are 45872 sentences in the training set.
There are 11468 sentences in the testing set.


In [4]:
data.X[0]

('Mr.',
 'Podger',
 'had',
 'thanked',
 'him',
 'gravely',
 ',',
 'and',
 'now',
 'he',
 'made',
 'use',
 'of',
 'the',
 'advice',
 '.')

### The Dataset Interface

You can access (mostly) immutable references to the dataset through a simple interface provided through the `Dataset` class, which represents an iterable collection of sentences along with easy access to partitions of the data for training & testing. Review the reference below, then run and review the next few cells to make sure you understand the interface before moving on to the next step.

```
Dataset-only Attributes:
    training_set - reference to a Subset object containing the samples for training
    testing_set - reference to a Subset object containing the samples for testing

Dataset & Subset Attributes:
    sentences - a dictionary with an entry {sentence_key: Sentence()} for each sentence in the corpus
    keys - an immutable ordered (not sorted) collection of the sentence_keys for the corpus
    vocab - an immutable collection of the unique words in the corpus
    tagset - an immutable collection of the unique tags in the corpus
    X - returns an array of words grouped by sentences ((w11, w12, w13, ...), (w21, w22, w23, ...), ...)
    Y - returns an array of tags grouped by sentences ((t11, t12, t13, ...), (t21, t22, t23, ...), ...)
    N - returns the number of distinct samples (individual words or tags) in the dataset

Methods:
    stream() - returns an flat iterable over all (word, tag) pairs across all sentences in the corpus
    __iter__() - returns an iterable over the data as (sentence_key, Sentence()) pairs
    __len__() - returns the nubmer of sentences in the dataset
```

For example, consider a Subset, `subset`, of the sentences `{"s0": Sentence(("See", "Spot", "run"), ("VERB", "NOUN", "VERB")), "s1": Sentence(("Spot", "ran"), ("NOUN", "VERB"))}`. The subset will have these attributes:

```
subset.keys == {"s1", "s0"}  # unordered
subset.vocab == {"See", "run", "ran", "Spot"}  # unordered
subset.tagset == {"VERB", "NOUN"}  # unordered
subset.X == (("Spot", "ran"), ("See", "Spot", "run"))  # order matches .keys
subset.Y == (("NOUN", "VERB"), ("VERB", "NOUN", "VERB"))  # order matches .keys
subset.N == 7  # there are a total of seven observations over all sentences
len(subset) == 2  # because there are two sentences
```

<div class="alert alert-block alert-info">
**Note:** The `Dataset` class is _convenient_, but it is **not** efficient. It is not suitable for huge datasets because it stores multiple redundant copies of the same data.
</div>

#### Sentences

`Dataset.sentences` is a dictionary of all sentences in the training corpus, each keyed to a unique sentence identifier. Each `Sentence` is itself an object with two attributes: a tuple of the words in the sentence named `words` and a tuple of the tag corresponding to each word named `tags`.

In [7]:
key = 'b100-38532'
print("Sentence: {}".format(key))
print("words:\n\t{!s}".format(data.sentences[key].words))
print("tags:\n\t{!s}".format(data.sentences[key].tags))

Sentence: b100-38532
words:
	('Perhaps', 'it', 'was', 'right', ';', ';')
tags:
	('ADV', 'PRON', 'VERB', 'ADJ', '.', '.')


<div class="alert alert-block alert-info">
**Note:** The underlying iterable sequence is **unordered** over the sentences in the corpus; it is not guaranteed to return the sentences in a consistent order between calls. Use `Dataset.stream()`, `Dataset.keys`, `Dataset.X`, or `Dataset.Y` attributes if you need ordered access to the data.
</div>

#### Counting Unique Elements

You can access the list of unique words (the dataset vocabulary) via `Dataset.vocab` and the unique list of tags via `Dataset.tagset`.

In [8]:
print("There are a total of {} samples of {} unique words in the corpus."
      .format(data.N, len(data.vocab)))
print("There are {} samples of {} unique words in the training set."
      .format(data.training_set.N, len(data.training_set.vocab)))
print("There are {} samples of {} unique words in the testing set."
      .format(data.testing_set.N, len(data.testing_set.vocab)))
print("There are {} words in the test set that are missing in the training set."
      .format(len(data.testing_set.vocab - data.training_set.vocab)))

assert data.N == data.training_set.N + data.testing_set.N, \
       "The number of training + test samples should sum to the total number of samples"

There are a total of 1161192 samples of 56057 unique words in the corpus.
There are 928458 samples of 50536 unique words in the training set.
There are 232734 samples of 25112 unique words in the testing set.
There are 5521 words in the test set that are missing in the training set.


#### Accessing word and tag Sequences
The `Dataset.X` and `Dataset.Y` attributes provide access to ordered collections of matching word and tag sequences for each sentence in the dataset.

In [9]:
# accessing words with Dataset.X and tags with Dataset.Y 
for i in range(2):    
    print("Sentence {}:".format(i + 1), data.X[i])
    print()
    print("Labels {}:".format(i + 1), data.Y[i])
    print()

Sentence 1: ('Mr.', 'Podger', 'had', 'thanked', 'him', 'gravely', ',', 'and', 'now', 'he', 'made', 'use', 'of', 'the', 'advice', '.')

Labels 1: ('NOUN', 'NOUN', 'VERB', 'VERB', 'PRON', 'ADV', '.', 'CONJ', 'ADV', 'PRON', 'VERB', 'NOUN', 'ADP', 'DET', 'NOUN', '.')

Sentence 2: ('But', 'there', 'seemed', 'to', 'be', 'some', 'difference', 'of', 'opinion', 'as', 'to', 'how', 'far', 'the', 'board', 'should', 'go', ',', 'and', 'whose', 'advice', 'it', 'should', 'follow', '.')

Labels 2: ('CONJ', 'PRT', 'VERB', 'PRT', 'VERB', 'DET', 'NOUN', 'ADP', 'NOUN', 'ADP', 'ADP', 'ADV', 'ADV', 'DET', 'NOUN', 'VERB', 'VERB', '.', 'CONJ', 'DET', 'NOUN', 'PRON', 'VERB', 'VERB', '.')



#### Accessing (word, tag) Samples
The `Dataset.stream()` method returns an iterator that chains together every pair of (word, tag) entries across all sentences in the entire corpus.

In [10]:
# use Dataset.stream() (word, tag) samples for the entire corpus
print("\nStream (word, tag) pairs:\n")
for i, pair in enumerate(data.stream()):
    print("\t", pair)
    if i > 5: break


Stream (word, tag) pairs:

	 ('Mr.', 'NOUN')
	 ('Podger', 'NOUN')
	 ('had', 'VERB')
	 ('thanked', 'VERB')
	 ('him', 'PRON')
	 ('gravely', 'ADV')
	 (',', '.')



For both our baseline tagger and the HMM model we'll build, we need to estimate the frequency of tags & words from the frequency counts of observations in the training corpus. In the next several cells you will complete functions to compute the counts of several sets of counts. 

## Step 2: Build a Most Frequent Class tagger
---

Perhaps the simplest tagger (and a good baseline for tagger performance) is to simply choose the tag most frequently assigned to each word. This "most frequent class" tagger inspects each observed word in the sequence and assigns it the label that was most often assigned to that word in the corpus.

### IMPLEMENTATION: Pair Counts

Complete the function below that computes the joint frequency counts for two input sequences.

In [11]:
len(data.vocab)

56057

In [12]:
def pair_counts(sequences_A, sequences_B):
    """Return a dictionary keyed to each unique value in the first sequence list
    that counts the number of occurrences of the corresponding value from the
    second sequences list.
    
    For example, if sequences_A is tags and sequences_B is the corresponding
    words, then if 1244 sequences contain the word "time" tagged as a NOUN, then
    you should return a dictionary such that pair_counts[NOUN][time] == 1244
    """
    # TODO: Finish this function!
    big_dict = dict([(key, {}) for key in data.tagset])
    for set_x, set_y in zip(sequences_A, sequences_B):
        for x, y in zip(set_x, set_y):
            if x in big_dict[y]:
                big_dict[y][x] +=1
            else:
                big_dict[y].update({x:1})
    return big_dict
# Calculate C(t_i, w_i)
emission_counts = pair_counts(data.X, data.Y)# TODO: YOUR CODE HERE)

assert len(emission_counts) == 12, \
       "Uh oh. There should be 12 tags in your dictionary."
assert max(emission_counts["NOUN"], key=emission_counts["NOUN"].get) == 'time', \
       "Hmmm...'time' is expected to be the most common NOUN."
HTML('<div class="alert alert-block alert-success">Your emission counts look good!</div>')

### IMPLEMENTATION: Most Frequent Class Tagger

Use the `pair_counts()` function and the training dataset to find the most frequent class label for each word in the training data, and populate the `mfc_table` below. The table keys should be words, and the values should be the appropriate tag string.

The `MFCTagger` class is provided to mock the interface of Pomegranite HMM models so that they can be used interchangeably.

In [87]:
def create_table():
    ''' '''
    new_dict ={}
    for key in data.training_set.vocab:
        new_dict.update({key : {}})
    for x ,y  in data.training_set.stream():
        if y in new_dict[x]:
            new_dict[x][y] +=1
        else:
            new_dict[x].update({y:1})
    final_dict= {}
    for key in new_dict:
        final_dict.update({key:max(new_dict[key], key= lambda k: new_dict[key][k])})
    return final_dict

In [90]:
# Create a lookup table mfc_table where mfc_table[word] contains the tag label most frequently assigned to that word
from collections import namedtuple

FakeState = namedtuple("FakeState", "name")

class MFCTagger:
    # NOTE: You should not need to modify this class or any of its methods
    missing = FakeState(name="<MISSING>")
    
    def __init__(self, table):
        self.table = defaultdict(lambda: MFCTagger.missing)
        self.table.update({word: FakeState(name=tag) for word, tag in table.items()})
        
    def viterbi(self, seq):
        """This method simplifies predictions by matching the Pomegranate viterbi() interface"""
        return 0., list(enumerate(["<start>"] + [self.table[w] for w in seq] + ["<end>"]))


# TODO: calculate the frequency of each tag being assigned to each word (hint: similar, but not
# the same as the emission probabilities) and use it to fill the mfc_table

word_counts = pair_counts(data.X, data.Y) # TODO: YOUR CODE HERE)

mfc_table = create_table() # TODO: YOUR CODE HERE

# DO NOT MODIFY BELOW THIS LINE
mfc_model = MFCTagger(mfc_table) # Create a Most Frequent Class tagger instance

assert len(mfc_table) == len(data.training_set.vocab), ""
assert all(k in data.training_set.vocab for k in mfc_table.keys()), ""
assert sum(int(k not in mfc_table) for k in data.testing_set.vocab) == 5521, ""
HTML('<div class="alert alert-block alert-success">Your MFC tagger has all the correct words!</div>')

### Making Predictions with a Model
The helper functions provided below interface with Pomegranate network models & the mocked MFCTagger to take advantage of the [missing value](http://pomegranate.readthedocs.io/en/latest/nan.html) functionality in Pomegranate through a simple sequence decoding function. Run these functions, then run the next cell to see some of the predictions made by the MFC tagger.

In [91]:
def replace_unknown(sequence):
    """Return a copy of the input sequence where each unknown word is replaced
    by the literal string value 'nan'. Pomegranate will ignore these values
    during computation.
    """
    return [w if w in data.training_set.vocab else 'nan' for w in sequence]

def simplify_decoding(X, model):
    """X should be a 1-D sequence of observations for the model to predict"""
    _, state_path = model.viterbi(replace_unknown(X))
    return [state[1].name for state in state_path[1:-1]]  # do not show the start/end state predictions

### Example Decoding Sequences with MFC Tagger

In [92]:
for key in data.testing_set.keys[:3]:
    print("Sentence Key: {}\n".format(key))
    print("Predicted labels:\n-----------------")
    print(simplify_decoding(data.sentences[key].words, mfc_model))
    print()
    print("Actual labels:\n--------------")
    print(data.sentences[key].tags)
    print("\n")

Sentence Key: b100-28144

Predicted labels:
-----------------
['CONJ', 'NOUN', 'NUM', '.', 'NOUN', 'NUM', '.', 'NOUN', 'NUM', '.', 'CONJ', 'NOUN', 'NUM', '.', '.', 'NOUN', '.', '.']

Actual labels:
--------------
('CONJ', 'NOUN', 'NUM', '.', 'NOUN', 'NUM', '.', 'NOUN', 'NUM', '.', 'CONJ', 'NOUN', 'NUM', '.', '.', 'NOUN', '.', '.')


Sentence Key: b100-23146

Predicted labels:
-----------------
['PRON', 'VERB', 'DET', 'NOUN', 'ADP', 'ADJ', 'ADJ', 'NOUN', 'VERB', 'VERB', '.', 'ADP', 'VERB', 'DET', 'NOUN', 'ADP', 'NOUN', 'ADP', 'DET', 'NOUN', '.']

Actual labels:
--------------
('PRON', 'VERB', 'DET', 'NOUN', 'ADP', 'ADJ', 'ADJ', 'NOUN', 'VERB', 'VERB', '.', 'ADP', 'VERB', 'DET', 'NOUN', 'ADP', 'NOUN', 'ADP', 'DET', 'NOUN', '.')


Sentence Key: b100-35462

Predicted labels:
-----------------
['DET', 'ADJ', 'NOUN', 'VERB', 'VERB', 'VERB', 'ADP', 'DET', 'ADJ', 'ADJ', 'NOUN', 'ADP', 'DET', 'ADJ', 'NOUN', '.', 'ADP', 'ADJ', 'NOUN', '.', 'CONJ', 'ADP', 'DET', '<MISSING>', 'ADP', 'ADJ', 'ADJ', 

### Evaluating Model Accuracy

The function below will evaluate the accuracy of the MFC tagger on the collection of all sentences from a text corpus. 

In [93]:
def accuracy(X, Y, model):
    """Calculate the prediction accuracy by using the model to decode each sequence
    in the input X and comparing the prediction with the true labels in Y.
    
    The X should be an array whose first dimension is the number of sentences to test,
    and each element of the array should be an iterable of the words in the sequence.
    The arrays X and Y should have the exact same shape.
    
    X = [("See", "Spot", "run"), ("Run", "Spot", "run", "fast"), ...]
    Y = [(), (), ...]
    """
    correct = total_predictions = 0
    for observations, actual_tags in zip(X, Y):
        
        # The model.viterbi call in simplify_decoding will return None if the HMM
        # raises an error (for example, if a test sentence contains a word that
        # is out of vocabulary for the training set). Any exception counts the
        # full sentence as an error (which makes this a conservative estimate).
        try:
            most_likely_tags = simplify_decoding(observations, model)
            correct += sum(p == t for p, t in zip(most_likely_tags, actual_tags))
        except:
            pass
        total_predictions += len(observations)
    return correct / total_predictions

#### Evaluate the accuracy of the MFC tagger
Run the next cell to evaluate the accuracy of the tagger on the training and test corpus.

In [94]:
mfc_training_acc = accuracy(data.training_set.X, data.training_set.Y, mfc_model)
print("training accuracy mfc_model: {:.2f}%".format(100 * mfc_training_acc))

mfc_testing_acc = accuracy(data.testing_set.X, data.testing_set.Y, mfc_model)
print("testing accuracy mfc_model: {:.2f}%".format(100 * mfc_testing_acc))

assert mfc_training_acc >= 0.955, "Uh oh. Your MFC accuracy on the training set doesn't look right."
assert mfc_testing_acc >= 0.925, "Uh oh. Your MFC accuracy on the testing set doesn't look right."
HTML('<div class="alert alert-block alert-success">Your MFC tagger accuracy looks correct!</div>')

training accuracy mfc_model: 95.72%
testing accuracy mfc_model: 93.01%


## Step 3: Build an HMM tagger
---
The HMM tagger has one hidden state for each possible tag, and parameterized by two distributions: the emission probabilties giving the conditional probability of observing a given **word** from each hidden state, and the transition probabilities giving the conditional probability of moving between **tags** during the sequence.

We will also estimate the starting probability distribution (the probability of each **tag** being the first tag in a sequence), and the terminal probability distribution (the probability of each **tag** being the last tag in a sequence).

The maximum likelihood estimate of these distributions can be calculated from the frequency counts as described in the following sections where you'll implement functions to count the frequencies, and finally build the model. The HMM model will make predictions according to the formula:

$$t_i^n = \underset{t_i^n}{\mathrm{argmin}} \prod_{i=1}^n P(w_i|t_i) P(t_i|t_{i-1})$$

Refer to Speech & Language Processing [Chapter 10](https://web.stanford.edu/~jurafsky/slp3/10.pdf) for more information.

### IMPLEMENTATION: Unigram Counts

Complete the function below to estimate the co-occurrence frequency of each symbol over all of the input sequences. The unigram probabilities in our HMM model are estimated from the formula below, where N is the total number of samples in the input. (You only need to compute the counts for now.)

$$P(tag_1) = \frac{C(tag_1)}{N}$$

In [105]:
def unigram_counts(sequences):
    """Return a dictionary keyed to each unique value in the input sequence list that
    counts the number of occurrences of the value in the sequences list. The sequences
    collection should be a 2-dimensional array.
    
    For example, if the tag NOUN appears 275558 times over all the input sequences,
    then you should return a dictionary such that your_unigram_counts[NOUN] == 275558.
    """
    # TODO: Finish this function!
    unigram_dict = {}
    for key in data.training_set.tagset:
        unigram_dict.update({key : 0})
    for tags in sequences:
        for tag in tags:
            unigram_dict[tag] +=1
    return unigram_dict

# TODO: call unigram_counts with a list of tag sequences from the training set
tag_unigrams = unigram_counts(data.Y)

assert set(tag_unigrams.keys()) == data.training_set.tagset, \
       "Uh oh. It looks like your tag counts doesn't include all the tags!"
assert min(tag_unigrams, key=tag_unigrams.get) == 'X', \
       "Hmmm...'X' is expected to be the least common class"
assert max(tag_unigrams, key=tag_unigrams.get) == 'NOUN', \
       "Hmmm...'NOUN' is expected to be the most common class"
HTML('<div class="alert alert-block alert-success">Your tag unigrams look good!</div>')

### IMPLEMENTATION: Bigram Counts

Complete the function below to estimate the co-occurrence frequency of each pair of symbols in each of the input sequences. These counts are used in the HMM model to estimate the bigram probability of two tags from the frequency counts according to the formula: $$P(tag_2|tag_1) = \frac{C(tag_2|tag_1)}{C(tag_2)}$$


In [192]:
def bigram_counts(sequences):
    """Return a dictionary keyed to each unique PAIR of values in the input sequences
    list that counts the number of occurrences of pair in the sequences list. The input
    should be a 2-dimensional array.
    
    For example, if the pair of tags (NOUN, VERB) appear 61582 times, then you should
    return a dictionary such that your_bigram_counts[(NOUN, VERB)] == 61582
    """
    
    # TODO: Finish this function!
    biagram_dict = {}
    for tag1 in data.tagset:
        for tag2 in data.tagset:
            biagram_dict.update({(tag1, tag2):0})
    for text in data.Y:
        for i in range(len(text) - 1):
            biagram_dict[tuple((text[i], text[i+1]))] +=1        
    return biagram_dict
    

# TODO: call bigram_counts with a list of tag sequences from the training set
tag_bigrams = bigram_counts(data.training_set.sentences)

assert len(tag_bigrams) == 144, \
       "Uh oh. There should be 144 pairs of bigrams (12 tags x 12 tags)"
assert min(tag_bigrams, key=tag_bigrams.get) in [('X', 'NUM'), ('PRON', 'X')], \
       "Hmmm...The least common bigram should be one of ('X', 'NUM') or ('PRON', 'X')."
assert max(tag_bigrams, key=tag_bigrams.get) in [('DET', 'NOUN')], \
       "Hmmm...('DET', 'NOUN') is expected to be the most common bigram."
HTML('<div class="alert alert-block alert-success">Your tag bigrams look good!</div>')

### IMPLEMENTATION: Sequence Starting Counts
Complete the code below to estimate the bigram probabilities of a sequence starting with each tag.

In [201]:
def starting_counts(sequences):
    """Return a dictionary keyed to each unique value in the input sequences list
    that counts the number of occurrences where that value is at the beginning of
    a sequence.
    
    For example, if 8093 sequences start with NOUN, then you should return a
    dictionary such that your_starting_counts[NOUN] == 8093
    """
    # TODO: Finish this function!
    start_dict = {}
    for key in data.tagset:
        start_dict.update({key : 0})
    for tag in sequences:
        start_dict[tag[0]] +=1
    return start_dict
# TODO: Calculate the count of each tag starting a sequence
tag_starts = starting_counts(data.Y)

assert len(tag_starts) == 12, "Uh oh. There should be 12 tags in your dictionary."
assert min(tag_starts, key=tag_starts.get) == 'X', "Hmmm...'X' is expected to be the least common starting bigram."
assert max(tag_starts, key=tag_starts.get) == 'DET', "Hmmm...'DET' is expected to be the most common starting bigram."
HTML('<div class="alert alert-block alert-success">Your starting tag counts look good!</div>')

### IMPLEMENTATION: Sequence Ending Counts
Complete the function below to estimate the bigram probabilities of a sequence ending with each tag.

In [204]:
def ending_counts(sequences):
    """Return a dictionary keyed to each unique value in the input sequences list
    that counts the number of occurrences where that value is at the end of
    a sequence.
    
    For example, if 18 sequences end with DET, then you should return a
    dictionary such that your_starting_counts[DET] == 18
    """
    # TODO: Finish this function!
    end_dict = {}
    for key in data.tagset:
        end_dict.update({key : 0})
    for tag in sequences:
        end_dict[tag[len(tag)-1]] +=1
    return end_dict

# TODO: Calculate the count of each tag ending a sequence
tag_ends = ending_counts(data.Y)

assert len(tag_ends) == 12, "Uh oh. There should be 12 tags in your dictionary."
assert min(tag_ends, key=tag_ends.get) in ['X', 'CONJ'], "Hmmm...'X' or 'CONJ' should be the least common ending bigram."
assert max(tag_ends, key=tag_ends.get) == '.', "Hmmm...'.' is expected to be the most common ending bigram."
HTML('<div class="alert alert-block alert-success">Your ending tag counts look good!</div>')

### IMPLEMENTATION: Basic HMM Tagger
Use the tag unigrams and bigrams calculated above to construct a hidden Markov tagger.

- Add one state per tag
    - The emission distribution at each state should be estimated with the formula: $P(w|t) = \frac{C(t, w)}{C(t)}$
- Add an edge from the starting state `basic_model.start` to each tag
    - The transition probability should be estimated with the formula: $P(t|start) = \frac{C(start, t)}{C(start)}$
- Add an edge from each tag to the end state `basic_model.end`
    - The transition probability should be estimated with the formula: $P(end|t) = \frac{C(t, end)}{C(t)}$
- Add an edge between _every_ pair of tags
    - The transition probability should be estimated with the formula: $P(t_2|t_1) = \frac{C(t_1, t_2)}{C(t_1)}$

In [206]:
basic_model = HiddenMarkovModel(name="base-hmm-tagger")

In [None]:
s1 = State()

In [229]:
for key in emission_counts:
    for word in emission_counts[key]:
        print(word, emission_counts[key][word])

per 369
durin' 3
As 492
above 169
Till 3
During 132
Whereas 10
upon 431
Along 8
Until 28
except 163
concerning 59
towards 62
Respecting 1
Besides 16
considerin' 1
After 246
inside 77
Than 2
lest 17
unlike 33
till 45
tho' 1
save 5
whether 258
with-but-after 1
Vs. 1
nearer 2
so's 4
: 20
behind 217
But 3
times 3
despite 66
under 632
Fur 1
respecting 1
Next 5
t'hi-im 1
including 165
On 345
According 29
but 128
Despite 38
Upon 20
Depending 1
Unlike 9
o' 15
seeing 1
Save 2
'ceptin' 1
pending 4
lahk 1
around 321
underneath 6
while 480
astride 3
to 10985
albeit 2
Below 7
At 409
notwithstanding 2
- 72
round 4
'bout 1
in 19066
Pending 1
Except 10
up 175
as 5656
over 820
involving 3
providing 3
unto 16
Including 1
inter 1
between 715
out 436
beside 76
Whether 28
With 277
Once 28
whereas 31
Toward 6
Over 21
by 5044
across 250
befoh 1
depending 28
on 5837
Supposing 1
Against 9
than 1788
Out 18
plus 16
By 200
'round 1
vis-a-vis 2
Uh 1
next 37
until 433
pursuant 20
Without 42
without 541
then 2
follo

1887 4
31730 1
1966 4
five-and-a-half 1
203 7
1300 1
1745 1
36-A 1
81 3
1801 4
7,484,268 1
1891 4
140,000 1
230 4
1797 3
5.6 2
sixteen 18
10:30 2
1307 1
268,900 1
Forty-six 2
ten-twelve 1
2:34.2 1
98 3
-.10 1
1200 1
15.0 1
21-2 1
1955 28
0.154 1
1644 1
1690 1
1-3 1
196 2
7-1 6
2,418 1
1310 1
182 1
3300 1
forty-eight 1
1806 1
1820 3
2.44 1
790 1
1639 1
4,122,354 1
60 46
4.0 1
96 5
17.3 1
1720 1
1777 1
Twenty 7
three 553
1925 9
fifty-five 2
1565 2
2.55 1
2 445
1911-1912 1
1642 1
0.4 4
1909-10 1
7.2 1
1639-40 1
920 1
108 4
1981 1
2500 1
2:31.3-:35.3 1
1688 1
1959-60 2
6- 1
two-to-three 1
1692 1
3.190 1
1,338,000 1
'51 1
2433 1
113 3
1640 1
Eighty-Four 3
1947 14
5612 1
hundred-odd 2
'38 1
4.7 1
1687 1
147,000 1
billion 62
127 1
Thirteen 2
11,330 1
13 49
1816 1
87 2
3,500 3
2:33.3 1
55,000 1
430,000 1
48,000 1
1,571 1
2:36 12
151 1
188 2
1514 1
1783 3
1958-60 1
1,450,000 1
389 1
275-300 1
120,000 1
367 1
'60 1
6(j) 1
2:05.2 1
2:33 3
5- 1
08 2
25-30 1
710 1
643 1
1986 3
9.3 1
0.2 4
260 5
9N 

a.m. 27
large 5
improbably 1
blindly 8
nakedly 1
upward 24
weekly 7
scholastically 1
furthermore 3
fantastically 2
warmly 7
slimly 1
keenly 3
Less 5
electronically 1
Happily 1
sweet 1
yet 226
'nuff 1
thrice 1
vaguely 17
Wisely 1
uselessly 3
noticeably 2
majestically 1
customarily 4
quicker 1
counter-clockwise 1
diagonally 4
Also 70
clinically 1
A.M. 13
thenceforth 1
privately 10
naively 1
Entirely 1
honorably 3
subconsciously 4
Functionally 1
some 19
analytically 1
Very 24
louder 1
thereto 11
insofar 5
altruistically 1
gust 1
later 304
atonally 1
horrifyingly 1
meticulously 6
casually 13
whereof 8
believably 1
soothingly 1
according 10
Moreover 67
fiercely 4
yonder 1
mightily 1
capably 1
tenaciously 1
correspondingly 2
normally 34
admiringly 1
o'clock 39
Traditionally 1
glibly 4
therefrom 5
Obligingly 1
natch 1
huskily 1
uptown 1
unreasonably 1
temptingly 1
Acourse 1
indecisively 1
awake 7
extensively 10
industriously 2
partially 25
theoretically 4
Probably 25
sinuously 1
analogously 1

Great's 1
in 465
Blackwell's 1
Morning 1
Baseball's 1
up 1699
She'll 6
Rockabye 1
thet's 1
buzz-buzz-buzz 1
Please 7
what're 1
Myra's 1
Uh-huh 4
Arlene's 1
many 20
Kaboom 1
Yalagaloo 1
Aye-yah-ah-ah 1
damn 3
they'd 24
There'll 1
Wow 1
she'll 4
Sonuvabitch 1
Yesiree 1
Uh 3
we're 35
a-gracious 1
Jee-sus 1
Nobody's 2
Goddammit 1
fire's 2
you've 43
wife's 1
You'll 30
Hurray 1
Mmmm 1
Kitty's 4
presto 1
Aaa-ee 1
Whoa 1
Black's 1
Hell 11
oystchers'll 1
Goolick 1
He'll 14
Knife's 2
t'lah 1
Man 1
Jerusalem 1
insomma 1
Goddamn 1
They've 8
throat's 1
Indeed 2
Umm 1
Atta 2
Hey 11
Somebody'll 1
Nothing's 1
Mack's 2
All 210
damnit 1
O 10
Hello 6
Whyn't 1
Hmpf 1
Sssshoo 1
Hooray 1
They're 28
Ahah 1
diddle 1
eh 3
granite's 1
well 22
That's 97
Say 1
shucks 1
Good-by 1
Arthur's 1
She'd 13
godamit 1
uh-huh 1
sky's 2
Who's 10
hush 1
sun's 1
what's 15
please 8
Off 4
Heat's 1
God 2
Shh 1
camera's 1
over 383
amen 5
pugh 1
money's 1
leg's 1
goodnight 2
Down 2
okay 2
Bullshit 1
we'll 26
How's 9
Them's 2
dammit

radioactive 10
ghostly 2
immutable 1
self-judging 1
Theological 3
homebound 1
gowned 1
life-death 1
analeptic 2
swanlike 1
overhead 7
Systemic 2
oversize 2
Loud 1
eighth 17
Residential 4
lurid 3
innermost 1
fringed-wrapped 1
paranormal 1
straight-backed 1
S-11 1
buccolic 1
physical-chemical 1
unchristian 1
fine-chiseled 1
floppy 1
oblique 1
wily 2
by-passing 1
two-game 1
six-foot 2
memorial 2
wiser 7
selling 1
deductive 3
cross-eyed 1
baseless 1
sq. 4
irresolvable 1
unwrinkled 1
prehistoric 2
uncousinly 1
black-clad 1
drunker 2
larger 122
noxious 2
rotund 1
libelous 1
Carolingian 1
papillary 1
sway-backed 1
Einsteinian 1
Past 1
immeasurable 2
16-mesh 1
gilt 2
unsatisfactory 8
fore 1
graduate 4
double-breasted 2
Heavy 3
creedal 1
once-in-a-lifetime 1
addle-brained 1
Western-style 1
presumptuous 4
unmalicious 1
territorial 12
Ithacan 1
inertial 3
vindictive 2
tolerant 9
sun-bleached 1
bittersweet 1
Colonial 7
folksy 3
hypophyseal 1
world-ignoring 1
masterly 1
sterling 2
bogus 3
American-

opinionated 2
Hawaiian 6
nine-game 1
devoid 6
picturesque 9
carboxy-labeled 1
test-like 1
21-year 1
drive-in 1
Treasonable 1
awe-inspiring 1
Sour 1
Organic 1
angry 42
unabashed 3
reverent 3
rectilinear 1
ol' 1
stationary 2
villainous 1
pastoral 6
stormbound 1
water-filled 1
excellent 67
multi-state 1
cozy 1
unimproved 2
Loyal 1
developmental 9
inescapable 8
Dry 2
inefficient 7
fifty-pound 1
blithe 2
downtalking 1
rateable 1
fusty 1
seven-inch 1
permissive 5
frothy 2
A-1 2
simultaneous 8
world-shattering 1
smashed-out 1
closer 18
cathodoluminescent 4
bereft 2
flat-footed 1
richest 5
starboard 1
Viennese 1
fluffy 1
satin-covered 1
neatest 1
morphophonemic 6
Meaningful 1
Antithyroid 2
friendlier 2
senior-graduate 1
theoretical 18
elegiac 2
ready 140
Operational 1
creditable 4
Congregational-Baptist 2
listless 1
snow-white 1
forty-year 1
few 583
desultory 1
cognitive 2
ultramarine 1
no-good 1
contemptible 2
melancholy 3
ubiquitous 2
unbeknownst 1
exclusive 26
clonic 1
friction-free 1
appro

undersize 1
recovery 1
busiest 2
Filipino 1
salacious 2
always-present 1
idle 11
matchless 2
kingdom-wide 1
Spike-haired 1
Liberal 4
Siamese 4
semi-inflated 1
Tensile 1
soul-searching 1
Annual 4
olive-flushed 1
cohesive 11
pestilent 1
Faint 1
bad-fitting 1
spring-back 1
brocaded 1
60-city 1
simplest 10
crotchety 1
heartening 4
overactive 1
morning-frightened 1
year-end 1
snob-clannish 1
Migrant 1
self-sustaining 5
ankle-deep 2
midwestern 4
roughshod 1
thirteenth 2
spectral 6
fade-in 1
efficacious 2
Rhythmic 1
sincerest 1
ministerial 2
blue-green 3
mud-beplastered 1
two-digit 5
invincible 2
gyro-stabilized 6
manmade 2
indigenous 3
unanimous 5
above-ground 1
Blind 1
Frail 1
tranquil 2
nationalistic 3
cream 2
hypothetical 8
present-day 17
Agricultural 3
over-the-counter 1
wobbly 2
sweeter 2
supernatural 14
optimistic 15
noblest 5
touchy 1
Rotary 3
inapplicable 2
humane 5
nerve-shattering 2
fine-featured 1
therapeutic 13
microscopic 5
Anti-Communist 2
strangest 1
grave 19
red-haired 3
rece

Temporary 2
sinewy 2
philosophic 10
Local 12
affluent 2
unsure 1
reciprocal 8
cancer-ridden 1
salubrious 2
unrelenting 1
stubborn 12
topsy-turvy 1
red-faced 1
supernormal 1
municipally-sponsored 1
ironical 2
Facilitatory 1
iliac 1
pre-Fair 1
erect 8
short-cut 1
intradepartmental 1
nasal 2
five-a-week 1
dissimilar 3
cumulative 12
11-year-old 2
cultural 49
cynical 8
vivid 25
fractional 1
all-college 1
seventh 14
nodular 1
eight-year 1
Goddamn 1
blonde-headed 1
olive-green 1
auspicious 1
storied 1
bipartisan 2
tense 10
Assyrian 1
gnomelike 1
Lebanese 1
drugless 1
forthcoming 10
feudalistic 1
biochemical 3
placeless 1
understandable 13
Little 41
every-day 1
44-year-old 3
cold 138
mature 24
parental 2
elderly 13
roaringest 1
non-itemized 1
well-received 1
1/50th 1
bloodstained 1
half-dressed 1
odious 3
plus-one 2
conservative-liberal 1
virtuous 6
sub-freezing 1
12-shot 1
strict 10
twenty-first 3
all-pervading 1
transient 3
adolescent 3
adventurous 5
record-tying 1
various 195
rocky 9
calami

unrelieved 4
nonsegregated 1
unauthorized 2
brisk 7
snap-in 1
militant 7
four-fold 1
threatening 4
sixteen-year-old 1
Israeli 4
revolting 2
half-blood 1
220-yard 1
unfertile 1
uninjured 2
select 5
Cool 2
Three-day 1
loudest 3
punishable 1
swift-striding 1
aniseikonic 1
quicker 5
unrevealing 1
advisable 1
curtained 1
Trichrome 1
pit-run 3
mournful 1
thirty-caliber 1
libertarian 1
fortunate 21
unfenced 2
compensatory 3
brutal 7
Formosan 1
tenuous 6
eastern 11
on-the-spot 1
psychiatric 5
louder 11
pregnant 8
non-Western 1
baronial 1
later 53
critical-intellectual 1
Physiological 2
non-absorbent 1
rickety 1
stronger 37
host-specific 1
bushwhackin' 1
hilar 4
penal 1
brief 61
instrumental-reward 1
pleasing 10
sunshiny 1
bluish 2
superficial 7
over-arranged 1
mud-sweat-and-tears 1
mother-naked 1
non-competitive 1
deceitful 1
pre-eminent 1
Primary 3
emotional 66
never-to-be-forgotten 1
allegro 1
weakest 3
individual 152
obsequious 2
out-of-town 6
guerrilla-th'-wisp 1
polar 7
lonely 25
Straight

Lay 1
recurring 6
threatens 5
leveling 9
heeded 1
Dried 2
scaring 1
registering 4
debilitated 2
surfeit 1
position 3
inferred 3
distributing 4
Note 21
alleviating 1
reinforces 3
hating 2
equipped 36
Preserving 3
bristling 3
received 163
cause 52
flows 4
believeth 2
browbeaten 1
pocketed 1
Spread 4
substantiates 1
Invite 2
obliged 21
remanded 2
grill 1
straining 7
allege 1
played 103
rouse 2
test 17
tease 4
translating 2
Depicted 1
swoops 1
Travelling 1
Upholds 1
worshipped 2
praised 13
comparing 9
gasping 5
Boating 2
tiptoeing 2
garbed 1
invests 1
Hurt 1
Detailed 2
envisages 1
Fill 1
re-elected 3
castigates 1
begins 55
stoked 1
eluted 1
hugged 2
spade 1
tolerated 6
peppered 2
Meaning 1
simmered 1
accentuates 1
Using 13
grab 9
centered 14
spill 1
garaged 1
Meet 2
fancied 2
dislike 7
magnifies 1
circumscribed 1
explaining 13
subject 1
Glaze 2
consent 1
modified 13
expecting 17
hulking 2
damages 2
bunt 1
forgit 2
being 663
disclaimed 1
coincides 5
crazing 1
withstands 1
perceived 12
strai

listed 42
questioning 19
consented 4
forked 3
Illustrated 3
haunts 1
tack-solder 1
wasting 5
Skipping 1
preached 8
scoured 3
stains 1
Filling 2
unitized 5
reckons 1
knotted 4
Named 2
shock 2
awakening 1
estimating 2
implemented 1
gassed 2
beget 1
flinching 1
cutting 54
Living 9
indoctrinated 1
deal 45
overheat 1
shave 6
spending 29
inform 7
Contribute 1
wrought 3
declaring 10
paw 1
skindive 1
refreshed 5
blistered 1
veering 2
court 2
train 10
leaked 5
digesting 2
Weren't 1
sings 10
cooking 26
promoted 12
nod 4
convincing 3
trouble 4
speculated 2
alphabetized 1
Situated 1
drawin' 1
punch 1
fairing 1
scans 1
fightin' 1
snaked 3
veined 1
entrenched 5
draft 1
buffet 1
flounced 1
Retiring 1
confine 2
lounging 4
clot 1
Shouldering 1
scorched 2
tick 2
cheering 1
blurted 3
burns 2
preventing 10
Computing 2
flock 1
confessed 7
crammed 3
inhabiting 1
Meeting 5
burgeoned 1
liked 58
roaming 3
lynched 1
Throw 1
bailing 2
till 1
suburbanized 1
tormented 3
Didn't 8
reopening 1
chinked 1
miswritten 1


slapping 6
extruded 8
nolle 1
halt 7
Live 2
Resolved 4
Called 5
Praises 1
smacked 2
Harnessing 1
Fuss 1
noticing 3
unmasked 1
Listed 2
unfolded 4
save 55
accord 2
resorting 3
Introduce 1
vivified 1
fined 4
fluorinated 1
paralleled 4
supposes 1
Serve 4
belittling 1
Racin' 1
calcified 1
fails 14
providing 53
hankered 1
endorsed 4
arching 1
resealed 1
sewn 1
shute 1
dustin' 1
Hold 7
swum 1
coordinate 7
name 20
plummeting 1
complicate 2
skirts 1
insisted 43
contracts 5
echoing 2
traverse 2
alienate 2
rode 40
culminated 2
Combine 3
toll 1
misted 1
feigned 1
forsake 1
budgeting 3
swollen 4
frozen 26
damning 1
stratified 2
instruct 3
Art 1
wed 2
Walk 3
pulling 22
Waiting 6
Cursed 1
deceiving 1
inactivate 2
tug 1
whitewashed 1
hooking 1
defending 13
nudge 1
frame 4
cleared 23
punish 3
browsing 1
tyrannize 1
Hearing 1
average 5
agree 51
dreams 2
begin 82
Get 30
purify 2
dot 2
cleft 1
class 1
discard 1
attended 36
towers 1
compile 1
reassert 1
re-explore 1
secularized 3
locking 31
ceasing 2
sche

leavened 1
consenting 1
slash 3
peering 9
establish 58
See 44
exaggerating 4
rejoiced 1
Supports 1
created 81
lend 13
decreases 7
scream 7
hurried 23
hopped 5
centralized 9
Make 26
berated 3
limit 16
drag 9
inhaling 1
traveling 16
Moving 4
cooperated 2
Seizin' 1
metabolized 1
comprehending 3
Exhibited 1
unbalanced 3
recoiled 1
errs 1
Commenting 2
registered 22
Defining 1
weaving 4
averted 3
appraising 1
recruiting 4
refresh 1
proscribed 1
washing 37
accounted 5
daring 11
counts 6
thrived 5
neglect 4
Deciding 1
handicapped 10
deepened 2
tasting 3
affiliated 7
twirling 5
debuting 2
troubled 30
trip 1
aim 9
Help 5
governing 21
conformed 3
lurching 2
lift 17
Ducking 2
Wouldn't 4
bandaged 4
parodied 2
parlayed 1
rimmed 2
preserve 31
discarded 8
ticking 1
declines 4
scavenging 1
Elected 1
bumping 2
reviving 2
hogging 1
Fleeting 1
tilted 12
constituted 11
inhibit 8
pulls 8
advancing 4
attracts 3
peed 1
marched 9
concern 12
advertised 9
tipped 4
secure 16
abiding 5
harping 1
accepting 19
resta

whirled 6
estranged 3
stored 36
Squeezing 1
prefer 27
counteracting 2
Pardon 2
terrified 7
selected 73
campaigning 2
rutted 1
sorting 1
petting 2
space 1
skiing 5
connecting 6
amortize 2
entrust 2
Choose 2
depresses 1
Mind 1
squared 5
level 2
marveled 2
bye 1
weakened 6
circled 9
supposed 65
crossroading 1
snuck 1
roars 1
stir 7
doubting 3
clinging 7
grazing 3
base 3
bogged 1
solves 2
gnaw 1
Corroborating 1
flow 13
sneering 1
protect 34
presumed 12
pleads 1
muddling 1
dressed 36
sustain 14
sample 1
reaffirmed 3
intersect 6
munching 1
co-operating 1
suffered 42
dehumanize 1
expediting 2
welding 1
question 17
surfaced 1
enjoyed 57
secluded 1
discriminate 1
Surrounded 2
elapses 1
Dazed 1
arrange 9
invoked 6
given 371
adjoins 2
hypothesize 1
stammered 4
praising 2
scowling 2
furthered 3
Fired 1
eluding 2
parting 3
portray 6
prospers 1
Seems 7
entranced 3
entreat 1
lapses 1
adapted 13
trotted 7
subduing 1
angle 1
scheduling 1
enmeshed 2
wounding 1
unsee 1
pull 38
leavin 2
sed 2
reassuring 6

clogged 2
cavorted 1
overlapping 3
terrifies 2
skirmished 1
overload 2
disorganized 5
encamped 1
impacted 2
Square 1
timbered 2
traces 1
decorating 3
reinstated 1
munch 1
damage 5
catching 7
grooved 1
rehearing 1
loading 5
fearing 5
envisaged 1
progressing 2
cruising 6
observed 74
whistling 5
Gaining 2
uttered 5
dissolved 15
inculcated 2
generates 5
knowed 1
divide 11
disqualified 1
degenerated 1
suffuse 1
Soothing 1
lingering 5
vows 1
fulfill 9
availing 1
Ain't 4
Voting 2
expanding 28
Undertaken 1
substitute 4
glamorize 1
fought 45
banishing 1
bevel 1
Hunting 1
Start 13
graced 1
smiled 71
mended 1
Expected 1
commented 18
start 90
dipped 3
demonstrates 6
prove 53
capture 13
scrambled 9
Am 3
scattering 1
protest 6
condition 1
radiated 4
quickened 1
coincided 6
mistaken 17
sparring 3
beholds 1
bugged 2
subdued 7
encroached 1
Viewing 1
increase 82
recognized 80
wished 55
fashioned 7
wheeling 1
expelling 1
betting 3
Convinced 1
parried 1
slung 2
Grinned 1
Fingered 1
overeat 1
imported 8
tr

wish 86
birdie 1
tiled 4
cook 13
Ionizing 1
crouchin' 1
perform 29
sponsoring 3
must've 2
drowned 6
blustered 1
lease 3
Rock 1
regaled 1
shot 48
undulating 1
separating 6
forgive 20
planned 72
penetrate 7
shan't 1
blow 8
tenting 1
sluiced 2
claimed 35
rejoices 1
noticed 50
doctored 4
steadied 2
confounding 1
memorizing 1
waning 2
dispatched 5
stamped 7
fumbling 4
blown 9
tantalizing 4
mapped 1
yellin' 1
regenerates 1
Refuses 1
culminates 5
Authenticated 1
carousing 1
shipped 6
Shu-tt 1
Construct 1
removing 8
organizes 1
computing 9
Leaving 8
flared 5
accommodated 1
stricken 5
incurring 1
remitted 1
settle 23
Furnishes 1
accelerated 13
Notice 3
Weaken 1
span 4
bilked 2
detail 1
overrated 1
fielded 1
boards 1
trudged 4
doped 1
Strolling 1
Position 1
tolerate 4
dwarf 2
dazzles 1
tour 1
deter 1
drugging 1
thrive 1
enforce 8
lurched 5
Allow 2
harvesting 2
documented 6
paint 18
boxed 2
abstracting 3
Looking 11
extended 55
undermined 2
butchered 1
weighed 16
welcome 15
credit 2
accelerate 5
f

crossing 7
Appendixes 1
scholar 11
dicks 1
lords 1
Flames 1
Bryan 12
guerilla 1
Berrellez 1
Mercers 1
Nestor 1
Peep 2
rage 16
Newburger 1
U.M.C.I.A. 3
regularity 2
aircraft 64
nonreactors 2
Dominique 1
K.G. 1
Curt's 8
Teacher 2
inheritors 1
Brush-off 1
spirits 42
titer 4
stratagem 1
consanguinity 2
runways 4
Hitlers 1
navigation 5
Komurasaki 1
tenderfoot 2
marauders 1
delineaments 1
Casey 23
clatter 1
stretches 5
Valmet 1
clotheslines 1
Slavs 1
Scobee-Frazier 1
Center's 1
integers 1
Vittorio 2
hide-out 2
duets 1
Stevenses' 1
reproducibility 1
Leukemia 1
Beryl 2
havens 1
Lincoln 47
catastrophes 5
Jacoby 5
Luck 2
Friends' 1
Title 25
bales 3
deduction 9
Bourcier 3
Patrimony 1
tar 2
Playhouse 4
tape 31
Stars 4
Belletch 2
Suburbs 2
Harnack 1
first-level 2
Key 7
firelight 2
Tchaikovsky 2
definition 34
Cudmore 1
needle 15
courage 32
atrophy 4
Expansion 2
Chuck 4
variables 26
Maguire 5
Opinion 2
training 83
reach 14
Finalists 1
animal's 3
Beardens 5
character-education 1
going-over 2
Scarface 

announcements 6
Kellum 1
Monthly 3
effort 145
1940s 1
Nishimo 1
Biscayne 2
Sessions 1
Constable 5
1940's 1
track-signal 1
Scotch 2
Dilys 1
Todd 2
Rafer 1
automobile 43
Payments 2
Fyodor 1
propriety 6
restraints 7
facility 11
fashions 3
Holders 1
Ryder 1
$4,000 1
clamps 6
self-insurance 1
ambuscade 1
tragedians 2
trestle 1
Berea 1
Lessons 1
heelers 1
letterman 1
wrapper 2
bourbon 4
Lagrange's 1
Albers 1
wisdom 42
Mozart 2
arenas 3
Pope's 1
Ado 4
Ellamae 1
Sunset 2
over-simplification 1
Bengal 4
deadliness 2
investigation 43
Gums 1
lyricist 2
townships 1
Viola's 4
instruments 25
Drill 1
Bishop 13
Christopher 5
miscellanies 1
price-level 1
stake-out 3
synthesis 16
Cezanne 2
Octet 1
mania 5
Milledgeville 1
scalp 4
heavens 9
Commodities 1
dinosaur 1
dimers 1
hue 1
Rachel's 2
subsidiary 6
Big 1
Wechsler 1
Don 22
liberalism 13
Pa. 9
Felix 30
active 6
scandals 7
Yahwe 1
Trig's 1
White's 4
stubble 2
ironies 1
Dionysus 1
Schopenhauer 1
App 1
navels 1
Trujillo's 2
squaw 1
apricot 1
rupees 14
bang

Kedzie 1
Boothby 1
enrage 1
ft 4
smaller-size 1
scrap 8
silences 3
bull 13
elm 2
Kabalevsky 1
treble 1
Eugene's 2
Sirs 2
quarter-inch 1
circuitry 1
witness 19
carrot 1
dislocations 2
offer 12
battleground 2
forums 1
Candide 1
mornings 10
Clemenceau 2
dolls 12
Vero 2
bag 41
Concepts 1
Darius 1
Nile 3
bop 3
globetrotter 1
zeal 8
F-major 1
maiestie 1
Steak 2
Lanesmanship 1
detachment 4
Berry 6
heroes 17
divine 2
lectures 13
Railroad 11
drums 14
Blackwells 1
scarcity 2
Mar. 3
founder-originator 1
organ 11
riders 6
DiSimone 1
Cauffman 1
Convocation 3
setback 3
snips 1
Clemens 2
feare 1
consideration 49
Bartha 1
throw-rug 1
prognosis 2
Sprague 3
possessive 1
deterrence 1
cluster 8
Chennault's 1
Trimmer 2
Merner 1
breakwater 2
flange 2
chastisement 2
Reuveni 5
midsummer 3
Identification 1
Krutch 2
slave-laborers 1
hydraulics 1
Evangelism 3
cowbirds 3
tempo 4
Hospitals 2
Patty 1
pressure-volume-temperature 1
Linden 6
Plastics 3
Josephus 1
Calhoun 10
$2.50 1
salvation 27
catharsis 5
countryside

dissemination 2
fishpond 1
Yuba 2
Hartweger 5
Searles 1
embargo 2
monitor 1
Alamogordo 1
setbacks 3
license 35
Jacopo 2
Assimilation 1
city 259
Hebrew 7
stable-garage 1
advocates 1
Concerto 5
saffron 1
follow-through 1
$8,313,514 1
layers 10
springtime 4
grams 18
nuisance 5
examples 47
Khasi 1
contexts 2
Packs 1
routes 6
bedpost 1
John-Henry 1
Beadles' 1
canvassers 2
thrusts 4
Enlargement 1
pogroms 2
ultracentrifugation 3
song 56
Margaret 10
datum 1
Knee 2
accessibility 2
glycol 2
Hanover-Lucy 1
gloss 1
aviator 3
Privacy 1
CDC 7
offering 6
Southwest 7
cat 22
adroitness 1
preservation 17
regret 3
Toland 1
rat's 1
yachts 3
mapping 3
skillet 2
Oopsie-Cola 1
meeting 122
gray-backs 1
Blanc 1
brother's 10
neo-stagnationist 1
front-page 2
Boogie 1
flying-mount 1
Manor 2
coincidence 11
barns 4
$12.7 1
Groggins 4
parapsychology 1
Devil 5
conversion 20
contrasts 9
Charts 1
beers 1
Claims 8
collector 7
trait 3
pirate 2
Shah 2
skiffs 4
phosphor-screen 1
drawing 12
thigh 9
culprits 2
Tropez 1
musts

Cerv 5
to-day 4
daylight 15
riddle 1
Commander-in-Chief 1
Godunov 1
materialism 6
Succession 7
bosom 8
goose 2
Yin-Yang 2
Commentary 3
sluggers 2
organdy 3
khan 1
lions 4
coronary 3
pip 1
Hitler's 7
matter 281
Dusseldorf 3
rafters 1
Murtaugh 3
o 1
Brighetti 1
Pact 3
newlyweds 2
ground-level 1
Oyster 1
pool-side 2
Arthur 51
Super-Set 8
fealty 1
Mityukh 5
hearing 48
contraceptives 4
sun-tan 1
Burr 5
People 36
Borromini 1
Dodgers 5
supremacy 5
Stripes 1
restrictions 27
Amazon 2
reunion-Halloween 1
nagging 4
recording 14
Trafton's 1
Miss 232
applications 24
Huff's 1
will 99
pillow 8
meddling 1
radiochlorine 3
screenland 1
Osram 1
galleys 4
shoji 1
pseudonym 1
Brakke 1
U. 61
Fairmount 2
decree 3
dressings 1
choruses 2
walk-up 1
ciliates 1
Marchand 1
applicator 1
swamps 2
promptings 1
Boylston 1
Chappell 1
bedtime 3
Reunion 1
January's 1
Midge 4
Dove 1
seers 1
FELA 2
Chantilly 1
B's 1
Wieland 1
call 50
constancy 5
indisposition 1
anarchist-adventurers 1
subtype 3
Claude 11
$6,666.66 1
Lodowi

Review 10
Galahad 1
Vickery 3
Eastman 1
differentiability 2
bodies 63
control 181
physiotherapist 1
clumps 2
Plenty 4
turkey 3
engineers' 3
mass 81
cession 1
Unit 9
Gottingen 1
equivalent 10
Enos 1
Berto 3
$40 3
applicants 10
biophysicist 1
Designs 2
know 3
X-ray-proof 1
Piranesi 1
Michilimackinac 1
Millay 1
anomaly 1
stomachs 3
Bennett 1
beast 7
Koh 1
Okinawa 1
mall 1
smear 2
psychiatrists 5
pails 4
immortality 19
thrusting 1
Butcher 6
community 214
busy-work 1
locomotives 1
Statistics 1
affiliations 5
Coming 1
matrix 1
Vecchio 8
commonness 1
multitude 3
Hoyt 4
Batavia 3
Fox 3
Brandeis 1
Brumby 2
plot 32
earthmoving 1
August 52
pool's 6
guinea 1
Hotel 43
Graves 3
polemics 1
delegates 14
revetments 1
Sibylla 4
graybeards 1
Baird 4
polyether 7
furs 5
Tobacco 5
comet's-tail 1
patriarch 1
Covent 2
Meyner 3
Pesce 2
garments 5
coves 1
Shep 1
firemen 5
Dunne 4
Sihanouk's 1
lieutenant 12
Truth 2
creepers 2
Shoettle 2
Drs. 5
Pontissara 1
deferent 2
forecast 3
smatterings 1
Discourse 3
schoolho

regattas 1
sponge 4
Caneli 1
Argos 1
teahouses 1
rigors 4
shrieking 1
comment 34
Auntie 3
warrant 9
garment 6
Cloth 1
Franklin 30
Metro 3
citizens 78
gyrations 1
inheritance 5
dealers 29
teasing 1
Rotarians 1
Howser 2
threesome 3
pair 50
Superintendents 1
Spirituals 1
Nurses' 1
codfish 1
Characteristics 1
misconstructions 1
stiletto 1
easements 2
Mimi 1
palazzo 3
Halls 1
buckle 2
spate 2
aerosols 3
fares 3
Service 69
ex-jazz 1
14% 3
Malcolm 3
Valerie 1
reasoning 11
Looks 1
elevator 12
Funk 4
Revolution's 1
disputes 7
Format 4
2'' 1
Apache 1
Favre 5
holdings 4
cue-phrase 1
octaves 2
lovering 1
interlude 4
sectors 10
aide 8
Rents 1
half-inch 3
Heel-Terka 1
Dickson 3
suicides 2
Sunay 1
oversimplification 3
sham 1
Artist 2
cytolysis 1
Markovitz 1
modernists 2
Mayor 30
cowboy's 1
realist 2
reader 43
little-town 1
states 141
Potowomut 1
behaviour 3
Cheyenne 2
non-farm 1
founders 2
Guatemala 3
Benets 1
Styka 16
Keynotes 1
sittings 2
220-degrees 2
Gods 2
caucus 2
Doolin 6
Millay's 2
winehead 1

dooms 1
fetes 1
Mansion's 1
wine's 1
Christiansen 2
blizzard 7
necropsy 2
courtyards 1
Islands 18
Proceeds 2
Sources 6
pets 5
fancy 7
finance 9
Nickel-iron 1
pillage 1
limited-time 1
madhouse 1
H.M. 2
butterfly 2
Woman's 5
Sober 3
Adam 44
radar-type 1
sellout 1
Kalonji 1
Kathy 4
Alamein 1
dearth 3
Improvement 3
Act 116
Money 6
dissonances 1
ashes 6
Sally 13
restoration 8
street 146
Coconut 1
Sha. 1
Sunshine 1
Scots 8
Aparicio 1
league's 1
Kirkland 1
Breed 1
mean 12
solipsism 1
Pittenger 1
decorum 2
provisons 1
Ai 1
skullcap 3
episodes 6
obviousness 1
production 140
caution 13
triol 1
immorality 4
celebrities 3
tidelands 1
$77,389,000 1
elephant 7
allurement 1
infections 5
inter-relation 1
Englishman 15
parlors 1
Centralia 1
Citizen 2
Rutstein 1
Palisades 1
dilation 2
Duchess 1
NCTA 3
Open 8
cm. 12
colonials 1
foraging 1
store 65
Rector 30
Iraqw 2
450 1
Cable 2
processors 1
uranyl 2
sea 78
filbert 1
photos 4
Nouvelle-Heloise 1
hydrophobia 1
organs 14
comedian 4
traders' 1
ee-faket 1
pla

Garza 1
Carlo 2
youngster's 1
$.03 4
Bluthenzweig 1
recoil 4
waterways 2
low-foam 1
Autocoder 10
performance 122
Cloud 3
dependent 1
strangers' 1
non-English 1
Mycenae 2
sonofabitch 4
cubists 1
Henry's 2
Poet's 5
Orange 6
wail 3
pines 2
file 44
Pall 1
Information 11
Nordstrom 1
pfennig 1
Voroshilov 1
insanity 3
handyman 2
Father 23
folder 1
transports 4
prototype 3
anemia 5
dynamo 2
Rostagnos 1
flux 30
astronomer 1
savings 19
trumpeter 1
socket 3
pianist's 4
Princess 5
Coroner 1
petition 14
schedules 10
Purcell 1
glob-flakes 1
Wolfgang 1
balkiness 1
archipelago 1
formality 2
Vue 3
privet 1
bunters 1
Vachell 1
Hatfield 2
adversary 5
discussant 1
Samples 2
drugs 28
Gelly 1
Aeschbacher 1
skewer 1
Joshual 1
dictionary 54
Athlete 1
Duty 3
Twist 1
soles 5
ACTH 1
lightweight 1
photography 7
Casbah 3
ileum 1
stabs 1
Lateran 1
Gershwins 1
pyocanea 1
rascals 1
Constantinople 3
Mineral 1
sacrifices 5
Chattanooga 3
madness 2
rf 1
bunter 1
Montgomery 16
Jock 1
idiosyncrasies 3
Hudson's 16
DIOCS 7
a

balsams 1
Mohammedanism 1
speech-making 1
Tolerance 1
Blvd. 4
teats 2
Bayreuth 4
grapevine 3
campgrounds 2
F.R. 2
ROTC 1
meantime 9
ward-personnel 1
pool-owners 1
treaties 4
whims 1
tree 56
prostate 2
Bean 1
interment 2
happiness 22
Finberg 2
room's 1
Cocktails 2
stench 1
Giovanni 3
subway 7
Bellwood 1
Snow 8
Roberta 6
kid 54
abdomen 6
side-rack 1
Bucer 1
upswing 2
tunefulness 1
opener 6
tappet 15
interest 320
Anthony's 1
rascal 1
sentinels 2
Epicurean 1
Hesiometer 6
advisory 2
shams 1
Rogers 9
tapes 4
reading-rooms 1
usages 3
python 14
edition 36
Artur 1
Finney's 2
bayonet 6
pin 14
oven 7
Visa 1
Medfield's 1
Friar 1
mythology 3
Backyard 1
thermocouple 3
analogy 13
whit 1
verisimilitude 1
Federation 12
understructure 1
70's 1
parish 10
Admissions 1
aqueducts 1
copings 1
Tenn. 1
old-age 1
weave 1
barn 29
coke 2
short-range 1
bibliophiles 1
verge 2
tournament 18
proliferation 5
$35,823 1
day's 14
draining 2
filets 1
occupations 3
Sino-Soviet 1
bishops 5
Sholom 1
ruling's 1
photo 5
fables

Hurts 1
grabs 3
dairy 18
diapers 3
O'Brien 2
chamois 1
bewilderment 3
husband 131
velour 1
bookseller 1
tenant 5
Lancret 1
tent 20
partisan 3
stories 58
polysiloxanes 1
Minnett 3
spasm 3
boxcars 2
Cardboard 1
Hegel 2
violation 17
Eli 2
chromatogram 1
additions 9
Sexton 2
wager 1
Mauldin 1
AjA 1
hinge 1
barracks 3
suits 24
curricula 3
cab's 2
equator 1
bandage 4
Permit 1
$625,561 1
towels 11
utopian 2
second-level 1
antagonism 9
J.D.H. 2
Russia's 14
tissue 41
steamboat 2
planetoids 1
Eisler 1
u. 1
Sweeney 4
beans 9
teens 5
Errol 1
tongue 35
stove 15
Moffett 1
realty 1
attrition 5
Honotassa 4
distrust 4
Grumble 1
cutters 6
Oldsmobile 2
Jannsen 1
Norris-LaGuardia 1
$30,000 3
Grass 1
disrepute 2
Prisca 1
variety 84
Message 1
Pohly 2
bout 4
graveyard 7
Seigner 5
rationalization 2
Dana's 3
Canada 34
tail 24
Muck's 1
cadre 3
overpressure 1
Galveston-Port 1
Stram 4
hunt 3
prestige 27
municipality 1
minarets 4
guests 57
flu 8
Junior 8
horse-chestnut 1
Billy 26
Dog 5
Alexis 2
subsedies 1
Thailan

Waal's 1
lagoons 2
Minutes 3
Svevo 1
handicraftsman 1
Respondents' 5
cathode 10
Vacancy 2
purchaser's 1
Blanche 24
Lorlyn 1
vintage 3
ditties 3
Sarkees 2
Gustaf 1
Crawford 3
Transpiration 1
Mackey 3
pitch 20
yodel 1
hideout 1
Thruston 1
exodus 3
bindle 1
grandma 1
Handbook 1
Experiments 6
fracases 1
bin 9
spires 3
gluttons 3
random-storage 1
gardenia 1
giving 2
holidays 10
taxi-ways 1
overpayment 4
neurotic 1
$75-billion 1
corn 31
bodyguard 1
hemoglobin 4
croak 1
exit 7
Paulus 1
hookups 1
hiding 1
drinking 6
Buckhannon 1
it-wit 1
fingernails 2
Lars 1
musicians 39
Betsey 2
Edgewater 2
sounds 32
republics 2
sponging 1
Rush 1
Ellsworth 1
Voters 2
Tipoff 1
blueberries 1
Missions 1
tarpaulin 1
mavericks 1
sisters-in-law 1
realities 15
Vermouth 1
metaphor 5
Jansen 1
Rev 11
perfectibility 1
Campaigne 1
phantasy 2
Martians 3
Maggie's 3
Exports 1
cleat 1
luxuriance 1
Squaresville 1
favorer 1
Olvey 1
quake's 1
avocation 1
layoffs 1
let 7
lymph 2
Magnums 5
punching 1
cleaners 8
one-sixteenth 1
po

narration 2
defenses 12
sensuality 5
abilities 13
bolts 1
Doug 1
T. 27
cost-plus 1
Homestead 1
Hell 10
Danehy 1
taper 2
stimuli 5
springs 7
acres 42
sheriff's 5
Leighton 1
knott 1
devotees 3
Fires 1
Bud 5
Brace 3
Alfred's 3
risks 5
General 103
nonshifters 1
parson 1
Merchandise 1
Wallingford 1
Richmond-Petersburg 1
scouring 1
panjandrum 1
cruise 1
Quint 11
Railroad's 1
platoons 2
Meyers 1
lackeys 1
capitals 4
Reich 6
Boats 4
ones 114
Whipple 7
excellence 13
experiment 53
20-gauge 1
aerator 11
crawlspace 1
pastness 1
Marcile 1
Conservatism 1
raggedness 1
Mambo 1
grub 2
mysteries 7
gradualist 1
Bottom 8
puddles 2
Kiz 3
best-sellers 1
moon's 2
Ahmet 3
Brewers 1
Pressure 1
Pageants 2
airfields 6
Iodination 1
crew 36
Pains 1
breeches 1
Casualties 1
monopolies 5
Maplecrest 1
sitting 4
chant 1
deprivation 1
scurvy 1
Jump 1
publishers 9
Tract 2
tap 8
Evidences 1
Shingles 1
'90s 1
single-dose 1
Reilly 1
Gracie's 1
ballerinas 1
0.002'' 1
cash 31
brotherhood 5
Equivalents 1
dividends 8
braids 1
m

drawing-room 2
panthers 1
Cathedral 2
Brownings 1
Bekkai 1
sound 126
Clipper 1
It-wit 1
lats 2
Bertha 5
Pantheon's 2
SAMOS 3
psychologist 10
Smokies 1
HBO 1
temperatures 25
Bumblebees 2
paradise 5
trusses 2
Erik 2
stick 23
statisticians 1
Socialism 3
Dicks 1
banners 2
Negroes' 1
waltz 1
orchestration 3
Fiedler's 1
sniper 1
rump 2
Sounder 1
Property 3
peptide 1
Ticker 1
M.P. 2
territory 27
texts 3
blossoms 7
Barker 8
Brestowe 1
Mikeen 1
Twain 1
Mongolia's 1
verbs 7
Smoke 1
Roger 14
symposium 3
Ainus 1
callers 3
Senese 1
Dillinger 1
Mame 1
hints 7
stances 1
influenza 2
boors 1
Reiss 1
enthusiast 2
incentive 12
ages 37
motets 1
Traitor 1
sparrow's 1
stranger 37
blight 2
Customer 1
Dnieper 1
perceptions 9
Sharkey 1
commissioner 6
ruins 8
Yale 13
Laos 64
Messiah 2
3.25'' 1
steroid 1
Stallard 1
shoe 14
spider 2
Osipenko 1
ponds 7
income 97
Rennell 1
moderates 3
leader 69
Gasset 1
blessing 9
subjectivist 2
pope 5
exhibitions 3
Taras-Tchaikovsky 1
Tygartis 1
sportsmen 6
Asteria 2
Signor 2
Jack

Verner 1
churches 82
looseness 2
client's 7
windbreaks 1
pores 3
Israel 15
first-aid 1
rhythms 11
robe 6
Braques 1
caterpillar 1
59-cents 1
Nigger 1
Snyder's 1
librarian 5
fertilizers 3
probity 1
squadrons 1
Palasts 1
problems 240
Self 5
transference 2
Attendance 3
Lafayette 13
founding 3
countenance 6
Downs 2
research 133
20th-Century 2
Havilland 1
Land's 1
Kings 5
nurse 16
teacher's 2
Sherry 4
Frostbite 1
Pye 1
Erasmus's 1
Beverly 14
classic 5
recruits 8
$172,400 1
berth 3
vs 4
Anne 42
speculation 3
Yang 12
swearinge 1
billets 1
knots 1
immaturity 1
neutralists 2
Mubarak 1
dismissal 7
stair 2
Boulder 3
exhaustion 1
Ulbricht 2
try 3
File 3
Nations 58
Song 14
Geology 2
elite 11
cattle 90
misunderstandings 1
passerby 1
pearls 2
Shares 1
threes 3
Jesus 58
woodcutters 1
liability 7
clown's 1
Ellwood 1
safeties 1
wage-rate 1
Chambers 3
margin 10
disparagement 2
Sell 1
stain 4
simplicitude 1
streak 10
orthicon 2
Barrette 1
patchwork 2
I 5
Charleston 2
Sodium 1
decay 11
helpmate 1
abstractio

idea-exchange 1
child-face 1
alkalis 2
endosperm 1
deceit 2
Gilkson 1
Westhampton 1
pedigree 3
Burton's 2
metalsmiths 1
parimutuels 1
timeliness 2
Robertsons 1
Conflict 1
member 133
Roleplaying 3
Ebbetts 1
pilot's 2
boldness 3
patsy 1
Illinois' 1
deed 7
newsboy 2
tappets 12
double-crossing 1
operetta 6
bass 15
Hypotheses 2
asbestos-cement 1
inns 1
blue 15
knife 73
Disquisition 1
quarry 7
jade 1
Vienna 22
Milk 5
biopsy 2
deer 8
sub-chiefdom 1
blocks 37
saturation 5
chlorine-carbon 1
Motel 4
tutor 4
early-season 1
Ormoc 2
snorkle 2
dislikes 1
sun-suit 1
post-mortem 1
mills 11
Seymour 1
fund-raisers 1
seminarians 1
Mushr 1
connection 68
Varnessa 1
favorites 10
complainant 1
Morgan 72
Science 26
Judson 3
complaint 14
blanket 29
enrollment 6
floorboards 4
Vagabonds 1
creases 1
diamonds 7
wrath 8
sweets 2
wealth 20
Rum 1
circumlocution 1
de-iodinase 1
Strindberg 1
flash-bulbs 1
blinkers 1
O'Hare 1
Bermuda 9
Palestine 7
Laughlin 2
Bosis' 1
side-arm 1
wrought-iron 1
entropy 4
Boys 11
proficien

pebbles 3
bark 13
reflector 5
integration 47
pecs 3
beer 32
tularemia 2
Eshleman 1
Hume's 2
look-see 1
vibrato 1
Lante 1
surge 8
Angelina 4
lucy 1
mister 3
reflex 3
chops 3
Silk 1
Perier 6
restlessness 2
Pilgrim's 1
Raoul 2
y 1
Arrack 1
Eckart 1
nation's 32
Jolla 2
enzymes 11
schedule 34
beggar 2
Arrington 1
chowders 1
outposts 1
butterfat 1
Orioles' 1
equivalent-choice 1
hens' 1
confidences 1
itinerary 3
transposition 1
Vernava 3
schnooks 1
merry-go-round 1
Mommor 1
Flats 1
plainclothes 1
elan 1
canals 1
cubes 4
demolition 1
Boils 1
ethers 1
Asch 1
Catalog 2
classification 21
Field's 2
fish 31
launch 3
catalyst 3
Hayes 5
Lorde 1
eulogizers 1
freights 1
1% 2
ineffectiveness 1
textiles 15
plagiarism 1
once-over 1
P 47
high-velocity 1
Collinsville 1
Peru 4
glover 1
Gatlinburg 2
Pensacola 4
hour's 1
rolls 8
Requests 3
Osaka 6
reorganizations 2
Huxley 6
wreath 8
Cedvet 1
Morse's 5
Merc 2
Travellers 1
20-megaton 1
ocher 1
**yb 3
Bayanihan 1
gastronomy 1
transcription 2
Bellini 3
Jouvet 4
va

cobalt-60 1
granite 2
road-circuit 1
boat-yard 1
Culbertson 1
monograph 1
Yancy-6 1
rocket-bombs 1
lagers 1
april 1
squires 1
Cen-Tennial 2
crudity 1
broods 3
employees 65
foot 68
County's 1
couple's 1
secularism 1
Federalist 1
fritters 1
Rawlins 3
voters 18
father-brother 1
Marlin 2
acclaim 4
radiocarbon 1
jump 8
doorbell 2
raising 8
Guiana 2
assists 1
porches 2
inventory 20
metrazol 1
Torpetius 1
bigotry 2
windowpanes 2
Johnson's 10
planes 24
carpeting 2
Beman 1
coatings 12
pre-selling 1
economists 5
mothers' 3
ulcerations 1
Sparrow-size 1
feat 6
paunch 2
Canal 3
curiosity 23
t's 1
undoing 1
cartons 1
Rebecca 1
chairing 1
Valentine 2
Yankees 28
helmsman 1
glands 6
manual 1
Martin's 1
Tiao 1
concept 85
handhold 1
jaws 10
Catskill 5
Mahler's 4
resolution 62
passport 6
fears 43
runabout 1
10-degrees-C 1
Square's 1
snake 42
1-inch 1
neutrality 3
colossus 1
Lemon 4
Napoleon 7
canyonside 1
Verdi's 2
sincerity 13
Physique 1
Borough 2
Vienot 1
Hewlett-Woodmere 1
Burke's 1
larceny 2
Gizenga 2

ability 74
Cole 1
cubist 1
Nathanael 1
polymerizations 1
Tannhaeuser 2
grease 9
RA 1
loader 1
Pasadena's 1
Auditorium 6
embarrassment 8
cabs 1
Karamazov 4
depressions 3
noun 1
Gortonists 2
formability 1
Wildlife 1
Allay 1
auditors 3
Moriarty 5
acumen 1
poetry-and-jazz 4
Ada's 4
Stephane 1
misfortunes 1
Industry's 1
sienna 2
larder 1
movers 4
creak 1
Hubie's 1
Robby's 1
specificity 10
turnips 1
disunion 1
baseman 3
TNT 4
surfaceness 1
calisthenics 4
brunches 1
Urielites 1
conducts 1
ex-Yankee 1
courtliness 1
Spillane's 1
Zemlinsky 2
F. 59
Herford 5
watching 2
poppy 2
yoga 1
grant 22
sorrow 9
Rockport 2
buyers 5
nymphs 1
Fellowship 10
confidentiality 1
unawareness 3
rediscovery 2
Height 1
hoste 1
Luthuli 1
foal 2
foaming 4
attendants 7
Lescaut 1
sect 2
Toll 2
quibs 1
widower 1
Ziminska-Sygietynska 1
Spanish 5
sheriffs 3
alizarin 1
Curie 2
Tuberculosis 1
Trapp 1
phosphate 7
suey 1
sultans 3
Grafton 5
Carruthers 5
turtle-neck 1
Albania 3
Danaher 2
spoonful 1
shop 49
Attic 1
rulers' 1
madme

Houston 25
sprays 1
Bath 1
mulch 6
salesman 12
Melamine 2
lies 3
kitchen 90
kindergarten 3
almonds 2
nomenclature 7
stacks 1
Tire 2
arrangers 1
exposition 4
Orcutt 2
Mike 91
prosperity 12
concentration 47
Suzuki 1
Lauri 1
Diseases 3
handshake 1
continence 1
City 134
cards 33
Proxmire 1
dismemberment 2
pause 16
wines 24
Brumidi's 6
starter 2
Hanover-Sally 1
gasps 5
Djakarta 1
currency 7
gasser 1
dower 1
lambs 7
Heraclitus 1
comparisons 6
fall's 1
rival's 1
sentinel 2
headlines 7
configuration 7
Turnout 1
gangs 6
hypervelocity 1
Krumpp 1
Epsom 1
porch 42
constables 1
antibodies 9
Swift 20
imminence 1
carbon 28
parenthood 3
furlough 2
Arco 1
Lublin 15
Boredom 1
Scrapiron 1
airplanes 10
Ward's 1
D-night 1
tight-turn 1
Sunny 1
Caldwell's 2
Alsing 1
interferometer 4
Feuchtwanger 2
SETSW 1
enterprise 31
gain 50
resorcinol 3
judging 2
territories 8
sandals 5
combinations 19
Roomberg 1
Jockey 2
AMA 1
Bench 7
Connally 8
abeyance 3
Quick-Wate 1
mite-box 1
young 2
wrongdoing 2
suitor 1
simpleton 1

In [221]:
for tag in data.tagset:
    

ADP
DET
NUM
CONJ
PRT
PRON
.
X
ADJ
VERB
NOUN
ADV


In [219]:
tag_unigrams

{'.': 147565,
 'ADJ': 83721,
 'ADP': 144766,
 'ADV': 56239,
 'CONJ': 38151,
 'DET': 137019,
 'NOUN': 275558,
 'NUM': 14874,
 'PRON': 49334,
 'PRT': 29829,
 'VERB': 182750,
 'X': 1386}

In [216]:
for 
basic_model.states()

TypeError: 'NoneType' object is not callable

In [205]:
basic_model = HiddenMarkovModel(name="base-hmm-tagger")

# TODO: create states with emission probability distributions P(word | tag) and add to the model
# (Hint: you may need to loop & create/add new states)
basic_model.add_states()

# TODO: add edges between states for the observed transition frequencies P(tag_i | tag_i-1)
# (Hint: you may need to loop & add transitions
basic_model.add_transition()


# NOTE: YOU SHOULD NOT NEED TO MODIFY ANYTHING BELOW THIS LINE
# finalize the model
basic_model.bake()

assert all(tag in set(s.name for s in basic_model.states) for tag in data.training_set.tagset), \
       "Every state in your network should use the name of the associated tag, which must be one of the training set tags."
assert basic_model.edge_count() == 168, \
       ("Your network should have an edge from the start node to each state, one edge between every " +
        "pair of tags (states), and an edge from each state to the end node.")
HTML('<div class="alert alert-block alert-success">Your HMM network topology looks good!</div>')

TypeError: add_transition() takes at least 3 positional arguments (0 given)

In [None]:
hmm_training_acc = accuracy(data.training_set.X, data.training_set.Y, basic_model)
print("training accuracy basic hmm model: {:.2f}%".format(100 * hmm_training_acc))

hmm_testing_acc = accuracy(data.testing_set.X, data.testing_set.Y, basic_model)
print("testing accuracy basic hmm model: {:.2f}%".format(100 * hmm_testing_acc))

assert hmm_training_acc > 0.97, "Uh oh. Your HMM accuracy on the training set doesn't look right."
assert hmm_testing_acc > 0.955, "Uh oh. Your HMM accuracy on the testing set doesn't look right."
HTML('<div class="alert alert-block alert-success">Your HMM tagger accuracy looks correct! Congratulations, you\'ve finished the project.</div>')

### Example Decoding Sequences with the HMM Tagger

In [None]:
for key in data.testing_set.keys[:3]:
    print("Sentence Key: {}\n".format(key))
    print("Predicted labels:\n-----------------")
    print(simplify_decoding(data.sentences[key].words, basic_model))
    print()
    print("Actual labels:\n--------------")
    print(data.sentences[key].tags)
    print("\n")


## Finishing the project
---

<div class="alert alert-block alert-info">
**Note:** **SAVE YOUR NOTEBOOK**, then run the next cell to generate an HTML copy. You will zip & submit both this file and the HTML copy for review.
</div>

In [None]:
!!jupyter nbconvert *.ipynb

## Step 4: [Optional] Improving model performance
---
There are additional enhancements that can be incorporated into your tagger that improve performance on larger tagsets where the data sparsity problem is more significant. The data sparsity problem arises because the same amount of data split over more tags means there will be fewer samples in each tag, and there will be more missing data  tags that have zero occurrences in the data. The techniques in this section are optional.

- [Laplace Smoothing](https://en.wikipedia.org/wiki/Additive_smoothing) (pseudocounts)
    Laplace smoothing is a technique where you add a small, non-zero value to all observed counts to offset for unobserved values.

- Backoff Smoothing
    Another smoothing technique is to interpolate between n-grams for missing data. This method is more effective than Laplace smoothing at combatting the data sparsity problem. Refer to chapters 4, 9, and 10 of the [Speech & Language Processing](https://web.stanford.edu/~jurafsky/slp3/) book for more information.

- Extending to Trigrams
    HMM taggers have achieved better than 96% accuracy on this dataset with the full Penn treebank tagset using an architecture described in [this](http://www.coli.uni-saarland.de/~thorsten/publications/Brants-ANLP00.pdf) paper. Altering your HMM to achieve the same performance would require implementing deleted interpolation (described in the paper), incorporating trigram probabilities in your frequency tables, and re-implementing the Viterbi algorithm to consider three consecutive states instead of two.

### Obtain the Brown Corpus with a Larger Tagset
Run the code below to download a copy of the brown corpus with the full NLTK tagset. You will need to research the available tagset information in the NLTK docs and determine the best way to extract the subset of NLTK tags you want to explore. If you write the following the format specified in Step 1, then you can reload the data using all of the code above for comparison.

Refer to [Chapter 5](http://www.nltk.org/book/ch05.html) of the NLTK book for more information on the available tagsets.

In [None]:
import nltk
from nltk import pos_tag, word_tokenize
from nltk.corpus import brown

nltk.download('brown')
training_corpus = nltk.corpus.brown
training_corpus.tagged_sents()[0]