<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Getting-Started" data-toc-modified-id="Getting-Started-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Getting Started</a></span></li><li><span><a href="#Dataset" data-toc-modified-id="Dataset-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Dataset</a></span><ul class="toc-item"><li><span><a href="#Fiction" data-toc-modified-id="Fiction-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Fiction</a></span></li><li><span><a href="#Shakespeare" data-toc-modified-id="Shakespeare-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Shakespeare</a></span></li><li><span><a href="#Wine-Reviews" data-toc-modified-id="Wine-Reviews-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Wine Reviews</a></span></li></ul></li><li><span><a href="#N-gram-Length" data-toc-modified-id="N-gram-Length-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>N-gram Length</a></span></li><li><span><a href="#Dataset-Size" data-toc-modified-id="Dataset-Size-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Dataset Size</a></span></li></ul></div>

Now that we've created and interacted with some language models, let's have some fun exploring some other parameters! Note: before going through this, I would recommend going through the "Language Model Tutorial" first.

# Getting Started

As usual, let's import a set of libraries we'll find useful later on.

In [33]:
%matplotlib inline  

# for manipulating data
from collections import Counter
import random

# useful nlp methods
import nltk
from nltk import ngrams
from nltk.corpus import stopwords
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE, Laplace

# download some datasets
nltk.download('brown')
nltk.download('gutenberg')
nltk.download('webtext')

# plotting
import matplotlib
import matplotlib.pyplot as plt

# printing
from tabulate import tabulate

[nltk_data] Downloading package brown to
[nltk_data]     /Users/eugenetang/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package gutenberg to
[nltk_data]     /Users/eugenetang/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package webtext to
[nltk_data]     /Users/eugenetang/nltk_data...
[nltk_data]   Package webtext is already up-to-date!


We'll define a few useful constants and functions that will be useful later. They should be familiar from the previous notebook.

In [27]:
SENTENCE_BEGIN = '<s>'
SENTENCE_END = '</s>'
STOPWORDS = set(stopwords.words('english'))

def pretty_print_tuples(tuples, headers):
    '''Pretty print tuples using tabulate.
    
    Parameters
    ----------
    tuples: list[tuple[str]]
        a list of tuples; each tuple must have the same dimensions
    headers: list[str]
        a list of headers to use; this list must be the same size as the number of elements in each tuple

    '''
    table = [list(tup) for tup in tuples]
    print(tabulate(table, headers = headers, floatfmt=".5f"))


def print_top_unigrams(sentences, n, remove_stopwords_and_punc):
    '''Print the top n unigrams in the sentences.
    
    Parameters
    ----------
    sentences: list[list[str]]
        a list of tokenized sentences
    n: int
        the number of unigrams to print
    remove_stopwords_and_punc: bool
        whether to remove stopwords and punctuation from list of unigrams

    '''
    unigram_counter = Counter()
    for sentence in sentences:
        for word in sentence:
            if not remove_stopwords_and_punc or (word.lower() not in STOPWORDS and word.isalpha()):
                unigram_counter[word] += 1
    print('Our dataset has {} unique words.'.format(len(unigram_counter)))
    print()
    print('--Top 10 Unigrams--')
    print()
    pretty_print_tuples(unigram_counter.most_common(n=n), ['Unigram', 'Count'])
    
def generate_sentence(lm, text_seed, random_seed=None):
    '''Generate a random sentence from the given language model.
    
    Parameters
    ----------
    lm: nltk.LanguageModel
        an nltk language model object
    text_seed: [str]
        a list of strings to seed the sentence with
    random_seed: int
        an integer seed for the randomization

    '''
    tokens = lm.generate(100, text_seed=text_seed, random_seed=random_seed)
    # remove the sentence begin and end tokens to keep it clean
    return ' '.join(text_seed + [t for t in tokens if t != SENTENCE_BEGIN and t != SENTENCE_END])

# Dataset

Now let's try training n-gram language models with different training data and see how the output changes. Conveniently, nltk provides us an interface to some additional datasets.

## Fiction
The Brown corpus has category labels for each document. Let's focus just on the fiction categories.

In [5]:
from nltk.corpus import brown
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

In [17]:
fiction_sentences = brown.sents(categories=['fiction', 'science_fiction'])
print('The fiction and science_fiction categories have {} sentences.'.format(len(fiction_sentences)))
print()
print('--Sample sentences--')
for i in range(5):
     print('>> sentence {}:'.format(i), ' '.join(fiction_sentences[i]))
print()
print_top_unigrams(fiction_sentences, 10, True)

The fiction and science_fiction categories have 5197 sentences.

--Sample sentences--
>> sentence 0: Thirty-three
>> sentence 1: Scotty did not go back to school .
>> sentence 2: His parents talked seriously and lengthily to their own doctor and to a specialist at the University Hospital -- Mr. McKinley was entitled to a discount for members of his family -- and it was decided it would be best for him to take the remainder of the term off , spend a lot of time in bed and , for the rest , do pretty much as he chose -- provided , of course , he chose to do nothing too exciting or too debilitating .
>> sentence 3: His teacher and his school principal were conferred with and everyone agreed that , if he kept up with a certain amount of work at home , there was little danger of his losing a term .
>> sentence 4: Scotty accepted the decision with indifference and did not enter the arguments .

Our dataset has 9762 unique words.

--Top 10 Unigrams--

Unigram      Count
---------  -------
woul

In [21]:
lm_fiction = MLE(3)
train_text, text_vocab = padded_everygram_pipeline(3, fiction_sentences)
lm_fiction.fit(train_text, vocabulary_text=text_vocab)
generate_sentence(lm_fiction, ['Doctor'], 4)

'Doctor It infuriated him .'

## Shakespeare
Let's train our language model to produce Shakespeare!

In [22]:
shakespeare_sentences = []
for corpora in ['shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt']:
    shakespeare_sentences += list(nltk.corpus.gutenberg.sents(corpora))

print('"Julius Caeser", "Hamlet", and "Macbeth" by Shakespeare have {} sentences.'.format(len(shakespeare_sentences)))
print()
print('--Sample sentences--')
for i in range(5):
     print('>> sentence {}:'.format(i), ' '.join(shakespeare_sentences[i]))
print()
print_top_unigrams(shakespeare_sentences, 10, True)

"Julius Caeser", "Hamlet", and "Macbeth" by Shakespeare have 7176 sentences.

--Sample sentences--
>> sentence 0: [ The Tragedie of Julius Caesar by William Shakespeare 1599 ]
>> sentence 1: Actus Primus .
>> sentence 2: Scoena Prima .
>> sentence 3: Enter Flauius , Murellus , and certaine Commoners ouer the Stage .
>> sentence 4: Flauius .

Our dataset has 8729 unique words.

--Top 10 Unigrams--

Unigram      Count
---------  -------
haue           406
Ham            337
Lord           293
shall          259
thou           256
King           231
Enter          225
Caesar         192
vs             183
thy            175


In [23]:
lm_shakespeare = MLE(3)
train_text, text_vocab = padded_everygram_pipeline(3, shakespeare_sentences)
lm_shakespeare.fit(train_text, vocabulary_text=text_vocab)
generate_sentence(lm_shakespeare, ['Why'], 4)

'Why are you aught That man may question ?'

## Wine Reviews
For something a little different, let's look at some wine reviews too.

In [25]:
wine_sentences = list(nltk.corpus.webtext.sents('wine.txt'))

print('The wine review corpus has {} sentences.'.format(len(wine_sentences)))
print()
print('--Sample sentences--')
for i in range(5):
     print('>> sentence {}:'.format(i), ' '.join(wine_sentences[i]))
print()
print_top_unigrams(wine_sentences, 10, True)

The wine review corpus has 2984 sentences.

--Sample sentences--
>> sentence 0: Lovely delicate , fragrant Rhone wine .
>> sentence 1: Polished leather and strawberries .
>> sentence 2: Perhaps a bit dilute , but good for drinking now .
>> sentence 3: *** Liquorice , cherry fruit .
>> sentence 4: Simple and coarse at the finish .

Our dataset has 3121 unique words.

--Top 10 Unigrams--

Unigram      Count
---------  -------
fruit          296
good           250
wine           229
bit            217
quite          204
Top            182
nose           151
touch          146
Bare           133
palate         121


In [26]:
lm_wine = MLE(3)
train_text, text_vocab = padded_everygram_pipeline(3, wine_sentences)
lm_wine.fit(train_text, vocabulary_text=text_vocab)
generate_sentence(lm_wine, ['Red'], 4)

'Red Burgundy Wine as it approaches its 10th birthday .'

# N-gram Length

Let's create some n-gram models of varying lengths and see how it affects the quality of the generated sentences.

In [29]:
# train various n-gram models. This cell takes a few minutes to run
fiction_sentences = brown.sents(categories=['fiction', 'science_fiction'])
def train_ngram_language_model(n):
    '''Train an n-gram language model on the fiction brown corpus.'''
    lm = MLE(n)
    train_text, text_vocab = padded_everygram_pipeline(n, fiction_sentences)
    lm.fit(train_text, vocabulary_text=text_vocab)
    return lm
lm_unigram = train_ngram_language_model(1)
lm_bigram = train_ngram_language_model(2)
lm_trigram = train_ngram_language_model(3)
lm_fourgram = train_ngram_language_model(4)
lm_fivegram = train_ngram_language_model(5)

In [30]:
print('---Sample Unigram Language Model Sentences---')
print('Sample 1:', generate_sentence(lm_unigram, ['The'], 4))
print('Sample 2:', generate_sentence(lm_unigram, ['The'], 5))
print()
print('---Sample Bigram Language Model Sentences---')
print('Sample 1:', generate_sentence(lm_bigram, ['The'], 4))
print('Sample 2:', generate_sentence(lm_bigram, ['The'], 5))
print()
print('---Sample Trigram Language Model Sentences---')
print('Sample 1:', generate_sentence(lm_trigram, ['The'], 4))
print('Sample 2:', generate_sentence(lm_trigram, ['The'], 5))
print()
print('---Sample Four-gram Language Model Sentences---')
print('Sample 1:', generate_sentence(lm_fourgram, ['The'], 4))
print('Sample 2:', generate_sentence(lm_fourgram, ['The'], 5))
print()
print('---Sample Five-gram Language Model Sentences---')
print('Sample 1:', generate_sentence(lm_fivegram, ['The'], 4))
print('Sample 2:', generate_sentence(lm_fivegram, ['The'], 5))
print()

---Sample Unigram Language Model Sentences---
Sample 1: The To . consult Certain , county tune take should So his about He . Rameau us the that take Jesus and might receipt the this . listened of he Homemakers gesture . war their horizon and to intercept thought the he did left east Eugene and that , , metal address himself funeral baritone you Kahler destination Michelangelo more about believe salvation any in to . , The should maids Unless asked His for , original to we'd remembered were , alone where snoring declaration was mates the an It face ? categories what asked '' , He spirit bone
Sample 2: The me room such was road uniform , four was no to . from `` his into '' Said across tried shouted Each support ? man : ! there Once Repeating within they along westerly his of Mrs. was one where to and blue Godwin And , and like ! of audience and the good and got own , wings , say the , stealthily breathlessly it '' , I wearing Kayabashi selfishness very was be behind her so . satisfactio

You'll notice that once we get to longer and longer n-grams, the diversity of the sentences start to decrease. For example, if we wanted to find the word after "The clock on the", there are fewer options for what can follow "The clock on the" in a five-gram model versus "on the" in a trigram model.

# Dataset Size
How does the quality of our language models change with dataset size? To mimic this, we'll train various language models with a differing number of sentences from the Brown Fiction Corpus. We'll stick to the trigram language model here.

In [39]:
# train various n-gram models. This cell takes a few minutes to run
fiction_sentences = brown.sents(categories=['fiction', 'science_fiction'])
def train_trigram_language_model_n_sentences(n):
    '''Train an n-gram language model on the brown corpus'''
    lm = MLE(3)
    train_text, text_vocab = padded_everygram_pipeline(3, list(fiction_sentences)[0:n])
    lm.fit(train_text, vocabulary_text=text_vocab)
    return lm
lm_trigram_10s = train_trigram_language_model_n_sentences(10)
lm_trigram_100s = train_trigram_language_model_n_sentences(100)
lm_trigram_1000s = train_trigram_language_model_n_sentences(1000)
lm_trigram_all = train_trigram_language_model_n_sentences(len(fiction_sentences))

In [40]:
print('---Sample Sentences from Language Model Trained on 10 Sentences---')
print('Sample 1:', generate_sentence(lm_trigram_10s, ['The'], 4))
print('Sample 2:', generate_sentence(lm_trigram_10s, ['The'], 5))
print()
print('---Sample Sentences from Language Model Trained on 100 Sentences---')
print('Sample 1:', generate_sentence(lm_trigram_100s, ['The'], 4))
print('Sample 2:', generate_sentence(lm_trigram_100s, ['The'], 5))
print()
print('---Sample Sentences from Language Model Trained on 1,000 Sentences---')
print('Sample 1:', generate_sentence(lm_trigram_1000s, ['The'], 4))
print('Sample 2:', generate_sentence(lm_trigram_1000s, ['The'], 5))
print()
print('---Sample Sentences from Language Model Trained on 10,000 Sentences---')
print('Sample 1:', generate_sentence(lm_trigram_all, ['The'], 4))
print('Sample 2:', generate_sentence(lm_trigram_all, ['The'], 5))
print()

---Sample Sentences from Language Model Trained on 10 Sentences---
Sample 1: The Mr. McKinley described as a `` celebration lunch '' at the University Hospital -- Mr. McKinley was entitled to a discount for members of his family -- and it was decided it would be best for him to take the remainder of the room with an expression of proprietorship .
Sample 2: The if he kept up with a certain amount of work at home , there was little danger of his losing a term .

---Sample Sentences from Language Model Trained on 100 Sentences---
Sample 1: The doctor , since Scotty was neutral .
Sample 2: The doctors had suggested Scotty remain most of every afternoon in bed , his eyes dull .

---Sample Sentences from Language Model Trained on 1,000 Sentences---
Sample 1: The big shock everybody had when they found ol Slater and those krautheads tune in on Father Werther every night , and someone invented the name Trig for him to deliver his package in person .
Sample 2: The only one who would have to be 

Here we might see that language models with fewer training data actually looked better! This is because it's more or less returning existing sentences word-for-word.

The weakness of having less training data is less accurate probabilities. For example, let's say that we wanted to again compute: $P(w_n = "sky" | w_{n-1} = "blue", w_{n-2} = "the")$. You'll find that the smaller corpora haven't even seen this example before and thus aren't able to produce an accurate probability.

In [50]:
word = 'University'
context = ['at', 'the']
print('---Computed Probabilities---')
print('10-sentence training data:   {:.3f}'.format(lm_trigram_10s.score(word, context)))
print('100-sentence training data:  {:.3f}'.format(lm_trigram_100s.score(word, context)))
print('1000-sentence training data: {:.3f}'.format(lm_trigram_1000s.score(word, context)))
print('All training data:           {:.3f}'.format(lm_trigram_all.score(word, context)))

---Computed Probabilities---
10-sentence training data:   0.333
100-sentence training data:  0.125
1000-sentence training data: 0.028
All training data:           0.006
