# N-gram Text Generation

This notebook demonstrates text generation using unigram and bigram models. All text is tokenized to lowercase and non-alphabetic tokens are removed.

In [None]:
import nltk
import random
from nltk.book import *
from collections import Counter, defaultdict

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [None]:
# Example: weighted random selection
values = "a b c d e f g".split()
weights = [1,1,5,5,10,10,20]

In [None]:
random.choices(population=values, weights=weights, k=5)

['g', 'd', 'c', 'f', 'f']

In [None]:
Counter(random.choices(population=values, weights=weights, k=1000)).most_common(10)

[('g', 390),
 ('e', 177),
 ('f', 172),
 ('c', 120),
 ('d', 105),
 ('b', 20),
 ('a', 16)]

## Unigram Text Generation

Function to generate text by sampling words independently from the unigram distribution.

In [None]:
def generate_unigram_text(text, length=10):
    words = [i.lower() for i in text if i.isalpha()]
    nonsense = random.choices(population=words, k=length)
    return ' '.join(nonsense)

Generate sample sentences from different texts.

In [54]:
generate_unigram_text(text1)

'of s of ice sideways that too it the ahab'

In [8]:
generate_unigram_text(text2)

'the left ill marriage not quite is in elinor which'

In [9]:
generate_unigram_text(text5)

'does i join ben me bites join am the part'

## Bigram Text Generation

Function to generate text using bigram probabilities, chaining words together.

In [None]:
def generate_bigram_text(text, length=10, start=None):
    words = [i.lower() for i in text if i.isalpha()]
    if not words:
        return ""
    uni_fd = nltk.FreqDist(words)
    if not start:
        start = random.choice(words)
    else:
        if start not in uni_fd:
            print(f"The starting word, {start}, isn't in the text!")
            return ""
    results = [start]
    bigram_list = list(nltk.bigrams(words))
    d = defaultdict(list)
    for key, value in bigram_list:
        d[key].append(value)
    count = 0
    while count < length - 1:
        options = d.get(start)
        if not options:
            break
        word = random.choice(options)
        results.append(word)
        start = word
        count += 1
    return " ".join(results)

In [56]:
generate_bigram_text(text1)

'easily dent sir sailor and mealy aspect or in the'

In [57]:
generate_bigram_text(text2)

'eyes to carry to him to discompose the latter which'

In [59]:
generate_bigram_text(text5)

'how have a spat now my whisper lmao girl s'