## N-gram Text Generation

In this assignment we'll generate text via various n-gram models. See the README for full instructions. For this whole assignment use a tokenization that folds to lowercase and removes tokens where `isalpha` is False. 

In [1]:
import nltk
import random
from nltk.book import *
from collections import Counter
from collections import defaultdict

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [2]:
# In this assignment, I recommend you use the random.choices function. 
# here are some examples of its use.
values = "a b c d e f g".split()
weights = [1,1,5,5,10,10,20]

In [3]:
random.choices(population=values,weights=weights,k=5)

['e', 'f', 'f', 'd', 'c']

In [4]:
from collections import Counter

Counter(random.choices(population=values,weights=weights,k=1000)).most_common(10)

[('g', 412),
 ('f', 199),
 ('e', 183),
 ('d', 89),
 ('c', 87),
 ('b', 17),
 ('a', 13)]

Now, write a function that generates text of a given length, using the probabilistic approach to glue one word to another. Have it start with a text and the desired length of the output.

In [21]:
def generate_unigram_text(text,length=10) :
    
    # tokenize the text here, 
    # 1. fold to lowercase
    # 2. remove any tokens where isalpha is False
    
    words = [i.lower() for i in text
              if i.isalpha()]
    
    # Now use random.choices to select `length` words and return them 
    
    nonsense = random.choices(population = words, k = length)
    
    return(' '.join(nonsense))



Now play around with the various texts, generating nonsense sentences from them. 

In [22]:
generate_unigram_text(text1)

'torrid with secured dismal the ahead to d than hole'

In [23]:
generate_unigram_text(text2)

'thousand my him cannot do they been s end else'

In [24]:
generate_unigram_text(text5)

'bored so the action dawnstar and wanna so pic hmm'

Now do the same thing, but have it work with bigrams. This is harder, since you have a "current word" you want to glue text onto. The parameter "start" will give you a word to start with. 

In [25]:
def generate_bigram_text(text,length=10,start=None) :
    
    uni_fd = FreqDist(text)

    if not start :
        # Select a starting point here. 
        words = []
        for i in text:
            if i.isalpha():
                words.append(i)
        start = random.choice(words)
        
    else :
        if start not in uni_fd :
            print(f"The starting word, {start}, isn't in the text!")
            return("")
    
    # here we'll need the frequency distribution for the bigrams   
    
    
    results = [] # the results of your text generator
    
    results.append(start)
    
    bigram_list = list(bigrams(words))
    
    d = defaultdict(list)
    
    
    for key, value in bigram_list:
        d[key].append(value)
        
    key_list = list(d.keys())
    value_list = list(d.values())
    
    count = 0
    
    #the loop will stop after the tenth word gets added to the results variable.
    #it will use the start variable as the start word, randomly choose a value from the values list,
    #add it to the results list, then assign a different value to the start variable, append that new value,
    #and continue unitl there are 10 values in "results"
    
    while count < length - 1:
        pos = key_list.index(start)
        word = random.choice(value_list[pos])
        results.append(word)
        start = word
        count += 1

   
    
    # here you'll build up results by randomly selecting
    # bigrams that "chain" on to the last word in results
    
    
    return(" ".join(results))

In [26]:
generate_bigram_text(text1)

'by cords woven verdant land or Case While Daggoo was'

In [27]:
generate_bigram_text(text2)

'married till they were any curiosity on her four weeks'

In [28]:
generate_bigram_text(text5)

'was off your poor thing It Rang dont like xbox'