## N-gram Model From Scratch

In this notebook, we'll implement a simple ngram language model using the text from Moby Dick.

In [None]:
from nltk.tokenize import word_tokenize, sent_tokenize
from collections import defaultdict
import random

from tqdm.notebook import tqdm

First, we'll read in the text and tokenize it into sentences.

In [None]:
with open('moby_dick.txt') as fi:
    text = fi.read()
    
sentences = sent_tokenize(text)

First, let's build a simple unigram language model. This will only require us to keep up with the counts of each individual word.

For this, let's use a [defaultdict](https://docs.python.org/3/library/collections.html#defaultdict-objects) from the collections module. A behaves very much like a dictionary, but it allows us to specify what happens when we try to access a key which does not yet exist.

You might want to check out this page which talks about working with defaultdict objects: https://www.geeksforgeeks.org/defaultdict-in-python/

Create a defaultdict named `unigram_model` which contains the count of each word in the text. Do this by iterating through each sentence, tokenizing and then adding the count to unigram_model.

In [None]:
# Your code here

Now, write a function called `unigram_prob` which takes in a word as an argument and returns the probability (based on word counts) of seeing that word.

In [None]:
# Your code here

In [None]:
unigram_prob('the')

In [None]:
unigram_prob('whale')

Finally, create a function, `random_unigram` that randomly selects a word based on the probabilities above.

**Hint:** The [`random.choices` function](https://www.w3schools.com/python/ref_random_choices.asp) might be useful for this task, which allows you to randomly select an element from a list, with the chances of each word specified by a list of weights. 

In [None]:
# Your Code Here

Now, we can see what random text generated by your model looks like.

In [None]:
sentence = []

for _ in range(50):
    sentence.append(random_unigram())

print(' '.join(sentence))

## Bigram Model

Now, let's build a bigram model. This will require us to count how many times each bigram appears. 

Let's do this by again utilizing a defaultdict, but this time we'll use a nested structure.

Create a defaultdict `bigram_model` where the keys are equal to the first word in a bigram and each value is a defaultdict containing counts of the number of times each word is the second part of a bigram starting with the first word.

For example, `bigram_model["white"]` would be a defaultdict counting the number of times each word shows up as the second part of a bigram which starts with "white". So `bigram_model["white"]["whale"]` gives the number of times that "white whale" shows up in our text and `bigram_model["accursed"]["whale"]` gives the number of times that "accursed whale" appears.

**Note:** To account for words that appear at the beginning of sentences, add a token `"<s>"` at the beginning of each sentence and one `"</s>"` at the end.

In [None]:
# Your Code Here

Now, write a function `random_bigram` which takes as an argument `seed` (which should default to `"<s>"`) and randomly selects a next word based on the bigram probabilities from your bigram model.

In [None]:
# Your Code Here

Now, let's try it out and see how well it creates sentences.

Starting with a list containing `['<s>']`, use your `random_bigram` function to add tokens until you have added the `"</s>"` token. Then print out the resulting sentence.

In [None]:
sentence = ['<s>']

# Fill in the code to generate a random sentence

print(' '.join(sentence))

**Bonus** Extend your above work to trigram or even higher. What are the disadvantages of 3- or 4-gram models using the current text?