# STATISTICAL LANGUAGE MODEL OF THE SHAKESPEARE CORPUS

In [6]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [7]:
import pandas as pd
import numpy as np
import os
import re
import requests
import time

In [8]:
from project import *

### Introduction 

In this project, I will build *[statistical language models](https://en.wikipedia.org/wiki/Language_model)* using public domain books found on [Project Gutenberg](https://www.gutenberg.org/). Language models attempt to capture the likelihood that a given sequence of words occur in a given "language" (the precise term is "corpus" or "corpora"). Here, "language" is a broad term that, in addition to the normal usage, may mean the language of a particular author or style. As with all statistical models, the true data generating process is never known and thus we cannot know the true probability that a sequence of words will occur – however, we can estimate these probabilities via various methods, some of which are more reliable than others. For example, one might guess that the probability of a sentence is simply the product of the empirical probabilities (i.e., the number of times a word is observed in a dataset divided by the number of words in that dataset). This is one of the methods of estimating the probability of a sequence of words that you will implement in this project.

### Tokenizing Corpora

Computing the probabilities of a language model from a book requires breaking up the text of book into sequences of words. This process is called *tokenization*. In reality though, the sequences are not made up entirely of words, but rather more general terms called *tokens*. In this project tokens will include not only whole words, but also punctuation and other terms. Below are a few examples of other types of tokens:

* Punctuation. For example, the period makes sense as a token, as certain words tend to end sentences (i.e. appear right before a `'.'`), while other words tend to begin sentences (i.e. appear right after a `'.'`).
* We have special "START" and "END" tokens that begin and end every word sequence (in our case, paragraphs of words in a given book). These make sense as tokens, as certain words may tend to begin and end paragraphs. 

It is useful for the tokens used to represent START and END to be single characters that can never be found in the text of the book you use to create your language model. Thus, you will use two ASCII hidden "[control characters](https://en.wikipedia.org/wiki/C0_and_C1_control_codes#C0_(ASCII_and_derivatives))":
* For START, you will use the character `'\x02'`, which refers to the "beginning of text".
* For END, you will use the character `'\x03'`, which refers to the "end of text".

## Part 1: Dissecting the Corpus 
<a name='part1'></a>

Preparing the Corpus

<a name='question1'></a>

In this part, I will use `requests` to download a public domain book from [Project Gutenberg](https://www.gutenberg.org/), like [this one](http://www.gutenberg.org/files/57988/57988-0.txt), and prepare it for analysis in later questions. Create a function `get_book` that takes in the `url` of a "Plain Text UTF-8" book and **returns a string** containing the contents of the book. 

In [9]:
url = "https://www.gutenberg.org/cache/epub/100/pg100.txt"
r = requests.get(url)
post_page = r.text
text = post_page.replace('\r\n', '\n')

pattern1 = " ***"
pattern2 = "*** END"
book_string = text[text.find(pattern1)+len(pattern1):text.find(pattern2)]


In [10]:
book_string



### Tokenizing the Corpus

<a name='question2'></a>

Now, we need to **tokenize** the text of a book. To do so, we create a function `tokenize` that takes in a string, `book_string`, and returns a **list of the tokens** (words, numbers, and all punctuation).
For example, consider the following excerpt. (The first sentence is at the end of a larger paragraph, and the second sentence is at the start of a longer paragraph.)
```
...
My phone's dead.

I didn't get your call.
...
```
Will tokenize to:
```py
[...
'My', 'phone', "'", 's', 'dead', '.', '\x03', '\x02', 'I', 'didn', "'", 't', 'get', 'your', 'call', '.'
...]
```

In [12]:
book_string
l = re.split("\n{2,}",book_string)
l

['',
 'The Complete Works of William Shakespeare',
 'by William Shakespeare',
 '                    Contents',
 '    THE SONNETS\n    ALL’S WELL THAT ENDS WELL\n    THE TRAGEDY OF ANTONY AND CLEOPATRA\n    AS YOU LIKE IT\n    THE COMEDY OF ERRORS\n    THE TRAGEDY OF CORIOLANUS\n    CYMBELINE\n    THE TRAGEDY OF HAMLET, PRINCE OF DENMARK\n    THE FIRST PART OF KING HENRY THE FOURTH\n    THE SECOND PART OF KING HENRY THE FOURTH\n    THE LIFE OF KING HENRY THE FIFTH\n    THE FIRST PART OF HENRY THE SIXTH\n    THE SECOND PART OF KING HENRY THE SIXTH\n    THE THIRD PART OF KING HENRY THE SIXTH\n    KING HENRY THE EIGHTH\n    THE LIFE AND DEATH OF KING JOHN\n    THE TRAGEDY OF JULIUS CAESAR\n    THE TRAGEDY OF KING LEAR\n    LOVE’S LABOUR’S LOST\n    THE TRAGEDY OF MACBETH\n    MEASURE FOR MEASURE\n    THE MERCHANT OF VENICE\n    THE MERRY WIVES OF WINDSOR\n    A MIDSUMMER NIGHT’S DREAM\n    MUCH ADO ABOUT NOTHING\n    THE TRAGEDY OF OTHELLO, THE MOOR OF VENICE\n    PERICLES, PRINCE OF TYRE\

In [13]:
pattern = "[^0-9a-zA-Z\s]"
compiled = re.compile(pattern)
lst = []
for i in re.split("\n{2,}",book_string):
    i = compiled.sub(' \g<0> ',i)
    if i != "":
        lst.append('\x02 '+i+' \x03 ')

answer = "".join(lst).strip().split(" ")
answer = list(filter(None,answer))

answer


['\x02',
 'The',
 'Complete',
 'Works',
 'of',
 'William',
 'Shakespeare',
 '\x03',
 '\x02',
 'by',
 'William',
 'Shakespeare',
 '\x03',
 '\x02',
 'Contents',
 '\x03',
 '\x02',
 'THE',
 'SONNETS\n',
 'ALL',
 '’',
 'S',
 'WELL',
 'THAT',
 'ENDS',
 'WELL\n',
 'THE',
 'TRAGEDY',
 'OF',
 'ANTONY',
 'AND',
 'CLEOPATRA\n',
 'AS',
 'YOU',
 'LIKE',
 'IT\n',
 'THE',
 'COMEDY',
 'OF',
 'ERRORS\n',
 'THE',
 'TRAGEDY',
 'OF',
 'CORIOLANUS\n',
 'CYMBELINE\n',
 'THE',
 'TRAGEDY',
 'OF',
 'HAMLET',
 ',',
 'PRINCE',
 'OF',
 'DENMARK\n',
 'THE',
 'FIRST',
 'PART',
 'OF',
 'KING',
 'HENRY',
 'THE',
 'FOURTH\n',
 'THE',
 'SECOND',
 'PART',
 'OF',
 'KING',
 'HENRY',
 'THE',
 'FOURTH\n',
 'THE',
 'LIFE',
 'OF',
 'KING',
 'HENRY',
 'THE',
 'FIFTH\n',
 'THE',
 'FIRST',
 'PART',
 'OF',
 'HENRY',
 'THE',
 'SIXTH\n',
 'THE',
 'SECOND',
 'PART',
 'OF',
 'KING',
 'HENRY',
 'THE',
 'SIXTH\n',
 'THE',
 'THIRD',
 'PART',
 'OF',
 'KING',
 'HENRY',
 'THE',
 'SIXTH\n',
 'KING',
 'HENRY',
 'THE',
 'EIGHTH\n',
 'THE',
 '


## Part 2: Creating the Baseline Language Models 

Now that we're able to tokenize a corpus, it is time to start building language models (LM).

In this project, we will build three different language models. They all operate under the premise of assigning probabilities to sentences. Given a sentence – that is, a sequence of tokens $w = w_1\ldots w_n$ – we want to be able to compute the **probability** that sentence is used: 
$$P(w) = P(w_1,\ldots,w_n)$$

However, sentences are built from tokens, and the likelihood that a token occurs where it does depends on the tokens before it. This points to using **conditional probability** to compute $P(w)$. That is, we can write:

$$
P(w) = P(w_1,\ldots,w_n) = P(w_1) \cdot P(w_2|w_1) \cdot P(w_3|w_1,w_2) \cdot\ldots\cdot P(w_n|w_1,\ldots,w_{n-1})
$$  
This is also called the **chain rule** for probabilities.

**Example:** Consider the sentence 

<center><code>'when I drink Coke I smile'</code></center>
    
The probability that it occurs, according the the chain rule, is

$$
P(\text{when}) \cdot P(\text{I | when}) \cdot P(\text{drink | when I})\cdot P(\text{Coke | when I drink}) \cdot P(\text{I | when I drink Coke}) \cdot P(\text{smile | when I drink Coke I})
$$

That is, the probability that the sentence occurs is the product of the probability that each subsequent token follows the tokens that came before. For example, the probability $P(\text{Coke | when I drink})$ is likely pretty high, as Coke is something that you drink. The probability $P(\text{pizza | when I drink})$ is likely low (if not 0), because pizza is not something that you drink.

### Model 1: Uniform Language Models


A uniform language model is one in which each **unique** word is **equally likely** to appear in any position, unconditional of any other information.

Let's put into context what this means by using the following example corpus:

```py
>>> corpus = 'when I eat pizza, I smile, but when I drink Coke, my stomach hurts'
>>> tokenize(corpus)
['\x02', 'when', 'I', 'eat', 'pizza', ',', 'I', 'smile', ',', 'but', 'when', 'I', 'drink', 'Coke', ',', 'my', 'stomach', 'hurts', '\x03']
```

Given a tokenized corpus, we build the language model itself in `train`. As mentioned above, language models are stored as Series, where words are the index and probabilities are the values. **In a uniform language model**, the probability assigned to each token is **1 over the total number of unique tokens in the corpus**.

The example corpus above has 14 **unique** tokens. This means that we'd have $P(\text{\x02}) = \frac{1}{14}$, $P(\text{when}) = \frac{1}{14}$, and so on. Specifically, in this example, **the Series that `train` returns should contain the following values**:

| Token | Probability |
| --- | --- |
| `'\x02'` | $\frac{1}{14}$ |
| `'when'` | $\frac{1}{14}$ |
| `'I'` | $\frac{1}{14}$ |
| `'eat'` | $\frac{1}{14}$ |
| `'pizza'` | $\frac{1}{14}$ |
| `','` | $\frac{1}{14}$ |
| `'smile'` | $\frac{1}{14}$ |
| `'but'` | $\frac{1}{14}$ |
| `'drink'` | $\frac{1}{14}$ |
| `'Coke'` | $\frac{1}{14}$ |
| `'my'` | $\frac{1}{14}$ |
| `'stomach'` | $\frac{1}{14}$ |
| `'hurts'` | $\frac{1}{14}$ |
| `'\x03'` | $\frac{1}{14}$ |

Note that:
- **None of the probabilities we computed are conditional – the uniform model does not use conditional probabilities!**
- When looking at the Series that `train` returns (i.e. when looking at the `mdl` attribute), the `'\x02'` and `'\x03'` characters will show up as blank characters in the index. This is to be expected.
- Your Series doesn't need to have the labels `'Token'` or `'Probability'` in it, like the above table does.

After training the model, we implement two more methods:

### `probability`:

`probability` Takes in any tuple of tokens and use the probabilities computed in `train` (that are stored in the `mdl` attribute) to assign a probability to that sequence. For instance, suppose the input tuple is `('when', 'I', 'drink', 'Coke', 'I', 'smile')`. The returned probability is:

$$P(\text{when I drink Coke I smile}) = P(\text{when}) \cdot P(\text{I}) \cdot P(\text{drink}) \cdot P(\text{Coke}) \cdot P(\text{I}) \cdot P(\text{smile}) = \frac{2}{19} \cdot \frac{3}{19} \cdot \frac{1}{19} \cdot \frac{1}{19} \cdot \frac{3}{19} \cdot \frac{1}{19}$$


### `sample`

Finally, `sample` takes in a positive integer, `M`, and return a sentence made up of `M` randomly sampled tokens, in which the probabilities come from `mdl`.Note that this sentence doesn't make any grammatical sense as we are assuming a **uniform probability** in *this* model and that tokens are separated by spaces.

In [42]:
class UniformLM(object):

    def __init__(self, tokens):
        """
        Initializes a Uniform languange model using a
        list of tokens. It trains the language model
        using `train` and saves it to an attribute
        self.mdl.
        """
        self.mdl = self.train(tokens)
    
    
    def train(self, tokens):
        index = pd.Series(tokens).unique()
        return pd.Series(index).value_counts()/len(index)

    def probability(self, words):
        try:
            return (pd.DataFrame((self.mdl[list(words)].values)).T.prod(axis=1).values[0])
        except:
            return 0
    
    def sample(self, M):
        return " ".join(self.mdl.sample(n=M,replace = True).index)

In [45]:
tokens = tuple('the world\'s a stage, and all the men and women merely players.'.split())
unif = UniformLM(tokens)
unif.train(tokens)

the         0.1
world's     0.1
a           0.1
stage,      0.1
and         0.1
all         0.1
men         0.1
women       0.1
merely      0.1
players.    0.1
dtype: float64

In [48]:
tokens = tuple('the world\'s a stage, and all the men and women merely players.'.split())
unif = UniformLM(tokens)
unif.sample(100)

"and men players. merely merely stage, all men all and women stage, women merely stage, women world's players. a the all merely merely players. a merely merely world's stage, women women all women a a women the men stage, women all all women world's the women the a and stage, stage, a all world's and men stage, players. and and merely players. merely merely men stage, world's all the a all and a stage, all and men a players. merely stage, women women a a all a women the men stage, merely all and women and merely the stage, women"

### Model 2: Unigram Language Models


A unigram language model is one in which the **probability assigned to a token is equal to the proportion of tokens in the corpus that are equal to said token**. That is, the probability distribution associated with a unigram language model is just the empirical distribution of tokens in the corpus. 

Let's understand how probabilities are assigned to tokens using our example corpus from before.

```py
>>> corpus = 'when I eat pizza, I smile, but when I drink Coke, my stomach hurts'
>>> tokenize(corpus)
['\x02', 'when', 'I', 'eat', 'pizza', ',', 'I', 'smile', ',', 'but', 'when', 'I', 'drink', 'Coke', ',', 'my', 'stomach', 'hurts', '\x03']
```

Here, there are 19 total tokens. 3 of them are equal to `'I'`, so $P(\text{I}) = \frac{3}{19}$. Here, the Series that `train` returns should contain the following values:

| Token | Probability |
| --- | --- |
| `'\x02'` | $\frac{1}{19}$ |
| `'when'` | $\frac{2}{19}$ |
| `'I'` | $\frac{3}{19}$ |
| `'eat'` | $\frac{1}{19}$ |
| `'pizza'` | $\frac{1}{19}$ |
| `','` | $\frac{3}{19}$ |
| `'smile'` | $\frac{1}{19}$ |
| `'but'` | $\frac{1}{19}$ |
| `'drink'` | $\frac{1}{19}$ |
| `'Coke'` | $\frac{1}{19}$ |
| `'my'` | $\frac{1}{19}$ |
| `'stomach'` | $\frac{1}{19}$ |
| `'hurts'` | $\frac{1}{19}$ |
| `'\x03'` | $\frac{1}{19}$ |

As before, we implement the 2 methods - `probability` & `sample`, except the probabilities will be different given that the probabilities are now *proportional* to presence in the corpus.

In [47]:
class UnigramLM(object):
    
    def __init__(self, tokens):
        self.mdl = self.train(tokens)
        
    def train(self, tokens):

        return pd.Series(tokens).value_counts()/len(pd.Series(tokens))
    
    
    def probability(self, words):

        try:
            return (pd.DataFrame((self.mdl[list(words)].values)).T.prod(axis=1).values[0])
        except:
            return 0
        
    def sample(self, M):

        return " ".join(self.mdl.sample(n = M,replace = True,weights = self.mdl.values).index)


In [69]:
tokens = tuple('the world\'s a stage, and all the men and women merely players.'.split())
unigram = UnigramLM(tokens)


unigram.probability(('and', 'all', 'the', 'men'))

0.00019290123456790122

In [70]:
unigram.sample(100)

"and the and stage, all and men the merely a world's a players. players. and and the a and men the the world's a players. all players. men and world's and the the women all world's all players. a and and merely women stage, the a the and and and the the all players. world's merely and the the women the and the stage, stage, the merely women players. the women all players. players. the stage, and stage, players. women world's merely and all and a women merely all a players. and and world's all players. the the all women"

---

<a name='part3'></a>

## Model 3: N-Gram Language Model 

### N-Gram Overview

Now we will build an N-Gram language model, in which the probability of a token appearing in a sentence **does depend** on the tokens that come before it. 

The chain rule above specifies that the probability that a token occurs at in a particular position in a sentence depends on **all** previous tokens in the sentence. However, it is often the case that the likelihood that a token appears in a sentence is influenced more by **nearby** tokens. (Remember, tokens are words, punctuation, or `'\x02'` / `'\x03'`).

The N-Gram language model relies on the assumption that only nearby tokens matter. Specifically, it assumes that the probability that a token occurs depends only on the previous $N-1$ tokens, rather than all previous tokens. That is:

$$P(w_n|w_1,\ldots,w_{n-1}) = P(w_n|w_{n-(N-1)},\ldots,w_{n-1})$$

In an N-Gram language model, there is a hyperparameter that we get to choose when creating the model, $N$. For any $N$, the resulting N-Gram model looks at the previous $N-1$ tokens when computing probabilities. (Note that the unigram model you built in Question 4 is really an N-Gram model with $N=1$, since it looked at 0 previous tokens when computing probabilities.)

<br>

#### Example: Trigram Model

When $N=3$, we have a "trigram" model. Such a model looks at the previous $N-1 = 2$ tokens when computing probabilities.

Consider the tuple `('when', 'I', 'drink', 'Coke', 'I', 'smile')`, corresponding to the sentence `'when I drink Coke I smile'`. Under the trigram model, the probability of this sentence is computed as follows:

$$P(\text{when I drink Coke I smile}) = P(\text{when}) \cdot P(\text{I | when}) \cdot P(\text{drink | when I}) \cdot P(\text{Coke | I drink}) \cdot P(\text{I | drink Coke}) \cdot P(\text{smile | Coke I})$$

The trigram model doesn't consider the beginning of the sentence when computing the probability that the sentence ends in `'smile'`.

<br>

#### N-Grams

Both when working with a training corpus and when implementing the `probability` method to compute the probabilities of other sentences, you will need to work with "chunks" of $N$ tokens at a time.

**Definition:** The **N-Grams of a text** are a list of tuples containing sliding windows of length $N$.

For instance, the trigrams in the sentence `'when I drink Coke I smile'` are:

```py
[('when', 'I', 'drink'), ('I', 'drink', 'Coke'), ('drink', 'Coke', 'I'), ('Coke', 'I', 'smile')]
```

<br>

#### Computing N-Gram Probabilities

Notice in our trigram model above, we computed $P(\text{when I drink Coke I smile})$ as being the product of several conditional probabilities. These conditional probabilities are the result of **training** our N-Gram model on a training corpus.

To train an N-Gram model, we must compute a conditional probability for every $N$-token sequence in the corpus. For instance, suppose again that we are training a trigram model. Then, for every 3-token sequence $w_1, w_2, w_3$, we must compute $P(w_3 | w_1, w_2)$. To do so, we use:

$$P(w_3 | w_1, w_2) = \frac{C(w_1, w_2, w_3)}{C(w_1, w_2)}$$

where $C(w_1, w_2, w_3)$ is the number of occurrences of the trigram sequence $w_1, w_2, w_3$ in the training corpus and $C(w_1, w_2)$ is the number of occurrences of the bigram sequence  $w_1, w_2$ in the training corpus. (Technical note: the probabilities that we compute using the ratios of counts are _estimates_ of the true conditional probabilities of N-Grams in the population of corpuses from which our corpus was drawn.)

In general, for any $N$, conditional probabilities are computed by dividing the counts of N-Grams by the counts of the (N-1)-Grams they follow. 


### Creating the N-Grams Class:

In [82]:
class NGramLM(object):
    
    def __init__(self, N, tokens):
        """
        Initializes a N-gram languange model using a
        list of tokens. It trains the language model
        using `train` and saves it to an attribute
        self.mdl.
        """
        
        self.N = N

        ngrams = self.create_ngrams(tokens)

        self.ngrams = ngrams
        self.mdl = self.train(ngrams)

        if N < 2:
            raise Exception('N must be greater than 1')
        elif N == 2:
            self.prev_mdl = UnigramLM(tokens)
        else:
            self.prev_mdl = NGramLM(N-1, tokens)

    def create_ngrams(self, tokens):
        """
        create_ngrams takes in a list of tokens and returns a list of N-grams. 
        The START/STOP tokens in the N-grams should be handled as 
        explained in the notebook.
        """
        ngram = []
        for i in range(0,len(tokens) - self.N + 1):
            ngram.append(tuple(tokens[i:i+self.N]))
        return ngram
        
    def train(self, ngrams):
        """
        Trains a n-gram language model given a list of tokens.
        The output is a dataframe with three columns (ngram, n1gram, prob).

        """
        # N-Gram counts C(w_1, ..., w_n)
        numerator = pd.DataFrame(pd.Series(self.ngrams).value_counts(), columns = ["numerator"])
        
        # (N-1)-Gram counts C(w_1, ..., w_(n-1))
        n1gram = []
        for i in self.ngrams:
            n1gram.append(i[0:self.N-1])
        denominator = pd.DataFrame(pd.Series(n1gram).value_counts(), columns = ["denominator"])

        # Create the conditional probabilities
        df =  pd.DataFrame()
        df["ngram"] = self.ngrams
        df["n1gram"] = n1gram
        merged_ne = pd.merge(df,numerator,left_on = "ngram", right_on = numerator.index)
        final_df = pd.merge(merged_ne,denominator,left_on = "n1gram", right_on = denominator.index)
        final_df["prob"] = final_df["numerator"]/final_df["denominator"]
        # Put it all together

        final_df = final_df.drop(["numerator","denominator"],axis = 1)
        return final_df.drop_duplicates()
    
    def probability(self, words):
        """
        probability gives the probabiliy a sequence of words
        appears under the language model.
        :param: words: a tuple of tokens
        :returns: the probability `words` appears under the language
        model.
        """
        try:
            if type(self.prev_mdl) == UnigramLM:
                idk = NGramLM(2, words)
                lst = []
                for i in idk.mdl["ngram"].values:
                    lst.append(self.mdl[self.mdl["ngram"] == i]["prob"].values[0])
                return self.prev_mdl.mdl[words[0]]*pd.Series(lst).prod()
            else: 
                trigrams = NGramLM(3, words)
                lst = []
                for i in trigrams.mdl["ngram"]:
                    lst.append(self.mdl["prob"][self.mdl["ngram"]==i].values[0])
                return self.prev_mdl.prev_mdl.mdl[words[0]]*self.prev_mdl.mdl[self.prev_mdl.mdl["ngram"] == tuple(words[0:2])]["prob"].values[0]*pd.Series(lst).prod()
        except IndexError:
            return 0
    def sample(self, M):
        """
        sample selects tokens from the language model of length M, returning
        a string of tokens.
        """
        # Use a helper function to generate sample tokens of length `length`
        def helper(string1,string2):
            if type(self.prev_mdl) == UnigramLM:
                lst = []
                df = pd.DataFrame()
                for i in self.mdl["ngram"]:
                    if i[0]==string1:
                        lst.append(i)
                for i in lst:
                    df = df.append(self.mdl[self.mdl["ngram"] == i])
                return df.sample(n=1,replace = True, weights = df["prob"])["ngram"].values[0]
            elif (string1 == "\x02")&(string2 == None):
                lst = []
                df = pd.DataFrame()
                for i in self.prev_mdl.mdl["ngram"]:
                    if i[0]==string1:
                        lst.append(i)
                for i in lst:
                    df = df.append(self.prev_mdl.mdl[self.prev_mdl.mdl["ngram"] == i])
                return df.sample(n=1,replace = True, weights = df["prob"])["ngram"].values[0]
            else:
                lst = []
                df = pd.DataFrame()
                for i in self.mdl["n1gram"]:
                    if (i[0] == string1)&(i[1] == string2):
                        lst.append(i)
                for i in lst:
                    df = df.append(self.mdl[self.mdl["n1gram"] == i])
                df = df.drop_duplicates()
                return df.sample(n=1,replace = True,weights = df["prob"])["ngram"].values[0]
        
        
        # Transform the tokens to strings
        words = ""
        Mcount = 0
        if M == 1:
            return "\x02 \x03"
        else :
            while Mcount != M:
                if type(self.prev_mdl) == UnigramLM:
                    if Mcount == 0:
                        start = helper("\x02",None)
                        words += start[0]
                        Mcount +=1
                    elif Mcount == 1:
                        word = helper(start[1],None)
                        words += (" " + word[0])
                        Mcount +=1
                    else :
                        word = helper(word[1],None)
                        words += (" " + word[0])
                        Mcount +=1
                else:
                    if Mcount == 0:
                        start = helper("\x02",None)
                        words += start[0]
                        Mcount +=1
                    elif Mcount == 1:
                        word = helper(start[0],start[1])
                        words += (" " + word[1])
                        Mcount +=1
                    else :
                        word = helper(word[1],word[2])
                        words += (" " + word[1])
                        Mcount +=1
        words += " \x03"
        return words


### Training the N-Gram Probabilities: 

<a name='question5b'></a>

Now, you will compute the probabilities that define N-Gram language model itself. Recall that the N-Gram LM consists of probabilities of the form

$$P(w_n|w_{n-(N-1)},\ldots,w_{n-1})$$

which we estimate by  

$$\frac{C(w_{n-(N-1)}, w_{n-(N-2)}, \ldots, w_{n-1}, w_n)}{C(w_{n-(N-1)}, w_{n-(N-2)}, \ldots, w_{n-1})}$$

for every N-Gram that occurs in the corpus. To illustrate, consider again the following example corpus:

```py
>>> corpus = 'when I eat pizza, I smile, but when I drink Coke, my stomach hurts'
>>> tokens = tokenize(corpus)
>>> tokens
['\x02', 'when', 'I', 'eat', 'pizza', ',', 'I', 'smile', ',', 'but', 'when', 'I', 'drink', 'Coke', ',', 'my', 'stomach', 'hurts', '\x03']
>>> pizza_model = NGrams(3, tokens)
```

Here, `pizza_model.train` must compute $P(\text{I | \x02 when})$, $P(\text{eat | when I})$, $P(\text{pizza | I eat})$, and so on, until $P(\text{\x03 | stomach hurts})$.

To compute $P(\text{eat | when I})$, we must find the number of occurrences of `'when I eat'` in the training corpus, and divide it by the number of occurrences of `'when I'` in the training corpus. `'when I eat'` occurred exactly once in the training corpus, while `'when I'` occurred twice, so,

$$P(\text{eat | when I}) = \frac{C(\text{when I eat})}{C(\text{when I})} = \frac{1}{2}$$




### Computing Probabilities using the N-Gram Model

<a name='question5c'></a>

After we've trained our N-Gram model – that is, after we've computed a DataFrame associating each N-Gram with a conditional probability – we need to compute probabilities for new sentences.

To illustrate how this may work, let's look at an example input tuple to `probability`. Assume our model is `pizza_model` from Question 5.2's description; we will not repeat the probability table here.

Suppose our input tuple is `('when', 'I', 'eat', 'pizza', ',', 'I', 'smile')`, corresponding to the sentence `'when I eat pizza, I smile'` (remember again that the tuples provided to `probability` don't need to include `'\x02'` or `'\x03'`). Then,

$$
\begin{align*} &P(\text{when I eat pizza, I smile}) \\ &= P(\text{when}) \cdot P(\text{I | when}) \cdot P(\text{eat | when I}) \cdot P(\text{pizza | I eat}) \cdot P(\text{, | eat pizza}) \cdot P(\text{I | pizza,})\cdot P(\text{smile | , I}) \\ &= \frac{2}{19} \cdot 1 \cdot \frac{1}{2} \cdot 1 \cdot 1 \cdot 1 \cdot 1 \\ &= \frac{1}{19} \end{align*}
$$


- To find the latter five probabilities – $P(\text{eat | when I}) , P(\text{pizza | I eat}) , P(\text{, | eat pizza}) , P(\text{I | pizza,}),$ and $P(\text{smile | , I})$, we can use the `mdl` DataFrame that the `train` method computes.
- To find $P(\text{I | when})$, we can't just look at the `mdl` DataFrame, because `('when', 'I')` is not a trigram, it is a bigram. Instead, we look at our model's `prev_mdl` attribute, which itself is another instance of `NGramLM`, corresponding to a bigram model over the same corpus. There, we can find the probability $P(\text{I | when})$.
- To find $P(\text{when})$, we can't just look at the `mdl` DataFrame, because `'when'` is not a trigram. It is not a bigram either. Instead, we need to look at `prev_mdl`'s `prev_mdl`, which is a `UnigramLM`, to find $P(\text{when})$.

Note that if the input tuple contains an N-Gram that was never seen in the training corpus, the returned probability is 0. Convince yourself why `pizza_model.probability(('when', 'I', 'drink', 'Coke', ',', 'I', 'smile'))` is 0 before proceeding.


### Sampling from the N-Gram Model

<a name='question5d'></a>

The last method you implemented in the `UniformLM` and `UnigramLM` classes was `sample`, which gave you a way of generating new sentences. 

Now, you will implement the `sample` method in the `NGramLM` class. It should take in a positive integer `M` and generate a string of M tokens using the trained language model. It should begin with a starting token `'\x02'`, then generate subsequent tokens from the probabilities in `self.mdl` and continue picking words conditional on the previous choice. 

Let's illustrate how sampling works using a small concrete example. Suppose our corpus and **trigram** model are defined below:

```py
>>> short_corpus = 'zebras eat green peas \n\n cows eat green grass \n\n zebras eat green peppers'
>>> short_tokens = tokenize(short_corpus)
>>> short_tokens
['\x02', 'zebras', 'eat', 'green', 'peas', '\x03', '\x02', 'cows', 'eat', 'green', 'grass', '\x03', '\x02', 'zebras', 'eat', 'green', 'peppers', '\x03']
>>> grass_model = NGramLM(3, short_tokens)
```

Suppose we are told to execute `grass_model.sample(5)`. Here's how we'd proceed:

0. The first character in the output is `'\x02'`, as specified above. **We won't count `'\x02'` in the length of our output string**, so we still need to find 5 more tokens.
1. The next character needs to be either `'zebras'` or `'cows'`, since `('\x02', 'zebras')` and `('\x02', 'cows')` are the only **bigrams** in `short_tokens` that start with an `'\x02'`. $P(\text{zebras | \x02})$ is $\frac{2}{3}$ and $P(\text{cows | \x02})$ is $\frac{1}{3}$, so we select either `'zebras'` or `'cows'` for our next token according to these probabilities. For the sake of example, suppose we select `'cows'`. 4 more tokens to go.
2. Now, we must look for **trigrams** that start with the bigram `('\x02', 'cows')`. There is just one, `('\x02', 'cows', 'eat')`, so our next token must be `'eat'`. 3 more tokens to go.
3. Now, we must look for **trigrams** that start with the bigram `('cows', 'eat')`. Again, there is just one, `('cows', 'eat', 'green')`, so our next token must be `'green'`. 2 more tokens to go.
4. Now, we must look for **trigrams** that start with the bigram `('eat', 'green')`. There are three options – `('eat', 'green', 'peas')`, `('eat', 'green', 'grass')`, and `('eat', 'green', 'peppers')`. Since $P(\text{peas | eat green}) = P(\text{grass | eat green}) = P(\text{peppers | eat green}) = \frac{1}{3}$, we pick either `'peas'`, `'grass'`, or `'peppers'` uniformly at random. For the sake of example, suppose we select `'peppers'`. 1 more token to go.
5. We must end the output string now with `'\x03'`, putting us at `'\x02'` plus 5 tokens, which is the number of tokens we were told to sample. Note that `'\x03'` **does** count towards the number of tokens we were asked to sample.

Our result is `'\x02 cows eat green peppers \x03'`. **Note that in our training corpus we never encountered an instance of cows 🐄 eating green peppers 🫑, but we were able to generate a coherent sentence in which they did – pretty cool!**


Some additional guidance:
- If you run into a situation where there are no N-Grams that match the most recent (N-1)-Gram, you should add `'\x03'` (STOP) token as the next token in your output sentence. There is a chance that your sampled sentence ends in many `'\x03'`s, and that's fine.
- Helper functions and recursion will be very helpful.

After you've understood the above example, complete the implementation of the `sample` method in `NGramLM`.

# TESTING OUR NGRAM MODEL ON SHAKESPEARE & HOMER:

In [93]:
homer_tokens = tuple(open('data/homertokens.txt').read().split(' '))
shakes_tokens = tuple(open('data/shakespeare.txt').read().split(' '))

In [96]:
homer_corpus  = NGramLM(5, homer_tokens)
shakes_corpus = NGramLM(5, shakes_tokens)
#homer_text_sample = homer_corpus.sample(20)
#Shakespeare_text_sample = shakes_corpus.sample(20)

### Recreating a text in the style of Homer:

In [90]:
homer_text_sample

'\x02 { 120 } And may we not tramps and beggars generally ? I wish you would give my message \x03'

### Recreating a text in the style of William Shakespeare:

In [98]:
Shakespeare_text_sample

' \x02 { 101 } How else must we attempt to deliver ? I pray that you love that which you dream \x03'

After our `NGramLM` implementation, we need to do a little bit of reflecting. How can we improve the `NGramLM` model? One major deficit is that it assigns a probability of 0 to sentences that contain N-Grams that weren't seen in the corpus; how might can we address this? 