## Playing with (unsmoothed character-level maximum likelihood) Language Models

In this exercise we'll play with so called unsmoothed character-level maximum likelihood language models and discuss how impressively effective can even these simple models be for modeling language.

There will be one singificant difference from what we've seen so far -- the language model will consider characters as its "atomic units", as opposed to words. This has its pros and cons, but we'll stick with characters due to the simplicity of the resulting model. You will find that we'll need surprisingly little amount of code for what we are trying to do.

Speaking of which, what is it that we want to do exactly? Well, contrary to the model's name, the idea is actually very simple. We want a model that will receive $n$ characters as input and guess the next character in the sequence (hence the **character-level** part of the title). These $n$ characters the model sees are also known as "history" or the "order" of a language model.

Looking at the task intuitively, we could easily rephrase it as "looking at the history $h$, tell which character is likely to go next". Humans have whole lifetimes of reading to train their internal language models for this task, and it turns out we are pretty good at that! In our case we'll try to learn this language model from English text, namely the works of Shakespeare.

Mathematically speaking, we'll try to learn function $P(c|h)$ where $h$ is the history and $c$ is a character. Function $P$ then describes the likelihood of character $c$ following the history $h$. This should once again make sense intuitively, as $P(\texttt{o}, \texttt{hell})$ should be higher than say $P(\texttt{l}, \texttt{hell})$.

So how should we go about implementing something like this? In a very straightforward manner: let's just count the number of times a character $c$ follows the history $h$ and divide it by the total number of letters that follow $h$ in the text we are provided (this is what's called **maximum likelihood** estimate). The reason why we call this Language Model **unsmoothed** is because when we do not see a letter following $h$, we simply set the probability to zero. That can potentially cause some trouble, but it should not be a big problem in our case.

Let's put it together!

### Training

In [None]:
from collections import defaultdict, Counter

There are two components in Python's standard library that can help us implement our language model -- a `defaultdict` defaulting to a [`Counter`](https://docs.python.org/3.7/library/collections.html#collections.Counter). That will then allow us to do things like the following:

In [None]:
# Initialize a sample Language Model
sample_lm = defaultdict(Counter)

# Add a count for character 'p' that follows the history 'hel'
sample_lm['hel']['p'] += 1
# Add a count for character 'l' that follows the history 'hel'
sample_lm['hel']['l'] += 3

We can also use this setup to easily get the number of characters we saw after a particular history (which should be quite helpful for implementing the normalization step) :

In [None]:
sum(sample_lm['hel'].values())

Armed with this knowledge, it should not be a problem to implement the following `train_char_lm` function that takes an input text and returns a trained language model of specific order.

*Note: the language model should return probabilities. You can easily get them just by normalizing with the total number of characters seen after some history (previous cell). Creating a new dictionary for the output should be the easiest way of doing so.*

In [None]:
SENTINEL = '~'
def train_char_lm(text, order=3):
    """
    Args:
        text (str): the input text to train the language model on
        order (int): the order of the language model
    
    Returns:
        dict[list]: a dictionary with a list of (char, probability)
        tuples for each encountered history.
    
    Example:
    
        lm = train_char_lm('what am i doing here', order=3)
        print(lm['wha'])
        # [('t', 1.0)] 
        
    """
    padding = SENTINEL * order
    text = padding + text

    lm = defaultdict(Counter)
    
    # YOUR CODE HERE
    
    return {}

Using this function, training a language model should be quite straightforward (and take only a few seconds on modern hardware):

In [None]:
lm = train_char_lm(open('./shakespeare.txt').read())

Here are a couple of tests to ensure your language model got trained correctly ()

In [None]:
assert type(lm['hel']) == list
assert type(lm['hel'][0]) == tuple
assert type(lm['hel'][0][0]) == str
assert type(lm['hel'][0][1]) == float

In [None]:
lm['hel']

In [None]:
lm['DUK']

If you implemented the `train_char_lm` function correctly, you should see above that `hel` is generally followed by `p`, `l` or `d` and that once we find `DUK` in this Shakespeare's text, we can be sure that it si followed by `E`.

### Generating from the model

Once we have the model, generating from it is also quite easy. We'll just take the history, look at the last `order` characters, obtain a probability distribution and sample the next character based on that.

In [None]:
from random import random

def sample_char(lm, history, order):
    h = history[-order:]
    dist = lm[h]
    
    # A poor man's weighted sampling method
    x = random()
    for c, v in dist:
        x = x - v
        if x <= 0:
            return c

If we then want to generate say $k$ characters in a row, we'll just pass in the initial history and sample the next character $k$ times, while passing the newly generated characters back to the history.

In [None]:
def generate_text(lm, order, k=1000):
    
    history = SENTINEL * order
    text = ''
    
    # YOUR CODE HERE
    
    return text

Having done all this, we can finally experiment with trained models!

#### Generating text from language models of various orders

Let's start slow with a **bi-gram model (order ot 2)**. Note that this model will only look at the history of the past two characters when generating a new one.

In [None]:
order = 2
lm = train_char_lm(open('./shakespeare.txt').read(), order=order)
print(generate_text(lm, order=order))

Well, the results do not look all too good -- let's try to up the **order to 4**.

In [None]:
order = 4
lm = train_char_lm(open('./shakespeare.txt').read(), order=order)
print(generate_text(lm, order=order))

Hmm, interesting. Will it help if we go up? Let's bump it up even more to **order of 7**.

In [None]:
order = 7
lm = train_char_lm(open('./shakespeare.txt').read(), order=order)
print(generate_text(lm, order=order))

In [None]:
order = 10
lm = train_char_lm(open('./shakespeare.txt').read(), order=order)
print(generate_text(lm, order=order))

As we can see, with order of 2 and 4 we generally get gibberish, but once we move towards 7 to 10 (which in English basically corresponds to roughly 1.5 to 2 short words), text that would have passed as something a Shakespeare might have written! The ugly truth about the generated text is that quite a few parts of it are actually copied straight from the training text. It is kind of understandable -- we never let the model "experiment" with things that may not be in the training text, but you can try to fix that in the bonus part of the exercise.

### Shakespearean autocomplete

One of the interesting usecases of well-trained language models is something we all know from (smart)phones: autocomplete. Given the training text we have, it would be one heck of an overstatement to call any language model that sees nothing else "well-trained" but still, since we have a trained langauge model, let's use it to autocomplete sentences for us!

Yes, in standard setup this task is usually done with word-level language models. But with a few changes, we can use character-level language models as well.

Here is what we are going to do: we'll use the beginning of the text (or sentence) we want to autocomplete as a history and then sample as many characters as necessary, until we get a single word. In practice, this means we'll sample characters until we hit a word boundary (basically one of `[' ', '?', '!', '.', ';', ',', '\n']`). Doing this multiple times will provide various auto completitions, just like an autocompletion engine.

Your task is to implement this autocompletion process in the following cell:

In [None]:
def autocomplete(lm, order, prompt, n=3):
    """
    Args:
    
        lm (dict[str, list]): A trained language model.
        order (int): The order of the language model.
        prompt (str): The prompt to base the completion on.
        n (int): The number of completions.
    
    Returns:
        list[str]: a list of n possible completitions for
        the provided prompt.
    """
    completions = []

    # YOUR CODE HERE
    
    return completions

Once you implement it correctly, the following cell should output `['think']`

In [None]:
order = 3
lm = train_char_lm('what do you think about it', order=order)
autocomplete(lm, order, 'do you ', n=1)

And if that's the case, see if you can generate some more interesting, "Shakespearean" autocompletions:

In [None]:
order = 7
lm = train_char_lm(open('./shakespeare.txt').read(), order=order)

In [None]:
prompts = [
    'I would ',
    'How do I ',
    'I want a ',
    'There are some '
]

for p in prompts:
    print(p, ':', autocomplete(lm, order, p, n=5))

### Final words

As we saw, it does not take too much code to get together a language model that produces some "readable" text (for some definition of readable). All we need is to have some suitable training text and look a couple characters back at every point in it.

If you look at the generated text a bit more closely though (especially for higher-order language models), you may notice that it copies quite a few words directly from the source text. Some could of course say that "humans do that too" and they would be obviously correct to some extend. And before you'd know, this discussion would end with questions like "what does it actually mean to be *creative*?" and we certainly do not want to go there. The larger point here is that text can be generated even with a simple model like this, although its "inspiration" may come directly from the data it saw in training. 

This should help put things into perspective when we see text generated by [various neural models](https://transformer.huggingface.co/doc/gpt2-large). They are indeed quite impressive, but not because they can generate text: even a simple character-level model can do that. What such a simple model cannot do, however, is to be aware of context. And it turns out the neural models are somehow not too bad at doing so.

Hopefully, this quick exercise did make you at least a bit curious to check them out!

*Note: the gist of this exercise is based on [this article by Yoav Goldberg](https://nbviewer.jupyter.org/gist/yoavg/d76121dfde2618422139). You may also appreciate [Peter Norvig's](https://nbviewer.jupyter.org/url/norvig.com/ipython/How%20to%20Do%20Things%20with%20Words.ipynb) article on playing with word-level language models.*

### Bonus: Smoothed character-level language model

As we've said above, the model we described is **unsmoothed**, which means that things that we do not see in the training dataset are considered not to exist (or to occur with zero probability). If we only sample text from known prompts (i.e. if we only use the history we are sure has occured in the training text) this does not cause too much trouble, but try some more "exotic" promts for the autocompletion above (like `'THERE ARE FOUR '`) and observe how the whole procedure fails on a `KeyError`.

This situation can be fixed though. Compared to word-level language models, it is much easier to enumerate all possible options (it's basically just all the characters that we expect to find in the input text -- about 70 in total).

In this bonus part, your task is to implement a [simple Laplace (add one) smoothing](https://web.stanford.edu/~jurafsky/slp3/slides/LM_4.pdf#page=48) in the training procedure and see whether this addition helps you generate better texts and autocompletions!