# Character-level Language Models

### [Derived from a blog post by Yoav Goldberg](https://nbviewer.jupyter.org/gist/yoavg/d76121dfde2618422139)

## Unsmoothed Maximum Likelihood Character Level Language Model 

The name is quite long, but the idea is very simple.  We want a model whose job is to guess the next character based on the previous $n$ letters. For example, having seen `ello`, the next characer is likely to be either a commma or space (if we assume is is the end of the word "hello"), or the letter `w` if we believe we are in the middle of the word "mellow". Humans are quite good at this, but of course seeing a larger history makes things easier (if we were to see 5 letters instead of 4, the choice between space and `w` would have been much easier).

We will call $n$, the number of letters we need to guess based on, the _order_ of the language model.

So, we are seeing $n$ letters, and need to guess the $n+1$th one. We are also given a large-ish amount of text (say, all of Shakespear works) that we can use. How would we go about solving this task?

Mathematiacally, we would like to learn a function $P(c | h)$. Here, $c$ is a character, $h$ is a $n$-letters history, and $P(c|h)$ stands for how likely is it to see $c$ after we've seen $h$.

Perhaps the simplest approach would be to just count and divide (a.k.a **maximum likelihood estimates**). We will count the number of times each letter $c'$ appeared after $h$, and divide by the total numbers of letters appearing after $h$. The **unsmoothed** part means that if we did not see a given letter following $h$, we will just give it a probability of zero.

And that's all there is to it.


### Training Code
Here is the code for training the model. `fname` is a file to read the characters from. `order` is the history size to consult (i.e., the $n$ in your $n$-gram). Note that we pad the data with leading `~` so that we also learn how to start (this is your `<START>` symbol in most language model notation.


In [1]:
import pandas as pd

In [1]:
from collections import *

def train_char_lm(fname, order=4):
    with open(fname) as f:
        data = '\n'.join(f.readlines())
    lm = defaultdict(Counter)
    pad = "~" * order
    data = pad + data
    for i in range(len(data)-order):
        history, char = data[i:i+order], data[i+order]
        lm[history][char]+=1
    def normalize(counter):
        s = float(sum(counter.values()))
        return [(c,cnt/s) for c,cnt in counter.items()]

    outlm = {hist:normalize(chars) for hist, chars in lm.items()}
    return outlm

In [2]:
df=pd.read_csv('./ROCStories__spring2016 - ROCStories_spring2016.csv')

In [3]:
df=df.drop(columns=['storyid', 'storytitle'])

In [4]:
df['text'] = df.apply(lambda row: ' '.join(row.values.astype(str)), axis=1)

In [6]:
df['text'].to_csv('./ROCstories2016.txt', index=False, header=False)

Let's train it on some Shakespeare, which you can find on Canvas for SI 630

In [21]:
# lm = train_char_lm("shakespeare_input.txt", order=4)
lm = train_char_lm('ROCstories2016.txt', order=4)

Ok. Now let's do some queries:

In [22]:
lm['hell']

[('y', 0.16666666666666666),
 ('o', 0.08536585365853659),
 ('e', 0.2845528455284553),
 ('s', 0.22357723577235772),
 (' ', 0.17073170731707318),
 ('.', 0.052845528455284556),
 (',', 0.008130081300813009),
 ('f', 0.0040650406504065045),
 ('i', 0.0040650406504065045)]

In [23]:
lm['Firs']

[('t', 1.0)]

In [24]:
lm['rst ']

[('d', 0.15833333333333333),
 ('s', 0.08666666666666667),
 ('e', 0.011666666666666667),
 ('t', 0.18766666666666668),
 ('p', 0.08133333333333333),
 ('b', 0.042333333333333334),
 ('i', 0.028),
 ('h', 0.04566666666666667),
 ('c', 0.062),
 ('g', 0.03933333333333333),
 ('f', 0.026333333333333334),
 ('A', 0.003),
 ('y', 0.008),
 ('m', 0.021),
 ('2', 0.0016666666666666668),
 ('v', 0.005),
 ('j', 0.01633333333333333),
 ('n', 0.017),
 ('I', 0.01),
 ('r', 0.021),
 ('a', 0.026333333333333334),
 ('o', 0.02666666666666667),
 ('J', 0.002),
 ('1', 0.0006666666666666666),
 ('k', 0.006333333333333333),
 ('l', 0.012),
 ('B', 0.003),
 ('w', 0.025),
 ('C', 0.004666666666666667),
 ('M', 0.0016666666666666668),
 ('R', 0.001),
 ('q', 0.002),
 ('S', 0.0026666666666666666),
 ('6', 0.0003333333333333333),
 ('4', 0.0006666666666666666),
 ('K', 0.0006666666666666666),
 ('3', 0.001),
 ('9', 0.0006666666666666666),
 ('u', 0.0013333333333333333),
 ('L', 0.0016666666666666668),
 ('P', 0.0006666666666666666),
 ('Y', 0

So `ello` is primarily followed by either space, punctuation or `w` (or `r`, `u`, `n`), `Firs` is pretty much deterministic, and the word following `ist ` can start with pretty much every letter.

### Generating from the model
Generating is also very simple. To generate a letter, we will take the history, look at the last $order$ characteters, and then sample a random letter based on the corresponding distribution.

In [25]:
from random import random

def generate_letter(lm, history, order):
        history = history[-order:]
        dist = lm[history]
        x = random()
        for c,v in dist:
            x = x - v
            if x <= 0: return c

To generate a passage of $k$ characters, we just seed it with the initial history and run letter generation in a loop, updating the history at each turn.

In [26]:
def generate_text(lm, order, nletters=500):
    history = "~" * order
    out = []
    for i in range(nletters):
        c = generate_letter(lm, history, order)
        history = history[-order:] + c
        out.append(c)
    return "".join(out)

### Generated Shakespeare from different order models

Let's try to generate text based on different language-model orders. Let's start with something silly:

### order 2:

In [27]:
lm = train_char_lm("ROCstories2016.txt", order=2)

In [28]:
print(generate_text(lm, 2))

Daving the caus vina pas the beca. Now toptingre eme. Sooll, alled clet anto at to preamin print. I use ind was a pulas eves in to alwas sady he cryondman sa's ing it intick holdentelf try. He dechilly. Fing the wat himet piz wayed, hand aborrentir was lowennef massy a quid alk went the lito she st a was do beighboucce grout and th any gavelfribre mad nigned a the suld shout a drad atchourprojew out but gothary waskiniverget hiso hare zook out by ass a but bett dause wask a schhin She whe decol.


Not so great.. but what if we increase the order to 4?

### order 4

In [31]:
lm = train_char_lm("ROCstories2016.txt", order=4)

In [32]:
print(generate_text(lm, 4))

Dan's who work one wered her job incomfortunately. Christmas the he woman mad. He went a uninja. Kat was dog computer dropped up.

"Amy had evening else.

A littled out with her set. She were play walk and feet. He has bear mouth. He went ther to birthday. He ran into the affortunately was a labcoat a pet his a tracked than had way the stressing

Lisa. I ask. After school and bulldog snowblow on.

"Brent to walked our Years of and have gave heard the trained close heard told because tooth. She d


In [33]:
print(generate_text(lm, 4))

Dan's old his back.

"Tiffany keys acted what in 1987 to the figures, and she was able blue house. But some soon her per boss to review at to he grow animals. After was all disasterday.

"My moving on the mother typing. Not to keep on the slid could the kids wife and really decided up my fire that an and went with made store him to see the the rations. Kelly friend Megan the felt practers severy his deals. She punch. The saw the began to buy his knew friend the were to money. She decided about o


This is already quite reasonable, and reads like English. Just 4 letters history! What if we increase it to 7?

### order 7

In [34]:
lm = train_char_lm("ROCstories2016.txt", order=7)

In [35]:
print(generate_text(lm, 7))

Dan's phone. The new intern shrieked and kicked the picture wife did not own every day they are going to lunch and it was missing food. Suddenly, a small hole in a wheat flourish. After a year. Faith taking the costume to show. All of a store. Kacie decided to take the city street to rehearsal. She memory."

Milly woke up, she needed a new video games all over Gina cried classified and pressed.

"Leo really bought he was done, I looked for an incredibly landed 4 feet and inviting his house. Sinc


### How about 10?

In [36]:
lm = train_char_lm("ROCstories2016.txt", order=10)

In [37]:
print(generate_text(lm, 10))

Dan's parents called Lena's mom. Lena's mom found a giant, black, beast of a truck to the casino. At the commute. Tim began to break free. Hours later explained that Jesus loved everything went smoothly. Then, she called her bird.

"Nola lost her judgmental again."

"Jeff insulted her. The next day.

"Sam bought a chipmunk, but it was boring. Luckily, it ended up walking my dog. My dog got home and relax. He put out swan status to scare me."

"Stone was always sweet to the garbage men stopped by


In [46]:
# pip install bert_score

In [47]:
from evaluate import load
bertscore = load("bertscore")


In [48]:
predictions = ["hello there", "general kenobi"]
references = ["hello there", "general kenobi"]
# results = 
bertscore.compute(predictions=predictions, references=references, lang="en")

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'precision': [0.9999998211860657, 1.0000003576278687],
 'recall': [0.9999998211860657, 1.0000003576278687],
 'f1': [0.9999998211860657, 1.0000003576278687],
 'hashcode': 'roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.38.2)'}

In [49]:
predictions = [generate_text(lm, 10)]

In [50]:
predictions

['Dan\'s parents took off his shoes. That\'s when Sam decided he was just a little late, so had to go last week. I received the album. Then, Jo, turned blue. There were raccoons flew past Rick\'s face. Rick grabbed him. He asked the pretty girl."\n\n"Bill was extremely tired and fell. Jane ended up getting a flat iron. When I came back on Tom saw that the clients I\'ve dealt with his dad. At first he was getting his new house, he started snowing. In my rush to get out. They told him he needed the car. I']

In [51]:
def read_first_n_lines(file_path, n=10000):
    """
    Reads the first n lines from a file.

    Parameters:
    - file_path: Path to the file.
    - n: Number of lines to read.

    Returns:
    - A list containing the first n lines of the file.
    """
    lines = []
    try:
        with open(file_path, 'r') as file:
            for _ in range(n):
                line = file.readline()
                if not line:
                    break  # Stop if the file ends before reaching n lines
                lines.append(line.strip())  # Use strip() to remove newline characters
    except FileNotFoundError:
        print("File not found.")
        return []

    return lines

# Since we can't execute file operations here, this is just an illustration.
# You would replace `file_path` with the actual path to your 'ROCstories2016.txt' file and run this in your environment.


In [52]:
# Example usage (this won't actually run here, it's just an example)
file_path = './ROCstories2016.txt'
first_10000_lines = read_first_n_lines(file_path)

In [58]:
len(predictions)

1

In [67]:
references = "\n".join(first_10000_lines)

In [69]:
len(referenc

2370445

In [61]:
bertscore.compute(predictions=predictions, references=first_10000_lines, lang="en")

ValueError: Mismatch in the number of predictions (1) and references (10000)