# Character-level Language Models

### [Derived from a blog post by Yoav Goldberg](https://nbviewer.jupyter.org/gist/yoavg/d76121dfde2618422139)

## Unsmoothed Maximum Likelihood Character Level Language Model 




### Training Code
Here is the code for training the model. `fname` is a file to read the characters from. `order` is the history size to consult (i.e., the $n$ in your $n$-gram). Note that we pad the data with leading `~` so that we also learn how to start (this is your `<START>` symbol in most language model notation.


In [92]:
import pandas as pd
import csv
import random
from collections import *

In [93]:
def train_char_lm(fname, order=4):
    with open(fname) as f:
        data = '\n'.join(f.readlines())
    lm = defaultdict(Counter)
    pad = "~" * order
    data = pad + data
    for i in range(len(data)-order):
        history, char = data[i:i+order], data[i+order]
        lm[history][char]+=1
    def normalize(counter):
        s = float(sum(counter.values()))
        return [(c,cnt/s) for c,cnt in counter.items()]

    outlm = {hist:normalize(chars) for hist, chars in lm.items()}
    return outlm

In [94]:
df=pd.read_csv('./ROCStories__spring2016 - ROCStories_spring2016.csv')

In [95]:
df=df.drop(columns=['storyid', 'storytitle'])

In [96]:
df['text'] = df.apply(lambda row: ' '.join(row.values.astype(str)), axis=1)

In [97]:
df['text'].to_csv('./ROCstories2016.txt', index=False, header=False)

In [98]:
def read_random_sentences(file_path, n=10):
    """
    Reads the fifth sentence from n random lines in a CSV file.

    Parameters:
    - file_path: Path to the CSV file.
    - n: Number of random lines to read.

    Returns:
    - A list containing the fifth sentence from n random lines of the file.
    """
    # Initialize an empty list to hold the fifth sentences
    sentences = []

    try:
        # Open the file and create a csv.reader object
        with open(file_path, 'r', newline='', encoding='utf-8') as file:
            reader = csv.DictReader(file)
            rows = list(reader)  # Convert iterator to list to get total count and allow random access
            
            total_lines = len(rows)  # Get the total number of lines in the file
            
            # Ensure we don't try to sample more lines than exist in the file
            n = min(n, total_lines)
            
            # Generate n unique random indices
            random_indices = random.sample(range(total_lines), n)
            
            # Retrieve and store the fifth sentence from each randomly selected line
            for index in random_indices:
                sentences.append(rows[index]['sentence5'])  # Access by column name

    except FileNotFoundError:
        print(f"File not found: {file_path}")
    except Exception as e:
        print(f"An error occurred: {e}")

    return sentences


In [99]:
# Example usage:
file_path = './ROCStories__spring2016 - ROCStories_spring2016.csv'  # Replace with your actual CSV file path
random_sentences = read_random_sentences(file_path, n=100)
print(random_sentences)

['The DJ appreciated the enthusiasm so much he played the song.', 'Disgusted, he got up and quickly brushed his teeth.', 'Julia was both in shock and happy with her new style.', 'Now he is an advocate for all his friends to get their eyes checked.', 'John is hoping this is his big break!', 'Amy left the class without thinking of Ray again that day.', 'Luckily, when the results were posted he saw that he had passed.', "Dad found Jason with a 100' of VHS tape in his lap.", 'Neighbors now help Doris take care of her flower bed.', 'Now I save a lot of money and am insured!', 'Cassie switched bras, and her shirts began to fit her just fine.', 'John surprised Brittany with a carnation and it made her very happy.', 'James now has achieved his life long dream of flying a rocket.', 'Skye and her friends chased him, then resumed trick-or-treating.', 'She vowed never to touch rum again!', 'Allison and Bobo are now best friends!', 'Maya decides that she should never have lived without steak.', 'Lu

Let's train!

### Generating from the model
Generating is also very simple. To generate a letter, we will take the history, look at the last $order$ characteters, and then sample a random letter based on the corresponding distribution.

In [100]:
def generate_letter(lm, history, order):
        history = history[-order:]
        dist = lm[history]
        x = random.random()
        for c,v in dist:
            x = x - v
            if x <= 0: return c

To generate a passage of $k$ characters, we just seed it with the initial history and run letter generation in a loop, updating the history at each turn.

In [101]:
def generate_text(lm, order, nletters=500):
    history = "~" * order
    out = []
    for i in range(nletters):
        c = generate_letter(lm, history, order)
        history = history[-order:] + c
        out.append(c)
    return "".join(out)

### Generated Story Ending from different order models

Let's try to generate text based on different language-model orders. Let's start with something silly:

### order 2:

In [102]:
lm = train_char_lm("ROCstories2016.txt", order=2)

In [103]:
print(generate_text(lm, 2))

Dall. She us."

Len and parted it triummy wore."

"Jim thday for frileft the gavoultur cobought didencionted trook taught her. She balked. So wany st the to to buslee ther a steved he losides but and he top a recorricked Cor schounignme vot and as day st as pas firs taked. Her had New sking herah gan hout iffeck ane he riesso everess th iteren like re me wor to beas exambe gir thas ary whising ard nis the hooks frived the piell the doctry was was roun. Hisawand hat hiso antool pes sounds ore his


Not so great.. but what if we increase the order to 4?

### order 4

In [104]:
lm = train_char_lm("ROCstories2016.txt", order=4)

In [105]:
print(generate_text(lm, 4))

Dan's doing to he next daughter and adorability of a geckoutside thing! She tide to water a come wires if went back home more all. As that night shoes. On his from the next money to the first weekend. He just ever, Joe.

Early the had beach morning to badly. Thing is a long a has best he what guy fit. It was started nevery looking the heard and Sam at the loved shell overcame asketball since on. I check homewhere.

"Jane on likes mom the road their school swing he was believed to a Fred and hot 


In [106]:
print(generate_text(lm, 4))

Dan's bowling back at Moroccoli airfield an apples nerved the on said she banne was Alas, I play the spillion. She doctor it was sister photos. I rough for died as family always play what his pizza food idea the more to help her eyes white, Jared to a new he chance headphone cookies and looking. She went in class. So her coach the creamed it for excited they seat exactly. We disticing at a free was verywhere, and traves the didn't life and city."

Jake park. They kept the for the taught during w


This is already quite reasonable, and reads like English. Just 4 letters history! What if we increase it to 7?

### order 7

In [107]:
lm = train_char_lm("ROCstories2016.txt", order=7)

In [108]:
print(generate_text(lm, 7))

Dan's pair of his life!"

Little Jimmy got to her family. They check.

I just worn out to leave. Four weeks. She saw he had a twin. One day I realized the boy's' parents as well. Unfortunately they greetings scheduled that the best gym around a beating post in time. One day he took us out of the guitar and immediately and decorating was too tall this to hang out. When he won't go anywhere force out that he had a girl saw amazing progress. But soon so Lulu got a new brother zoo and pulled into he


### How about 10?

In [109]:
lm = train_char_lm("ROCstories2016.txt", order=10)

In [110]:
print(generate_text(lm, 10))

Dan's parent's trust."

"When I was ten my uncle gave me some sea life. She had never been to one before. On his first day of school. She hoped they would. Horace's girlfriend with her sister has formed on her day. The next day at school. The room stayed home to his friend Timothy. James trained the whole day making crafts for Christmas. She went for a walk and saw his friend told them to go to college was looking forward to Saint Croix. For our 25th wedding and watch the premiere of a new movie


In [111]:
from evaluate import load
bertscore = load("bertscore")


In [112]:
predictions = ["hello there", "general kenobi"]
references = ["hello there", "general kenobi"]
# results = 
bertscore.compute(predictions=predictions, references=references, lang="en")

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'precision': [0.9999998211860657, 1.0000003576278687],
 'recall': [0.9999998211860657, 1.0000003576278687],
 'f1': [0.9999998211860657, 1.0000003576278687],
 'hashcode': 'roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.38.2)'}

In [113]:
predictions = []

for i in range(len(random_sentences)):
    # Extract the current line
    current_line = random_sentences[i]
    
    # Determine the number of letters to be generated. This part is up to your task's specifics.
    # For example, if you want to generate text of length equal to the current line's length:
    # print(current_line)
    num_letters = len(current_line)
    # print(num_letters)
    
    # Generate text using the multigram model
    generated_text = generate_text(lm, 10, num_letters)  # Adjust if generate_text requires different parameters
    
    # Append the generated text to the predictions list
    predictions.append(generated_text)


In [114]:
len(predictions)

100

In [115]:
len(random_sentences)

100

In [116]:
predictions

["Dan's parents if there were bears in the beach. She went from",
 "Dan's parents told her there was a tornado was comi",
 'Dan\'s parent\'s trust."\n\n"Lou was Italian parents. He ',
 "Dan's parents sell my dirt bike skid and spilled the air-conditionin",
 "Dan's parents for her causes in other",
 "Dan's parent's divorce.\n\nMeg bought a refurbished the nigh",
 'Dan\'s parents went viral.\n\n"Three kids to school by herself."\n\nM',
 "Dan's parents planned a huge pasta party. When they",
 "Dan's parents said no. I decided to go to his neighbo",
 'Dan\'s parents.\n\n"Amy wanted to be able to',
 "Dan's parents to earn money for a vacation to Las Vegas. Charles",
 "Dan's parents house one day I decided I'd get an apartment. The old ",
 'Dan\'s parents were gone soon.\n\n"The teacher was in so much pie',
 "Dan's parents didn't notice that a piece of Basalt he found out ",
 "Dan's parents were close to her bed",
 "Dan's parents brought a puppy. There w",
 "Dan's parents took away the food 

In [117]:
random_sentences

['The DJ appreciated the enthusiasm so much he played the song.',
 'Disgusted, he got up and quickly brushed his teeth.',
 'Julia was both in shock and happy with her new style.',
 'Now he is an advocate for all his friends to get their eyes checked.',
 'John is hoping this is his big break!',
 'Amy left the class without thinking of Ray again that day.',
 'Luckily, when the results were posted he saw that he had passed.',
 "Dad found Jason with a 100' of VHS tape in his lap.",
 'Neighbors now help Doris take care of her flower bed.',
 'Now I save a lot of money and am insured!',
 'Cassie switched bras, and her shirts began to fit her just fine.',
 'John surprised Brittany with a carnation and it made her very happy.',
 'James now has achieved his life long dream of flying a rocket.',
 'Skye and her friends chased him, then resumed trick-or-treating.',
 'She vowed never to touch rum again!',
 'Allison and Bobo are now best friends!',
 'Maya decides that she should never have lived with

In [118]:
results_bert = bertscore.compute(predictions=predictions, references=random_sentences, lang="en")

In [119]:
results_bert

{'precision': [0.8503642678260803,
  0.8422554731369019,
  0.8366479873657227,
  0.8098124861717224,
  0.8302710652351379,
  0.8242311477661133,
  0.8318902850151062,
  0.8638921976089478,
  0.8607701659202576,
  0.8463727831840515,
  0.8503743410110474,
  0.8462571501731873,
  0.8444008231163025,
  0.8268052935600281,
  0.8503530025482178,
  0.8631380200386047,
  0.8471012711524963,
  0.8181440234184265,
  0.841023325920105,
  0.8206250071525574,
  0.8769026398658752,
  0.8395283222198486,
  0.8560234904289246,
  0.8770163059234619,
  0.8569288849830627,
  0.8509084582328796,
  0.8368858695030212,
  0.8438241481781006,
  0.8737807869911194,
  0.853492259979248,
  0.816209077835083,
  0.8475600481033325,
  0.8605552911758423,
  0.8333156704902649,
  0.8295606970787048,
  0.8186219334602356,
  0.8656216859817505,
  0.8192716836929321,
  0.8704106211662292,
  0.8208596110343933,
  0.8325019478797913,
  0.8307416439056396,
  0.8315919041633606,
  0.8455209136009216,
  0.8323401212692261,


In [120]:
# Calculates the BLEU score for the random selection model
bleu = load("bleu")
results_bleu = bleu.compute(predictions=predictions, references=random_sentences)

In [121]:
results_bleu

{'bleu': 0.0,
 'precisions': [0.09840954274353876, 0.002207505518763797, 0.0, 0.0],
 'brevity_penalty': 0.9281585737605184,
 'length_ratio': 0.9306197964847364,
 'translation_length': 1006,
 'reference_length': 1081}

In [123]:
# Calculates the METEOR score for the random selection model
meteor = load('meteor')
results_met = meteor.compute(predictions=predictions, references=random_sentences)

[nltk_data] Downloading package wordnet to /Users/divyams/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /Users/divyams/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/divyams/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [None]:
results_met

In [126]:
# pip install rouge_score

In [127]:
rouge = load('rouge')
results_rouge = rouge.compute(predictions=predictions, references=random_sentences)
# The ROUGE score results using the random selection model
results_rouge


{'rouge1': 0.053738827752923046,
 'rouge2': 0.0020855614973262033,
 'rougeL': 0.05001561050650466,
 'rougeLsum': 0.04996879069791861}

In [128]:
# Calculates the Perplexity score for the predictions from the random selection model
perplexity = load("perplexity", module_type="metric")
results_perplexity = perplexity.compute(predictions=predictions, model_id='gpt2')
# The Perplexity score results using the predictions from the random selection model
results_perplexity

Downloading builder script:   0%|          | 0.00/8.46k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

  0%|          | 0/7 [00:00<?, ?it/s]

{'perplexities': [256.490478515625,
  333.8858337402344,
  386.294189453125,
  600.8056640625,
  880.5797729492188,
  243.31643676757812,
  103.27037811279297,
  182.84104919433594,
  146.325927734375,
  51.37561798095703,
  119.28215789794922,
  186.78843688964844,
  150.89675903320312,
  259.5313415527344,
  253.90679931640625,
  648.8135375976562,
  357.4811706542969,
  325.3968811035156,
  297.6617126464844,
  237.60568237304688,
  165.904541015625,
  165.25418090820312,
  291.7914123535156,
  411.1168518066406,
  196.31112670898438,
  129.31173706054688,
  1275.345458984375,
  189.2100830078125,
  167.20565795898438,
  288.1860046386719,
  820.6649780273438,
  918.371337890625,
  1485.72900390625,
  183.1670684814453,
  187.20440673828125,
  257.7385559082031,
  444.8812561035156,
  799.953125,
  276.9836730957031,
  312.4973449707031,
  265.4933166503906,
  686.7530517578125,
  915.882080078125,
  375.3231201171875,
  11462.9482421875,
  378.7496032714844,
  341.736572265625,
  1