# Character-level Language Models

### [Derived from a blog post by Yoav Goldberg](https://nbviewer.jupyter.org/gist/yoavg/d76121dfde2618422139)

## Unsmoothed Maximum Likelihood Character Level Language Model 




### Training Code
Here is the code for training the model. `fname` is a file to read the characters from. `order` is the history size to consult (i.e., the $n$ in your $n$-gram). Note that we pad the data with leading `~` so that we also learn how to start (this is your `<START>` symbol in most language model notation.


In [1]:
import pandas as pd
import csv
import random
from collections import *
import numpy as np

random.seed(630)

In [2]:
def train_char_lm(fname, order=4):
    with open(fname) as f:
        data = '\n'.join(f.readlines())
    lm = defaultdict(Counter)
    pad = "~" * order
    data = pad + data
    for i in range(len(data)-order):
        history, char = data[i:i+order], data[i+order]
        lm[history][char]+=1
    def normalize(counter):
        s = float(sum(counter.values()))
        return [(c,cnt/s) for c,cnt in counter.items()]

    outlm = {hist:normalize(chars) for hist, chars in lm.items()}
    return outlm

In [3]:
df = pd.read_csv('../data/ROCStories.csv')

In [4]:
df = df.drop(columns=['storyid', 'storytitle'])

In [5]:
df['text'] = df.apply(lambda row: ' '.join(row.values.astype(str)), axis=1)

In [6]:
df['text'].to_csv('../data/ROCstories.txt', index=False, header=False)

In [7]:
def read_random_sentences(file_path, n=10):
    """
    Reads the fifth sentence from n random lines in a CSV file.

    Parameters:
    - file_path: Path to the CSV file.
    - n: Number of random lines to read.

    Returns:
    - A list containing the fifth sentence from n random lines of the file.
    """
    # Initialize an empty list to hold the fifth sentences
    sentences = []

    try:
        # Open the file and create a csv.reader object
        with open(file_path, 'r', newline='', encoding='utf-8') as file:
            reader = csv.DictReader(file)
            rows = list(reader)  # Convert iterator to list to get total count and allow random access
            
            total_lines = len(rows)  # Get the total number of lines in the file
            
            # Ensure we don't try to sample more lines than exist in the file
            n = min(n, total_lines)
            
            # Generate n unique random indices
            random_indices = random.sample(range(total_lines), n)
            
            # Retrieve and store the fifth sentence from each randomly selected line
            for index in random_indices:
                sentences.append(rows[index]['sentence5'])  # Access by column name

    except FileNotFoundError:
        print(f"File not found: {file_path}")
    except Exception as e:
        print(f"An error occurred: {e}")

    return sentences


In [8]:
# Example usage:
file_path = '../data/ROCStories.csv'  # Replace with your actual CSV file path
random_sentences = read_random_sentences(file_path, n=100)
print(random_sentences)

['Five hours later they went home exhausted and happy.', 'When she gave him the red race car he was so surprised.', 'She would have to wait days to see it!', 'She agreed to pay Angela $5 for the ball.', 'After marrying Julio, Michelle stopped buying breakfast burritos.', 'Thankfully the next day she felt better.', 'His dog chased away at the squirrel afterwards.', 'When they were done, they got dinner.', 'He was so loud that day that nobody enjoyed the conversation.', 'The little girl went to look at it and found a huge crack.', 'She scheduled Tim in for an appointment the next day.', 'They got a loan and used it to go on vacation.', 'Now Bob can move his boxes without hurting his back.', 'Tom made money from the insurance.', 'Luckily I did not get an infection.', 'When they got home he named him pokey.', 'His father caught him and grounded him for two weeks.', 'His card had no balance in it.', 'She started choking on her first drag and had to go home early.', 'Finally the drought emer

Let's train!

### Generating from the model
Generating is also very simple. To generate a letter, we will take the history, look at the last $order$ characteters, and then sample a random letter based on the corresponding distribution.

In [9]:
def generate_letter(lm, history, order):
        history = history[-order:]
        dist = lm[history]
        x = random.random()
        for c,v in dist:
            x = x - v
            if x <= 0: return c

To generate a passage of $k$ characters, we just seed it with the initial history and run letter generation in a loop, updating the history at each turn.

In [10]:
def generate_text(lm, order, nletters=500):
    history = "~" * order
    out = []
    for i in range(nletters):
        c = generate_letter(lm, history, order)
        history = history[-order:] + c
        out.append(c)
    return "".join(out)

### Generated Story Ending from different order models

Let's try to generate text based on different language-model orders. Let's start with something silly:

### order 2:

In [11]:
lm = train_char_lm("../data/ROCstories.txt", order=2)

In [12]:
print(generate_text(lm, 2))

Dave frien and mad of put embear bus. A Sylay. Miked ing to shecidentiten as wit it of woody pled Kimesid mod trend th he brousto her but feekeeks fed rew, hem ken holdfam. The wrownevere joland. Capkiderly ding to days hirlostalki. Jakin platera carke why trileeks some a bir a histat wase cout put to tor to mot som. To the it. Wen fer foup saide foldn't feaccide on lartairestraperearnettilen and anike paseve was winged toptim ang bird hoods the she mould, en mirlecatch, hapencout ch hile. Sheye


Not so great.. but what if we increase the order to 4?

### order 4

In [13]:
lm = train_char_lm("../data/ROCstories.txt", order=4)

In [14]:
print(generate_text(lm, 4))

Dan's carafe put of severy have theating classed to crystale. She adult fit had somework. As the restaurant pay his friend tasterday friend innocent. Hermarket. Joe bothese she left it, he choice.

Jenny to chalk home. When he hole eyes looking about nor over presert.

"Opening his more.

"Billy he she hadn't find a bird's place music football. The was an in the thout his park was to eat on Mturkey trieve went to her parathere the radio writing. So her boss clean are out computer Peter was decid


In [15]:
print(generate_text(lm, 4))

Dan's really she competition the USA. Jack dropping by.

I country. On his construm for the strings. Her movie. Luckily, he slip to be playing not all sold me and Bradfootball the work and she team. When weekend. He was veryday to driving long. We race and And entire began to my fit promised him enought about another all was work, Mark ther to go. The clothink I regreen and she found anyone night. She went less write because this not wait the kitchen has an ex-boyfriends wedding polish section, 


This is already quite reasonable, and reads like English. Just 4 letters history! What if we increase it to 7?

### order 7

In [16]:
lm = train_char_lm("../data/ROCstories.txt", order=7)

In [17]:
print(generate_text(lm, 7))

Dan's phone number. She thought the lice. They felt so alive with a safe and my mom had a flat iron. Dan found out, her friends how to carry. He loves to build a PC that circulated her the property. If she saw he looked it but couldn't slept in a new hair extend the sun's rays hits in money from everywhere. Joe ended up crashing dishes. He told Jack the movies yet. However, non-everything he wanted to make do with what she was carefully. Franny cut his children.

The forced to 10 when his parent


### How about 10?

In [18]:
lm = train_char_lm("../data/ROCstories.txt", order=10)

In [19]:
print(generate_text(lm, 10))

Dan's parents have offered to share, he would be enough to get her burnt dinner.

The Ravens were tied. They always loved the inexpensive playbook and all the supplies such as comic books just sat and smiled. Mom patter Malia's head and walked to a late movie was ending by the time now.

"Helunko Spelunko was a common occurrence. I was initially they did in the hot water to max. It helped quiet her down constant complain about it."

"The fruit store . He found for days to get there on Black Frid


In [20]:
from evaluate import load
bertscore = load("bertscore")


In [21]:
predictions = ["hello there", "general kenobi"]
references = ["hello there", "general kenobi"]
# results = 
bertscore.compute(predictions=predictions, references=references, lang="en")

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'precision': [1.0, 1.0],
 'recall': [1.0, 1.0],
 'f1': [1.0, 1.0],
 'hashcode': 'roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.34.1)'}

In [22]:
predictions = []

for i in range(len(random_sentences)):
    # Extract the current line
    current_line = random_sentences[i]
    
    # Determine the number of letters to be generated. This part is up to your task's specifics.
    # For example, if you want to generate text of length equal to the current line's length:
    # print(current_line)
    num_letters = len(current_line)
    # print(num_letters)
    
    # Generate text using the multigram model
    generated_text = generate_text(lm, 10, num_letters)  # Adjust if generate_text requires different parameters
    
    # Append the generated text to the predictions list
    predictions.append(generated_text)


In [23]:
len(predictions)

100

In [24]:
len(random_sentences)

100

In [25]:
predictions

["Dan's parent's room and turned until Monday. Instead",
 'Dan\'s parents bought them.\n\n"Jolie\'s family were search',
 'Dan\'s parents proud.\n\n"Zach was stingi',
 "Dan's parents. His grandfather had troubl",
 'Dan\'s parents enrolled in over the fall!"\n\n"Bob was unhappy, but ',
 "Dan's parents decided to have friends ou",
 "Dan's parents were buggy. Charlie decides to ma",
 "Dan's parents and asked for help. He ",
 "Dan's parents refused to serve on panels she refused. One day",
 "Dan's parents and made a weird noise. It didn't mean she n",
 "Dan's parents took him months to live. She and the ot",
 "Dan's parents brought her one. Finally, Andrew",
 "Dan's parents remained of it was a boy! Ellen and he",
 "Dan's parents eventually I made it",
 "Dan's parents assured him the corre",
 "Dan's parents room. She swung the ball",
 "Dan's parents. His dad and the rides. One day, Charli",
 "Dan's parents were impressed. ",
 "Dan's parents who loved the attic in exchange day. I work in a 

In [26]:
random_sentences

['Five hours later they went home exhausted and happy.',
 'When she gave him the red race car he was so surprised.',
 'She would have to wait days to see it!',
 'She agreed to pay Angela $5 for the ball.',
 'After marrying Julio, Michelle stopped buying breakfast burritos.',
 'Thankfully the next day she felt better.',
 'His dog chased away at the squirrel afterwards.',
 'When they were done, they got dinner.',
 'He was so loud that day that nobody enjoyed the conversation.',
 'The little girl went to look at it and found a huge crack.',
 'She scheduled Tim in for an appointment the next day.',
 'They got a loan and used it to go on vacation.',
 'Now Bob can move his boxes without hurting his back.',
 'Tom made money from the insurance.',
 'Luckily I did not get an infection.',
 'When they got home he named him pokey.',
 'His father caught him and grounded him for two weeks.',
 'His card had no balance in it.',
 'She started choking on her first drag and had to go home early.',
 'Final

In [27]:
results_bert = bertscore.compute(predictions=predictions, references=random_sentences, lang="en")

In [28]:
results_bert

{'precision': [0.8423677086830139,
  0.8250561952590942,
  0.8393429517745972,
  0.8439666628837585,
  0.8515053391456604,
  0.8340190649032593,
  0.8516144752502441,
  0.8563029170036316,
  0.8484340906143188,
  0.8544703722000122,
  0.8436055779457092,
  0.8810375332832336,
  0.8274329900741577,
  0.8586156964302063,
  0.8293236494064331,
  0.8605955243110657,
  0.8734256029129028,
  0.8805202841758728,
  0.8340209722518921,
  0.8274343609809875,
  0.838249683380127,
  0.8437716364860535,
  0.8644999265670776,
  0.8309587836265564,
  0.8531175851821899,
  0.8337967395782471,
  0.8473888039588928,
  0.8097715973854065,
  0.8246848583221436,
  0.8332445025444031,
  0.8250089883804321,
  0.8761205077171326,
  0.831635057926178,
  0.8500640988349915,
  0.8389027118682861,
  0.852202832698822,
  0.8557407855987549,
  0.8298729062080383,
  0.8319246768951416,
  0.8586809635162354,
  0.8540977835655212,
  0.8727129101753235,
  0.8549500107765198,
  0.8360841274261475,
  0.8331066370010376,


In [29]:
np.mean(results_bert['precision']), np.mean(results_bert['recall']), np.mean(results_bert['f1'])

(0.8460567516088485, 0.8577603483200074, 0.8517862033843994)

In [30]:
# Calculates the BLEU score for the random selection model
bleu = load("bleu")
results_bleu = bleu.compute(predictions=predictions, references=random_sentences)

In [31]:
results_bleu

{'bleu': 0.0,
 'precisions': [0.1, 0.0, 0.0, 0.0],
 'brevity_penalty': 0.9314618921275921,
 'length_ratio': 0.9337068160597572,
 'translation_length': 1000,
 'reference_length': 1071}

In [32]:
# Calculates the METEOR score for the random selection model
meteor = load('meteor')
results_met = meteor.compute(predictions=predictions, references=random_sentences)

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/divyasanthanam/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/divyasanthanam/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/divyasanthanam/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [33]:
results_met

{'meteor': 0.05284814808372861}

In [34]:
# pip install rouge_score

In [35]:
rouge = load('rouge')
results_rouge = rouge.compute(predictions=predictions, references=random_sentences)
# The ROUGE score results using the random selection model
results_rouge


{'rouge1': 0.052734654861345466,
 'rouge2': 0.0,
 'rougeL': 0.048020855444919686,
 'rougeLsum': 0.047767597869439904}

In [36]:
# Calculates the Perplexity score for the predictions from the random selection model
perplexity = load("perplexity", module_type="metric")
results_perplexity = perplexity.compute(predictions=predictions, model_id='gpt2')
# The Perplexity score results using the predictions from the random selection model

  0%|          | 0/7 [00:00<?, ?it/s]

In [37]:
results_perplexity['mean_perplexity']

518.2468044662476