# $n$-Gram Shakespeare Generation

## Setup

General imports.

In [1]:
import os
import re

General constants.

In [2]:
CORPUS_FILE = "data/shakespeare_input.txt"

Helper functions.

In [3]:
def print_n_lines(text: str, lines: int = 10):
    print("\n".join(text.split("\n")[:lines]))

## Dataset Preparation

We will be using the [Shakespeare Corpus](https://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt) from [Andrej Karpathy](https://cs.stanford.edu/people/karpathy/)'s [`char-rnn`](https://cs.stanford.edu/people/karpathy/char-rnn/) project. This corpus will be saved in the `data` folder as `shakespeare_input.txt`.

In [4]:
if not os.path.isfile(CORPUS_FILE):
    import urllib.request
    
    urllib.request.urlretrieve("https://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt", CORPUS_FILE)
    print("Downloaded corpus")
else:
    print("Corpus already exists")

Corpus already exists


Let us read the first few lines of the raw corpus.

In [5]:
with open(CORPUS_FILE, "r") as f:
    raw_corpus = f.read()

In [6]:
print_n_lines(raw_corpus)

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:


We don't want the names of the speakers, and so we need to remove them.

Notice that the speaker name ends with a colon, and is either
- the first line; or
- is preceeded by two newlines.

We can thus make a RegEx to remove all speaker names.

In [7]:
def remove_speakers(text: str) -> str:
    # Remove first line
    text = "\n".join(text.split("\n")[1:])

    # Use RegEx to identify subsequent speakers
    text = re.sub(r"\n\n.*:", "", text)

    return text

In [8]:
speakers_removed = remove_speakers(raw_corpus)

In [9]:
print_n_lines(speakers_removed)

Before we proceed any further, hear me speak.
Speak, speak.
You are all resolved rather to die than to famish?
Resolved. resolved.
First, you know Caius Marcius is chief enemy to the people.
We know't, we know't.
Let us kill him, and we'll have corn at our own price.
Is't a verdict?
No more talking on't; let it be done: away, away!
One word, good citizens.


Now we can place all the lines of text in one line.

In [10]:
corpus = " ".join(speakers_removed.split("\n"))

Convert the corpus to lowercase.

In [11]:
corpus = corpus.lower()

In [12]:
print(corpus[:256])

before we proceed any further, hear me speak. speak, speak. you are all resolved rather to die than to famish? resolved. resolved. first, you know caius marcius is chief enemy to the people. we know't, we know't. let us kill him, and we'll have corn at our


Split the corpus into words (and punctuation).

In [13]:
words = re.findall(r"\w+|[^\w\s]+", corpus)

Finally, remove any punctuation. These will be our tokens.

In [14]:
tokens = [word for word in words if word.isalpha()]

## Preprocessing

### Getting Sentences

Define the delimiters for the sentences.

In [15]:
SENTENCE_DELIMITERS = list(".?!")

Get the sentences.

In [16]:
split_pattern = "|".join(map(re.escape, SENTENCE_DELIMITERS))
sentences = re.split(split_pattern, corpus)

Strip any leading or trailing whitespace from the sentences.

In [17]:
sentences = [sentence.strip() for sentence in sentences]

In [18]:
print(sentences[:5])

['before we proceed any further, hear me speak', 'speak, speak', 'you are all resolved rather to die than to famish', 'resolved', 'resolved']


### Building Vocabulary

We will be using the set of all tokens as our vocabulary, although we will add three more tokens:
- `<s>` for the start of the sentence;
- `</s>` for the end of the sentence; and
- `<unk>` for an unknown token (i.e., out of vocabulary (OOV) words).

In [19]:
vocab = set(tokens)

In [20]:
vocab.add("<s>")
vocab.add("</s>")
vocab.add("<unk")

How large is our vocabulary?

In [21]:
len(vocab)

22549

Build a RegEx to match tokens.

In [22]:
vocab_by_length = sorted(list(vocab), key=lambda x: -len(x))  # Sorted by length in decreasing order
vocab_re = re.compile("|".join(map(re.escape, vocab_by_length)))

## Creating the $n$-Gram Model

Define the value of $n$ we will be using for our $n$-Gram model.

In [23]:
N = 4

Get all the $n$-Grams from our corpus.

In [24]:
from tqdm.notebook import tqdm

from ngram import NGramModel

model = NGramModel(N)

for sentence in tqdm(sentences, desc="Processing sentences"):
    ngram = ["<s>"]
    tokens = vocab_re.findall(sentence)
    for word in tokens:
        ngram.append(word)
        if len(ngram) == N:
            model.add_ngram(tuple(ngram))
            ngram = ngram[1:]  # Remove first token

    # Handle end of sentence case
    ngram.append("</s>")
    if len(ngram) == N:
        model.add_ngram(tuple(ngram))

Processing sentences:   0%|          | 0/52783 [00:00<?, ?it/s]

Save the model.

In [25]:
model.save("models/shakespeare.model")

Load the model.

In [26]:
model_new = NGramModel.load("models/shakespeare.model")

Test model generation.

In [27]:
out = model_new.generate_text(("<s>", "thou"), 50, seed=8192, temperature=1)
print(out)

['great', 'sized', 'monster', 'of', 'you', 'should', 'bear', 'his', 'body', 'couched', 'in', 'thine', 'this', 'love', 'you', 'gainst', 'the', 'state', 'it', 'cannot', 'be', 'sounded', 'but', 'with', 'proviso', 'and', 'exception', 'that', 'we', 'did', 'not', 'gentle', 'eros', 'there', 'is', 'nothing', 'done', 'to', 'morrow', 'morning', 'lords', 'farewell', 'share', 'the', 'glory', 'is', 'to', 'tell', 'his', 'grace']
