# A Project Gutenberg Poetry Corpus: Quick Experiments

By [Allison Parrish](https://www.decontextualize.com/)

I made [a corpus of around three million lines of poetry from Project Gutenberg](https://github.com/aparrish/gutenberg-poetry-corpus), which anyone can download and use. This notebook shows a couple of quick examples of using the corpus in Python, just to get you started.

First, [download the corpus via this link](http://static.decontextualize.com/gutenberg-poetry-v001.ndjson.gz), or if you're following along in your own copy of Jupyter Notebook and you have `curl` installed, run the cell below:

In [1]:
!curl -O http://static.decontextualize.com/gutenberg-poetry-v001.ndjson.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 52.2M  100 52.2M    0     0  2014k      0  0:00:26  0:00:26 --:--:-- 1089k


Three million lines of poetry in just over 52 megabytes! Not bad.

The file is in gzipped [newline delimited JSON format](http://ndjson.org/): there's a JSON object on each line. You don't need to decompress the file to work with it, since Python has a handy library for working with gzipped files right in the code. The following cell will read in the file and create a list `all_lines` that contains all of these JSON objects.

In [1]:
import gzip, json
all_lines = []
for line in gzip.open("gutenberg-poetry-v001.ndjson.gz"):
    all_lines.append(json.loads(line.strip()))

Just to see what those lines look like, let's pick a handful at random:

In [2]:
import random

In [3]:
random.sample(all_lines, 8)

[{'gid': '3305', 's': 'When shall we find the spring come in,'},
 {'gid': '33156', 's': "How great, in the wild whirl of Time's pursuits,"},
 {'gid': '40344', 's': 'She sighs in desert lands:'},
 {'gid': '34870', 's': '"We came within the fosses deep, that moat'},
 {'gid': '37752', 's': 'The dense black-coated throng, and all a-strain'},
 {'gid': '1365', 's': 'One only lives.  Behold them where they lie'},
 {'gid': '32153', 's': 'With the rapturous adoration'},
 {'gid': '38877', 's': 'About them; and the horse of faery screamed'}]

Each object has a key `s` that contains the text of the line of poetry, and a key `gid` that contains the Project Gutenberg ID of the file in question. You can use this ID to look up the title and author of the book of poetry that the line came from (either using the [Project Gutenberg website](https://www.gutenberg.org/) or using pre-built metadata from, e.g., [Gutenberg, dammit](https://github.com/aparrish/gutenberg-dammit/)).

## Concordances and counts

The corpus could be useful for collecting, counting and comparing lines of poetry with certain characteristics. Here's our first experiment: find every line of poetry in the corpus with the word "flower." I do this using a regular expression that finds the string `flower` between two word boundaries, without respect to case:

In [4]:
import re
flower_lines = [line['s'] for line in all_lines if re.search(r'\bflower\b', line['s'], re.I)]

Again, just to see what we have, we'll take a random sample:

In [5]:
random.sample(flower_lines, 8)

['Blooms for you some happy flower.',
 "Low to his heart he said; 'the flower",
 'The blush is on the flower, and the bloom is on the tree,',
 'Woo and win the Sahri-flower,',
 "The very flower of Issland; 'twas a fair yet fearful scene.",
 "There's not a dew drop on the flower,",
 "Of fame, the world's alluring, phantom flower.",
 'Be it not mine to steal the cultured flower']

As a cut-up method poem, that's not bad all on its own! But let's do a little bit of Digital Humanities and make an aligned concordance of these lines, with the lines sorted alphabetically by the word following "flower," using a bit of regular expression trickery:

In [6]:
longest = max([len(x) for x in flower_lines]) # find the length of the longest line
center = longest - len("flower") # and use it to create a "center" offset that will work for all lines

sorted_flower_lines = sorted(
    [line for line in flower_lines if re.search(r"\bflower\b\s\w", line)], # only lines with word following
    key=lambda line: line[re.search(r"\bflower\b\s", line).end():]) # sort on the substring following the match

for line in sorted_flower_lines[350:400]: # change these numbers to see a different slice
    offset = center - re.search(r'\bflower\b', line, re.I).start()
    print((" "*offset)+line) # left-pad the string with spaces to align on "flower"

                                        Or why sae sweet a flower as love
                                                So sweet a flower as she."
                                                         A flower as yet unblossomed. Warmth and light
                                           Is only half in flower as yet. But why--
                                        "To gain so fair a flower as you,
                                               Cast like a flower aside?
                              (Yon scarlet fruit-bell is a flower asleep;)
                                                 As doth a flower at Apollo's touch.
                                             'Twas a pigmy flower at best,
                                               But he, the flower at head and soil at root,
                                               But he, the flower at head and soil at root,
                                        Blooms the perfect flower at last.
                       

As another experiment, let's find all of the words that occur between either "the" or "a" and the word "flower." English being the way it is, these words are pretty much guaranteed to be adjectives, so this is an ersatz but effective way of getting a (non-exhaustive) list of adjectives that are used to describe a flower in the corpus.

In [7]:
found_adj = []
for line in flower_lines:
    matches = re.findall(r"(the|a)\s(\b\w+\b)\s(\bflower\b)", line, re.I)
    for match in matches: 
        found_adj.append(match[1])

Some adjectives at random:

In [8]:
random.sample(found_adj, 12)

['milky',
 'sweetest',
 'wild',
 'fairer',
 'moon',
 'fairest',
 'blue',
 'flaming',
 'splendid',
 'golden',
 'meanest',
 'coveted']

Using the `Counter` object, we can easily count these up and find the twelve most common adjectives (used in the type of noun phrase we've identified) used to describe a flower:

In [9]:
from collections import Counter

In [10]:
Counter(found_adj).most_common(12)

[('little', 26),
 ('white', 23),
 ('sweetest', 22),
 ('wild', 19),
 ('fairest', 15),
 ('tender', 13),
 ('sweet', 11),
 ('purple', 11),
 ('meanest', 11),
 ('lovely', 10),
 ('bonnie', 10),
 ('faded', 9)]

The little white sweetest wild fairest tender sweet purple meanest lovely bonnie faded flower...

## Rhymes

Stretches of language identified as poetry characteristically exhibit some variety of rhyming, and the lines of poetry in the Gutenberg Poetry corpus are no different. Let's set ourselves a task of finding random rhyming lines in the corpus. To do this, we need to know how words are pronounced. The way that words are spelled in English doesn't really tell us anything helpful about how the word is pronounced, so we need some alternate method to get that information. The [CMU Pronouncing Dictionary](http://www.speech.cs.cmu.edu/cgi-bin/cmudict) is one such method: it's a big database of phonetic transcriptions for many thousands of English words.

I made a Python library called [pronouncing](https://pypi.org/project/pronouncing/) to make it very easy to work with the CMU Pronouncing Dictionary in Python. You can install it like so:

In [33]:
!pip install pronouncing

[33mYou are using pip version 9.0.3, however version 18.0 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


And then import it:

In [11]:
import pronouncing

We'll consider two lines to rhyme with each other if the last words in the lines rhyme. To test this out, we'll pick a source word, say, "flowering," and find all of the words that rhyme with it:

In [12]:
source_word = "flowering"
source_word_rhymes = pronouncing.rhymes(source_word)

In [13]:
source_word_rhymes

['cowering',
 'devouring',
 'empowering',
 'glowering',
 'powering',
 'scouring',
 'showering',
 'souring',
 'towering']

And then look through the lines of poetry in the corpus for lines that end with any of these words:

In [14]:
for line in all_lines:
    text = line['s']
    match = re.search(r'(\b\w+\b)\W*$', text)
    if match:
        last_word = match.group()
        if last_word in source_word_rhymes:
            print(text)

In the Winter you are cowering
"Oh, yes!" exclaimed John, with a towering
In the Winter you are cowering
winged things may never pass, nay, not even the cowering
Ithaca, these are wooing me against my will, and devouring
"Of Coleridge, I can not speak but with reverence. His towering
upbraid him. "Son of Tydeus," he said, "why stand you cowering
the heaviness of his heart, "why are the Achaeans again scouring
Maidens with towering
Are its waters, aye showering
In the Winter you are cowering
In the Winter you are cowering
So hunted, yet defiant, cowering
The moonlit crests of foaming waves gleam towering


Looking through all three million lines of poetry to find rhyming lines one-by-one will be pretty slow. Another approach is to use the `phones_for_word()` and `rhyming_part()` functions in the `pronouncing` library to pre-build a data structure with all of the lines in the corpus grouped with their rhymes. The `phones_for_word()` function gives you the "phones" (sounds) of how a word is pronounced; the `rhyming_part()` function gives you just the portion of a string of phones that another word must share in order for them to be considered "rhyming":

In [15]:
phones = pronouncing.phones_for_word(source_word)[0] # words may have multiple pronunciations, so this returns a list
phones

'F L AW1 ER0 IH0 NG'

In [16]:
pronouncing.rhyming_part(phones)

'AW1 ER0 IH0 NG'

The following cell builds the data structure proposed above: a dictionary that maps rhyming parts to a dictionary that maps words with that rhyming part to the lines of poetry that they're found at the end of.

In [17]:
from collections import defaultdict
by_rhyming_part = defaultdict(lambda: defaultdict(list))
for line in all_lines:
    text = line['s']
    if not(32 < len(text) < 48): # only use lines of uniform lengths
        continue
    match = re.search(r'(\b\w+\b)\W*$', text)
    if match:
        last_word = match.group()
        pronunciations = pronouncing.phones_for_word(last_word)
        if len(pronunciations) > 0:
            rhyming_part = pronouncing.rhyming_part(pronunciations[0])
            # group by rhyming phones (for rhymes) and words (to avoid duplicate words)
            by_rhyming_part[rhyming_part][last_word.lower()].append(text)

A random key/value pair from this dictionary, so you can see its structure:

In [24]:
random_rhyming_part = random.choice(list(by_rhyming_part.keys()))
random_rhyming_part, by_rhyming_part[random_rhyming_part]

('EH1 N S AH0 Z',
 defaultdict(list,
             {'commences': ['Ancient history of Portugal commences',
               'Each day some scene of woe commences'],
              'expenses': ['Will pay for all the school expenses',
               'Will pay for all the school expenses',
               'Which brought great bothers and expenses'],
              'fences': ["We've been climbing trees an' fences",
               'And men too; and why there are fences']}))

Many rhyming parts are found in multiple lines, but only with one unique word. While it's true that identical words "rhyme," it's a little disingenuous to claim that we've made a computer program that finds rhyming lines of poetry if it's mostly just finding lines that end in the same word. So we'll just find the groups from the `by_rhyming_part` dictionary that have at least two different line-ending words:

In [25]:
rhyme_groups = [group for group in by_rhyming_part.values() if len(group) >= 2]

Now, find seven rhyming couplets by selecting a random rhyming group, sampling two keys (words) from that group, and printing a random line from both groups:

In [32]:
for i in range(7):
    group = random.choice(rhyme_groups)
    words = random.sample(list(group.keys()), 2)
    print(random.choice(group[words[0]]))
    print(random.choice(group[words[1]]))

For Brighton's size compared to Nairn
The wind blaws clean about the cairn
Or vermin, or, at best, of cock purloined
There with the Romans in the camp were joined
Nor wine nor wassail could raise a vassal
You saw the day when Henry Schnetzen's castle
The Legislative Bodies to assemble
In vain would formal art dissemble
Venus's Advice to Adonis on Hunting
Growling, as was his wont, and grunting
Of our successors should in part be seated
Of ancient prudent words too much repeated
Reared by a spring to stately height, amidst
For here I read of Eden, and that in the midst


## Markov chain text generation

Markov chain text generation uses statistical information about word co-occurrence to build a model that allows you to generate text that looks similar to your source text. [Markovify](https://github.com/jsvine/markovify) is a great library for Python that makes it easy to build and generate from Markov chain models. Install it like so:

In [83]:
!pip install markovify

[33mYou are using pip version 9.0.3, however version 18.0 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


And import it:

In [33]:
import markovify

Our goal is to use a Markov chain to generate new lines of poetry from the Gutenberg Poetry corpus. Markovify requires you to pass in your source text as a string, so first off we'll create a big string with a sample of the corpus, separated by newlines:

In [34]:
big_poem = "\n".join([line['s'] for line in random.sample(all_lines, 250000)])

(You can change the number as needed; I kept it low so that the model will build fast and not consume too much RAM.)

Build the model:

In [35]:
model = markovify.NewlineText(big_poem)

And then generate some lines:

In [36]:
for i in range(14):
    print(model.make_sentence())

Were emerald: snow new-fallen seem'd the white wisteria
And the stars have hid their white faces
Bot wel he sih hire wepe,
Red like a moon-shaft silver and the flow!
In love Heaven gave him last his country house, as if in Nature's scorn,
Of cloud grew violet; how thy fame has felt joy and uproar, can ne'er be effaced--
Thus Ráma spoke: the Vánar found,
I feel him warm, but how it steams in your arms and hands forespent with toil,
Or that starred Ethiop queen that we die in a pleasant dream.
Leave the dead anew.
Through the streets he passed,
Home through the mire;
Since, stranger! thou hast every gentle wight I pray,
Soon made the clouds, as morning walks the sea,


This is okay but the lines don't make a lot of sense, and are sometimes too long. You can constrain the length using Markovify's `.make_short_sentence()` method:

In [49]:
model.make_short_sentence(60)

'The record sound in the wood, or the glory moving on,'

I find that Markov-generated text is best when you keep it short and force juxtapositions—otherwise the reader's attention will wander. The following cell generates a series of short, haiku-esque poems of two to five Markov-generated lines, and ensures that the last line of each poem ends with a period:

In [67]:
for i in range(6):
    print()
    for i in range(random.randrange(1, 5)):
        print(model.make_short_sentence(40))
    # ensure last line has a period at the end, for closure
    print(re.sub(r"(\w)[^\w.]?$", r"\1.", model.make_short_sentence(40)))
    print()
    print("～ ❀ ～")


There and here he died,
Must be the king Theucer.

～ ❀ ～

They seemed the most beautiful;
Better the rule maintain?
Or kings be worn,
From curl-crowned forehead to my good.

～ ❀ ～

And then the words upon our sphere,
And so it runs away.
Four-and-twenty years he spake
They buried him at your length,
I do not go from her flying.

～ ❀ ～

Bot of verray covenant
amiable lady, by whom alone is giv'n.

～ ❀ ～

And there in the little earthen vessels,
And owns no softer charm
Is each to Heaven commends.

～ ❀ ～

I thought it very large.
And wish'd confusion to the lute
And, as he was kind,
As the black stars, merrily.

～ ❀ ～


## Further reading

The [README in the code repository](README.md) has a few more examples of (earlier iterations of) this corpus at work.

If you're just getting started with Python and creative language generation, check out the notes for [Reading and Writing Electronic Text](http://rwet.decontextualize.com/), a class I teach at ITP.