# Exploring the Lovecraft Corpus - Data Processing  
Ryan Folks  
vcz2aj@virginia.edu  

### Objective: The objective of this notebook is to synthesize an F0 dataset out of several sources and transform it into an F3 dataset.
---

In [200]:
import glob
import pandas as pd
import numpy as np
import nltk
from tqdm import tqdm

---

# 0) Decisions in Building this Digital Analytical Edition

Anytime I make a conscientious decision in the building of this digital analytical edition, I want to preserve that so that it can be reversed/changed, and understood by others who use this for academic purposes.  

## Decision Log:  

1. Changed the title of <ins>Celephaïs</ins> to <ins>Celephais</ins> (with one dot over the 'i') for easier data cleaning and indexing.  
2. Changed the title of <ins>Herbert West — Reanimator</ins> to <ins>Herbert West: Reanimator</ins> because of potential confusion between the minus sign (-) and hyphen (—)  
3. Listed the date of <ins>The Crawling Chaos</ins> as 1921, but it is unclear whether it was written in 1920 or 1921.  
4. Fixed/changed the title of <ins>Dreams in the Witch-House</ins> in the lovecraft corpus to <ins>**The** Dreams in the Witch House</ins>.  
5. Listed the date of <ins>Through the Gates of the Silver Key</ins> as 1933, when in reality it was started in 1932.  
6. Fixed the title of <ins>Poetry of the Gods</ins> in the lovecraft corpus to <ins>Poetry **and** the Gods</ins>.  
7. <ins>Christmas</ins>'s date is unknown, but it is listed as 1920, since this was the date it was published in a magazine.  
8. <ins>Nathicana</ins>'s date is unknown, but was first published in 1927.  
9. <ins>Psychopompos: A Tale in Rhyme</ins> might have been written in 1917, but it is listed as 1918.  
10. <ins>Waste Paper</ins> may have been written in late 1922, but it is listed as 1923.
11. 

## Sources:  
The sources of information that I have used.  
1. https://hplovecraft.com/writings/texts/  
2. https://en.wikipedia.org/wiki/H._P._Lovecraft_bibliography  
3. https://github.com/vilmibm/lovecraftcorpus  
4. https://github.com/ontoligent/DS5001-2022-01

---

## 1) Building the Corpus

## Extracting Stories and their Titles

In [121]:
lovecraft_dict = {}
for i in glob.glob('data/short_stories/*.txt'):
    story = open(i, 'r').readlines()
    title = ' '.join([i for i in str(story[0]).replace('\n', '').lower().split()])
    story = ' '.join(story)
    story = story.split('\n \n')[1:]
    lovecraft_dict[title] = ' '.join(story)

In [122]:
stories_df = pd.DataFrame([lovecraft_dict.keys(), lovecraft_dict.values()]).transpose()
stories_df.columns = ['title', 'content']
stories_df.set_index('title', inplace=True)

In [123]:
stories_df

Unnamed: 0_level_0,content
title,Unnamed: 1_level_1
the alchemist,"High up, crowning the grassy summit of a swel..."
facts concerning the late arthur jermyn and his family,"I Life is a hideous thing, and from the back..."
azathoth,"When age fell upon the world, and wonder went..."
the beast in the cave,The horrible conclusion which had been gradua...
beyond the wall of sleep,I have often wondered if the majority of mank...
...,...
the unnamable,"With this friend, Joel Manton, I had often la..."
in the vault,Birch acquired a limitation and changed his b...
what the moon brings,It was in the spectral summer when the moon s...
the whisperer in darkness,I Bear in mind closely that I did not see an...


In [124]:
stories_df['filepath'] = [i for i in glob.glob('data/short_stories/*.txt')]

In [125]:
stories_df

Unnamed: 0_level_0,content,filepath
title,Unnamed: 1_level_1,Unnamed: 2_level_1
the alchemist,"High up, crowning the grassy summit of a swel...",data/short_stories\alchemist.txt
facts concerning the late arthur jermyn and his family,"I Life is a hideous thing, and from the back...",data/short_stories\arthur_jermyn.txt
azathoth,"When age fell upon the world, and wonder went...",data/short_stories\azathoth.txt
the beast in the cave,The horrible conclusion which had been gradua...,data/short_stories\beast.txt
beyond the wall of sleep,I have often wondered if the majority of mank...,data/short_stories\beyond_wall_of_sleep.txt
...,...,...
the unnamable,"With this friend, Joel Manton, I had often la...",data/short_stories\unnamable.txt
in the vault,Birch acquired a limitation and changed his b...,data/short_stories\vault.txt
what the moon brings,It was in the spectral summer when the moon s...,data/short_stories\what_moon_brings.txt
the whisperer in darkness,I Bear in mind closely that I did not see an...,data/short_stories\whisperer.txt


## Getting Dates for the Stories

As far as I can tell, these dates mark the date of when Lovecraft *completed* these stories as opposed to when they were published. It is more important to list them this way since, analytically, we wist to see how his writing developed over time.

In [126]:
dates = '''The Beast in the Cave (1905)
    The Alchemist (1908)
    The Tomb (1917)
    Dagon (1917)
    Polaris (1918)
    Beyond the Wall of Sleep (1919)
    Memory (1919)
    Old Bugs (1919)
    The Transition of Juan Romero (1919)
    The White Ship (1919)
    The Doom That Came to Sarnath (1919)
    The Statement of Randolph Carter (1919)
    The Terrible Old Man (1920)
    The Tree (1920)
    The Cats of Ulthar (1920)
    The Temple (1920)
    Facts Concerning the Late Arthur Jermyn and His Family (1920)
    The Street (1920)
    Celephais (1920)
    From Beyond (1920)
    Nyarlathotep (1920)
    The Picture in the House (1920)
    Poetry and the Gods (1920)
    Ex Oblivione (1921)
    The Nameless City (1921)
    The Quest of Iranon (1921)
    The Moon-Bog (1921)
    The Outsider (1921)
    The Other Gods (1921)
    The Music of Erich Zann (1921)
    The Crawling Chaos (1921)
    Herbert West: Reanimator (1922)
    Hypnos (1922)
    What the Moon Brings (1922)
    Azathoth (1922)
    The Hound (1922)
    The Horror at Martin's Beach (1922)
    The Lurking Fear (1922)
    The Rats in the Walls (1923)
    The Unnamable (1923)
    The Festival (1923)
    The Shunned House (1924)
    Imprisoned with the Pharaohs (1924)
    The Horror at Red Hook (1925)
    He (1925)
    In the Vault (1925)
    The Descendant (1926)
    Cool Air (1926)
    The Call of Cthulhu (1926)
    Pickman's Model (1926)
    The Silver Key (1926)
    The Strange High House in the Mist (1926)
    The Dream-Quest of Unknown Kadath (1927)
    The Case of Charles Dexter Ward (1927)
    The Colour Out of Space (1927)
    The Very Old Folk (1927)
    The Thing in the Moonlight (1927)
    The History of the Necronomicon (1927)
    Ibid (1928)
    The Dunwich Horror (1928)
    Medusa's Coil (1930)
    The Whisperer in Darkness (1930)
    At the Mountains of Madness (1931)
    The Shadow Over Innsmouth (1931)
    The Dreams in the Witch House (1932)
    Through the Gates of the Silver Key (1933)
    The Thing on the Doorstep (1933)
    The Evil Clergyman (1933)
    The Book (1933)
    The Shadow out of Time (1934)
    The Haunter of the Dark (1935)'''

In [127]:
dates = dates.split('\n')
dates = [i.lstrip() for i in dates]
dates = [i.lower() for i in dates]
dates = {i.split('(')[0].rstrip():i.split('(')[1].replace(')', '') for i in dates}

In [128]:
dates = pd.DataFrame(dates, index=[0]).transpose()
dates.columns = ['date']
dates.index.rename('title', inplace=True)

In [129]:
dates

Unnamed: 0_level_0,date
title,Unnamed: 1_level_1
the beast in the cave,1905
the alchemist,1908
the tomb,1917
dagon,1917
polaris,1918
...,...
the thing on the doorstep,1933
the evil clergyman,1933
the book,1933
the shadow out of time,1934


## Merging Dates and Stories

In [130]:
hp_df = stories_df.merge(dates, how='left', right_index=True, left_index=True)

In [131]:
hp_df['category'] = 'short story'

## Extracting Poems and their Titles

In [132]:
poem_dict = {}
for i in glob.glob('data/poetry/*.txt'):
    poem = open(i, 'r', encoding='utf-8').readlines()
    title = poem[0].split(': ')[1:][0].replace('\n', '')
    poem = ' '.join(poem).split('+++\n')[1]
    poem_dict[title.lower()] = poem

date_dict = {}
for i in glob.glob('data/poetry/*.txt'):
    poem = open(i, 'r', encoding='utf-8').readlines()
    title = poem[0].split(': ')[1:][0].replace('\n', '')
    date = poem[2].split(': ')[1]
    date_dict[title.lower()] = date.replace('\n', '')

In [133]:
poems_df['filepath'] = [i for i in glob.glob('data/poetry/*.txt')]

In [134]:
dates_df = pd.DataFrame(date_dict, index=[0]).transpose()
dates_df.columns = ['date']

In [135]:
poems_df = pd.DataFrame(poem_dict, index=[0]).transpose()
poems_df.columns = ['content']
poems_df['category'] = 'poem'

In [136]:
print(len(poems_df), len(dates_df), len(poems_df.merge(dates_df, right_index=True, left_index=True)))
poems_df = poems_df.merge(dates_df, right_index=True, left_index=True)
poems_df.index.rename('title', inplace=True)

42 42 42


In [137]:
poems_df

Unnamed: 0_level_0,content,category,date
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a garden,"There’s an ancient, ancient garden that I see...",poem,1917
an american to mother england,England! My England! Can the surging sea\n Th...,poem,1916
arcadia,By Head Balledup\n \n O give me the life of t...,poem,1935
astrophobos,In the midnight heavens burning\n \tThro’ eth...,poem,1917
christmas,"The cottage hearth beams warm and bright,\n \...",poem,1920
dead passion's flame,"A Poem by Blank Frailty\n \n Ah, Passion, lik...",poem,1935
despair,"O’er the midnight moorlands crying,\n Thro’ t...",poem,1919
fact and fancy,"How dull the wretch, whose philosophic mind\n...",poem,1917
festival,"\tThere is snow on the ground,\n \t\tAnd the ...",poem,1925
fungi from yuggoth,I. The Book\n \n The place was dark and dusty...,poem,1930


In [138]:
hp_df = hp_df.append(poems_df)

I am going to save one file without the raw text content as the LIB file, and another that drops the category and date columns as a RAW file. This way I can conform to the F2 style and also retain the raw text as a csv.

In [139]:
LIB = hp_df.copy()
LIB.drop('content', axis=1, inplace=True)
LIB.to_csv('data/output/lovecraft_LIB.csv')

In [140]:
RAW = hp_df.copy()
RAW.drop(['category', 'date'], axis=1, inplace=True)
RAW.to_csv('data/output/lovecraft_RAW.csv')

---

# 2) From F0 to F2

In [141]:
tokenizable_df = hp_df.copy()

In [142]:
def tokenize(dataframe):
    pass


def preprocess(dataframe):
    '''Preprocesses the text for tokenization'''
    temp = dataframe.copy()
    for i in ['\n', '\t']:
        temp.content.apply(lambda x: x.replace(i, ''))
    return temp



In [143]:
#tokenizable_df.content = tokenizable_df.content.str.lower()
tokenizable_df = tokenizable_df.replace(['\n', '\t'], ' ', regex=True)

In [144]:
#tokenizable_df['content'] = tokenizable_df['content'].str.replace('[^\w\s]', '', regex=True)

In [145]:
tokenizable_df

Unnamed: 0_level_0,content,filepath,date,category
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
the alchemist,"High up, crowning the grassy summit of a swel...",data/short_stories\alchemist.txt,1908,short story
facts concerning the late arthur jermyn and his family,"I Life is a hideous thing, and from the back...",data/short_stories\arthur_jermyn.txt,1920,short story
azathoth,"When age fell upon the world, and wonder went...",data/short_stories\azathoth.txt,1922,short story
the beast in the cave,The horrible conclusion which had been gradua...,data/short_stories\beast.txt,1905,short story
beyond the wall of sleep,I have often wondered if the majority of mank...,data/short_stories\beyond_wall_of_sleep.txt,1919,short story
...,...,...,...,...
the wood,"They cut it down, and where the pitch-black a...",,1929,poem
"to clark ashton smith, esq., upon his phantastick tales, verses, pictures, and sculptures",A time-black tower against dim banks of cloud...,,1936,poem
"to edward john moreton drax plunkett, eighteenth baron dunsany",As when the sun above a dusky wold Springs i...,,1919,poem
"unda; or, the bride of the sea",Respectfully Dedicated with Permission to MAU...,,1915,poem


In [146]:
tokenizable_df['content_lower'] = tokenizable_df['content'].str.lower()

In [109]:
TOKENS = pd.DataFrame(columns=['title', 'sent_num', 'token_num', 'tokens_str', 'term_str'])

In [110]:
TOKENS

Unnamed: 0,title,tokens_str,term_str
0,the alchemist,THE,the
1,the alchemist,ALCHEMIST,alchemist
2,the alchemist,High,high
3,the alchemist,"up,","up,"
4,the alchemist,crowning,crowning
...,...,...,...
939,waste paper a poem of profound insignificance,Nobody,nobody
940,waste paper a poem of profound insignificance,home,home
941,waste paper a poem of profound insignificance,In,in
942,waste paper a poem of profound insignificance,the,the


In [203]:
cols = ['title', 'para_num', 'sent_num', 'token_num', 'pos_tuple', 'pos', 'token_str', 'term_str']
CORPUS = pd.DataFrame(columns=cols)

for work in tqdm(range(len(tokenizable_df))):
    work_df = pd.DataFrame(columns=cols)
    paras = tokenizable_df.iloc[work].content.split('\n \n')
    title = str(tokenizable_df.index[work])
    for para in range(len(paras)):
        sentence = nltk.sent_tokenize(paras[para])
        for sent in range(len(sentence)):
            tokens = nltk.word_tokenize(sentence[sent])
            tag = nltk.pos_tag(tokens)
            new_row = {}
            for token in range(len(tag)):
                new_row['title'] = title
                new_row['para_num'] = para
                new_row['sent_num'] = sent
                new_row['token_num'] = token
                new_row['pos_tuple'] = str(tag[token])
                new_row['pos'] = tag[token][1]
                new_row['token_str'] = tag[token][0]
                new_row['term_str'] = new_row['token_str'].lower()
                temp = pd.DataFrame(new_row, columns=cols, index=[0])
                work_df = pd.concat([work_df, temp])
    CORPUS = pd.concat([CORPUS, work_df])

100%|██████████| 109/109 [48:02<00:00, 26.45s/it]  


In [204]:
CORPUS

Unnamed: 0,title,para_num,sent_num,token_num,pos_tuple,pos,token_str,term_str
0,the alchemist,0,0,0,"('High', 'NNP')",NNP,High,high
0,the alchemist,0,0,1,"('up', 'RB')",RB,up,up
0,the alchemist,0,0,2,"(',', ',')",",",",",","
0,the alchemist,0,0,3,"('crowning', 'VBG')",VBG,crowning,crowning
0,the alchemist,0,0,4,"('the', 'DT')",DT,the,the
...,...,...,...,...,...,...,...,...
0,waste paper a poem of profound insignificance,0,59,1,"('home', 'NN')",NN,home,home
0,waste paper a poem of profound insignificance,0,59,2,"('In', 'IN')",IN,In,in
0,waste paper a poem of profound insignificance,0,59,3,"('the', 'DT')",DT,the,the
0,waste paper a poem of profound insignificance,0,59,4,"('shantih', 'NN')",NN,shantih,shantih


In [206]:
CORPUS.to_csv('data/output/lovecraft_CORPUS.csv')

## Vocab Table

In [90]:
# This code comes from Dr Alvarado's ETA github repo.
def extract_vocab(TOKENS):
        """This should also be done at the corpus level."""
        VOCAB = TOKENS.term_str.value_counts().to_frame('n')
        VOCAB.index.name = 'term_str'
        VOCAB['n_chars'] = VOCAB.index.str.len()
        VOCAB['p'] = VOCAB['n'] / VOCAB['n'].sum()
        VOCAB['s'] = 1 / VOCAB['p']
        VOCAB['i'] = np.log2(VOCAB['s']) # Same as negative log probability (i.e. log likelihood)
        VOCAB['h'] = VOCAB['p'] * VOCAB['i']
        H = VOCAB['h'].sum()
        return VOCAB
VOCAB = extract_vocab(TOKENS)

In [91]:
TOKENS

Unnamed: 0,title,tokens_str,term_str
0,the alchemist,THE,the
1,the alchemist,ALCHEMIST,alchemist
2,the alchemist,High,high
3,the alchemist,up,up
4,the alchemist,crowning,crowning
...,...,...,...
918,waste paper a poem of profound insignificance,Nobody,nobody
919,waste paper a poem of profound insignificance,home,home
920,waste paper a poem of profound insignificance,In,in
921,waste paper a poem of profound insignificance,the,the


In [92]:
TOKENS.to_csv('data/output/lovecraft_TOKENS.csv')
VOCAB.to_csv('data/output/lovecraft_VOCAB.csv')

---

tracking psuedonyms

In [None]:
{'arcadia' : 'Head Balledup',
 'dead passion\'s flame' : 'Blank Frailty'}