# Exploring the Lovecraft Corpus - Data Processing  
Ryan Folks  
vcz2aj@virginia.edu  

### Objective: The objective of this notebook is to synthesize an F0 dataset out of several sources and transform it into an F4 dataset.
---

In [2]:
import glob
import pandas as pd
import numpy as np
import nltk
from tqdm import tqdm

---

# 0) Decisions in Building this Digital Analytical Edition

Anytime I make a conscientious decision in the building of this digital analytical edition, I want to preserve that so that it can be reversed/changed, and understood by others who use this for academic purposes.  

## Decision Log:  

1. Changed the title of <ins>Celephaïs</ins> to <ins>Celephais</ins> (with one dot over the 'i') for easier data cleaning and indexing.  
2. Changed the title of <ins>Herbert West — Reanimator</ins> to <ins>Herbert West: Reanimator</ins> because of potential confusion between the minus sign (-) and hyphen (—)  
3. Listed the date of <ins>The Crawling Chaos</ins> as 1921, but it is unclear whether it was written in 1920 or 1921.  
4. Fixed/changed the title of <ins>Dreams in the Witch-House</ins> in the lovecraft corpus to <ins>**The** Dreams in the Witch House</ins>.  
5. Listed the date of <ins>Through the Gates of the Silver Key</ins> as 1933, when in reality it was started in 1932.  
6. Fixed the title of <ins>Poetry of the Gods</ins> in the lovecraft corpus to <ins>Poetry **and** the Gods</ins>.  
7. <ins>Christmas</ins>'s date is unknown, but it is listed as 1920, since this was the date it was published in a magazine.  
8. <ins>Nathicana</ins>'s date is unknown, but was first published in 1927.  
9. <ins>Psychopompos: A Tale in Rhyme</ins> might have been written in 1917, but it is listed as 1918.  
10. <ins>Waste Paper</ins> may have been written in late 1922, but it is listed as 1923.
11. 

## Sources:  
The sources of information that I have used.  
1. https://hplovecraft.com/writings/texts/  
2. https://en.wikipedia.org/wiki/H._P._Lovecraft_bibliography  
3. https://github.com/vilmibm/lovecraftcorpus  
4. https://github.com/ontoligent/DS5001-2022-01

---

## 1) Building the Corpus

## Extracting Stories and their Titles

In [3]:
lovecraft_dict = {}
for i in glob.glob('data/short_stories/*.txt'):
    story = open(i, 'r').readlines()
    title = ' '.join([i for i in str(story[0]).replace('\n', '').lower().split()])
    story = ' '.join(story)
    story = story.split('\n \n')[1:]
    lovecraft_dict[title] = ' '.join(story)

In [4]:
stories_df = pd.DataFrame([lovecraft_dict.keys(), lovecraft_dict.values()]).transpose()
stories_df.columns = ['title', 'content']
stories_df.set_index('title', inplace=True)

In [5]:
stories_df

Unnamed: 0_level_0,content
title,Unnamed: 1_level_1
the alchemist,"High up, crowning the grassy summit of a swel..."
facts concerning the late arthur jermyn and his family,"I Life is a hideous thing, and from the back..."
azathoth,"When age fell upon the world, and wonder went..."
the beast in the cave,The horrible conclusion which had been gradua...
beyond the wall of sleep,I have often wondered if the majority of mank...
...,...
the unnamable,"With this friend, Joel Manton, I had often la..."
in the vault,Birch acquired a limitation and changed his b...
what the moon brings,It was in the spectral summer when the moon s...
the whisperer in darkness,I Bear in mind closely that I did not see an...


In [6]:
stories_df['filepath'] = [i for i in glob.glob('data/short_stories/*.txt')]

In [7]:
stories_df

Unnamed: 0_level_0,content,filepath
title,Unnamed: 1_level_1,Unnamed: 2_level_1
the alchemist,"High up, crowning the grassy summit of a swel...",data/short_stories\alchemist.txt
facts concerning the late arthur jermyn and his family,"I Life is a hideous thing, and from the back...",data/short_stories\arthur_jermyn.txt
azathoth,"When age fell upon the world, and wonder went...",data/short_stories\azathoth.txt
the beast in the cave,The horrible conclusion which had been gradua...,data/short_stories\beast.txt
beyond the wall of sleep,I have often wondered if the majority of mank...,data/short_stories\beyond_wall_of_sleep.txt
...,...,...
the unnamable,"With this friend, Joel Manton, I had often la...",data/short_stories\unnamable.txt
in the vault,Birch acquired a limitation and changed his b...,data/short_stories\vault.txt
what the moon brings,It was in the spectral summer when the moon s...,data/short_stories\what_moon_brings.txt
the whisperer in darkness,I Bear in mind closely that I did not see an...,data/short_stories\whisperer.txt


## Getting Dates for the Stories

As far as I can tell, these dates mark the date of when Lovecraft *completed* these stories as opposed to when they were published. It is more important to list them this way since, analytically, we wist to see how his writing developed over time.

In [8]:
dates = '''The Beast in the Cave (1905)
    The Alchemist (1908)
    The Tomb (1917)
    Dagon (1917)
    Polaris (1918)
    Beyond the Wall of Sleep (1919)
    Memory (1919)
    Old Bugs (1919)
    The Transition of Juan Romero (1919)
    The White Ship (1919)
    The Doom That Came to Sarnath (1919)
    The Statement of Randolph Carter (1919)
    The Terrible Old Man (1920)
    The Tree (1920)
    The Cats of Ulthar (1920)
    The Temple (1920)
    Facts Concerning the Late Arthur Jermyn and His Family (1920)
    The Street (1920)
    Celephais (1920)
    From Beyond (1920)
    Nyarlathotep (1920)
    The Picture in the House (1920)
    Poetry and the Gods (1920)
    Ex Oblivione (1921)
    The Nameless City (1921)
    The Quest of Iranon (1921)
    The Moon-Bog (1921)
    The Outsider (1921)
    The Other Gods (1921)
    The Music of Erich Zann (1921)
    The Crawling Chaos (1921)
    Herbert West: Reanimator (1922)
    Hypnos (1922)
    What the Moon Brings (1922)
    Azathoth (1922)
    The Hound (1922)
    The Horror at Martin's Beach (1922)
    The Lurking Fear (1922)
    The Rats in the Walls (1923)
    The Unnamable (1923)
    The Festival (1923)
    The Shunned House (1924)
    Imprisoned with the Pharaohs (1924)
    The Horror at Red Hook (1925)
    He (1925)
    In the Vault (1925)
    The Descendant (1926)
    Cool Air (1926)
    The Call of Cthulhu (1926)
    Pickman's Model (1926)
    The Silver Key (1926)
    The Strange High House in the Mist (1926)
    The Dream-Quest of Unknown Kadath (1927)
    The Case of Charles Dexter Ward (1927)
    The Colour Out of Space (1927)
    The Very Old Folk (1927)
    The Thing in the Moonlight (1927)
    The History of the Necronomicon (1927)
    Ibid (1928)
    The Dunwich Horror (1928)
    Medusa's Coil (1930)
    The Whisperer in Darkness (1930)
    At the Mountains of Madness (1931)
    The Shadow Over Innsmouth (1931)
    The Dreams in the Witch House (1932)
    Through the Gates of the Silver Key (1933)
    The Thing on the Doorstep (1933)
    The Evil Clergyman (1933)
    The Book (1933)
    The Shadow out of Time (1934)
    The Haunter of the Dark (1935)'''

In [9]:
dates = dates.split('\n')
dates = [i.lstrip() for i in dates]
dates = [i.lower() for i in dates]
dates = {i.split('(')[0].rstrip():i.split('(')[1].replace(')', '') for i in dates}

In [10]:
dates = pd.DataFrame(dates, index=[0]).transpose()
dates.columns = ['date']
dates.index.rename('title', inplace=True)

In [11]:
dates

Unnamed: 0_level_0,date
title,Unnamed: 1_level_1
the beast in the cave,1905
the alchemist,1908
the tomb,1917
dagon,1917
polaris,1918
...,...
the thing on the doorstep,1933
the evil clergyman,1933
the book,1933
the shadow out of time,1934


## Merging Dates and Stories

In [12]:
hp_df = stories_df.merge(dates, how='left', right_index=True, left_index=True)

In [13]:
hp_df['category'] = 'short story'

## Extracting Poems and their Titles

In [14]:
poem_dict = {}
for i in glob.glob('data/poetry/*.txt'):
    poem = open(i, 'r', encoding='utf-8').readlines()
    title = poem[0].split(': ')[1:][0].replace('\n', '')
    poem = ' '.join(poem).split('+++\n')[1]
    poem_dict[title.lower()] = poem

date_dict = {}
for i in glob.glob('data/poetry/*.txt'):
    poem = open(i, 'r', encoding='utf-8').readlines()
    title = poem[0].split(': ')[1:][0].replace('\n', '')
    date = poem[2].split(': ')[1]
    date_dict[title.lower()] = date.replace('\n', '')

In [15]:
poems_df = pd.DataFrame(poem_dict, index=[0]).transpose()
poems_df.columns = ['content']
poems_df['category'] = 'poem'

In [16]:
poems_df['filepath'] = [i for i in glob.glob('data/poetry/*.txt')]

In [17]:
dates_df = pd.DataFrame(date_dict, index=[0]).transpose()
dates_df.columns = ['date']

In [18]:
print(len(poems_df), len(dates_df), len(poems_df.merge(dates_df, right_index=True, left_index=True)))
poems_df = poems_df.merge(dates_df, right_index=True, left_index=True)
poems_df.index.rename('title', inplace=True)

42 42 42


In [19]:
poems_df

Unnamed: 0_level_0,content,category,filepath,date
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a garden,"There’s an ancient, ancient garden that I see...",poem,data/poetry\A Garden.txt,1917
an american to mother england,England! My England! Can the surging sea\n Th...,poem,data/poetry\An American to Mother England.txt,1916
arcadia,By Head Balledup\n \n O give me the life of t...,poem,data/poetry\Arcadia.txt,1935
astrophobos,In the midnight heavens burning\n \tThro’ eth...,poem,data/poetry\Astrophobos.txt,1917
christmas,"The cottage hearth beams warm and bright,\n \...",poem,data/poetry\Christmas.txt,1920
dead passion's flame,"A Poem by Blank Frailty\n \n Ah, Passion, lik...",poem,data/poetry\Dead Passion's Flame.txt,1935
despair,"O’er the midnight moorlands crying,\n Thro’ t...",poem,data/poetry\Despair.txt,1919
fact and fancy,"How dull the wretch, whose philosophic mind\n...",poem,data/poetry\Fact and Fancy.txt,1917
festival,"\tThere is snow on the ground,\n \t\tAnd the ...",poem,data/poetry\Festival.txt,1925
fungi from yuggoth,I. The Book\n \n The place was dark and dusty...,poem,data/poetry\Fungi from Yuggoth.txt,1930


In [20]:
hp_df = hp_df.append(poems_df)

I am going to save one file without the raw text content as the LIB file, and another that drops the category and date columns as a RAW file. This way I can conform to the F2 style and also retain the raw text as a csv.

In [21]:
LIB = hp_df.copy()
LIB.drop('content', axis=1, inplace=True)
LIB.to_csv('data/output/lovecraft_LIB.csv')

In [21]:
RAW = hp_df.copy()
RAW.drop(['category', 'date'], axis=1, inplace=True)
RAW.to_csv('data/output/lovecraft_RAW.csv', index=False)

In [22]:
# load these dataframes back if the kernal disconnects or is shut off.
LIB = pd.read_csv('data/output/lovecraft_LIB.csv')
RAW = pd.read_csv('data/output/lovecraft_RAW.csv')

---

# 2) From F0 to F2

In [23]:
tokenizable_df = hp_df.copy()

In [24]:
def tokenize(dataframe):
    pass


def preprocess(dataframe):
    '''Preprocesses the text for tokenization'''
    temp = dataframe.copy()
    for i in ['\n', '\t']:
        temp.content.apply(lambda x: x.replace(i, ''))
    return temp



In [25]:
#tokenizable_df.content = tokenizable_df.content.str.lower()
tokenizable_df = tokenizable_df.replace(['\n', '\t'], ' ', regex=True)

In [26]:
#tokenizable_df['content'] = tokenizable_df['content'].str.replace('[^\w\s]', '', regex=True)

In [27]:
tokenizable_df

Unnamed: 0_level_0,content,filepath,date,category
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
the alchemist,"High up, crowning the grassy summit of a swel...",data/short_stories\alchemist.txt,1908,short story
facts concerning the late arthur jermyn and his family,"I Life is a hideous thing, and from the back...",data/short_stories\arthur_jermyn.txt,1920,short story
azathoth,"When age fell upon the world, and wonder went...",data/short_stories\azathoth.txt,1922,short story
the beast in the cave,The horrible conclusion which had been gradua...,data/short_stories\beast.txt,1905,short story
beyond the wall of sleep,I have often wondered if the majority of mank...,data/short_stories\beyond_wall_of_sleep.txt,1919,short story
...,...,...,...,...
the wood,"They cut it down, and where the pitch-black a...",data/poetry\The Wood.txt,1929,poem
"to clark ashton smith, esq., upon his phantastick tales, verses, pictures, and sculptures",A time-black tower against dim banks of cloud...,"data/poetry\To Clark Ashton Smith, Esq., upon ...",1936,poem
"to edward john moreton drax plunkett, eighteenth baron dunsany",As when the sun above a dusky wold Springs i...,data/poetry\To Edward John Moreton Drax Plunke...,1919,poem
"unda; or, the bride of the sea",Respectfully Dedicated with Permission to MAU...,"data/poetry\Unda; or, The Bride of the Sea.txt",1915,poem


In [28]:
tokenizable_df['content_lower'] = tokenizable_df['content'].str.lower()

In [29]:
TOKENS = pd.DataFrame(columns=['title', 'sent_num', 'token_num', 'tokens_str', 'term_str'])

In [30]:
TOKENS

Unnamed: 0,title,sent_num,token_num,tokens_str,term_str


# F2 to F4

### Part of speech tagging.

This will take a while...

In [32]:
cols = ['title', 'para_num', 'sent_num', 'token_num', 'pos_tuple', 'pos', 'token_str', 'term_str']
CORPUS = pd.DataFrame(columns=cols)

for work in tqdm(range(len(tokenizable_df))):
    work_df = pd.DataFrame(columns=cols)
    paras = tokenizable_df.iloc[work].content.split('\n \n')
    title = str(tokenizable_df.index[work])
    for para in range(len(paras)):
        sentence = nltk.sent_tokenize(paras[para])
        for sent in range(len(sentence)):
            tokens = nltk.word_tokenize(sentence[sent])
            tag = nltk.pos_tag(tokens)
            new_row = {}
            for token in range(len(tag)):
                new_row['title'] = title
                new_row['para_num'] = para
                new_row['sent_num'] = sent
                new_row['token_num'] = token
                new_row['pos_tuple'] = str(tag[token])
                new_row['pos'] = tag[token][1]
                new_row['token_str'] = tag[token][0]
                new_row['term_str'] = new_row['token_str'].lower()
                temp = pd.DataFrame(new_row, columns=cols, index=[0])
                work_df = pd.concat([work_df, temp])
    CORPUS = pd.concat([CORPUS, work_df])

100%|██████████| 109/109 [43:37<00:00, 24.01s/it]  


In [33]:
CORPUS

Unnamed: 0,title,para_num,sent_num,token_num,pos_tuple,pos,token_str,term_str
0,the alchemist,0,0,0,"('High', 'NNP')",NNP,High,high
0,the alchemist,0,0,1,"('up', 'RB')",RB,up,up
0,the alchemist,0,0,2,"(',', ',')",",",",",","
0,the alchemist,0,0,3,"('crowning', 'VBG')",VBG,crowning,crowning
0,the alchemist,0,0,4,"('the', 'DT')",DT,the,the
...,...,...,...,...,...,...,...,...
0,waste paper a poem of profound insignificance,0,59,1,"('home', 'NN')",NN,home,home
0,waste paper a poem of profound insignificance,0,59,2,"('In', 'IN')",IN,In,in
0,waste paper a poem of profound insignificance,0,59,3,"('the', 'DT')",DT,the,the
0,waste paper a poem of profound insignificance,0,59,4,"('shantih', 'NN')",NN,shantih,shantih


### Part of speech group from part of speech

In [34]:
CORPUS['pos_group'] = CORPUS['pos'].apply(lambda x: x[:2])

In [35]:
CORPUS

Unnamed: 0,title,para_num,sent_num,token_num,pos_tuple,pos,token_str,term_str,pos_group
0,the alchemist,0,0,0,"('High', 'NNP')",NNP,High,high,NN
0,the alchemist,0,0,1,"('up', 'RB')",RB,up,up,RB
0,the alchemist,0,0,2,"(',', ',')",",",",",",",","
0,the alchemist,0,0,3,"('crowning', 'VBG')",VBG,crowning,crowning,VB
0,the alchemist,0,0,4,"('the', 'DT')",DT,the,the,DT
...,...,...,...,...,...,...,...,...,...
0,waste paper a poem of profound insignificance,0,59,1,"('home', 'NN')",NN,home,home,NN
0,waste paper a poem of profound insignificance,0,59,2,"('In', 'IN')",IN,In,in,IN
0,waste paper a poem of profound insignificance,0,59,3,"('the', 'DT')",DT,the,the,DT
0,waste paper a poem of profound insignificance,0,59,4,"('shantih', 'NN')",NN,shantih,shantih,NN


In [36]:
CORPUS[CORPUS.term_str == '']

Unnamed: 0,title,para_num,sent_num,token_num,pos_tuple,pos,token_str,term_str,pos_group


No anomalies.  

Saving.

In [37]:
CORPUS.to_csv('data/output/lovecraft_CORPUS.csv', index=False)

In [38]:
# load the corpus table if the kernal disconnects or is shut down.
CORPUS = pd.read_csv('data/output/lovecraft_CORPUS.csv')

## Vocab Table

In [55]:
# This code comes from Dr Alvarado's ETA github repo.
def extract_vocab(TOKENS):
        """This should also be done at the corpus level."""
        VOCAB = TOKENS.copy()
        VOCAB = TOKENS.term_str.value_counts().to_frame('n')
        VOCAB.index.name = 'term_str'
        VOCAB['n_chars'] = VOCAB.index.str.len()
        VOCAB['p'] = VOCAB['n'] / VOCAB['n'].sum()
        VOCAB['s'] = 1 / VOCAB['p']
        VOCAB['i'] = np.log2(VOCAB['s']) # Same as negative log probability (i.e. log likelihood)
        VOCAB['h'] = VOCAB['p'] * VOCAB['i']
        H = VOCAB['h'].sum()
        return VOCAB
VOCAB = extract_vocab(CORPUS[['title', 'token_str', 'term_str']])

In [58]:
VOCAB = VOCAB.reset_index()

### Adding Term Rank

In [59]:
rank_dict = {}
n_list = list(VOCAB.n.unique())
for i in range(len(n_list)):
    rank_dict[n_list[i]] = i+1

In [60]:
VOCAB['term_rank'] = VOCAB['n'].apply(lambda x: rank_dict[x])

In [61]:
VOCAB

Unnamed: 0,term_str,n,n_chars,p,s,i,h,term_rank
0,the,36221,3,0.061829,16.173601,4.015569,0.248279,1
1,",",28309,1,0.048323,20.693914,4.371135,0.211228,2
2,and,20776,3,0.035465,28.197151,4.817477,0.170850,3
3,of,19688,2,0.033607,29.755384,4.895079,0.164511,4
4,.,18392,1,0.031395,31.852110,4.993317,0.156766,5
...,...,...,...,...,...,...,...,...
27565,hand-held,1,9,0.000002,585824.000000,19.160108,0.000033,421
27566,maouths,1,7,0.000002,585824.000000,19.160108,0.000033,421
27567,gammell,1,7,0.000002,585824.000000,19.160108,0.000033,421
27568,sojournings,1,11,0.000002,585824.000000,19.160108,0.000033,421


### Stemming

In [62]:
porter = nltk.stem.PorterStemmer()
snowball = nltk.stem.SnowballStemmer(language='english')
lancaster = nltk.stem.LancasterStemmer()

VOCAB['stem_porter'] = VOCAB['term_str'].apply(lambda x: porter.stem(x))
VOCAB['stem_snowball'] = VOCAB['term_str'].apply(lambda x: snowball.stem(x))
VOCAB['stem_lancaster'] = VOCAB['term_str'].apply(lambda x: lancaster.stem(x))

In [63]:
VOCAB

Unnamed: 0,term_str,n,n_chars,p,s,i,h,term_rank,stem_porter,stem_snowball,stem_lancaster
0,the,36221,3,0.061829,16.173601,4.015569,0.248279,1,the,the,the
1,",",28309,1,0.048323,20.693914,4.371135,0.211228,2,",",",",","
2,and,20776,3,0.035465,28.197151,4.817477,0.170850,3,and,and,and
3,of,19688,2,0.033607,29.755384,4.895079,0.164511,4,of,of,of
4,.,18392,1,0.031395,31.852110,4.993317,0.156766,5,.,.,.
...,...,...,...,...,...,...,...,...,...,...,...
27565,hand-held,1,9,0.000002,585824.000000,19.160108,0.000033,421,hand-held,hand-held,hand-held
27566,maouths,1,7,0.000002,585824.000000,19.160108,0.000033,421,maouth,maouth,maouth
27567,gammell,1,7,0.000002,585824.000000,19.160108,0.000033,421,gammel,gammel,gammel
27568,sojournings,1,11,0.000002,585824.000000,19.160108,0.000033,421,sojourn,sojourn,sojourn


### Adding Max POS

In [67]:
VOCAB['max_pos'] = CORPUS[['term_str', 'pos']].value_counts().unstack(fill_value=0).idxmax(1)

Unnamed: 0,term_str,max_pos
0,!,.
1,#,#
2,$,$
3,&,CC
4,',''
...,...,...
27565,—maevius,NNP
27566,—now,VBP
27567,’,NNP
27568,“,NN


In [83]:
temp = VOCAB.copy()
temp = temp.merge(pd.DataFrame(CORPUS[['term_str', 'pos']].value_counts().unstack(fill_value=0).idxmax(1), 
                        columns=['max_pos']).reset_index(),
                  on='term_str',
                  how='left')
temp['max_pos'] = temp['max_pos_y']
temp.drop(['max_pos_x', 'max_pos_y'], inplace=True, axis=1)
VOCAB = temp.copy()

In [90]:
temp = VOCAB.copy()
temp = temp.merge(pd.DataFrame(CORPUS[['term_str','pos']].value_counts().unstack().count(1), 
                        columns=['n_pos']).reset_index(),
                  on='term_str',
                  how='left')
temp['n_pos'] = temp['n_pos_y']
temp.drop(['n_pos_x', 'n_pos_y'], inplace=True, axis=1)
VOCAB = temp.copy()

In [91]:
VOCAB

Unnamed: 0,term_str,n,n_chars,p,s,i,h,term_rank,stem_porter,stem_snowball,stem_lancaster,max_pos,n_pos
0,the,36221,3,0.061829,16.173601,4.015569,0.248279,1,the,the,the,DT,2
1,",",28309,1,0.048323,20.693914,4.371135,0.211228,2,",",",",",",",",1
2,and,20776,3,0.035465,28.197151,4.817477,0.170850,3,and,and,and,CC,2
3,of,19688,2,0.033607,29.755384,4.895079,0.164511,4,of,of,of,IN,2
4,.,18392,1,0.031395,31.852110,4.993317,0.156766,5,.,.,.,.,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
27565,hand-held,1,9,0.000002,585824.000000,19.160108,0.000033,421,hand-held,hand-held,hand-held,NN,1
27566,maouths,1,7,0.000002,585824.000000,19.160108,0.000033,421,maouth,maouth,maouth,NNS,1
27567,gammell,1,7,0.000002,585824.000000,19.160108,0.000033,421,gammel,gammel,gammel,NNP,1
27568,sojournings,1,11,0.000002,585824.000000,19.160108,0.000033,421,sojourn,sojourn,sojourn,NNS,1


### Stopwords

In [92]:
sw = pd.DataFrame(nltk.corpus.stopwords.words('english'), columns=['term_str'])
sw = sw.reset_index().set_index('term_str')
sw.columns = ['dummy']
sw.dummy = 1

In [93]:
VOCAB['stop'] = VOCAB['term_str'].apply(lambda x: 1 if x in list(sw.reset_index()['term_str']) else 0)

In [94]:
VOCAB[VOCAB['stop'] == 1]

Unnamed: 0,term_str,n,n_chars,p,s,i,h,term_rank,stem_porter,stem_snowball,stem_lancaster,max_pos,n_pos,stop
0,the,36221,3,0.061829,16.173601,4.015569,0.248279,1,the,the,the,DT,2,1
2,and,20776,3,0.035465,28.197151,4.817477,0.170850,3,and,and,and,CC,2,1
3,of,19688,2,0.033607,29.755384,4.895079,0.164511,4,of,of,of,IN,2,1
5,to,11063,2,0.018885,52.953448,5.726653,0.108145,6,to,to,to,TO,2,1
6,a,10452,1,0.017842,56.048986,5.808616,0.103635,7,a,a,a,DT,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11882,re,3,2,0.000005,195274.666667,17.575145,0.000090,419,re,re,re,NNP,3,1
13859,don,2,3,0.000003,292912.000000,18.160108,0.000062,420,don,don,don,VB,1,1
15577,didn,2,4,0.000003,292912.000000,18.160108,0.000062,420,didn,didn,didn,VBP,2,1
21910,m,1,1,0.000002,585824.000000,19.160108,0.000033,421,m,m,m,JJ,1,1


In [97]:
VOCAB[VOCAB['max_pos'].isna()]

Unnamed: 0,term_str,n,n_chars,p,s,i,h,term_rank,stem_porter,stem_snowball,stem_lancaster,max_pos,n_pos,stop


### Zipf's k

In [98]:
VOCAB['zipf_k'] = VOCAB.n * VOCAB.term_rank

In [99]:
VOCAB

Unnamed: 0,term_str,n,n_chars,p,s,i,h,term_rank,stem_porter,stem_snowball,stem_lancaster,max_pos,n_pos,stop,zipf_k
0,the,36221,3,0.061829,16.173601,4.015569,0.248279,1,the,the,the,DT,2,1,36221
1,",",28309,1,0.048323,20.693914,4.371135,0.211228,2,",",",",",",",",1,0,56618
2,and,20776,3,0.035465,28.197151,4.817477,0.170850,3,and,and,and,CC,2,1,62328
3,of,19688,2,0.033607,29.755384,4.895079,0.164511,4,of,of,of,IN,2,1,78752
4,.,18392,1,0.031395,31.852110,4.993317,0.156766,5,.,.,.,.,1,0,91960
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27565,hand-held,1,9,0.000002,585824.000000,19.160108,0.000033,421,hand-held,hand-held,hand-held,NN,1,0,421
27566,maouths,1,7,0.000002,585824.000000,19.160108,0.000033,421,maouth,maouth,maouth,NNS,1,0,421
27567,gammell,1,7,0.000002,585824.000000,19.160108,0.000033,421,gammel,gammel,gammel,NNP,1,0,421
27568,sojournings,1,11,0.000002,585824.000000,19.160108,0.000033,421,sojourn,sojourn,sojourn,NNS,1,0,421


### TFIDF

#### Bag of Words

In [42]:
BOW = CORPUS.groupby(['title']+['term_str']).term_str.count().to_frame('n') 

In [43]:
len(BOW)

130334

In [44]:
BOW.sample(15)

Unnamed: 0_level_0,Unnamed: 1_level_0,n
title,term_str,Unnamed: 2_level_1
the dream-quest of unknown kadath,sculptured,1
the rats in the walls,beings,1
the very old folk,become,2
the picture in the house,hence,1
the dunwich horror,pursuers,1
the dreams in the witch house,lodgings,1
the temple,records,1
the shunned house,sickly,2
the shunned house,schooner,1
imprisoned with the pharaohs,break,1


#### Document Term Matrix

In [45]:
DTCM = BOW.n.unstack()

In [46]:
DTCM.columns[10000:10010]

Index(['fruitful', 'fruition', 'fruitless', 'fruitlessly', 'fruits',
       'frustrated', 'frustration', 'fry', 'frye', 'fryes'],
      dtype='object', name='term_str')

In [47]:
# Compute DF (Document Frequency)
DF = DTCM.count() # THIS WORKS IF WE KEPT NULLS IN DTCM

# Compute TF (Term Frequency)
tf_method = 'sum'
print('TF method:', tf_method)
if tf_method == 'sum':
    TF = DTCM.T / DTCM.T.sum()
elif tf_method == 'max':
    TF = DTCM.T / DTCM.T.max()
elif tf_method == 'log':
    TF = np.log10(DTCM.T + 1)
elif tf_method == 'raw':
    TF = DTCM.T
elif tf_method == 'bool':
    TF = DTCM.T.astype('bool') #.astype('int')
TF = TF.T

# Compute IDF (Inverse Document Frequency)
idf_method = 'standard'
N = DTCM.shape[0]
print('IDF method:', idf_method)
if idf_method == 'standard':
    IDF = np.log10(N / DF)
elif idf_method == 'max':
    IDF = np.log10(DF.max() / DF) 
elif idf_method == 'smooth':
    IDF = np.log10((1 + N) / (1 + DF)) + 1
    
# Compute TFIDF
TFIDF = TF * IDF

TF method: sum
IDF method: standard


In [48]:
TFIDF.head()

term_str,!,#,$,&,','','10,'28,'45,'46,...,κὀνις,μηδἐν,πἀντα,τὁ,—lucan,—maevius,—now,’,“,”
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
a garden,0.000822,,,,,,,,,,...,,,,,,,,0.012538,,
an american to mother england,0.00434,,,,,,,,,,...,,,,,,,,0.02048,,
arcadia,,,,,,,,,,,...,,,,,,,,,,
astrophobos,,,,,,,,,,,...,,,,,,,,0.028308,,
at the mountains of madness,8.3e-05,,,,0.000174,0.000298,,,,,...,,,,,,,,,,


In [49]:
df = pd.DataFrame(DF, columns=['df']).reset_index().copy()

temp = VOCAB.copy()
temp = temp.merge(df, on='term_str', how='left')
VOCAB = temp.copy()

In [50]:
idf = pd.DataFrame(IDF, columns=['idf']).reset_index().copy()

temp = VOCAB.copy()
temp = temp.merge(idf, on='term_str', how='left')
VOCAB = temp.copy()

#### TFIDF aggregates

In [51]:
tfidf_mean = pd.DataFrame(TFIDF[TFIDF > 0].mean().fillna(0), columns=['tfidf_mean']).reset_index().copy()

temp = VOCAB.copy()
temp = temp.merge(tfidf_mean, on='term_str', how='left')
VOCAB = temp.copy()

In [89]:
VOCAB['tfidf_max'] = TFIDF.max()

tfidf_max = pd.DataFrame(TFIDF.max(), columns=['tfidf_max']).reset_index().copy()
temp = VOCAB.copy()
temp = temp.merge(tfidf_max, on='term_str', how='left')
temp.rename(columns={'tfidf_max_y': 'tfidf_max'}, inplace=True)
temp.drop(['tfidf_max_x'], axis=1, inplace=True)
temp = temp.loc[:,~temp.columns.duplicated()]
temp
#VOCAB = temp.copy()

Unnamed: 0,term_str,n,n_chars,p,s,i,h,term_rank,stem_porter,stem_snowball,...,zipf_k,df_x,idf_x,tfidf_mean_x,tfidf_max,dfidf,dfidf2,df_y,idf_y,tfidf_mean_y
0,the,36221,3,0.061829,16.173601,4.015569,0.248279,1,the,the,...,36221,109,0.000000,0.000000,0.000000,0.000000,0.000000,109,0.000000,0.000000
1,",",28309,1,0.048323,20.693914,4.371135,0.211228,2,",",",",...,56618,109,0.000000,0.000000,0.000000,0.000000,0.000000,109,0.000000,0.000000
2,and,20776,3,0.035465,28.197151,4.817477,0.170850,3,and,and,...,62328,108,0.004003,0.000151,0.000276,0.432296,0.003966,108,0.004003,0.000151
3,of,19688,2,0.033607,29.755384,4.895079,0.164511,4,of,of,...,78752,108,0.004003,0.000127,0.000243,0.432296,0.003966,108,0.004003,0.000127
4,.,18392,1,0.031395,31.852110,4.993317,0.156766,5,.,.,...,91960,107,0.008043,0.000250,0.002298,0.860571,0.007895,107,0.008043,0.000250
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27565,hand-held,1,9,0.000002,585824.000000,19.160108,0.000033,421,hand-held,hand-held,...,421,1,2.037426,0.000072,0.000072,2.037426,0.018692,1,2.037426,0.000072
27566,maouths,1,7,0.000002,585824.000000,19.160108,0.000033,421,maouth,maouth,...,421,1,2.037426,0.000103,0.000103,2.037426,0.018692,1,2.037426,0.000103
27567,gammell,1,7,0.000002,585824.000000,19.160108,0.000033,421,gammel,gammel,...,421,1,2.037426,0.000154,0.000154,2.037426,0.018692,1,2.037426,0.000154
27568,sojournings,1,11,0.000002,585824.000000,19.160108,0.000033,421,sojourn,sojourn,...,421,1,2.037426,0.000170,0.000170,2.037426,0.018692,1,2.037426,0.000170


In [69]:
VOCAB[VOCAB.tfidf_max.isna()]

Unnamed: 0,term_str,n,n_chars,p,s,i,h,term_rank,stem_porter,stem_snowball,...,idf_x,tfidf_mean_x,tfidf_max_x,tfidf_max_y,dfidf,dfidf2,df_y,idf_y,tfidf_mean_y,tfidf_max
0,the,36221,3,0.061829,16.173601,4.015569,0.248279,1,the,the,...,0.000000,0.000000,,0.000000,0.000000,0.000000,109,0.000000,0.000000,
1,",",28309,1,0.048323,20.693914,4.371135,0.211228,2,",",",",...,0.000000,0.000000,,0.000000,0.000000,0.000000,109,0.000000,0.000000,
2,and,20776,3,0.035465,28.197151,4.817477,0.170850,3,and,and,...,0.004003,0.000151,,0.000276,0.432296,0.003966,108,0.004003,0.000151,
3,of,19688,2,0.033607,29.755384,4.895079,0.164511,4,of,of,...,0.004003,0.000127,,0.000243,0.432296,0.003966,108,0.004003,0.000127,
4,.,18392,1,0.031395,31.852110,4.993317,0.156766,5,.,.,...,0.008043,0.000250,,0.002298,0.860571,0.007895,107,0.008043,0.000250,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27565,hand-held,1,9,0.000002,585824.000000,19.160108,0.000033,421,hand-held,hand-held,...,2.037426,0.000072,,0.000072,2.037426,0.018692,1,2.037426,0.000072,
27566,maouths,1,7,0.000002,585824.000000,19.160108,0.000033,421,maouth,maouth,...,2.037426,0.000103,,0.000103,2.037426,0.018692,1,2.037426,0.000103,
27567,gammell,1,7,0.000002,585824.000000,19.160108,0.000033,421,gammel,gammel,...,2.037426,0.000154,,0.000154,2.037426,0.018692,1,2.037426,0.000154,
27568,sojournings,1,11,0.000002,585824.000000,19.160108,0.000033,421,sojourn,sojourn,...,2.037426,0.000170,,0.000170,2.037426,0.018692,1,2.037426,0.000170,


In [118]:
VOCAB

Unnamed: 0,term_str,n,n_chars,p,s,i,h,term_rank,stem_porter,stem_snowball,stem_lancaster,max_pos,n_pos,stop,zipf_k,df,idf,tfidf_mean,tfidf_max_x,tfidf_max_y
0,the,36221,3,0.061829,16.173601,4.015569,0.248279,1,the,the,the,DT,2,1,36221,109,0.000000,0.000000,,0.000000
1,",",28309,1,0.048323,20.693914,4.371135,0.211228,2,",",",",",",",",1,0,56618,109,0.000000,0.000000,,0.000000
2,and,20776,3,0.035465,28.197151,4.817477,0.170850,3,and,and,and,CC,2,1,62328,108,0.004003,0.000151,,0.000276
3,of,19688,2,0.033607,29.755384,4.895079,0.164511,4,of,of,of,IN,2,1,78752,108,0.004003,0.000127,,0.000243
4,.,18392,1,0.031395,31.852110,4.993317,0.156766,5,.,.,.,.,1,0,91960,107,0.008043,0.000250,,0.002298
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27565,hand-held,1,9,0.000002,585824.000000,19.160108,0.000033,421,hand-held,hand-held,hand-held,NN,1,0,421,1,2.037426,0.000072,,0.000072
27566,maouths,1,7,0.000002,585824.000000,19.160108,0.000033,421,maouth,maouth,maouth,NNS,1,0,421,1,2.037426,0.000103,,0.000103
27567,gammell,1,7,0.000002,585824.000000,19.160108,0.000033,421,gammel,gammel,gammel,NNP,1,0,421,1,2.037426,0.000154,,0.000154
27568,sojournings,1,11,0.000002,585824.000000,19.160108,0.000033,421,sojourn,sojourn,sojourn,NNS,1,0,421,1,2.037426,0.000170,,0.000170


In [119]:
BOW['tf'] = TF.stack()
BOW['tfidf'] = TFIDF.stack()

#### DFIDF 

In [120]:
VOCAB['dfidf'] = VOCAB.df * VOCAB.idf
VOCAB['dfidf2'] = VOCAB.df/N * VOCAB.idf

In [121]:
VOCAB

Unnamed: 0,term_str,n,n_chars,p,s,i,h,term_rank,stem_porter,stem_snowball,...,n_pos,stop,zipf_k,df,idf,tfidf_mean,tfidf_max_x,tfidf_max_y,dfidf,dfidf2
0,the,36221,3,0.061829,16.173601,4.015569,0.248279,1,the,the,...,2,1,36221,109,0.000000,0.000000,,0.000000,0.000000,0.000000
1,",",28309,1,0.048323,20.693914,4.371135,0.211228,2,",",",",...,1,0,56618,109,0.000000,0.000000,,0.000000,0.000000,0.000000
2,and,20776,3,0.035465,28.197151,4.817477,0.170850,3,and,and,...,2,1,62328,108,0.004003,0.000151,,0.000276,0.432296,0.003966
3,of,19688,2,0.033607,29.755384,4.895079,0.164511,4,of,of,...,2,1,78752,108,0.004003,0.000127,,0.000243,0.432296,0.003966
4,.,18392,1,0.031395,31.852110,4.993317,0.156766,5,.,.,...,1,0,91960,107,0.008043,0.000250,,0.002298,0.860571,0.007895
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27565,hand-held,1,9,0.000002,585824.000000,19.160108,0.000033,421,hand-held,hand-held,...,1,0,421,1,2.037426,0.000072,,0.000072,2.037426,0.018692
27566,maouths,1,7,0.000002,585824.000000,19.160108,0.000033,421,maouth,maouth,...,1,0,421,1,2.037426,0.000103,,0.000103,2.037426,0.018692
27567,gammell,1,7,0.000002,585824.000000,19.160108,0.000033,421,gammel,gammel,...,1,0,421,1,2.037426,0.000154,,0.000154,2.037426,0.018692
27568,sojournings,1,11,0.000002,585824.000000,19.160108,0.000033,421,sojourn,sojourn,...,1,0,421,1,2.037426,0.000170,,0.000170,2.037426,0.018692


In [122]:
VOCAB[np.isnan(VOCAB['df'])]

Unnamed: 0,term_str,n,n_chars,p,s,i,h,term_rank,stem_porter,stem_snowball,...,n_pos,stop,zipf_k,df,idf,tfidf_mean,tfidf_max_x,tfidf_max_y,dfidf,dfidf2


In [123]:
(len(VOCAB) - len(VOCAB.dropna()))/len(VOCAB)

1.0

In [135]:
len(VOCAB[[True if '-' in x else False for x in VOCAB.term_str]])/len(VOCAB)

0.10214000725426188

In [36]:
VOCAB[[True if '-' in x else False for x in VOCAB.term_str]]

Unnamed: 0,term_str,n,n_chars,p,s,i,h,term_rank,stem_porter,stem_snowball,...,n_pos,stop,zipf_k,df,idf,tfidf_mean,tfidf_max_x,tfidf_max_y,dfidf,dfidf2
16,--,3710,2,0.006333,157.904043,7.302904,0.046249,17,--,--,...,1,0,63070,62,0.245035,0.001322,,0.005462,15.192158,0.139378
866,night-gaunts,64,12,0.000109,9153.500000,13.160108,0.001438,358,night-gaunt,night-gaunt,...,1,0,22912,2,1.736397,0.001359,,0.002351,3.472793,0.031860
2177,yog-sothoth,25,11,0.000043,23432.960000,14.516252,0.000619,397,yog-sothoth,yog-sothoth,...,4,0,9925,5,1.338456,0.000249,,0.000878,6.692282,0.061397
2390,dylath-leen,23,11,0.000039,25470.608696,14.636546,0.000575,399,dylath-leen,dylath-leen,...,2,0,9177,1,2.037426,0.001007,,0.001007,2.037426,0.018692
2843,high-priest,19,11,0.000032,30832.842105,14.912180,0.000484,403,high-priest,high-priest,...,2,0,7657,5,1.338456,0.001061,,0.003450,6.692282,0.061397
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27551,worn-down,1,9,0.000002,585824.000000,19.160108,0.000033,421,worn-down,worn-down,...,1,0,421,1,2.037426,0.000072,,0.000072,2.037426,0.018692
27555,ill-built,1,9,0.000002,585824.000000,19.160108,0.000033,421,ill-built,ill-built,...,1,0,421,1,2.037426,0.000045,,0.000045,2.037426,0.018692
27556,singular-looking,1,16,0.000002,585824.000000,19.160108,0.000033,421,singular-look,singular-look,...,1,0,421,1,2.037426,0.000069,,0.000069,2.037426,0.018692
27565,hand-held,1,9,0.000002,585824.000000,19.160108,0.000033,421,hand-held,hand-held,...,1,0,421,1,2.037426,0.000072,,0.000072,2.037426,0.018692


In [32]:
VOCAB.columns

Index(['term_str', 'n', 'n_chars', 'p', 's', 'i', 'h', 'term_rank',
       'stem_porter', 'stem_snowball', 'stem_lancaster', 'max_pos', 'n_pos',
       'stop', 'zipf_k', 'df', 'idf', 'tfidf_mean', 'tfidf_max_x',
       'tfidf_max_y', 'dfidf', 'dfidf2'],
      dtype='object')

### Saving

In [124]:
TOKENS.to_csv('data/output/lovecraft_TOKENS.csv', index=False)
VOCAB.to_csv('data/output/lovecraft_VOCAB.csv', index=False)
CORPUS.to_csv('data/output/lovecraft_CORPUS.csv', index=False)

In [41]:
# Loading tables if kernel shuts off.
TOKENS = pd.read_csv('data/output/lovecraft_TOKENS.csv')
VOCAB = pd.read_csv('data/output/lovecraft_VOCAB.csv')
CORPUS = pd.read_csv('data/output/lovecraft_CORPUS.csv')

---

In [125]:
# save all
TOKENS.to_csv('data/output/lovecraft_TOKENS.csv', index=False)
VOCAB.to_csv('data/output/lovecraft_VOCAB.csv', index=False)
RAW.to_csv('data/output/lovecraft_RAW.csv', index=False)
LIB.to_csv('data/output/lovecraft_LIB.csv', index=False)
TOKENS.to_csv('data/output/lovecraft_TOKENS.csv', index=False)

tracking psuedonyms

In [None]:
{'arcadia' : 'Head Balledup',
 'dead passion\'s flame' : 'Blank Frailty'}