Before running this notebook, please make sure to install the required modules and Spacy's English language model. These can be found in the `requirements.txt` file and can be installed by using `pip install -r requirements.txt`.

Once the required modules have been installed, import them and load Spacy's English language model, by running the cell below.

In [73]:
import en_core_web_sm
import pandas as pd
import requests
import spacy
from pathlib import Path
from spacy.symbols import NORM, ORTH
nlp = en_core_web_sm.load()

 Retrieve poems using the PoetryDB API. For this assignment, we're only looking at poems by Edgar Allan Poe.

In [74]:
def get_poems_by(author):
    # get json file of poems by specified author, 
    json = requests.get(f'https://poetrydb.org/author/{author}').json()
    # convert to pandas dataframe and return result
    return pd.json_normalize(json)

author_of_interest = 'Edgar Allan Poe'
poem_df = get_poems_by(author_of_interest)
poem_df.head()

Unnamed: 0,title,author,lines,linecount
0,The Raven,Edgar Allan Poe,"[Once upon a midnight dreary, while I pondered...",113
1,The Bells,Edgar Allan Poe,"[Hear the sledges with the bells--, Silver bel...",113
2,Ulalume,Edgar Allan Poe,"[The skies they were ashen and sober;, The l...",94
3,To Helen,Edgar Allan Poe,"[I saw thee once--once only--years ago:, I mus...",66
4,Annabel Lee,Edgar Allan Poe,"[It was many and many a year ago,, In a king...",41


Before we continue, we rename the `lines` column to `document` and convert its content (lists of strings) to a strings

In [75]:
poem_df.rename(columns={"lines": "document"}, inplace=True)
poem_df['document'] = poem_df['document'].str.join('\n')
poem_df.head()

Unnamed: 0,title,author,document,linecount
0,The Raven,Edgar Allan Poe,"Once upon a midnight dreary, while I pondered,...",113
1,The Bells,Edgar Allan Poe,Hear the sledges with the bells--\nSilver bell...,113
2,Ulalume,Edgar Allan Poe,The skies they were ashen and sober;\n The le...,94
3,To Helen,Edgar Allan Poe,I saw thee once--once only--years ago:\nI must...,66
4,Annabel Lee,Edgar Allan Poe,"It was many and many a year ago,\n In a kingd...",41


Then we add some linguistic features, like a tokenized and lemmatized version of each poem, as well as POS-tags.

In [76]:
def add_linguistic_features(documents: pd.Series) -> tuple[pd.Series, pd.Series, pd.Series, pd.Series]:
    texts = []
    tokens = []
    lemmas = []
    pos = []

    for index, document in documents.items():
        tokens.append([])
        lemmas.append([])
        pos.append([])
        # adding spaces around double dashes to prevent tokenization issues
        text = document.replace('--', ' -- ')
        # remove underscores used for emphasis
        text = text.replace('_', '')
        # save pre-processed document
        texts.append(text)

        # Add special tokenization cases
        special_cases = [("'tis", [{ORTH: "'t", NORM: "it"}, {ORTH: "is"}]),
                         ("'Tis", [{ORTH: "'T", NORM: "It"}, {ORTH: "is"}])]
        for special_str, special_attr in special_cases:
            nlp.tokenizer.add_special_case(special_str, special_attr)

        for token in nlp(text):
            tokens[index].append(token.text)
            lemmas[index].append(token.lemma_)
            pos[index].append(token.pos_)

    return pd.Series(texts), pd.Series(tokens), pd.Series(lemmas), pd.Series(pos)

poem_df['text'], poem_df['tokens'], poem_df['lemmas'], poem_df['parts-of-speech'] = add_linguistic_features(poem_df['document'])
poem_df.head()

Unnamed: 0,title,author,document,linecount,text,tokens,lemmas,parts-of-speech
0,The Raven,Edgar Allan Poe,"Once upon a midnight dreary, while I pondered,...",113,"Once upon a midnight dreary, while I pondered,...","[Once, upon, a, midnight, dreary, ,, while, I,...","[once, upon, a, midnight, dreary, ,, while, I,...","[ADV, SCONJ, DET, NOUN, NOUN, PUNCT, SCONJ, PR..."
1,The Bells,Edgar Allan Poe,Hear the sledges with the bells--\nSilver bell...,113,Hear the sledges with the bells -- \nSilver be...,"[Hear, the, sledges, with, the, bells, --, \n,...","[hear, the, sledge, with, the, bell, --, \n, s...","[VERB, DET, NOUN, ADP, DET, NOUN, PUNCT, SPACE..."
2,Ulalume,Edgar Allan Poe,The skies they were ashen and sober;\n The le...,94,The skies they were ashen and sober;\n The le...,"[The, skies, they, were, ashen, and, sober, ;,...","[the, sky, they, be, ashen, and, sober, ;, \n ...","[DET, NOUN, PRON, AUX, ADJ, CCONJ, ADJ, PUNCT,..."
3,To Helen,Edgar Allan Poe,I saw thee once--once only--years ago:\nI must...,66,I saw thee once -- once only -- years ago:\nI ...,"[I, saw, thee, once, --, once, only, --, years...","[I, see, thee, once, --, once, only, --, year,...","[PRON, VERB, PRON, ADV, PUNCT, ADV, ADV, PUNCT..."
4,Annabel Lee,Edgar Allan Poe,"It was many and many a year ago,\n In a kingd...",41,"It was many and many a year ago,\n In a kingd...","[It, was, many, and, many, a, year, ago, ,, \n...","[it, be, many, and, many, a, year, ago, ,, \n ...","[PRON, AUX, ADJ, CCONJ, ADJ, DET, NOUN, ADV, P..."


We also store each document in our data folder and add the filenames to our dataframe, we used separate folders for each author to keep things tidy

In [77]:
# save documents
filenames = []
for i, poem in poem_df.iterrows():
    Path(f'data/{poem['author']}').mkdir(exist_ok=True, parents=True)
    filename = f'{i} - {poem['title']}.txt'
    with open(f"data/{poem['author']}/{filename}", "w") as f:
        f.write(poem['document'])
    filenames.append(filename)

# add filenames to dataframe
poem_df['filename'] = pd.Series(filenames)
poem_df.head()


Unnamed: 0,title,author,document,linecount,text,tokens,lemmas,parts-of-speech,filename
0,The Raven,Edgar Allan Poe,"Once upon a midnight dreary, while I pondered,...",113,"Once upon a midnight dreary, while I pondered,...","[Once, upon, a, midnight, dreary, ,, while, I,...","[once, upon, a, midnight, dreary, ,, while, I,...","[ADV, SCONJ, DET, NOUN, NOUN, PUNCT, SCONJ, PR...",0 - The Raven.txt
1,The Bells,Edgar Allan Poe,Hear the sledges with the bells--\nSilver bell...,113,Hear the sledges with the bells -- \nSilver be...,"[Hear, the, sledges, with, the, bells, --, \n,...","[hear, the, sledge, with, the, bell, --, \n, s...","[VERB, DET, NOUN, ADP, DET, NOUN, PUNCT, SPACE...",1 - The Bells.txt
2,Ulalume,Edgar Allan Poe,The skies they were ashen and sober;\n The le...,94,The skies they were ashen and sober;\n The le...,"[The, skies, they, were, ashen, and, sober, ;,...","[the, sky, they, be, ashen, and, sober, ;, \n ...","[DET, NOUN, PRON, AUX, ADJ, CCONJ, ADJ, PUNCT,...",2 - Ulalume.txt
3,To Helen,Edgar Allan Poe,I saw thee once--once only--years ago:\nI must...,66,I saw thee once -- once only -- years ago:\nI ...,"[I, saw, thee, once, --, once, only, --, years...","[I, see, thee, once, --, once, only, --, year,...","[PRON, VERB, PRON, ADV, PUNCT, ADV, ADV, PUNCT...",3 - To Helen.txt
4,Annabel Lee,Edgar Allan Poe,"It was many and many a year ago,\n In a kingd...",41,"It was many and many a year ago,\n In a kingd...","[It, was, many, and, many, a, year, ago, ,, \n...","[it, be, many, and, many, a, year, ago, ,, \n ...","[PRON, AUX, ADJ, CCONJ, ADJ, DET, NOUN, ADV, P...",4 - Annabel Lee.txt


Lastly, we reorder the columns to meet the requirements of the assignment.

In [78]:
poem_df = poem_df[['filename', 'title', 'document', 'text', 'tokens', 'lemmas', 'parts-of-speech', 'author', 'linecount']]
poem_df.head()

Unnamed: 0,filename,title,document,text,tokens,lemmas,parts-of-speech,author,linecount
0,0 - The Raven.txt,The Raven,"Once upon a midnight dreary, while I pondered,...","Once upon a midnight dreary, while I pondered,...","[Once, upon, a, midnight, dreary, ,, while, I,...","[once, upon, a, midnight, dreary, ,, while, I,...","[ADV, SCONJ, DET, NOUN, NOUN, PUNCT, SCONJ, PR...",Edgar Allan Poe,113
1,1 - The Bells.txt,The Bells,Hear the sledges with the bells--\nSilver bell...,Hear the sledges with the bells -- \nSilver be...,"[Hear, the, sledges, with, the, bells, --, \n,...","[hear, the, sledge, with, the, bell, --, \n, s...","[VERB, DET, NOUN, ADP, DET, NOUN, PUNCT, SPACE...",Edgar Allan Poe,113
2,2 - Ulalume.txt,Ulalume,The skies they were ashen and sober;\n The le...,The skies they were ashen and sober;\n The le...,"[The, skies, they, were, ashen, and, sober, ;,...","[the, sky, they, be, ashen, and, sober, ;, \n ...","[DET, NOUN, PRON, AUX, ADJ, CCONJ, ADJ, PUNCT,...",Edgar Allan Poe,94
3,3 - To Helen.txt,To Helen,I saw thee once--once only--years ago:\nI must...,I saw thee once -- once only -- years ago:\nI ...,"[I, saw, thee, once, --, once, only, --, years...","[I, see, thee, once, --, once, only, --, year,...","[PRON, VERB, PRON, ADV, PUNCT, ADV, ADV, PUNCT...",Edgar Allan Poe,66
4,4 - Annabel Lee.txt,Annabel Lee,"It was many and many a year ago,\n In a kingd...","It was many and many a year ago,\n In a kingd...","[It, was, many, and, many, a, year, ago, ,, \n...","[it, be, many, and, many, a, year, ago, ,, \n ...","[PRON, AUX, ADJ, CCONJ, ADJ, DET, NOUN, ADV, P...",Edgar Allan Poe,41


And we save the dataframe as a CSV file.

In [79]:
poem_df.to_csv(f'corpus_data.csv', encoding='utf-8')