Before running this notebook, please make sure to install the required Python libraries and spaCy's English language model. These can be found in the `requirements.txt` file and can be installed by using `pip install -r requirements.txt`. We recommend doing so inside a [Python virtual enviroment](https://www.w3schools.com/python/python_virtualenv.asp).

Once the required libraries have been installed, import them by running the cell below. 

In [67]:
import en_core_web_sm
import pandas as pd
import requests
import spacy
from pathlib import Path
from spacy.symbols import NORM, ORTH

 We start by retrieving poems using the [PoetryDB](https://poetrydb.org/) API. For this revised assignment, we are going to look at poems by [Henry Wadsworth Longfellow](https://wikipedia.org/wiki/Henry_Wadsworth_Longfellow) and [Edgar Allan Poe](https://wikipedia.org/wiki/Edgar_Allan_Poe).

In [68]:
def get_poems_by(authors: list):
    # create empty dataframe that will be filled as we retrieve poems from the API
    poems_df = pd.DataFrame()
    # create set to avoid checking same author multiple times
    checked_authors = set()

    # retrieve known authors from PoetryDB, to avoid making invalid API requests
    valid_authors = requests.get('https://poetrydb.org/author').json()
    valid_authors = set(valid_authors['authors'])

    # iterate over authors submitted to function
    for author in authors:
        # only check authors found in PoetryDB
        if author in valid_authors:
            # only check authors that haven't been checked before
            if author not in checked_authors:
                # get json file of poems by specified author, 
                author_json = requests.get(f'https://poetrydb.org/author/{author}').json()
                # if the dataframe is still empty, overwrite it with poems of the first author we check
                if poems_df.empty:
                    poems_df = pd.json_normalize(author_json)
                # if the dataframe is not empty, concatenate poems of other authors
                else:
                    poems_df = pd.concat([poems_df, pd.json_normalize(author_json)], ignore_index=True)
                checked_authors.add(author)
            else:
                print(f"Skipping duplicate author '{author}'")
        else:
            print(f"Author '{author}' not found in PoetryDB")

    # return the final dataframe containing the poems of all authors found on PoetryDB
    return poems_df

authors_of_interest = ['Edgar Allan Poe', 'Henry Wadsworth Longfellow']
api_df = get_poems_by(authors_of_interest)

Let's take a quick look at a sample of the data.

In [69]:
api_df.sample(10)

Unnamed: 0,title,author,lines,linecount
88,The Jewish Cemetery at Newport,Henry Wadsworth Longfellow,[How strange it seems! These Hebrews in their ...,60
56,The Children's Hour,Henry Wadsworth Longfellow,"[Between the dark and the daylight,, When the ...",40
81,The Poet's Calendar,Henry Wadsworth Longfellow,"[January, , Janus am I; oldest of potentates;,...",108
76,Hiawatha's Wooing,Henry Wadsworth Longfellow,"[""As unto the bow the cord is,, So unto the ma...",283
70,The Rainy Day,Henry Wadsworth Longfellow,"[The day is cold, and dark, and dreary, It rai...",15
28,Sonnet--To Science,Edgar Allan Poe,"[SCIENCE! true daughter of Old Time thou art!,...",14
3,To Helen,Edgar Allan Poe,"[I saw thee once--once only--years ago:, I mus...",66
66,AFTERNOON IN FEBRUARY,Henry Wadsworth Longfellow,"[The day is ending,, The night is descending;,...",24
34,To----,Edgar Allan Poe,"[I heed not that my earthly lot, Hath--littl...",8
71,Morituri Salutamus: Poem for the Fiftieth Anni...,Henry Wadsworth Longfellow,"[Tempora labuntur, tacitisque senescimus annis...",288


Before we continue, we rename the `lines` column to `document` and combine the lines it contains to a single string, separated by newline characters.

In [70]:
poem_df = api_df.rename(columns={'lines': 'document'})
poem_df['document'] = poem_df['document'].str.join('\n')
poem_df.sample(5)

Unnamed: 0,title,author,document,linecount
50,The Goblet of Life,Henry Wadsworth Longfellow,Filled is Life's goblet to the brim;\nAnd thou...,60
15,To Marie Louise (Shew),Edgar Allan Poe,"Not long ago, the writer of these lines,\nIn t...",27
65,Holidays,Henry Wadsworth Longfellow,The holiest of all holidays are those\nKept by...,14
56,The Children's Hour,Henry Wadsworth Longfellow,"Between the dark and the daylight,\nWhen the n...",40
12,Eulalie,Edgar Allan Poe,I dwelt alone\n In a ...,22


Then we add some linguistic features, like a tokenized and lemmatized version of each poem, as well as POS-tags.

In [71]:
# load spaCy's English model
nlp = en_core_web_sm.load()
# Add special tokenization cases
special_cases = [("'tis", [{ORTH: "'t", NORM: "it"}, {ORTH: "is"}]),
                    ("'Tis", [{ORTH: "'T", NORM: "It"}, {ORTH: "is"}])]
for special_str, special_attr in special_cases:
    nlp.tokenizer.add_special_case(special_str, special_attr)


def add_linguistic_features(documents: pd.Series, nlp: spacy.language.Language
                            ) -> tuple[pd.Series, pd.Series, pd.Series, pd.Series, pd.Series]:
    texts = []
    tokens = []
    lemmas = []
    pos = []
    token_counts = []

    for document in documents:
        # replacing double dashes with an em dash and placing spaces around it to prevent tokenization issues
        text = document.replace('--', ' — ')
        # remove underscores used for emphasis
        text = text.replace('_', '')
        # save pre-processed document
        texts.append(text)

        # apply spaCy's NLP model to the document
        nlp_tokens = nlp(text)

        doc_tokens = []
        doc_lemmas = []
        doc_pos = []

        for token in nlp(text):
            doc_tokens.append(token.text)
            doc_lemmas.append(token.lemma_)
            doc_pos.append(token.pos_)

        tokens.append(doc_tokens)
        lemmas.append(doc_lemmas)
        pos.append(doc_pos)
        token_counts.append(len(doc_tokens))

    return (
        pd.Series(texts, index=documents.index),
        pd.Series(tokens, index=documents.index),
        pd.Series(lemmas, index=documents.index),
        pd.Series(pos, index=documents.index),
        pd.Series(token_counts, index=documents.index)
    )


poem_df['text'], poem_df['tokens'], poem_df['lemmas'], poem_df['parts-of-speech'], poem_df['tokencount'] = add_linguistic_features(poem_df['document'], nlp)
poem_df.head()

Unnamed: 0,title,author,document,linecount,text,tokens,lemmas,parts-of-speech,tokencount
0,The Raven,Edgar Allan Poe,"Once upon a midnight dreary, while I pondered,...",113,"Once upon a midnight dreary, while I pondered,...","[Once, upon, a, midnight, dreary, ,, while, I,...","[once, upon, a, midnight, dreary, ,, while, I,...","[ADV, SCONJ, DET, NOUN, NOUN, PUNCT, SCONJ, PR...",1489
1,The Bells,Edgar Allan Poe,Hear the sledges with the bells--\nSilver bell...,113,Hear the sledges with the bells — \nSilver bel...,"[Hear, the, sledges, with, the, bells, —, \n, ...","[hear, the, sledge, with, the, bell, —, \n, si...","[VERB, DET, NOUN, ADP, DET, NOUN, PUNCT, SPACE...",872
2,Ulalume,Edgar Allan Poe,The skies they were ashen and sober;\n The le...,94,The skies they were ashen and sober;\n The le...,"[The, skies, they, were, ashen, and, sober, ;,...","[the, sky, they, be, ashen, and, sober, ;, \n ...","[DET, NOUN, PRON, AUX, ADJ, CCONJ, ADJ, PUNCT,...",898
3,To Helen,Edgar Allan Poe,I saw thee once--once only--years ago:\nI must...,66,I saw thee once — once only — years ago:\nI mu...,"[I, saw, thee, once, —, once, only, —, years, ...","[I, see, thee, once, —, once, only, —, year, a...","[PRON, VERB, PRON, ADV, PUNCT, ADV, ADV, PUNCT...",710
4,Annabel Lee,Edgar Allan Poe,"It was many and many a year ago,\n In a kingd...",41,"It was many and many a year ago,\n In a kingd...","[It, was, many, and, many, a, year, ago, ,, \n...","[it, be, many, and, many, a, year, ago, ,, \n ...","[PRON, AUX, ADJ, CCONJ, ADJ, DET, NOUN, ADV, P...",385


We also store each document in our data folder and add the filenames to our dataframe, we used separate folders for each author to keep things tidy

In [72]:
# create a directory for each author in the dataframe
for author in poem_df['author'].unique():
    author_dir = Path('data') / Path(author)
    author_dir.mkdir(exist_ok=True, parents=True)

# save each poem in the directory of the respective author
file_names = []
for i, poem in poem_df.iterrows():
    # we use the tokencount and linecount in the filename to distinguish poems that are both from the same author and have the same title
    file_name = f"{poem['tokencount']}_{poem['linecount']} {poem['title']}.txt"
    file_path = Path('data') / Path(poem['author']) / Path(file_name)

    file_path.write_text(poem['document'], encoding='utf-8')
    file_names.append(file_name)

# add filenames to dataframe
poem_df['filename'] = pd.Series(file_names)
poem_df.head()

Unnamed: 0,title,author,document,linecount,text,tokens,lemmas,parts-of-speech,tokencount,filename
0,The Raven,Edgar Allan Poe,"Once upon a midnight dreary, while I pondered,...",113,"Once upon a midnight dreary, while I pondered,...","[Once, upon, a, midnight, dreary, ,, while, I,...","[once, upon, a, midnight, dreary, ,, while, I,...","[ADV, SCONJ, DET, NOUN, NOUN, PUNCT, SCONJ, PR...",1489,1489_113 The Raven.txt
1,The Bells,Edgar Allan Poe,Hear the sledges with the bells--\nSilver bell...,113,Hear the sledges with the bells — \nSilver bel...,"[Hear, the, sledges, with, the, bells, —, \n, ...","[hear, the, sledge, with, the, bell, —, \n, si...","[VERB, DET, NOUN, ADP, DET, NOUN, PUNCT, SPACE...",872,872_113 The Bells.txt
2,Ulalume,Edgar Allan Poe,The skies they were ashen and sober;\n The le...,94,The skies they were ashen and sober;\n The le...,"[The, skies, they, were, ashen, and, sober, ;,...","[the, sky, they, be, ashen, and, sober, ;, \n ...","[DET, NOUN, PRON, AUX, ADJ, CCONJ, ADJ, PUNCT,...",898,898_94 Ulalume.txt
3,To Helen,Edgar Allan Poe,I saw thee once--once only--years ago:\nI must...,66,I saw thee once — once only — years ago:\nI mu...,"[I, saw, thee, once, —, once, only, —, years, ...","[I, see, thee, once, —, once, only, —, year, a...","[PRON, VERB, PRON, ADV, PUNCT, ADV, ADV, PUNCT...",710,710_66 To Helen.txt
4,Annabel Lee,Edgar Allan Poe,"It was many and many a year ago,\n In a kingd...",41,"It was many and many a year ago,\n In a kingd...","[It, was, many, and, many, a, year, ago, ,, \n...","[it, be, many, and, many, a, year, ago, ,, \n ...","[PRON, AUX, ADJ, CCONJ, ADJ, DET, NOUN, ADV, P...",385,385_41 Annabel Lee.txt


Lastly, we reorder the columns to meet the requirements of the assignment.

In [73]:
poem_df = poem_df[['filename', 'title', 'document', 'text', 'tokens', 'lemmas', 'parts-of-speech', 'author', 'linecount', 'tokencount']]
poem_df.head()

Unnamed: 0,filename,title,document,text,tokens,lemmas,parts-of-speech,author,linecount,tokencount
0,1489_113 The Raven.txt,The Raven,"Once upon a midnight dreary, while I pondered,...","Once upon a midnight dreary, while I pondered,...","[Once, upon, a, midnight, dreary, ,, while, I,...","[once, upon, a, midnight, dreary, ,, while, I,...","[ADV, SCONJ, DET, NOUN, NOUN, PUNCT, SCONJ, PR...",Edgar Allan Poe,113,1489
1,872_113 The Bells.txt,The Bells,Hear the sledges with the bells--\nSilver bell...,Hear the sledges with the bells — \nSilver bel...,"[Hear, the, sledges, with, the, bells, —, \n, ...","[hear, the, sledge, with, the, bell, —, \n, si...","[VERB, DET, NOUN, ADP, DET, NOUN, PUNCT, SPACE...",Edgar Allan Poe,113,872
2,898_94 Ulalume.txt,Ulalume,The skies they were ashen and sober;\n The le...,The skies they were ashen and sober;\n The le...,"[The, skies, they, were, ashen, and, sober, ;,...","[the, sky, they, be, ashen, and, sober, ;, \n ...","[DET, NOUN, PRON, AUX, ADJ, CCONJ, ADJ, PUNCT,...",Edgar Allan Poe,94,898
3,710_66 To Helen.txt,To Helen,I saw thee once--once only--years ago:\nI must...,I saw thee once — once only — years ago:\nI mu...,"[I, saw, thee, once, —, once, only, —, years, ...","[I, see, thee, once, —, once, only, —, year, a...","[PRON, VERB, PRON, ADV, PUNCT, ADV, ADV, PUNCT...",Edgar Allan Poe,66,710
4,385_41 Annabel Lee.txt,Annabel Lee,"It was many and many a year ago,\n In a kingd...","It was many and many a year ago,\n In a kingd...","[It, was, many, and, many, a, year, ago, ,, \n...","[it, be, many, and, many, a, year, ago, ,, \n ...","[PRON, AUX, ADJ, CCONJ, ADJ, DET, NOUN, ADV, P...",Edgar Allan Poe,41,385


And we save the dataframe as a CSV file.

In [74]:
poem_df.to_csv(f'corpus_data.csv', encoding='utf-8')