Dataset: https://www.kaggle.com/mylesoneill/classic-literature-in-ascii/activity
This dataset contains complete written works of literature which are stored as text files. 

The purpose of this file is to parse these works and store them in a usable form, such as a pandas DataFrame.

#### Imports

In [1]:
import pandas as pd
import numpy as np
import os
import collections
import nltk
from IPython.display import display

Retrieving each book file from its respective directory, then extracting its contents.

In [2]:
#setting up directories
cur_path = os.getcwd()
arch =  cur_path + "/archive"
auths = arch + "/AUTHORS/AUTHORS"
fict = arch + "/FICTION/FICTION"
nonfict = arch + "/NONFICTION/NONFICTION"

works = collections.defaultdict()
intro_str = "ETEXTS*Ver.04.29.93*END*"
#extracting works by author
for auth in os.listdir(auths):
    if auth == '.DS_Store':
        continue
    auth_dir = auths + '/' + auth
    os.chdir(auth_dir)
    for book in os.listdir(auth_dir):
        b_file = open(book, mode='r', encoding='cp1252')
        content = b_file.read()
        if intro_str in content:
            start_ind = content.index(intro_str)
            content = content[(start_ind+len(intro_str)):]
        works[book] = content

#adding fiction works
os.chdir(fict)
for book in os.listdir(fict):
    if book == '.DS_Store':
        continue
    b_file = open(book, mode='r', encoding='cp1252')
    content = b_file.read()
    if intro_str in content:
        start_ind = content.index(intro_str)
        content = content[(start_ind+len(intro_str)):]
    works[book] = content

#adding nonfiction works
os.chdir(nonfict)
for book in os.listdir(nonfict):
    if book == '.DS_Store':
        continue
    b_file = open(book, mode='r', encoding='cp1252')
    content = b_file.read()
    if intro_str in content:
        start_ind = content.index(intro_str)
        content = content[(start_ind+len(intro_str)):]
    works[book] = content

Creating a dictionary storing book titles, using the year those books were written (which is roughly parsed from the initial few words of the text) as a key.

In [3]:
books_and_years = collections.defaultdict()
for book in works.keys():
    words_in_book = nltk.tokenize.word_tokenize(works[book])
    first_nums_in_book = [int(w) for w in words_in_book[:250] if w.isnumeric()]
    year = 0
    if len(first_nums_in_book) > 0:
        if 2022 > max(first_nums_in_book) > 1000:
            year = min([x for x in first_nums_in_book if x > 1000])
        elif 50 <= max(first_nums_in_book) < 1000:
            year = max([x for x in first_nums_in_book])
        else:
            continue
        if books_and_years.get(year) is None:
            books_and_years[year] = [book]
        else:
            books_and_years[year] = books_and_years[year] + [book]

Tabulating words by year (extracting year by tokenizing novels and parsing from initial few words)

In [4]:
years_with_words = collections.defaultdict()
years_with_sentences = collections.defaultdict()
for year in books_and_years.keys():
    unique_words_in_year = []
    unique_sents_in_year = []
    for book in books_and_years[year]:
        words_in_book = nltk.tokenize.word_tokenize(works[book])
        sentences_in_book = [s.replace('\n', ' ') for s in nltk.tokenize.sent_tokenize(works[book])]
        unique_words_in_year += words_in_book
        unique_sents_in_year += sentences_in_book
        unique_words_in_year = list(set(unique_words_in_year))
        unique_sents_in_year = list(set(unique_sents_in_year))
    years_with_words[year] = unique_words_in_year
    years_with_sentences[year] = unique_sents_in_year

Converting dictionary of words to pandas DataFrame.

In [5]:
words_with_years = pd.DataFrame.from_dict(years_with_words, orient='index')
display(words_with_years)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,38586,38587,38588,38589,38590,38591,38592,38593,38594,38595
1838,shining,pale,felt,--,expansion,sole,Let,animation,distributed,sidewalk,...,,,,,,,,,,
1895,pale,felt,--,feudal,tyrannic,dint,unlikely,exult,trained,Foul,...,,,,,,,,,,
1850,pale,felt,petted,--,feudal,Dashers,orb,half-fancy,dilation,unhappiness-,...,,,,,,,,,,
1837,slopes,vermiform,pale,felt,--,warm-blooded,admitting,slowness,entomologists,courageous,...,,,,,,,,,,
1831,dripping,pale,'Neath,rustle,Let,Wrapping,Stygian,this,Earth,hid,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
170,Seeing,sects,cava,Hippocratic,Alone,distributed,expansion,definitely,Let,regains,...,,,,,,,,,,
1713,Seeing,mite,pursuit,sects,felt,--,sole,imposition,Let,ensnared,...,,,,,,,,,,
1955,lusted,pale,felt,--,654,scholars,unmanly,raise,friendship,sadness,...,,,,,,,,,,
101,pale,felt,scholars,admitting,courageous,raise,friendship,sadness,trained,deliberations,...,,,,,,,,,,


Serializing DataFrame to file 'words.pkl' for use in later files.

In [6]:
os.chdir(cur_path)
words_with_years.to_pickle('./words.pkl')

Serializing books and years dictionary (to dataframe first) for future use.

In [7]:
years_and_books = pd.DataFrame.from_dict(books_and_years, orient='index')
display(years_and_books)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,43,44,45,46,47,48,49,50,51,52
1838,poe-ligeia-452.txt,poe-predicament-421.txt,hawthorne-lady-472.txt,hawthorne-peter-479.txt,,,,,,,...,,,,,,,,,,
1895,poe-pit-110.txt,the_pit.txt,cask.txt,telltale.txt,b-p_plan.txt,chilc10.txt,scott-ivanhoe-159.txt,redbadge.txt,,,...,,,,,,,,,,
1850,poe-morning-559.txt,poe-landscape-689.txt,poe-thousand-708.txt,poe-metzengerstein-557.txt,poe-sphinx-705.txt,poe-three-710.txt,poe-man-691.txt,poe-bon-430.txt,poe-some-657.txt,poe-man-690.txt,...,poe-thou-416.txt,poe-elizabeth-438.txt,poe-loss-455.txt,poe-literary-454.txt,poe-four-443.txt,poe-x-726.txt,hawthorne-scarlet-63.txt,hawthorne-great-466.txt,hawthorne-snow-478.txt,burton-arabian-363.txt
1837,poe-silence-656.txt,poe-sonnet-661.txt,poe-bridal-431.txt,hawthorne-prophetic-476.txt,hawthorne-dr-468.txt,origin_species.txt,,,,,...,,,,,,,,,,
1831,poe-sleeper-703.txt,poe-israfel-448.txt,poe-city-673.txt,poe-to-715.txt,poe-valley-709.txt,poe-lenore-451.txt,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
170,faculties.txt,,,,,,,,,,...,,,,,,,,,,
1713,berkeley-three-745.txt,,,,,,,,,,...,,,,,,,,,,
1955,augustine-confessions-276.txt,,,,,,,,,,...,,,,,,,,,,
101,epictetus-discourses-568.txt,,,,,,,,,,...,,,,,,,,,,


Serializing years and books dataframe.

In [8]:
years_and_books.to_pickle('./books.pkl')

Serializing years and sentences dataframe.

In [9]:
sentences_with_years = pd.DataFrame.from_dict(years_with_sentences, orient='index')
display(sentences_with_years)
sentences_with_years.to_pickle('./sentences.pkl')

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,56118,56119,56120,56121,56122,56123,56124,56125,56126,56127
1838,"""Turn him out of the house!""",--shall this Conqueror be not once conquered?,And in this expectation I was not at all decei...,Amazement now struggled in my bosom with the p...,"""Did not my great-granduncle, Peter Goldthwait...",What a host of gloomy recollections will ever ...,The latter article of dress was of great impor...,"""There is no use in rubbing it, Tabitha,"" said...","""And good cause have we to remember him,"" quot...",You are avenged- they are all avenged- Nature ...,...,,,,,,,,,,
1895,The youth leaned heavily upon his friend.,"-- here, here!","The man screamed: ""Let go me!",Prince John coloured as he put the quest...,"We came at length to the foot of the descent, ...","and look to it what thou wilt do!""",``Gods and fiends!'',Is not that the place where an object up...,Sometimes he inclined to believing them all he...,Sometimes he interjected anecdotes.,...,,,,,,,,,,
1850,"She answered: ""Allah Almighty vouchsafe to the...",Now it was this latter peculiarity in his disp...,"It is, indeed, the instinct given to man by Go...","observed the Baron, dryly, and at that instant...","For even as thou entreatedst me generously, wi...",LA BRUYERE.,"Thus, in less time than I have taken to tell i...","In looking about, I discovered the interesting...",And Kamar al-Akmar and his wife Shams al-Nahar...,We mean the abrupt employment of a direct pron...,...,,,,,,,,,,
1837,"As soon as this occurred, the bees ceased to e...",Insects often resemble for the sake of protect...,Modifications in hard parts and in external pa...,Secondary Sexual Characters Variable.- I think...,"Drift timber is thrown up on most islands, eve...","LARYNX, The upper part of the windpipe opening...",This line of argument seems to have had great ...,"In considering transitions of organs, it is so...","Almost every year, one or two land-birds are b...","cried Colonel Killigrew, whose eyes had been f...",...,,,,,,,,,,
1831,"Oh, may her sleep As it is lasting, so b...","Sure thou art come O'er far-off seas, A ...",Heaven have her in its sacred keep!,Nothing there is motionless- Nothing sa...,There open fanes and gaping graves Yawn ...,-THE END- .,"Ah, by no wind those clouds are driven ...",can it be right- This window open to the...,"On desperate seas long wont to roam, ...","on yon drear and rigid bier low lies thy love,...",...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
170,"The fact, therefore, that Erasistratus, in reg...","For, if they do rebound, how then do they pass...","Now, since the three faculties of Nature have ...",*What we now call the pulmonary artery.,For this simple vessel [i.e.,the tissues.,For I see that the Erasistrateans are at varia...,"I think, then, it has been proved to the satis...","And after the muscles, pass to the physical or...","Has Erasistratus, then, not read the book, ""On...",...,,,,,,,,,,
1713,Though indeed I deny they have an existence di...,So far you are in the right.,"In short, by whatever method you distinguish <...",And yet you asserted that you could not concei...,It cannot be denied that we perceive such cert...,"The colours, therefore, by it perceived are mo...",You may draw as many absurd consequences as yo...,"I am not for changing things into ideas, but r...",All which makes the case of <Matter> widely di...,And what is conceived is surely in the mind?,...,,,,,,,,,,
1955,"Have pity, O Lord God, lest those who pass by...",Why are they not happy?,[633] 37.,"[297] But what is like to thy Word, our Lord...",CHAPTER VII 11.,Prov.,"He ""burns"" with grief, for the things he has l...","Now, surely, those who live in gross wickednes...","But what hope he cherished, what struggles he...","Oh, if thou wouldst slay them with thy two-ed...",...,,,,,,,,,,
101,have you anything better or greater to see tha...,I would willingly take a voyage for this purpo...,And that you may not think that I show you the...,CHAPTER 5 How magnanimity is consistent wit...,for having acted conformably to nature?,For philosophers say we allow none to be free ...,"Observe, this is the beginning of philosophy, ...","Wretch, are you so blind, and don't you see th...","To live secure, to be happy, to do everything ...",And does not Antisthenes say so?,...,,,,,,,,,,
