In [5]:
import pandas as pd
import numpy as np 

import requests
from bs4 import BeautifulSoup

from gutenberg.acquire import load_etext
from gutenberg.cleanup import strip_headers

In this notebook, I'll be using the [Gutenberg](https://pypi.org/project/Gutenberg/) package to scrape a book of ghost stories from Project Gutenberg, and then I'll format the stories such that each story is its own object and put that in a dataframe.

In [6]:
text = strip_headers(load_etext(17893)).strip()

In [7]:
soup = BeautifulSoup(text, "html.parser")

In [8]:
soup

Transcriber's note:

   Two words in this text contain macrons over double ee. These
   are denoted in the text with [=ee].

   Superscripted text is denoted by the use of the following
   markings: 12^{mi} where "mi" is superscripted.

   A Transcriber's note at the end of the text lists the changes
   made in transcription.




The Modern Library of the World's Best Books

THE BEST GHOST STORIES

Introduction by Arthur B. Reeve







The Modern Library
Publishers New York
Copyright, 1919, by
Boni &amp; Liveright, Inc.
Manufactured in the United States of America by H. Wolff





CONTENTS

                                                                       PAGE

INTRODUCTION--"THE FASCINATION OF THE GHOST STORY"  _Arthur B. Reeve_   vii

THE APPARITION OF MRS. VEAL                           _Daniel De Foe_     3

CANON ALBERIC'S SCRAP-BOOK                     _Montague Rhodes James_   18

THE HAUNTED AND THE HAUNTERS                    _Edward Bulwer-Lytton_   31

THE SILENT WOMAN

In [9]:
soup_str = str(soup)

Okay, so now I have this big lump of html-formatted text. I need to slice it down to only the TEXT of the stories, and I want each story to be its own unit. In the following cells, I will split off the introduction and footnotes, and then I will slice up the text such that only the stories remain in soup_(story identifier) objects.

In [10]:
soup_str = soup_str.split("mocking figures, arms akimbo, defying all your science")[1].replace(" to crush\nthe ghost story.\n\n\n\n\nBEST GHOST STORIES\n\n\n\n\n", '')

In [11]:
soup_str = soup_str.rsplit("\n\n\n       *       *       *       *       *\n\nFOOTNOTE:\n\n[1] Transcriber\'s Note: The original is missing text")[0]

In [12]:
soup_de_foe = soup_str.split("\n\n\nTO THE READER\n\n")[0]

In [13]:
soup_de_foe = soup_de_foe.replace("THE APPARITION OF MRS. VEAL\n\nBY DANIEL DE FOE\n\n\nTHE PREFACE\n\n", "")

In [113]:
soup_de_foe

"This relation is matter of fact, and attended with such circumstances,\nas may induce any reasonable man to believe it. It was sent by a\ngentleman, a justice of peace, at Maidstone, in Kent, and a very\nintelligent person, to his friend in London, as it is here worded; which\ndiscourse is attested by a very sober and understanding gentlewoman, a\nkinswoman of the said gentleman's, who lives in Canterbury, within a few\ndoors of the house in which the within-named Mrs. Bargrave lives; who\nbelieves his kinswoman to be of so discerning a spirit, as not to be put\nupon by any fallacy; and who positively assured him that the whole\nmatter, as it is related and laid down, is really true; and what she\nherself had in the same words, as near as may be, from Mrs. Bargrave's\nown mouth, who, she knows, had no reason to invent and publish such a\nstory, or any design to forge and tell a lie, being a woman of much\nhonesty and virtue, and her whole life a course, as it were, of piety.\nThe use 

In [14]:
soup_str = soup_str.split("\n\n\nTO THE READER\n\n")[1]

In [15]:
soup_str = soup_str.split("Walter Scott, Bart., vol. iv. p. 305, ed. 1827.]\n\n\n\n\n")[1]

In [16]:
soup_mr_james = soup_str.split("\n\n       *       *       *       *       *\n\nThe book is in the Wentworth Collection at Cambridge. The drawing was\nphotographed and then burnt by Dennistoun on the day")[0]

In [17]:
soup_mr_james = soup_mr_james.replace("CANON ALBERIC\'S SCRAP-BOOK\n\nBY MONTAGUE RHODES JAMES\n\n\n", "")

In [18]:
soup_str = soup_str.split("\n\n       *       *       *       *       *\n\nThe book is in the Wentworth Collection at Cambridge. The drawing was\nphotographed and then burnt by Dennistoun on the day")[1]

In [19]:
soup_str = soup_str.split("Sammarthani.\n\n\n\n\n")[1]

In [20]:
soup_lytton = soup_str.split("\n\n\n\n\nTHE SILENT")[0]

In [21]:
soup_lytton = soup_lytton.replace("THE HAUNTED AND THE HAUNTERS\n\nOR,\n\nTHE HOUSE AND THE BRAIN\n\nBY EDWARD BULWER-LYTTON\n\n\n", "")

In [22]:
soup_str = soup_str.split("his\ntenant has made no complaints.\n\n\n\n\n")[1]

In [23]:
soup_kompert = soup_str.split("\n\nFOOTNOTE:\n\n[D] Copyright, 1890, by Harper Bros.\n\n\n\n\n")[0]

In [24]:
soup_kompert = soup_kompert.replace("THE SILENT WOMAN[D]\n\nBY LEOPOLD KOMPERT\n\n\n", "")

In [25]:
soup_str = soup_str.split("\n\nFOOTNOTE:\n\n[D] Copyright, 1890, by Harper Bros.\n\n\n\n\n")[1]

In [26]:
soup_banshee = soup_str.split('\n\nFOOTNOTES:\n\n[E] From "True Irish Ghost Stories."\n\n[F] Scott\'s _Lady of the Lake_, notes to Canto III (edition of 1811).\n\n[G] A.G. Bradley, _Notes on some Irish Superstitions_, p. 9.\n\n[H] _Occult Review_ for September, 1913.\n\n\n\n\n')[0]

In [27]:
soup_str = soup_str.split('\n\nFOOTNOTES:\n\n[E] From "True Irish Ghost Stories."\n\n[F] Scott\'s _Lady of the Lake_, notes to Canto III (edition of 1811).\n\n[G] A.G. Bradley, _Notes on some Irish Superstitions_, p. 9.\n\n[H] _Occult Review_ for September, 1913.\n\n\n\n\n')[1]

In [28]:
soup_benson = soup_str.split("\n\n\n\n\nTHE WOMAN\'S GHOST STORY[I]\n\nBY ALGERNON BLACKWOOD\n\n")[0]

In [29]:
soup_benson = soup_benson.replace("THE MAN WHO WENT TOO FAR\n\nBY E.F. BENSON\n\n\n", "")

In [30]:
soup_str = soup_str.split("hoofs of some\nmonstrous goat that had leaped and stamped upon him.\n\n\n\n\n")[1]

In [31]:
soup_blackwood = soup_str.split('\n\nFOOTNOTE:\n\n[I] Taken by permission from "The Listener and Other Stories,"--E.P.\nDutton &amp; Co.\n\n\n\n\n')[0]

In [32]:
soup_blackwood = soup_blackwood.replace("THE WOMAN\'S GHOST STORY[I]\n\nBY ALGERNON BLACKWOOD\n\n", "")

In [33]:
soup_str = soup_str.split('\n\nFOOTNOTE:\n\n[I] Taken by permission from "The Listener and Other Stories,"--E.P.\nDutton &amp; Co.\n\n\n\n\n')[1]

In [34]:
soup_kipling = soup_str.split("\n\n\n\n\nTHE RIVAL GHOSTS\n\nBY BRANDER MATTHEWS\n\n\n")[0]

In [35]:
soup_kipling = soup_kipling.replace("THE PHANTOM \'RICKSHAW\n\nBY RUDYARD KIPLING\n\n          ", "")

In [36]:
soup_str = soup_str.split("And the last portion of my punishment is\neven now upon me.\n\n\n\n\n")[1]

In [37]:
soup_matthews = soup_str.split("\n\n\n\n\nTHE DAMNED THING\n\nBY")[0]

In [38]:
soup_matthews = soup_matthews.replace("THE RIVAL GHOSTS\n\nBY BRANDER MATTHEWS\n\n\n", "")

In [39]:
soup_str = soup_str.split("broken harshly by the hoarse roar of the\nfog-horn.\n\n\n\n\n")[1]

In [40]:
soup_bierce = soup_str.split("\n\n\n\n\nTHE INTERVAL[J]\n\nBY VINCENT O")[0]

In [41]:
soup_bierce = soup_bierce.replace("THE DAMNED THING\n\nBY AMBROSE BIERCE\n\n\nI\n\n", "")

In [42]:
soup_str = soup_str.split('the Damned Thing is of such a color!"\n\n\n\n\n')[1]

In [43]:
soup_sullivan = soup_str.split("\n\nFOOTNOTE:\n\n[J] Copyright, 1917, by The Boston Transcript Co. Copyright, 1918, by\nVincent O\'Sullivan.\n\n\n\n\n")[0]

In [44]:
soup_sullivan = soup_sullivan.replace("THE INTERVAL[J]\n\nBY VINCENT O\'SULLIVAN\n\n", "")

In [45]:
soup_str = soup_str.split("\n\nFOOTNOTE:\n\n[J] Copyright, 1917, by The Boston Transcript Co. Copyright, 1918, by\nVincent O\'Sullivan.\n\n\n\n\n")[1]

The following story in the collection was written by a white man, Ellis Parker Butler, in an extremely offensive "dialect" meant to mimic Black people. I don't want my model to be racist, so we're not including this story.

In [46]:
soup_NOPE = soup_str.split("\n\nFOOTNOTE:\n\n[K] Copyright, 1913, by The Century Company.\n\n\n\n\n")[0]

In [47]:
soup_str = soup_str.split("\n\nFOOTNOTE:\n\n[K] Copyright, 1913, by The Century Company.\n\n\n\n\n")[1]

The following story, "Some Real American Ghosts," is a grouping of "true" ghost stories from newspapers in the late 19th/early 20th century, so I'm going to split that up from one aggregate story into short individual stories.

In [48]:
soup_str = soup_str.replace("SOME REAL AMERICAN GHOSTS\n\n", "")

In [49]:
soup_giant = soup_str.split("\n\n\nSOME FAMOUS GHOSTS OF THE NATIONAL CAPITOL")[0]

In [50]:
soup_giant = soup_giant.replace("THE GIANT GHOST\n\n(Philadelphia _Press_, Sept. 13, 1896)\n\n\n", "")

In [51]:
soup_str = soup_str.split("ghost without being\nable to discover any satisfactory explanation.\n\n\n")[1]

In [52]:
soup_national = soup_str.split("\n\n\nA GENUINE GHOST\n\n(Phil")[0]

In [53]:
soup_national = soup_national.replace("SOME FAMOUS GHOSTS OF THE NATIONAL CAPITOL\n\n(Philadelphia _Press_, Oct. 2, 1898)\n\n", "")

In [54]:
soup_str = soup_str.split("committee, and, if report be credited, he\nis still supervising its duties.\n\n\n")[1]

In [55]:
soup_genuine = soup_str.split("\n\n\nTHE BAGGAGEMAN\'S GHOST\n\n")[0]

In [56]:
soup_genuine = soup_genuine.replace("A GENUINE GHOST\n\n(Philadelphia _Press_, March 25, 1884)\n\nDAYTON, O., March 25.--", "")

In [57]:
soup_str = soup_str.split("with the head inclined forward and hands clasped behind.\n\n\n")[1]

In [58]:
soup_baggage = soup_str.split("\n\n\nDRUMMERS SEE A SPECTER\n\n(St Louis")[0]

In [59]:
soup_baggage = soup_baggage.replace("THE BAGGAGEMAN\'S GHOST\n\n", "")

In [60]:
soup_str = soup_str.split('rather than\ngo into the morgue again."\n\n\n')[1]

In [61]:
soup_drummer = soup_str.split("\n\n\nDR. FUNK SEES THE SPIRIT OF BEECHER")[0]

In [62]:
soup_drummer = soup_drummer.replace('DRUMMERS SEE A SPECTER\n\n(St Louis _Globe-Democrat_, Oct. 6, 1887)\n\n[The last man in the world to be accused of a belief in the supernatural\nwould be your go-ahead, hard-headed American "drummer" or traveling-man.\nYet here is a plain tale of how not one but two of the western\nfraternity saw a genuine ghost in broad daylight a few years ago.--ED.]\n\nJACKSON, MO., October 6. ', '')

In [63]:
soup_str = soup_str.split("\nisolated from the beautiful country which surrounds it.\n\n\n")[1]

In [64]:
soup_funk = soup_str.split("\n\n\nMYSTERY OF THE COINS\n\nDr")[0]

In [65]:
soup_funk = soup_funk.replace("DR. FUNK SEES THE SPIRIT OF BEECHER\n\n(New York _Herald_, April 4, 1903)\n\n", "")

In [66]:
soup_str = soup_str.split('to\nthe floor and fade away."\n\n\n')[1]

In [67]:
soup_funk_2 = soup_str.split('\n\n\nMR. BEECHER APPEASED\n\n"When')[0]

In [68]:
soup_funk_2 = soup_funk.replace("MYSTERY OF THE COINS\n\n", "")

In [69]:
soup_str = soup_str.split("that does not\nappear to concern the spirit of Mr. Beecher.\n\n\n")[1]

In [70]:
soup_funk_3 = soup_str.split("\n\n\nMARYLAND GHOSTS\n\n(_Baltimore American_, May")[0]

In [71]:
soup_funk_3 = soup_funk_3.replace("MR. BEECHER APPEASED\n\n", "")

In [72]:
soup_funk = ' '.join([soup_funk, soup_funk_2, soup_funk_3])

In [73]:
soup_str = soup_str.split('But none of them knew any more about the coin being in my\nsafe than I did."')[1]

In [74]:
soup_maryland = soup_str.split("\n\n\nTHE GHOST OF PEG ALLEY\'S POINT")[0]

In [75]:
soup_maryland = soup_maryland.replace("\n\n\nMARYLAND GHOSTS\n\n(_Baltimore American_, May, 1886)\n\n", "")

In [76]:
soup_str = soup_str.split("attest it, and fully corroborate each other, but without\nbeing able to suggest the slightest explanation.\n\n\n")[1]

In [77]:
soup_peg = soup_str.split("\n\n\nAN APPARITION AND DEATH\n\nThe old ")[0]

In [78]:
soup_str = soup_str.split(" for which, indeed, there was not sufficient time.\n\n\n")[1]

In [79]:
soup_apparition = soup_str.split("\n\n\nAN IDIOT GHOST WITH BRASS BUTTONS\n\n")[0]

In [80]:
soup_str = soup_str.split("her apparition and the time of her death coincided.\n\n\n")[1]

In [81]:
soup_idiot = soup_str.split("\n\n\nA MODEL GHOST STORY\n\n(Boston _Courier_,")[0]

In [82]:
soup_idiot = soup_idiot.replace("AN IDIOT GHOST WITH BRASS BUTTONS\n\n(Philadelphia _Press_, June 16, 1889)\n\n", "")

In [83]:
soup_str = soup_str.split("affair kept quiet, but\nthe captain left the house.\n\n\n")[1]

In [84]:
soup_model = soup_str.split("\n\n\nA GHOST THAT WILL NOT DOWN\n\n(Cincinnati")[0]

In [85]:
soup_model = soup_model.replace("A MODEL GHOST STORY\n\n(Boston _Courier_, Aug. 10)\n\n", "")

In [86]:
soup_str = soup_str.split("hideous expression is not very pleasant to\nlook upon.\n\n\n")[1]

In [87]:
soup_down = soup_str.split("\n\n\nTOM CYPHER\'S PHANTOM ENGINE\n\n(Seattle")[0]

In [88]:
soup_down = soup_down.replace("A GHOST THAT WILL NOT DOWN\n\n(Cincinnati _Enquirer_, Sept. 30, 1884)\n\nGRANTSVILLE, W. VA., September 30.--", "")

In [89]:
soup_str = soup_str.split("county possession which it will gladly\ndispose of at any price.\n\n\n")[1]

In [90]:
soup_cypher = soup_str.split("\n\n\nGHOSTS IN CONNECTICUT\n\n(N.Y. _Sun_, Sept.")[0]

In [91]:
soup_cypher = soup_cypher.replace("TOM CYPHER\'S PHANTOM ENGINE\n\n(Seattle _Press-Times_, Jan. 10, 1892)\n\n", "")

In [92]:
soup_str = soup_str.split("Thomas Cypher\'s spirit still hovers near Eagle gorge.\n\n\n")[1]

In [93]:
soup_ct = soup_str.split("\n\n\nTHE SPOOK OF DIAMOND ISLAND\n\n(St. Louis")[0]

In [94]:
soup_ct = soup_ct.replace("GHOSTS IN CONNECTICUT\n\n(N.Y. _Sun_, Sept. 1, 1885)\n\n", "")

In [95]:
soup_str = soup_str.split("may be a will contested in\nMiddletown one of these days.\n\n\n")[1]

In [96]:
soup_diamond = soup_str.split("\n\n\nTHE GHOST\'S FULL HOUSE\n\n(N.Y.")[0]

In [97]:
soup_diamond = soup_diamond.replace("THE SPOOK OF DIAMOND ISLAND\n\n(St. Louis _Globe-Democrat_, Sept. 18, 1888)\n\nHARDEN, Ill., Sept. 18.--", "")

In [98]:
soup_str = soup_str.split("crimson object is believed to be\nthe restless spirit of the slain man.\n\n\n")[1]

In [99]:
soup_full = soup_str

In [100]:
soup_full = soup_full.replace("THE GHOST\'S FULL HOUSE\n\n(N.Y. _Sun_, April 10, 1891)\n\n", "")

Okay! Wow, that took a while, but that's everything! The text of each story is in its own object. Now I'll make a list of those objects, turn that list into a dataframe with a column called 'text,' and save that dataframe as a csv file so I can use it in my cleaning notebook.

In [122]:
story_list = [soup_de_foe, soup_mr_james, soup_lytton, soup_kompert, soup_banshee,
             soup_benson, soup_blackwood, soup_kipling, soup_matthews, soup_bierce,
             soup_sullivan, soup_giant, soup_national, soup_genuine, soup_baggage,
             soup_drummer, soup_funk, soup_maryland, soup_peg, soup_apparition,
             soup_idiot, soup_model, soup_down, soup_cypher, soup_ct, soup_diamond, soup_full]

In [123]:
story_list = [story.replace("\n", " ") for story in story_list]

In [124]:
type(story_list[0])

str

This next bit of code comes from a helpful response to a [Stack Overflow post](https://stackoverflow.com/questions/36039919/beautifulsoup-output-to-txt-file), and is designed to write my list of stories out in one large text object, which can be used to train models.

In [129]:
with open("out.txt","w") as out:
    for i in range(0, len(story_list)):
        try:
            out.write(story_list[i])
        except Exception:
            1+1

In [125]:
stories = pd.DataFrame(data = story_list)

In [126]:
stories[0].loc[0]

"This relation is matter of fact, and attended with such circumstances, as may induce any reasonable man to believe it. It was sent by a gentleman, a justice of peace, at Maidstone, in Kent, and a very intelligent person, to his friend in London, as it is here worded; which discourse is attested by a very sober and understanding gentlewoman, a kinswoman of the said gentleman's, who lives in Canterbury, within a few doors of the house in which the within-named Mrs. Bargrave lives; who believes his kinswoman to be of so discerning a spirit, as not to be put upon by any fallacy; and who positively assured him that the whole matter, as it is related and laid down, is really true; and what she herself had in the same words, as near as may be, from Mrs. Bargrave's own mouth, who, she knows, had no reason to invent and publish such a story, or any design to forge and tell a lie, being a woman of much honesty and virtue, and her whole life a course, as it were, of piety. The use which we ought

In [127]:
stories = stories.rename(columns = {0 : 'text'})

In [128]:
stories.to_csv('stories.csv', index = False)