# Scraping A corpus of HP Lovecraft poems to be used as a distractor.

The raw data is taken from https://www.hplovecraft.com/.

In [15]:
# Importing modules
import os
import requests
import pandas as pd
from bs4 import BeautifulSoup
import re

### Step 1
For the first step, the titles and links from each of Lovecraft's works were scraped.

In [2]:
# Scraping and structuring the source code.
hp_archive = requests.get("https://www.hplovecraft.com/writings/texts/").text
hp_archive = BeautifulSoup(hp_archive, "html.parser")
hp_archive

<!-- Google tag (gtag.js) -->
<script async="" src="https://www.googletagmanager.com/gtag/js?id=G-V0FJZGS2EZ"></script>
<script>
  window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'G-V0FJZGS2EZ');
</script>
<style>

@font-face
   {
   font-family: 'Packard';
   font-style: normal;
   font-weight: 400;
   src: url('https://www.hplovecraft.com/includes/packard-antique-regular.ttf');
   }

</style>
<html>
<head>
<link href="https://fonts.googleapis.com/css?family=Cardo:400,700,400italic&amp;subset=latin,greek-ext,greek,latin-ext" rel="stylesheet" type="text/css"/>
<title>Electronic Texts of H.P. Lovecraft's Works</title>
<meta content="Electronic Texts of H.P. Lovecraft's Works" name="title"/>
<meta content="IE=edge;" http-equiv="X-UA-Compatible"/>
<meta content="Electronic Texts of H.P. Lovecraft's Works" name="description"/>
<meta content="Electronic Texts of H.P. Lovecraft's Works" name="keywords"/>
<li

In [3]:
# Limiting the results to only poetry
hp_archive_all = hp_archive.find_all("a")
hp_poetry = []
for work in hp_archive_all:
    if "poetry" in str(work):
        hp_poetry.append(work)
hp_poetry = hp_poetry[2:] # removing the first two entries, since they are not poems.
hp_poetry[:5]

[<a href="poetry/p061.aspx">An American to Mother England</a>,
 <a href="poetry/p286.aspx">The Ancient Track</a>,
 <a href="poetry/p337.aspx">Arcadia</a>,
 <a href="poetry/p122.aspx">Astrophobos</a>,
 <a href="poetry/p253.aspx">The Cats</a>]

### Step 2
Next, the titles of each poem were added to a new list.

In [4]:
# Getting a list of only the titles
hp_poetry_titles = [poem.text for poem in hp_poetry]
hp_poetry_titles[:5]

['An American to Mother England',
 'The Ancient Track',
 'Arcadia',
 'Astrophobos',
 'The Cats']

### Step 3
The same thing was done for the hyperlinks which lead to the actual poems.

In [5]:
# Getting the hyperlink to each poem
hp_poetry_links = []
for poem in hp_poetry:
    link = "https://www.hplovecraft.com/writings/texts/" + poem["href"]
    hp_poetry_links.append(link)
hp_poetry_links[:5]

['https://www.hplovecraft.com/writings/texts/poetry/p061.aspx',
 'https://www.hplovecraft.com/writings/texts/poetry/p286.aspx',
 'https://www.hplovecraft.com/writings/texts/poetry/p337.aspx',
 'https://www.hplovecraft.com/writings/texts/poetry/p122.aspx',
 'https://www.hplovecraft.com/writings/texts/poetry/p253.aspx']

### Step 4
Using the list of hyperlinks, Each poem was scraped from its respective webpage, cleaned up and added to two lists; one where line breaks are kept - for human reading, and one where line breaks are removed - for machine reading.

In [6]:
# Extracting and cleaning the actual poems:
hp_poetry_texts_h = [] # A list of all poems where line breaks are kept, keeping it in its intended structure.
hp_poetry_texts_m = [] # A list of all poems where line breaks are removed, and poems are one continuous block of text.
for link in hp_poetry_links:
    # The next 3 lines scrape the poem from the HTML code
    poem = requests.get(link).text
    poem = BeautifulSoup(poem, "html.parser")

    for img in poem.find_all("img"):
        img.replace_with(poem.new_tag("br"))
    
    poem = poem.find_all("font", face = "Arial,Sans-Serif")[2]

    # adding the text of the poem to the human-readable list
    poem_h = poem.text
    hp_poetry_texts_h.append(poem.text)
    
    # The next 8 lines remove any noise and join the poems into one string
    processed_poem = []
    tags = ["\xa0", "\n", "\r"]
    for content in poem:
        line = content.get_text(separator = " ")
        for tag in tags:
            line = re.sub(tag, "", line)
        processed_poem.append(line)
    processed_poem = " ".join(processed_poem)
    # Adding the final poem to the machine-readable list
    hp_poetry_texts_m.append(processed_poem)

### Step 5
Finally, a dataframe was created using the titles, links and texts, and exported it as a .csv file. Subsequently, the poems were exported as .txt files.

In [7]:
# Putting it all into a table
hp_dict = {"Title":hp_poetry_titles, "Text_Original":hp_poetry_texts_h, "Text_Continuous":hp_poetry_texts_m, "Link":hp_poetry_links}
df = pd.DataFrame(hp_dict)
df.head()

Unnamed: 0,Title,Text_Original,Text_Continuous,Link
0,An American to Mother England,\r\n\r\nEngland! My England! Can the surging s...,England! My England! Can the surging sea That ...,https://www.hplovecraft.com/writings/texts/poe...
1,The Ancient Track,\r\n\r\nThere was no hand to hold me back\r\nT...,There was no hand to hold me back That night I...,https://www.hplovecraft.com/writings/texts/poe...
2,Arcadia,\r\n\r\nBy Head Balledup\n\nO give me the life...,By Head Balledup O give me the life of the vil...,https://www.hplovecraft.com/writings/texts/poe...
3,Astrophobos,\r\n\nIn the midnight heavens burning\nThro’ e...,In the midnight heavens burning Thro’ etherea...,https://www.hplovecraft.com/writings/texts/poe...
4,The Cats,\r\n\r\nBabels of blocks to the high heavens t...,"Babels of blocks to the high heavens tow’ring,...",https://www.hplovecraft.com/writings/texts/poe...


In [8]:
# Exporting it as a CSV fileF
df.to_csv("HP_Lovecraft_Poems.csv")

In [14]:
if "Lovecraft" not in os.listdir():
    os.mkdir("Lovecraft")  # Creates a folder for the exported poems if it doesn't already exist.
for index, poem in enumerate(hp_poetry_texts_h):
    if "poem_" + str(index) + ".txt" not in os.listdir("Lovecraft"):  # Creates the file if it doesn't already exist.
        file = open("Lovecraft/poem_" + str(index) + ".txt", "x")
        file.close()
    with open("Lovecraft/poem_" +str(index) + ".txt", "w", encoding="utf-8") as file:  # Writes into the file.
        file.write(poem)