## Imports

In [2]:
import requests
import re
from bs4 import BeautifulSoup
import pandas as pd

## Scraping all article IDs

On the Newton's Project website, there are always 25 articles per page and there are 25 pages of correspondance.

In [3]:
number_of_pages = 18
articles_per_page = 25

For each page, we get its HTML content and parse it with BeautifulSoup. Then we simply temporarly store it in an array

In [4]:
html_soups = []
for page in range(number_of_pages):
    start_article_number = page * articles_per_page + 1
    url = f"http://www.newtonproject.ox.ac.uk/texts/correspondence/all?n=25&sr={start_article_number}&cat=Correspondence&tr=1&sort=shelfmark&order=asc"
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    html_soups.append(soup)

Using BeautifulSoup, we extract the article IDs from each HTML pages. These IDs are all in `p` tags with the `metadataContent` class. To retrieve the IDs only we use a Regular Expression that looks for 4 upper characters followed by 5 numbers.

In [5]:
article_ids = []
for soup in html_soups:
    metadata = soup.find_all("p", class_="metadataContent")
    for metadatum in metadata:
        ids = re.findall("[A-Z]{4}[0-9]{5}", metadatum.text)
        if len(ids) > 0:
            article_ids.append(ids[0])
assert(len(article_ids) == 431)

## Retrieve XML files

Once we have a list of IDs, we can simply query the corresponding XML file and parse it with BS

In [6]:
xml_soups = []
for article_id in article_ids:
    url = f"http://www.newtonproject.ox.ac.uk/view/texts/xml/{article_id}"
    r = requests.get(url)
    soup = BeautifulSoup(r.text, "xml")
    xml_soups.append(soup)

assert(len(xml_soups) == 431)

We can now simply iterate through each XML and get the useful information. This information is stored in a list of lists that will be used to build a DataFrame

In [13]:
entries = []
for xml in xml_soups:
    #Removes abbreviations. e.g. there are 2 versions for the word "Sir", namely the abbreviation "Sr" and the complete word "Sir".
    for abbr in xml("abbr"):
        abbr.decompose()
        
    author = xml.find("author").text.replace("\n", " ").strip(" ")
    
    #This may not be entirely correct to query div
    letter_content = xml.find("div").text
    if letter_content is not None:
        letter_content = " ".join(letter_content.split())
    else:
        letter_content = ""
   
    original_date = xml.find("origDate").text
    original_place = xml.find("origPlace")
    if original_place is not None:
        original_place = original_place.text
    else:
        original_place = "Unknown"
        
    languages = [lang.text for lang in xml.find_all("language")]
    
    entries.append([author, original_date, original_place, languages, letter_content])

Now we can actively build the DataFrame using the article IDs as indices

In [14]:
letters = pd.DataFrame(entries, columns=["author", "original_date", "original_place", "languages", "letter_content"], index = article_ids)

Saving DataFrame in pickle format

In [15]:
letters.to_pickle("letters.pickle")