## Imports

In [38]:
import requests
import re
from bs4 import BeautifulSoup
import pandas as pd

## Scraping all article IDs

On the Newton's Project website, there are always 25 articles per page and there are 25 pages of correspondance.

In [2]:
number_of_pages = 18
articles_per_page = 25

For each page, we get its HTML content and parse it with BeautifulSoup. Then we simply temporarly store it in an array

In [3]:
html_soups = []
for page in range(number_of_pages):
    start_article_number = page * articles_per_page + 1
    url = f"http://www.newtonproject.ox.ac.uk/texts/correspondence/all?n=25&sr={start_article_number}&cat=Correspondence&tr=1&sort=shelfmark&order=asc"
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    html_soups.append(soup)

Using BeautifulSoup, we extract the article IDs from each HTML pages. These IDs are all in `p` tags with the `metadataContent` class. To retrieve the IDs only we use a Regular Expression that looks for 4 upper characters followed by 5 numbers.

In [4]:
article_ids = []
for soup in html_soups:
    metadata = soup.find_all("p", class_="metadataContent")
    for metadatum in metadata:
        ids = re.findall("[A-Z]{4}[0-9]{5}", metadatum.text)
        if len(ids) > 0:
            article_ids.append(ids[0])
assert(len(article_ids) == 431)

## Retrieve XML files

Once we have a list of IDs, we can simply query the corresponding XML file and parse it with BS

In [27]:
xml_soups = []
for article_id in article_ids:
    url = f"http://www.newtonproject.ox.ac.uk/view/texts/xml/{article_id}"
    r = requests.get(url)
    soup = BeautifulSoup(r.text, "xml")
    xml_soups.append(soup)

assert(len(xml_soups) == 431)

We can now simply iterate through each XML and get the useful information. This information is stored in a list of lists that will be used to build a DataFrame

In [42]:
entries = []
for xml in xml_soups:
    #Removes abbreviations. e.g. there are 2 versions for the word "Sir", namely the abbreviation "Sr" and the complete word "Sir".
    for abbr in xml("abbr"):
        abbr.decompose()
        
    author = xml.find("author").text.replace("\n", " ").strip(" ")
    content = xml.find("div").text
    
    entries.append([author, content])

Now we can actively build the DataFrame using the article IDs as indices

In [44]:
pd.DataFrame(entries, columns=["author", "letter_content"], index = article_ids)

Unnamed: 0,author,letter_content
NATP00307,Isaac Newton,\n35Newton 135\nCambridg March 16th 1671.\nSir...
NATP00308,Isaac Newton,\n36Newton 236Trans 1672\n\nMr Newtons Letter ...
NATP00309,Isaac Newton,\n37Newton 337\n\nRead Jan: 11: 16712\nEntd. L...
NATP00310,Isaac Newton,\n38Newton 438.\nTrin. Coll. April 13. 72.\n\n...
NATP00311,Isaac Newton,\n\n39\nNewton 5\n39\nJune 11th 1672.\n\nRead ...
...,...,...
NATP00006,Isaac Newton,\n(3075)\nPHILOSOPHICAL TRANSACTIONS. February...
NATP00015,Ignace Gaston Pardies,"\n\nA Second Letter of P. Pardies, written to ..."
NATP00011,Robert Moray,\nSome Experiments propos'd in relation to Mr....
NATP00012,Ignace Gaston Pardies,\n\n(4087)\nA Latin Letter written to the Publ...
