## Code purpose
This code allows to recover all the letters on the Newton Project. To do this, it first recover all the letters ID which allows to get the correspondant XML file containing : some metadata about the letter and the text of the letter itself. It then stores these informations in a DataFrame which we save in pickle format.


## Imports

In [1]:
import requests
import re
from bs4 import BeautifulSoup
import pandas as pd

## Scraping all article IDs

On the Newton's Project website, there are 25 articles per page and there are 18 pages of correspondance. We need these informations for the parsing, because of the GET parameters used in the URL to display the letters. 

In [2]:
number_of_pages = 18
articles_per_page = 25
total_letters = 431

For each page, we get its full HTML content and parse it with BeautifulSoup. Then we simply temporarly store it in an array.

In [3]:
html_soups = []
for page in range(number_of_pages):
    start_article_number = page * articles_per_page + 1
    url = f"http://www.newtonproject.ox.ac.uk/texts/correspondence/all?n=25&sr={start_article_number}&cat=Correspondence"
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    html_soups.append(soup)
    
    


Using BeautifulSoup, we extract the article IDs from each HTML pages which will allow us to get the correspondings XML. These IDs are all in `p` tags with the `metadataContent` class. As this `p` tag also contains a `strong` tag containing the front-end title "Newton Catalogue ID", we need to exclude this tag. All letters ID are composed by 4 upper characters and 5 digits. To retrieve these IDs only, we use a Regular Expression that correspond to the ID characters composition.

In [4]:
article_ids = []
for soup in html_soups:
    metadata = soup.find_all("p", class_="metadataContent")
    for metadatum in metadata:
        ids = re.findall("[A-Z]{4}[0-9]{5}", metadatum.text)
        if len(ids) > 0:
            article_ids.append(ids[0])
assert(len(article_ids) == total_letters)


## Retrieve XML files

Once we have a list of IDs, we can simply query the corresponding XML file and parse it with BS.

In [5]:
xml_soups = []
for article_id in article_ids:
    url = f"http://www.newtonproject.ox.ac.uk/view/texts/xml/{article_id}"
    r = requests.get(url)
    soup = BeautifulSoup(r.text, "xml")
    xml_soups.append(soup)

assert(len(xml_soups) == total_letters)

We can now simply iterate through each XML and get the useful information. This information is stored in a list of lists that will be used to build a DataFrame.

In [17]:
entries = []
for xml in xml_soups:
    catDesc = xml.find("catDesc").text
    #Removes abbreviations. e.g. there are 2 versions for the word "Sir", namely the abbreviation "Sr" and the complete word "Sir".
    for abbr in xml("abbr"):
        abbr.decompose()
        
    author = xml.find("author").text.replace("\n", " ").strip(" ")

    #This may not be entirely correct to query div
    letter_content = xml.find("div").text
    if letter_content is not None:
        letter_content = " ".join(letter_content.split())
    else:
        letter_content = ""
   
    original_date = xml.find("origDate").text
    manuscript = xml.find("idno")
    if manuscript is not None:
        manuscript = manuscript.text
    original_place = xml.find("origPlace")
    title = xml.find("title").text
    if original_place is not None:
        original_place = original_place.text
    else:
        original_place = "Unknown"
        
    languages = [lang.text for lang in xml.find_all("language")]
    entries.append([author, catDesc, title, manuscript, original_date, original_place, languages, letter_content])

Now we can actively build the DataFrame using the article IDs as indices

In [23]:
letters = pd.DataFrame(entries, columns=["author", "category", "title", "manuscript", "original_date", "original_place", "languages", "letter_content"], index = article_ids)

In [24]:
letters

Unnamed: 0,author,category,title,manuscript,original_date,original_place,languages,letter_content
NATP00226,Isaac Newton,Mathematics,"Letter from Newton to a friend, together with ...",MS Add. 9597/2/18/3,23 February 1668/9,England,"[English, Latin]",3 Trinity College Cambridge Feb: 23d 16689 Sir...
NATP00227,Isaac Newton,Mathematics,"Letter from Newton to Francis Aston, dated 18 ...",MS Add. 9597/2/18/4,18 May 1669,England,"[English, Latin]",4 Trinity College Cambridge May 18 1669 Franci...
NATP00224,Isaac Newton,Mathematics,"Letter from Newton to John Collins, dated 19 J...",MS Add. 9597/2/18/1,19 January 1669/70,England,[English],1 Trinity College Cambridge. Ian 1669 Sir I re...
NATP00225,Isaac Newton,Mathematics,"Letter from Newton to John Collins, dated 6 Fe...",MS Add. 9597/2/18/2,6 February 1669/70,England,"[English, French]",2 Trinity College Feb 6 1669.Cambridge. Sir Mr...
NATP00228,Isaac Newton,Mathematics,"Letter from Newton to John Collins, dated 18 F...",MS Add. 9597/2/18/5,18 February 1669/70,England,[English],5. Feb 18th 166970. Sir Two days since I recei...
...,...,...,...,...,...,...,...,...
NATP00279,Isaac Newton,Mathematics,"Letter from Newton to Edmund Halley, dated 3 D...",MS Add. 9597/2/18/68,3 December 1724,England,"[English, Latin]",68 Dr Halley I received from you formerly a Ta...
MINT01077,V Kidder Assay-Master in Ireland,Mint,Copy of report on the assay of several new Por...,T 27/24.110,6 Apr 1725,England,[English],110 May it Please your Lordships In obedience ...
MINT01076,John Scrope Treasury Secretary,Mint,Copy of referral of report on the assay of sev...,T 27/24.110,23 Sep 1725,England,[English],110 Officers of the Mint Gentlemen Mr: Kidder ...
MINT01078,John Scrope Treasury Secretary,Mint,Copy of letter recommending John Rollos for th...,T 27/24.175,16 Aug 1726,England,[English],"175. Warden, Master & Worker and Comptroller o..."


Saving DataFrame in pickle format

In [25]:
letters.to_pickle("letters.pickle")

In [26]:
letters.title.to_csv("title.csv")