# Download Sands and Mac PDFs and OCR text

This notebook was used to download data from the Sands & Mac directories so I could create a fully-searchable version.

I started by downloading all the PDFs, but later found I could get OCR data directly from online versions of the ALTO files. The page images are all available through IIIF, so in the end I didn't really need the PDFs at all.

In [2]:
import pandas as pd
from requests_cache import CachedSession
import re
from pathlib import Path
import fitz
import time
from tqdm.auto import tqdm
from bs4 import BeautifulSoup
import json

sess = CachedSession()

We're starting with a CSV file containing the results of a search for Sands and Mac which was exported from the SLV catalogue.

In [3]:
# Load the CSV file from the catalogue
df = pd.read_csv("sands_and_mac.csv")

## Get PDF ids

To download a PDF version of a digitised item you need its identifier, and that's not easy to find.

If you're starting with an Alma identifier, you first need to get the `IE` identifer from the MARC record.

Then you need an identifier that points to the specific PDF representation of the item. The PDF links are created dynamically in the SLV image viewer, so the information is coming from somewhere, but where? I eventually realised that the viewer is loading a JSON file with what I think is data about the digitised item from Rosetta. This file contains identifiers for `small_pdf` and `master_pdf`. 

In [4]:
def get_marc_record(alma_id):
    """
    Gets a text representation of an item's MARC record.
    """
    response = sess.get(
        f"https://find.slv.vic.gov.au/primaws/rest/pub/sourceRecord?docId=alma{alma_id}&vid=61SLV_INST:SLV"
    )
    return response.text


def get_marc_value(marc, tag, subfield):
    """
    Gets the value of a tag/subfield from a text version of an item's MARC record.
    """
    try:
        tag = re.search(rf"^{tag}\t.+", marc, re.M).group(0)
        subfield = re.search(rf"\${subfield}([^\$]+)", tag).group(1)
    except AttributeError:
        return None
    return subfield.strip(" .,")

def get_image_id(alma_id):
    """
    Get the IE image identifier from the MARC record.
    These ids are used to construct IIIF manifest urls.
    """
    marc = get_marc_record(alma_id)
    try:
        image_id = re.search(r"\$e(IE\d+)", marc).group(1)
    except AttributeError:
        # print(alma_id)
        image_id = ""
    return image_id

def get_pdf_id(ie_id):
    # url to get METS info
    response = sess.get(f"https://viewerapi.slv.vic.gov.au/?entity={ie_id}&dc_arrays=1")
    data = response.json()
    try:
        pdf_id = data["summary"]["small_pdf"]["$ref"].split("][")[1].strip('"]')
    except KeyError:
        pdf_id = data["summary"]["master_pdf"]["$ref"].split("][")[1].strip('"]')
    return pdf_id

Add the `IE` and PDF identifiers to the dataset.

In [5]:
df["ie_id"] = df["Record ID"].apply(get_image_id)
df["pdf_id"] = df["ie_id"].apply(get_pdf_id)

In [6]:
df.head()

Unnamed: 0,Title,Author / Creator,Contributor(s),Series dates,Edition,Source,Publisher,Date,Distributor / Manufacturer,Description,...,Biographical / Historical note,Exhibited,Processing note,Language note,Language(s),User comment(s),Record ID,Persistent link,ie_id,pdf_id
0,Sands & McDougall's Melbourne and suburban dir...,,Sands & McDougall Limited.,,,,Melbourne : Sands & McDougall,1865,,614 digital files.,...,,,,,English,,9939682246307636,https://find.slv.vic.gov.au/permalink/61SLV_IN...,IE13680204,FL21805779
1,"Sands, Kenny & Co.'s commercial and general Me...",,,,,,"Melbourne : Sands, Kenny & Co.",1860,,467 digital files.,...,,,,,English,,9939682227907636,https://find.slv.vic.gov.au/permalink/61SLV_IN...,IE13647144,FL21803559
2,Sands & McDougall's Melbourne and suburban dir...,,Sands & McDougall Limited.,,,,Melbourne : Sands & McDougall,1885,,1330 digital files.,...,,,,,English,,9939666276107636,https://find.slv.vic.gov.au/permalink/61SLV_IN...,IE13746144,FL21843423
3,Sands & McDougall's directory of Victoria : 1915,,,,,,Melbourne : Sands & McDougall,1915,,3112 digital files.,...,,,,,English,,9939663929107636,https://find.slv.vic.gov.au/permalink/61SLV_IN...,IE13932084,FL21850715
4,Sands & McDougall's directory of Victoria : 1935,,,,,,Melbourne : Sands & McDougall,1935,,2236 digital files.,...,,,,,English,,9939663928207636,https://find.slv.vic.gov.au/permalink/61SLV_IN...,IE14054304,FL21811087


In [None]:
df.to_csv("sands_and_mac_ie.csv", index=False)

## Download PDFs and extract text

In [None]:
for pdf_id in df["pdf_id"].to_list():
    pdf_url = f"https://rosetta.slv.vic.gov.au/delivery/DeliveryManagerServlet?dps_func=stream&dps_pid={pdf_id}"
    response = sess.get(pdf_url)
    filename = response.headers["Content-Disposition"].split("=")[1].strip('"')
    Path("sands_and_mac", filename).write_bytes(response.content)

In [None]:
# Loop through all the PDFs
for pdf in Path("sands_and_mac").glob("*.pdf"):
    print(pdf.name)
    pid = pdf.name.split(".")[0]
    # Create directory for volume
    data_dir = Path("sands_and_mac", pid)
    data_dir.mkdir(exist_ok=True)
    # Create directories for text and images
    text_dir = Path(data_dir, "text")
    image_dir = Path(data_dir, "images")
    text_dir.mkdir(exist_ok=True)
    image_dir.mkdir(exist_ok=True)
    # Open the PDF with PyMuPDF
    doc = fitz.open(pdf)
    for i, page in enumerate(doc):
        # Get images
        for xref in page.get_images():
            pix = fitz.Pixmap(doc, xref[0])
            image_file = Path(image_dir, f"{pid}-{i+1}.jpg")
            pix.save(image_file)
        # Get text
        text_path = Path(text_dir, f"{pid}-{i+1}.txt")
        # The sort option tries to organise the text into a natural reading view.
        # However, this doesn't always manage to identify column boundaries, so values from adjacent columns can be munged together.
        text = page.get_text(sort=True)
        Path(text_path).write_text(text)

## Download ALTO files and extract text

In [None]:
def find_tiff_rep(data):
    for rep in data["representation"].values():
        if rep.get("entity_type") == "TIFF" and rep.get("representation_code") == "HIGH":
            return rep["id"]

Download all the ALTO files.

In [None]:
# Loop through volumes, sorted by date
for i, vol in df.sort_values("Date").iterrows(): 
    ie_id = vol["ie_id"]
    # Path to volume data
    vol_path = [v for v in Path("sands_and_mac").glob(f"sa{vol['Date']}*") if v.is_dir()][0]
    # Path to save ALTO files
    alto_path = Path(vol_path, "alto")
    alto_path.mkdir(exist_ok=True)
    # Need to look in METS data for the right id
    response = sess.get(f"https://viewerapi.slv.vic.gov.au/?entity={ie_id}&dc_arrays=1")
    vol_data = response.json()
    tiff_id = find_tiff_rep(vol_data)
    # Loop through files in METS data looking for ALTO files
    for file in tqdm(vol_data["file"].values()):
        if file.get("entity_type") == "ALTO":
            # Download ALTO file
            alto_response = sess.get(file["url"])
            # Get the image id related to the ALTO file
            image_id = file["related_files"][tiff_id]["$ref"].split("][")[1].strip('"]')
            # Save ALTO XML file using ALTO and image ids in filename (I want to keep the link between text and image for use later)
            Path(alto_path, f"{file["id"]}-{image_id}.xml").write_text(alto_response.text)
            if not alto_response.from_cache:
                time.sleep(2)
            

Extract the page text from each ALTO file.

In [10]:
# Loop through volumes, sorted by date
for i, vol in df.sort_values("Date").iterrows():
    print(vol["Date"])
    ie_id = vol["ie_id"]
    # Path to volume data
    vol_path = [v for v in Path("sands_and_mac").glob(f"sa{vol['Date']}*") if v.is_dir()][0]
    # Path where ALTO files are saved
    alto_path = Path(vol_path, "alto")
    # Path to save text from ALTO files
    alto_text = Path(vol_path, "alto-json")
    alto_text.mkdir(exist_ok=True)
    # Loop through all the ALTO XML files
    for alto_file in tqdm(alto_path.glob("*.xml")):
        soup = BeautifulSoup(alto_file.read_text(), features="xml")
        # Open a text file to save the lines of text
        with Path(alto_text, f"{alto_file.stem}.ndjson").open("w") as alto_text_file:
            # Extract the words from each line and write them as a single string to the text file
            for line in soup.find_all("TextLine"):
                words = []
                for word in line.find_all("String"):
                    words.append(word["CONTENT"].strip())
                line_data = {"text": " ".join(words), "h": int(line["HEIGHT"]), "w": int(line["WIDTH"]), "x": int(line["HPOS"]), "y": int(line["VPOS"])}
                alto_text_file.write(json.dumps(line_data) + "\n")
        

1860


0it [00:00, ?it/s]

1865


0it [00:00, ?it/s]

1870


0it [00:00, ?it/s]

1875


0it [00:00, ?it/s]

1880


0it [00:00, ?it/s]

1885


0it [00:00, ?it/s]

1890


0it [00:00, ?it/s]

1895


0it [00:00, ?it/s]

1900


0it [00:00, ?it/s]

1905


0it [00:00, ?it/s]

1910


0it [00:00, ?it/s]

1915


0it [00:00, ?it/s]

1920


0it [00:00, ?it/s]

1925


0it [00:00, ?it/s]

1930


0it [00:00, ?it/s]

1935


0it [00:00, ?it/s]

1940


0it [00:00, ?it/s]

1945


0it [00:00, ?it/s]

1950


0it [00:00, ?it/s]

1955


0it [00:00, ?it/s]

1960


0it [00:00, ?it/s]

1965


0it [00:00, ?it/s]

1970


0it [00:00, ?it/s]

1974


0it [00:00, ?it/s]