# Create archived url datasets from Pandora's collections and subjects

This notebook helps you create a dataset of archived urls using Pandora's subject and collection groupings.

The Australian Web Archive makes billions of archived web pages searchable through Trove. But how would you go about constructing a search that would find websites relating to election campaigns? Fortunately you don't have to, as Pandora provides a collection of archived web resources organised by subject and collection. By using harvests of Pandora's subject hierarchy and a complete list of archived titles, this notebook makes it easy for you to create custom datasets relating to a specific topic or event.

This notebook uses pre-harvested datasets containing information about Pandora's subjects, collections and titles. New titles are added to Pandora frequently, so you might want to create your own updated versions using these notebooks:

- [Harvest Pandora subjects and collections](harvest-pandora-subject-collections.ipynb)
- [Harvest the full collection of Pandora titles](harvest-pandora-titles.ipynb)

## Using this notebook

The simplest way to get started is to browse the subject and collection groupings in [Pandora](http://pandora.nla.gov.au/). Once you've found a subject or collection of interest, just copy its identifier, either `/subject/[subject number]` or `/col/[collection number]`. You also need to decide if you want *every* title under that subject or collection, including those associated with its children, or if you only want the titles directly linked to your selected grouping.

Then you can run either `create_subject_dataset([your subject id])` or `create_collection_dataset([your collection id])`.

## Datasets

This notebook creates a CSV formatted dataset containing the following fields:

- `tep_id` – the Title Entry Page (TEP) identifier in the form `/tep/[TEP NUMBER]`
- `name` – name of the title
- `gathered_url` – the url that was archived
- `surt` – the surt (Sort-friendly URI Reordering Transform) is a version of the url that reverses the order of the domain components to put the top-level domain first, making it easier to group or sort resources by domain

Note that Pandora's title records can bring together different urls and domains that have pointed to a resource over time. This means that there can be multiple urls associated with each TEP. See [Harvest the full collection of Pandora titles](harvest-pandora-titles.ipynb) for more information.

The dataset also includes an RO-Crate metadata file describing the dataset's contents and context.

## What can you do with a collection of archived urls?

For more information about the Pandora title, use the `tep_id` to construct a url to a human-readable version in Trove, or a machine-readable JSON version:

- [https://webarchive.nla.gov.au/tep/131444](https://webarchive.nla.gov.au/tep/131444) – goes to TEP web page
- [https://webarchive.nla.gov.au/bamboo-service/tep/131444](https://webarchive.nla.gov.au/bamboo-service/tep/131444) – returns JSON version of TEP

Once you have an archived url you can make use of the tools in the [Web Archives](https://glam-workbench.net/web-archives/) section of the GLAM Workbench to gather more data for analysis. For example:

- [Find all the archived versions of a web page using Timemaps](https://glam-workbench.net/web-archives/get-all-versions/)
- [Display changes in the text of an archived web page over time](https://glam-workbench.net/web-archives/display-changes-in-text/)
- [Harvesting collections of text from archived web pages](https://glam-workbench.net/web-archives/harvesting-text/)
- [Using screenshots to visualise change in a page over time](https://glam-workbench.net/web-archives/create-screenshots-over-time/)


In [299]:
from datetime import datetime
from pathlib import Path
import mimetypes

import ipynbname
import nbformat
import pandas as pd
from IPython.display import HTML, display
from rocrate.rocrate import ContextEntity, ROCrate
from slugify import slugify

In [300]:
dfc = pd.read_json(
    "https://github.com/GLAM-Workbench/trove-web-archives-collections/raw/main/pandora-collections.ndjson",
    lines=True,
)
dfs = pd.read_json(
    "https://github.com/GLAM-Workbench/trove-web-archives-collections/raw/main/pandora-subjects.ndjson",
    lines=True,
)
dft = pd.read_csv(
    "https://github.com/GLAM-Workbench/trove-web-archives-titles/raw/main/pandora-titles.csv"
)


def create_rocrate(subject, file_path, start_date, end_date):
    """
    Create an RO-Crate metadata file describing the downloaded dataset.
    """
    crate = ROCrate()
    crate.add_file(file_path)
    nb_path = ipynbname.path()
    nb = nbformat.read(nb_path, nbformat.NO_CONVERT)
    metadata = nb.metadata.rocrate
    nb_url = metadata.get("url", "")
    nb_properties = {
        "@type": ["File", "SoftwareSourceCode"],
        "name": metadata.get("name", ""),
        "description": metadata.get("description", ""),
        "encodingFormat": "application/x-ipynb+json",
        "codeRepository": metadata.get("codeRepository", ""),
        "url": nb_url,
    }
    crate.add(ContextEntity(crate, nb_url, properties=nb_properties))
    action_id = f"{nb_path.stem}_run"
    action_properties = {
        "@type": "CreateAction",
        "instrument": {"@id": nb_url},
        "actionStatus": {"@id": "http://schema.org/CompletedActionStatus"},
        "name": f"Run of notebook: {nb_path.name}",
        "result": {"@id": f"{file_path.name}/"},
        "object": [{"@id": o["url"]} for o in metadata["action"][0]["object"]],
        "query": f"{subject['id']} ({subject['name']})",
        "startDate": start_date,
        "endDate": end_date,
    }
    encoding = mimetypes.guess_type(file_path)[0]
    stats = file_path.stat()
    size = stats.st_size
    date = datetime.fromtimestamp(stats.st_mtime).strftime("%Y-%m-%d")
    rows = 0
    with file_path.open("r") as df:
        for line in df:
            rows += 1
    crate.update_jsonld(
        {
            "@id": file_path.name,
            "dateModified": date,
            "contentSize": size,
            "size": rows,
            "encodingFormat": encoding,
        }
    )
    crate.add(ContextEntity(crate, action_id, properties=action_properties))
    crate.write(file_path.parent)
    crate.write_zip(file_path.parent)

## Get title urls from a Pandora subject group

In [323]:
def get_title_ids_in_collection(coll_id, include_subcollections=True):
    title_ids = []
    coll = dfc.loc[dfc["id"] == coll_id].iloc[0]
    title_ids += coll["titles"]
    if include_subcollections:
        for scoll_id in coll["subcollections"]:
            scoll = dfc.loc[dfc["id"] == scoll_id].iloc[0]
            title_ids += scoll["titles"]
    return title_ids


def get_urls_by_subject(subject, include_subcategories=False, include_collections=False):
    title_ids = []
    title_ids += subject["titles"]
    if include_subcategories:
        for subc_id in subject["subcategories"]:
            subc = dfs.loc[dfs["id"] == subc_id].iloc[0]
            title_ids += subc["titles"]
            if include_collections:
                for coll_id in subc["collections"]:
                    title_ids += get_title_ids_in_collection(coll_id)
    if include_collections:
        for coll_id in subject["collections"]:
            title_ids += get_title_ids_in_collection(coll_id)
    titles = dft.loc[dft["tep_id"].isin(title_ids)]
    return titles


def create_subject_dataset(id, include_subcategories=False, include_collections=False):
    start_date = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    subject = dfs.loc[dfs["id"] == id].iloc[0]
    
    df = get_urls_by_subject(
        subject,
        include_subcategories=include_subcategories,
        include_collections=include_collections,
    )
    
    end_date = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    subject_slug = slugify(f"pandora-{id}-{subject['name']}")
    output_path = Path("datasets", subject_slug)
    output_path.mkdir(exist_ok=True, parents=True)
    output_file = Path(output_path, f"pandora-{subject_slug}.csv")
    df.to_csv(output_file, index=False)
    create_rocrate(subject, output_file, start_date, end_date)
    display(
        HTML(
            f"Download dataset: <a href='datasets/{subject_slug}.zip', download>datasets/{subject_slug}.zip</a>"
        )
    )

In [324]:
create_subject_dataset(
    "/subject/3", include_subcategories=True, include_collections=True
)

## Get title urls from a Pandora collection

In [327]:
def get_titles_by_collection(coll, include_subcollections=True):
    title_ids = get_title_ids_in_collection(
        coll["id"], include_subcollections=include_subcollections
    )
    titles = dft.loc[dft["tep_id"].isin(title_ids)]
    return titles


def create_collection_dataset(id, include_subcollections=False):
    start_date = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    coll = dfc.loc[dfc["id"] == id].iloc[0]
    df = get_titles_by_collection(
        coll,
        include_subcollections=include_subcollections,
    )
    end_date = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    coll_slug = slugify(f"pandora-{id}-{coll['name']}")
    
    output_path = Path("datasets", coll_slug)
    output_path.mkdir(exist_ok=True, parents=True)
    output_file = Path(output_path, f"pandora-{coll_slug}.csv")
    df.to_csv(output_file, index=False)
    create_rocrate(coll, output_file, start_date, end_date)
    display(
        HTML(
            f"Download dataset: <a href='datasets/{coll_slug}.zip', download>datasets/{coll_slug}.zip</a>"
        )
    )

In [328]:
create_collection_dataset("/col/21326", include_subcollections=True)

----

Created by [Tim Sherratt](https://timsherratt.au/) for the [GLAM Workbench](https://glam-workbench.net/).