# Accessing the NLNZ Web archive dataset

This notebook includes the following sections,
1. Query web archive data using Memento
2. Query web archive data using CDX API
3. Access CDX and WARC files
4. Extracting metadata (URLs, timestamps, MIME types).


## Install required python packages

In [None]:
# Install pre-requisites
!pip -q install warcio>=1.7.4 validators boto3>=1.40.26 s3fs bs4 wordcloud
!pip -q install selenium chromedriver-autoinstaller # for webpage screenshots

In [None]:
# Install wa_nlnz_toolkit
!pip -q install -i https://test.pypi.org/simple/ wa-nlnz-toolkit==0.2.1

## Query web archive data using Memento

The **Memento protocol** makes it easier to find and use archived versions of web pages, even if other APIs aren't available. This gives us machine-readable information about web captures.

In the following section, we'll see how NLNZ web archive support the Memento protocol. Specifically, we'll look at three main features:
- TimeGate - get the version of a page closest to a date you choose.
- TimeMap -  see all archived versions of a page.
- Memento - change how an archived page is shown using special URL options

In [None]:
import wa_nlnz_toolkit as want

In [None]:
webpage = "www.natlib.govt.nz"

# default query - get latest capture
dict(want.query_memento(webpage).headers)

In [None]:
# or get a tidy-up version
want.get_memento_urls(webpage)

The *link* field contains the Memento information. For this case, we can see it contains 4 link types as follows:

- **original**: the url that was archived (e.g., https://covid19.govt.nz/)
- **timegate**: the harvested url (e.g., https://ndhadeliver.natlib.govt.nz/webarchive/https://covid19.govt.nz/)
- **timemap**: list of all available captures over time (e.g., https://ndhadeliver.natlib.govt.nz/webarchive/timemap/link/https://covid19.govt.nz/)
- **memento**: the url of the specific archived version of the webpage (e.g., https://ndhadeliver.natlib.govt.nz/webarchive/20250728214105mp_/https://covid19.govt.nz/)

By default, the *memento* shows the url from the latest capture. If a specific datetime was provided, it will return the capture closest in time to the given datetime. Example is shown below.

In [None]:
import datetime


# query for a capture closest to a given datetime
dt_required = datetime.datetime(2020, 1, 1, 0, 0, 0)
dict(want.query_memento(webpage, dt=dt_required).headers)

In [None]:
# or get the tidy-up version
want.get_memento_urls(webpage, dt=dt_required)

In [None]:
want.query_memento("www.niwa.co.nz").links

### Get full list of captures from _timemap_

Memento Timemap provide a list of webpage captures for a given webpage. It is available from Pywb (NLNZ selective web archive) and OpenWayback systems. For Pywb, hree formats are supported - link, cdxj, and json.

The example below show a timemap for the given webpage from NLNZ selective web archive.

In [None]:
webpage = "www.natlib.govt.nz"

want.get_timemap(webpage)

Note that the load_url field contains the URL used by Pywb internally, which cannot be used directly to access the specific version of web archive.


Also, Memento supports changing the way it is presented by adding some modifiers to the url. For example,

- **mp_** modifier: indicate "main page" content replay.
- **id_** modifier: returns the original harvested version of the webpage.
- **if_** modifier: returns the view with web archive headers (default for NLNZ web archive).

For more information, check https://pywb.readthedocs.io/en/latest/manual/rewriter.html?highlight=id_#url-rewriting

## Query web archive data using CDX API

Because our OutbackCDX server is not accessed internally, the following CDX API queries were actually redirected by the pywb to the outbackCDX server. As a result, some native CDX query params are not supported, such as setting cdx output format.

In [None]:
webpage = "www.natlib.govt.nz"

df_captures = want.query_cdx_index(webpage)
df_captures

Note that the query results above is actually the same as timemap. But in our function, we have added a "access_url" column which contains actual URL for each webpage snapshot.

In [None]:
# Furthermore, we can also query the CDX index for other types of files, such as images, videos, etc.
# However, due to the architecture design, we cannot do a fuzzy query for these types of files. 
# Instead, we will need to query the webpage at least from the first-level subdomain.
webpage = "covid19.govt.nz/assets/"

df_captures = want.query_cdx_index(webpage, filter="mimetype:application/pdf", matchType="prefix")
df_captures["original_file_name"] = df_captures["urlkey"].str.split("/").str[-1]
df_captures

> HANDS-ON: Query the CDX index for all PNG files from the given webpage.

In [None]:
# webpage = "covid19.govt.nz/assets/"

# df_captures = want.query_cdx_index(webpage, filter="mimetype:image/png", matchType="prefix")
# df_captures["original_file_name"] = df_captures["urlkey"].str.split("/").str[-1]
# df_captures

## Access WARC file

In the following section, we will access real WARC files and its corresponding CDX files selected from the NLNZ web archive dataset.

In [None]:
bucket_name = "ndha-public-data-ap-southeast-2"
folder_prefix = "iPRES-2025"

want.list_s3_files(bucket_name, folder_prefix)

Let's have a look at the CDX file first.

Here we have followed the standard 11-field format as described in the [CDX documentation](https://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2015/).

These fields consist of the following:

1. N: massaged url
2. b: date
3. a: original url
4. m: mime type of original document
5. s: response code
6. k: new style checksum
7. r: redirect
8. M: meta tags 
9. S: compressed record size
10. V: compressed payload offset 
11. g: file name

The following cell reads a CDX index data into pandas DataFrame.

In [None]:
import pandas as pd


object_key = 'iPRES-2025/test/2023-12-14_IE89493927/IE89493927.cdx'
df = pd.read_csv(f"s3://{bucket_name}/{object_key}", sep=" ", skiprows=1)
df.columns = ['N', 'b', 'a', 'm', 's', 'k', 'r', 'M', 'S', 'V', 'g']
df.head()

Using the information from the CDX file, we can extract a specific payload from the WARC file.

In [None]:
html_payload = want.extract_payload("s3://ndha-public-data-ap-southeast-2/iPRES-2025/test/2023-12-14_IE89493927/FL89493929_NLNZ-20231212233435565-00000-72544~wlgprdwctweb01.natlib.govt.nz~8443.warc.gz",offset=3126252)

After we have extracted the payload, we can use BeautifulSoup module to parse it and then extract the text content.

In [None]:
from bs4 import BeautifulSoup

# Parse HTML
soup = BeautifulSoup(html_payload, "html.parser")

# Get all <p> elements as separate paragraphs
paragraphs = [p.get_text(" ", strip=True) for p in soup.find_all("p")]

for para in paragraphs:
    print(para)

In [None]:
# The above script has been wrapped into a function in `want.extract_content_html()`
# e.g.,
html_payload = want.extract_payload(find_warc_file_path(warc_file), warc_offset)
content = want.extract_content_html(html_payload)