# Accessing the NLNZ Web Archive Dataset

## Overview

This notebook demonstrates how to access and query web archive data from the National Library of New Zealand (NLNZ). It provides a foundation for working with web archives using various protocols and APIs.

### Learning Objectives

1. Query web archive data using the Memento protocol
2. Access archived content using the CDX API
3. Work directly with CDX and WARC files
4. Extract and analyze metadata from archived web content

This notebook serves as an introduction to web archive access methods that will be built upon in subsequent notebooks.

## Environment Setup

### Installing Required Python Packages

The following packages are necessary for working with web archives:

In [None]:
# Install core dependencies for web archive processing
!pip -q install warcio>=1.7.4 validators boto3>=1.40.26 s3fs bs4 wordcloud

# Install packages for webpage screenshots (optional visualization)
!pip -q install selenium chromedriver-autoinstaller

In [None]:
# Install the NLNZ Web Archive Toolkit
!pip -q install -i https://test.pypi.org/simple/ wa-nlnz-toolkit==0.2.1

In [None]:
# Import the NLNZ Web Archive Toolkit
import wa_nlnz_toolkit as want
import pandas as pd
import datetime
from bs4 import BeautifulSoup

## 1. Querying Web Archives with the Memento Protocol

### Introduction to Memento

The **Memento protocol** provides a standardized way to access archived versions of web pages across different web archives. It offers machine-readable information about web captures and simplifies the process of finding historical versions of web content.

### Key Memento Components

The NLNZ web archive supports three main Memento features:

1. **TimeGate** - Retrieves the version of a page closest to a specified date
2. **TimeMap** - Provides a list of all archived versions of a page
3. **Memento** - Represents a specific archived version with options to control presentation

Let's explore how to use these features with the NLNZ Web Archive Toolkit.

### Basic Memento Queries

We'll start by querying the latest capture of a website using the Memento protocol:

In [None]:
# Define target website
webpage = "www.natlib.govt.nz"

# Query the latest capture using Memento
# This returns the raw response headers
dict(want.query_memento(webpage).headers)

In [None]:
# Get a more structured representation of Memento URLs
want.get_memento_urls(webpage)

### Understanding Memento Link Types

The Memento response contains several important link types:

- **original**: The original URL that was archived (e.g., https://covid19.govt.nz/)
- **timegate**: The URL used to request archived versions (e.g., https://ndhadeliver.natlib.govt.nz/webarchive/https://covid19.govt.nz/)
- **timemap**: URL that lists all available captures (e.g., https://ndhadeliver.natlib.govt.nz/webarchive/timemap/link/https://covid19.govt.nz/)
- **memento**: URL of the specific archived version (e.g., https://ndhadeliver.natlib.govt.nz/webarchive/20250728214105mp_/https://covid19.govt.nz/)

By default, the *memento* link points to the latest capture. We can also request a capture closest to a specific date:

In [None]:
# Query for a capture closest to January 1, 2020
dt_required = datetime.datetime(2020, 1, 1, 0, 0, 0)
dict(want.query_memento(webpage, dt=dt_required).headers)

In [None]:
# Get structured Memento URLs for a specific date
want.get_memento_urls(webpage, dt=dt_required)

In [None]:
# Query another website to see its Memento links
want.query_memento("www.niwa.co.nz").links

### Retrieving Complete Capture History with TimeMap

The Memento TimeMap provides a comprehensive list of all captures for a given webpage. The NLNZ web archive supports multiple TimeMap formats (link, cdxj, and json).

Let's retrieve the TimeMap for our example website:

In [None]:
# Get the TimeMap for the National Library website
webpage = "www.natlib.govt.nz"
want.get_timemap(webpage)

### URL Modifiers in Memento

The `load_url` field in the TimeMap contains URLs used internally by PyWB (the web archive replay system). These cannot be directly accessed.

Memento supports special URL modifiers that control how archived content is presented:

- **mp_** modifier: Shows "main page" content only
- **id_** modifier: Returns the original harvested version without rewriting
- **if_** modifier: Shows the page with web archive headers (default for NLNZ web archive)

For more details on URL rewriting options, see the [PyWB documentation](https://pywb.readthedocs.io/en/latest/manual/rewriter.html?highlight=id_#url-rewriting).

## 2. Querying Web Archives with the CDX API

The CDX (Capture inDeX) API provides a more direct way to query web archive metadata. It allows for more specific filtering and returns structured data about archived captures.

### Note on NLNZ Implementation

In the NLNZ web archive, CDX API queries are redirected through PyWB to the OutbackCDX server. This means some native CDX query parameters (like output format) are not supported.

In [None]:
# Query the CDX index for the National Library website
webpage = "www.natlib.govt.nz"
df_captures = want.query_cdx_index(webpage)
df_captures

### CDX vs TimeMap

The CDX query results are similar to the TimeMap, but our toolkit adds an `access_url` column that contains the actual URL for accessing each webpage snapshot. This makes it easier to view or analyze specific captures.

### Advanced CDX Queries

The CDX API allows for more specific queries, such as filtering by MIME type or using prefix matching. This is particularly useful for finding non-HTML content like images or documents.

In [None]:
# Query for PDF files in a specific section of a website
# Note: Due to architecture limitations, we need to specify at least the first-level subdomain
webpage = "covid19.govt.nz/assets/"

# Filter for PDF files using the MIME type filter
df_captures = want.query_cdx_index(webpage, filter="mimetype:application/pdf", matchType="prefix")

# Extract original filenames from the URLs
df_captures["original_file_name"] = df_captures["urlkey"].str.split("/").str[-1]
df_captures

### Hands-On Exercise: Querying for Image Files

Try querying the CDX index for PNG image files from the same website section. Uncomment and complete the code below:

In [None]:
# Exercise: Query for PNG files
# webpage = "covid19.govt.nz/assets/"

# df_captures = want.query_cdx_index(webpage, filter="mimetype:image/png", matchType="prefix")
# df_captures["original_file_name"] = df_captures["urlkey"].str.split("/").str[-1]
# df_captures

## 3. Working with WARC Files

WARC (Web ARChive) files are the standard format for storing web archives. They contain the actual content of archived web pages along with metadata. In this section, we'll explore how to access and extract data from WARC files in the NLNZ collection.

In [None]:
# Define S3 bucket and folder containing the archive data
bucket_name = "ndha-public-data-ap-southeast-2"
folder_prefix = "iPRES-2025"

# List available files in the S3 bucket
want.list_s3_files(bucket_name, folder_prefix)

### Understanding CDX Files

CDX (Capture inDeX) files serve as indexes for WARC files, making it easier to locate specific content. They follow a standardized format described in the [CDX documentation](https://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2015/).

The standard 11-field CDX format includes:

1. **N**: Normalized/massaged URL
2. **b**: Capture date (timestamp)
3. **a**: Original URL
4. **m**: MIME type of the document
5. **s**: HTTP response code
6. **k**: Content checksum
7. **r**: Redirect URL
8. **M**: Meta tags
9. **S**: Compressed record size
10. **V**: Compressed payload offset
11. **g**: Source WARC filename

Let's load a CDX file and examine its structure:

In [None]:
# Load a CDX file into a pandas DataFrame
object_key = 'iPRES-2025/sample-data/covid19.govt.nz/2021-08-10_IE75130285/IE75130285.cdx'
df = pd.read_csv(f"s3://{bucket_name}/{object_key}", sep=" ", skiprows=1)

# Assign column names according to the CDX standard
df.columns = ['N', 'b', 'a', 'm', 's', 'k', 'r', 'M', 'S', 'V', 'g']

# Display the first few rows
df.head(10)

### Extracting Content from WARC Files

Using the information from the CDX file (particularly the filename and offset), we can extract specific content from WARC files:

In [None]:
# Extract HTML payload from a WARC file using a specific offset
html_payload = want.extract_payload(
    "s3://ndha-public-data-ap-southeast-2/iPRES-2025/sample-data/covid19.govt.nz/2021-08-10_IE75130285/FL75130287_NLNZ-20210809041626170-00000-22439~kaiwae-z4~8443.warc.gz",
    offset=2593631
)
html_payload

## 4. Processing and Analyzing Web Archive Content

Once we've extracted content from WARC files, we can process and analyze it using various tools. For HTML content, BeautifulSoup is particularly useful for parsing and extracting text.

In [None]:
# Parse the HTML payload using BeautifulSoup
soup = BeautifulSoup(html_payload, "html.parser")

# Extract text from all paragraph elements
paragraphs = [p.get_text(" ", strip=True) for p in soup.find_all("p")]

# Print each paragraph
for para in paragraphs:
    print(para)

### Using the Toolkit's Built-in Functions

The NLNZ Web Archive Toolkit provides convenience functions that combine these steps. For example, `extract_content_html()` extracts text content from HTML payloads:

In [None]:
# Example of using the toolkit's built-in functions
# Note: This is a reference example and may not run as-is without proper context

# Define a helper function to find WARC files (similar to what we'll use in later notebooks)
bucket_name = "ndha-public-data-ap-southeast-2"
folder_prefix = "iPRES-2025/sample-data/covid19.govt.nz/"
all_files = want.list_s3_files(bucket_name, folder_prefix)

def find_warc_file_path(warc_file):
    """Find the full S3 path for a given WARC filename.
    
    Args:
        warc_file (str): The WARC filename to search for
        
    Returns:
        str: Full S3 path if found, None otherwise
    """
    for s3_file in all_files:  # all_files would be defined in actual usage
        if warc_file in s3_file:
            warc_file_path = f"s3://{bucket_name}/{s3_file}"
            return warc_file_path
    return None

# Example usage of extract_content_html (reference only)
warc_file = "FL75130287_NLNZ-20210809041626170-00000-22439~kaiwae-z4~8443.warc.gz"
warc_offset = 2593631

html_payload = want.extract_payload(find_warc_file_path(warc_file), warc_offset)
content = want.extract_content_html(html_payload)

## Conclusion and Next Steps

This notebook has introduced the fundamental methods for accessing and working with web archives from the National Library of New Zealand:

1. **Memento Protocol** - For standardized access to archived web content
2. **CDX API** - For querying and filtering archive metadata
3. **WARC Files** - For direct access to archived content
4. **Content Extraction** - For processing and analyzing archived web pages

### What's Next?

In the following notebooks, we'll build on these foundations to:

- Explore and analyze web archive data in more depth
- Track changes in websites over time
- Extract and analyze textual content at scale
- Build advanced applications using web archive data

These techniques provide powerful tools for researchers, historians, and data scientists working with web archives.