# Exploring the NLNZ Web Archive Dataset

## Overview

This notebook demonstrates how to explore and analyze web archive data from the National Library of New Zealand (NLNZ). It builds upon the access methods covered in the previous notebook and focuses on analyzing archived web content.

### Target Website
- **Website:** covid19.govt.nz (historical site, no longer active)
- **Dataset:** covid19.govt.nz crawls (2020–2023, WARC + CDX, HTML only)

### Analysis Sections
1. **Temporal Coverage Analysis** - Examining capture frequency over time
2. **Content Size Evolution** - Tracking how the size of captured HTML pages changed
3. **Structural/URL Analysis** - Analyzing URL patterns and website structure evolution
4. **Text Content Exploration** - Extracting and analyzing textual content from archives

> **NOTE:** This notebook builds on the same Python packages used in the previous notebook.

## Environment Setup

### Installing Required Python Packages

The following packages are required for web archive analysis:

In [None]:
# Install core dependencies for web archive processing
!pip -q install warcio>=1.7.4 validators boto3>=1.40.26 s3fs bs4 wordcloud

# Install packages for webpage screenshots (optional visualization)
!pip -q install selenium chromedriver-autoinstaller

In [None]:
# Install the NLNZ Web Archive Toolkit
!pip -q install -i https://test.pypi.org/simple/ wa-nlnz-toolkit==0.2.1

In [None]:
# Configure environment based on execution context
try:
    import google.colab
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

# Set output directory based on execution environment
res_folder = "/content" if IN_COLAB else "./"

In [None]:
# Import required libraries
import wa_nlnz_toolkit as want  # NLNZ Web Archive Toolkit
import pandas as pd             # Data manipulation
import os                       # File operations
import numpy as np              # Numerical operations
from tqdm import tqdm           # Progress bars
from collections import Counter # For word frequency analysis

# Define target website
webpage = "covid19.govt.nz"

## 1. Temporal Coverage Analysis

This section analyzes how frequently the website was captured over time. We'll use the CDX API to query captures of the homepage and analyze the temporal distribution of these captures.

### Querying the CDX Index

In [None]:
# Query the CDX index for all captures of the target website
df_captures = want.query_cdx_index(webpage)

# Filter for successful captures only (HTTP status code 200)
df_captures = df_captures[df_captures["status"] == "200"]

# Display information about the first capture
df_captures.iloc[0]

### Temporal Distribution Analysis

The data shows that the first capture of covid19.govt.nz was on March 18, 2020, coinciding with the early stages of the COVID-19 pandemic. The capture frequency peaked in April 2020 with over 60 captures, followed by a slight decrease in May 2020.

After this initial surge, the crawl frequency decreased significantly but maintained a relatively stable level of approximately 10-20 crawls per month until mid-2023. In late 2023, the frequency further decreased to about 5 crawls per month, likely reflecting the reduced importance of the site as the pandemic situation evolved.

In [None]:
# Visualize the monthly capture frequency
want.plot_monthly_captures(df_captures)

In [None]:
# Display access URLs for all captures
df_captures["access_url"]

### (Optional) Visual Comparison with Screenshots

To better understand how the website evolved over time, we can capture screenshots of the archived versions from different time periods.

In [None]:
# Install Chromium browser if running in Google Colab
if IN_COLAB:
    !apt-get update
    !apt-get install -y chromium-browser

In [None]:
# Capture screenshots of the earliest and latest archived versions
# First capture (March 2020)
want.screenshot_webpage("https://ndhadeliver.natlib.govt.nz/webarchive/20200318051641/https://covid19.govt.nz/", 
                        os.path.join(res_folder, "covid19_20200318051641.png"))

# Recent capture (February 2024)
want.screenshot_webpage("https://ndhadeliver.natlib.govt.nz/webarchive/20240211071903/https://covid19.govt.nz/", 
                        os.path.join(res_folder, "covid19_20250827233627.png"))

## 2. Content Size Evolution

This section tracks how the size of the website's content changed over time. We'll analyze the pre-processed web archive data that contains captured HTML pages.

In [None]:
# Define S3 bucket and folder containing the archive data
bucket_name = "ndha-public-data-ap-southeast-2"
folder_prefix = "iPRES-2025/sample-data/covid19.govt.nz/"

# List all files in the S3 bucket folder
all_files = want.list_s3_files(bucket_name, folder_prefix)

# Filter for CDX index files
cdx_files = [f for f in all_files if f.endswith(".cdx")]

In [None]:
# Display the total number of crawls in the dataset
print(f"In total, there are {len(cdx_files)} crawls in the sample dataset.")

In [None]:
# Extract datetime information from CDX filenames
list_dt = [cdx_file.split("/")[-2].split("_")[0] for cdx_file in cdx_files]

# Calculate total content size for each crawl
list_size = []
for cdx_file in cdx_files:
    # Load CDX file data
    df_cdx = want.load_cdx_file_from_s3(bucket_name, cdx_file)
    
    # Filter for valid HTML responses (status 200, MIME type text/html, no JSON fragments)
    df_cdx_valid_html = df_cdx[(df_cdx["m"] == "text/html") 
                                & (df_cdx["s"] == 200)
                                & (~df_cdx["a"].str.contains("%22"))]

    # Calculate total size in MB and append to list
    list_size.append(df_cdx_valid_html["S"].sum() / 1024 / 1024)

In [None]:
# Create a DataFrame with datetime and size information
df_size = pd.DataFrame({"dt": pd.to_datetime(list_dt), "compressed payload size": list_size})
df_size.set_index("dt", inplace=True)
df_size.sort_index(inplace=True)

# Visualize content size evolution over time
df_size.plot(figsize=(12, 6), 
             title="Compressed Payload Size Evolution Over Time", 
             ylabel="Total Size (MB)", 
             xlabel="Date")

## 3. Structural/URL Analysis

This section explores how the website's structure evolved over time by analyzing URL patterns and diversity.

### Analysis Focus Areas
- URL diversity and growth over time
- Identification of core vs. peripheral URLs
- URL path structure analysis
- URL lifecycle (creation and removal of pages)
- Snapshot comparison across time periods

In [None]:
# Count unique URLs for each crawl
list_unique_urls = []
for cdx_file in cdx_files:
    # Load CDX file data
    df_cdx = want.load_cdx_file_from_s3(bucket_name, cdx_file)
    
    # Filter for valid HTML responses
    df_cdx_valid_html = df_cdx[(df_cdx["m"] == "text/html") 
                                & (df_cdx["s"] == 200) 
                                & (~df_cdx["a"].str.contains("%22"))]

    # Count unique URLs in this crawl
    list_unique_urls.append(len(df_cdx_valid_html["a"].unique()))

In [None]:
# Add unique URL counts to the size DataFrame
df_size["unique urls"] = list_unique_urls

# Visualize URL count evolution over time
df_size["unique urls"].plot(figsize=(12, 6), 
                            title="Evolution of Unique URLs Over Time", 
                            ylabel="Number of Unique URLs", 
                            xlabel="Date")

### Core vs. Peripheral URLs Analysis

To identify the most important pages on the website, we'll analyze which URLs appear most frequently across all crawls. This helps distinguish between core content (consistently present) and peripheral content (temporary or less important).

In [None]:
# Collect all unique URLs across all crawls
unique_urls = np.array([])
for cdx_file in cdx_files[:]:
    # Load CDX file data
    df_cdx = want.load_cdx_file_from_s3(bucket_name, cdx_file)
    
    # Filter for valid HTML responses
    df_cdx_valid_html = df_cdx[(df_cdx["m"] == "text/html") 
                                & (df_cdx["s"] == 200) 
                                & (~df_cdx["a"].str.contains("%22"))]

    # Append unique URLs from this crawl
    unique_urls = np.append(unique_urls, df_cdx_valid_html["a"].unique())

# Create a DataFrame with all collected URLs
df_unique_urls = pd.DataFrame(unique_urls, columns=["url"])

In [None]:
# Export URL frequency counts to CSV
df_unique_urls.value_counts().to_csv(os.path.join(res_folder, "url_counts.csv"))

In [None]:
# Analyze URL structure by extracting first-level path components
# Remove duplicates and query parameters
df_unique_urls = df_unique_urls.drop_duplicates(subset=["url"])
df_unique_urls = df_unique_urls[~df_unique_urls["url"].str.contains("?", regex=False)]

# Extract first-level path component (e.g., "about" from "covid19.govt.nz/about/...")
df_unique_urls["level_1_subdomain"] = df_unique_urls["url"].apply(lambda x: x.split("/")[3])

# Display top 10 first-level path components
df_unique_urls["level_1_subdomain"].value_counts().head(10)

# Export all first-level path components to CSV
df_unique_urls["level_1_subdomain"].value_counts().to_csv(os.path.join(res_folder, "level_1_subdomains.csv"))

### URL Lifecycle Analysis Exercise

The following exercise demonstrates how to track when a specific URL first appeared and when it was last seen in the archive. This helps understand the lifecycle of website sections.

> **Hands-on Exercise:** Complete the code below to find when a specific URL (e.g., "https://covid19.govt.nz/traffic-lights/") first appeared in the archive.

In [None]:
# Define the URL to search for
url_to_be_found = "https://covid19.govt.nz/traffic-lights/"

# Search through all CDX files for this URL
for cdx_file in cdx_files:
    # Load CDX file data
    df_cdx = want.load_cdx_file_from_s3(bucket_name, cdx_file)
    
    # Filter for valid HTML responses
    df_cdx_valid_html = df_cdx[(df_cdx["m"] == "text/html") 
                                & (df_cdx["s"] == 200) 
                                & (~df_cdx["a"].str.contains("%22"))]

    # SOLUTION (commented out)
    # # Filter for the target URL
    # df_cdx_valid_html = df_cdx_valid_html[df_cdx_valid_html["a"] == url_to_be_found]
    # if not df_cdx_valid_html.empty:
    #     # Print the timestamp when this URL was found
    #     print(df_cdx_valid_html["b"].tolist()[0])


## 4. Text Content Exploration

This section focuses on extracting and analyzing the textual content from archived HTML pages. This allows us to track how the website's messaging evolved over time.

### Setting Up Text Extraction

In [None]:
# Define S3 bucket and folder containing the archive data
bucket_name = "ndha-public-data-ap-southeast-2"
folder_prefix = "iPRES-2025/sample-data/covid19.govt.nz/"

# List all files in the S3 bucket folder
all_files = want.list_s3_files(bucket_name, folder_prefix)

# Helper function to locate a specific WARC file in the S3 bucket
def find_warc_file_path(warc_file):
    """Find the full S3 path for a given WARC filename.
    
    Args:
        warc_file (str): The WARC filename to search for
        
    Returns:
        str: Full S3 path if found, None otherwise
    """
    for s3_file in all_files:
        if warc_file in s3_file:
            warc_file_path = "s3://ndha-public-data-ap-southeast-2/" + s3_file
            return warc_file_path
    return None

### Single Crawl Text Extraction

First, we'll extract text content from a single CDX file to demonstrate the process.

In [None]:
# Select the first CDX file for demonstration
cdf_file = cdx_files[0]

# Load CDX file data
df_cdx = want.load_cdx_file_from_s3(bucket_name, cdf_file)

# Extract date from filename
dt = cdf_file.split("/")[-2].split("_")[0]

# Initialize lists to store content and URLs
content_all = []
url_all = []

# Process each entry in the CDX file
for idx in tqdm(range(len(df_cdx)), desc=f"Extracting contents for {dt}"):
    # Get WARC file and offset information
    warc_file = df_cdx.iloc[idx]["g"]
    offset = int(df_cdx.iloc[idx]["V"])

    # Extract HTML payload and text content
    html_payload = want.extract_payload(find_warc_file_path(warc_file), offset)
    content = want.extract_content_html(html_payload)
    content_all.append(content)
    
    # Construct access URL for this capture
    url = "https://ndhadeliver.natlib.govt.nz/webarchive/{}/{}".format(df_cdx.iloc[idx]["b"], df_cdx.iloc[idx]["a"])
    url_all.append(url)

# Remove duplicate content
content_cleaned = []
url_cleaned = []
for content, url in zip(content_all, url_all):
    content_joined = " ".join(content)
    if content_joined not in content_cleaned:
        content_cleaned.append(content_joined)
        url_cleaned.append(url)

# Save cleaned content to text files
with open(os.path.join(res_folder, f"covid19_content_cleaned_{dt}.txt"), "w") as f:
    f.write("\n".join(content_cleaned))
    
with open(os.path.join(res_folder, f"covid19_url_cleaned_{dt}.txt"), "w") as f:
    f.write("\n".join(url_cleaned))

### Batch Processing All Crawls

Now we'll extend the text extraction process to all crawls in the dataset.

In [None]:
# Create directories for storing extracted content
content_dir = os.path.join(res_folder, "covid19_corpus/raw/content")
url_dir = os.path.join(res_folder, "covid19_corpus/raw/url")

# Ensure directories exist
os.makedirs(content_dir, exist_ok=True)
os.makedirs(url_dir, exist_ok=True)

In [None]:
# Process each CDX file
for cdx_file in cdx_files:
    # Load CDX file data
    df_cdx = want.load_cdx_file_from_s3(bucket_name, cdx_file)
    
    # Extract date from filename
    dt = cdx_file.split("/")[-2].split("_")[0]

    # Initialize lists to store content and URLs
    content_all = []
    url_all = []
    
    # Process each entry in the CDX file
    for idx in tqdm(range(len(df_cdx)), desc=f"Extracting contents for {dt}"):
        # Get WARC file and offset information
        warc_file = df_cdx.iloc[idx]["g"]
        offset = int(df_cdx.iloc[idx]["V"])

        # Extract HTML payload and text content
        html_payload = want.extract_payload(find_warc_file_path(warc_file), offset)
        content = want.extract_content_html(html_payload)
        content_all.append(content)
        
        # Construct access URL for this capture
        url = "https://ndhadeliver.natlib.govt.nz/webarchive/{}/{}".format(df_cdx.iloc[idx]["b"], df_cdx.iloc[idx]["a"])
        url_all.append(url)

        # Log URLs with no content extracted
        if content == []:
            print(f"No content extracted from: {url}")

    # Remove duplicate content
    content_cleaned = []
    url_cleaned = []
    for content, url in zip(content_all, url_all):
        content_joined = " ".join(content)
        if content_joined not in content_cleaned:
            content_cleaned.append(content_joined)
            url_cleaned.append(url)

    # Save cleaned content to text files
    with open(os.path.join(content_dir, f"covid19_content_cleaned_{dt}.txt"), "w") as f:
        f.write("\n".join(content_cleaned))
        
    with open(os.path.join(url_dir, f"covid19_url_cleaned_{dt}.txt"), "w") as f:
        f.write("\n".join(url_cleaned))

### Text Analysis

Now that we have extracted text content, we can perform various analyses to understand the website's messaging.

In [None]:
# Calculate word frequency in the extracted content
word_freq = Counter(" ".join(content_cleaned).split())

# Display the 10 most common words
word_freq.most_common(10)

### Visualization with Word Clouds

Word clouds provide a visual representation of the most frequent terms in the text content.

In [None]:
# Generate a word cloud from the cleaned text content
want.create_world_cloud(content_cleaned, os.path.join(res_folder, "covid19_wordcloud.png"))

### Temporal Content Analysis

Finally, we'll track how the website's content evolved over time by generating word clouds for the homepage at different points in time.

In [None]:
# Define the URL to analyze (homepage)
url_to_be_analysed = "https://covid19.govt.nz/"

# Process each CDX file
for i, cdx_file in enumerate(cdx_files):
    # Load CDX file data
    df_cdx = want.load_cdx_file_from_s3(bucket_name, cdx_file)
    
    # Filter for valid HTML responses
    df_cdx_valid_html = df_cdx[(df_cdx["m"] == "text/html") 
                                & (df_cdx["s"] == 200)
                                & (~df_cdx["a"].str.contains("%22"))
                                ]

    # Filter for the homepage URL
    df_cdx_valid_html = df_cdx_valid_html[df_cdx_valid_html["a"] == url_to_be_analysed]
    
    # If the homepage was found in this crawl
    if not df_cdx_valid_html.empty:
        # Get WARC file and offset information
        warc_file = df_cdx_valid_html.iloc[0]["g"]
        warc_offset = df_cdx_valid_html.iloc[0]["V"]

        # Extract HTML payload and text content
        html_payload = want.extract_payload(find_warc_file_path(warc_file), warc_offset)
        content = want.extract_content_html(html_payload)

        # Extract date from filename and generate word cloud
        dt = cdx_file.split("/")[-2].split("_")[0]
        want.create_world_cloud(content, os.path.join(res_folder, f"wordcloud_{dt}.png"))


## Conclusion and Next Steps

This notebook has demonstrated various techniques for exploring and analyzing web archive data from the NLNZ collection, focusing on the covid19.govt.nz website. We've covered:

1. **Temporal analysis** - Understanding when and how frequently the website was captured
2. **Content size evolution** - Tracking how the website's size changed over time
3. **URL structure analysis** - Examining the website's organization and key sections
4. **Text content extraction and analysis** - Exploring the website's messaging through text analysis

### Potential Extensions

- **Sentiment analysis** of extracted text to track public messaging tone over time
- **Topic modeling** to identify key themes and how they evolved
- **Network analysis** of internal links to understand site structure
- **Image analysis** of extracted visual content
- **Comparative analysis** with other COVID-19 information websites

These techniques can be applied to other web archives to gain insights into how websites and their content evolve over time.