# Processing Web Archive Text Corpus

This notebook demonstrates how to work with text files extracted from web archive HTML snapshots. We'll cover the following steps:

1. Setting up the environment
2. Extracting text from web archive HTML
3. Cleaning and preprocessing the text data
4. Advanced analysis with embeddings and semantic search
5. Building a question-answering system with the corpus

**Purpose**: This notebook shows how to process and analyze text extracted from web archives, enabling researchers to gain insights from historical web content.

## 1. Setting up the environment

First, let's install the required packages for web archive processing, text analysis, and machine learning.

In [None]:
# Install core web archive processing packages
!pip -q install warcio>=1.7.4 validators boto3>=1.40.26 s3fs bs4 wordcloud

# Install packages for web screenshots (optional)
!pip -q install selenium chromedriver-autoinstaller

# Install packages for embeddings and semantic search
!pip -q install sentence-transformers chromadb

# Install the NLNZ Web Archive Toolkit
!pip -q install -i https://test.pypi.org/simple/ wa-nlnz-toolkit==0.2.1

In [None]:
# Detect environment and set appropriate paths
# This allows the notebook to run in different environments (local or Colab)
try:
    import google.colab
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

# Set default result folder based on environment
if IN_COLAB:
    res_folder = "/content/sample_data"
else:
    res_folder = "./sample_data"

In [None]:
# Import required libraries
import os
import re
import pandas as pd
import matplotlib.pyplot as plt
import wa_nlnz_toolkit as want
from tqdm import tqdm  # Progress bar for long-running operations
from glob import glob  # File pattern matching
from collections import Counter  # For counting word frequencies

## 2. Extracting text from web archive HTML

Web archives store content in WARC (Web ARChive) files. Here we demonstrate how to extract HTML content from these files and convert it to plain text using the `wa_nlnz_toolkit` library.

The process involves two main steps:
1. Extract the HTML payload from a WARC file using `extract_payload`
2. Parse the HTML to extract meaningful text using `extract_content_html`

In [None]:
# Example: Extract HTML payload from a WARC file
# This demonstrates accessing a specific web page snapshot from an archive
warc_file = "s3://ndha-public-data-ap-southeast-2/iPRES-2025/sample-data/covid19.govt.nz/2023-12-14_IE89493927/FL89493929_NLNZ-20231212233435565-00000-72544~wlgprdwctweb01.natlib.govt.nz~8443.warc.gz"
offset = 3126252  # Byte offset where the record starts in the WARC file

# Extract the HTML payload
html_payload = want.extract_payload(warc_file, offset)

# Check if extraction was successful
if html_payload:
    print(f"Successfully extracted HTML payload of {len(html_payload)} bytes")
else:
    print("Failed to extract HTML payload")

In [None]:
# Extract text content from the HTML payload
# This converts raw HTML into readable text paragraphs
if html_payload:
    # Use the toolkit's function to extract content
    paragraphs = want.extract_content_html(html_payload)
    
    # Print the first few paragraphs as a sample
    print(f"Extracted {len(paragraphs)} paragraphs. Here are the first 5:")
    for i, para in enumerate(paragraphs[:5]):
        print(f"\n{i+1}. {para}")

## 3. Cleaning and preprocessing the text data

Raw text extracted from web archives often contains duplicates, irrelevant content, and formatting issues. This section demonstrates how to clean and preprocess this data to make it more suitable for analysis.

The main preprocessing steps include:
1. Loading raw text files
2. Removing duplicates across snapshots
3. Maintaining the relationship between content and source URLs
4. Saving the cleaned data for further analysis

In [None]:
# Define paths for raw and cleaned data
raw_folder_path = os.path.join(res_folder, "covid19_corpus/raw/")
cleaned_folder_path = os.path.join(res_folder, "covid19_corpus/cleaned/")

# Create directory structure for cleaned data
# We maintain separate directories for content and URLs
cleaned_content_dir = os.path.join(cleaned_folder_path, "content")
cleaned_url_dir = os.path.join(cleaned_folder_path, "url")
os.makedirs(cleaned_content_dir, exist_ok=True)
os.makedirs(cleaned_url_dir, exist_ok=True)

# List all text files in the raw folder
raw_content_dir = os.path.join(raw_folder_path, "content")
raw_url_dir = os.path.join(raw_folder_path, "url")
raw_content_files = [f for f in os.listdir(raw_content_dir) if f.endswith(".txt")]
print(f"Found {len(raw_content_files)} raw text files")

In [None]:
# Helper function to extract date from filename
def extract_date(fname):
    """Extract date in YYYY-MM-DD format from a filename.
    
    Args:
        fname: Filename string containing a date
        
    Returns:
        Date string in YYYY-MM-DD format or None if not found
    """
    match = re.search(r'(\d{4}-\d{2}-\d{2})', fname)
    return match.group(1) if match else None


# Process all content files chronologically
content_files = sorted(glob(os.path.join(raw_content_dir, "covid19_content_cleaned_*.txt")))
seen_contents = set()  # Track unique content across all snapshots

# Lists to store data for later analysis
list_date_tag = []
list_unique_pages = []

# Process each content file and its corresponding URL file
for content_file in content_files:
    # Extract date from filename
    date_tag = extract_date(os.path.basename(content_file))
    url_file = os.path.join(raw_url_dir, f"covid19_url_cleaned_{date_tag}.txt")

    # Skip if URL file is missing
    if not os.path.exists(url_file):
        print(f"⚠️ URL file missing for {date_tag}, skipping.")
        continue

    print(f"Processing {date_tag} ...")

    # Read content and URL files
    with open(content_file, encoding="utf-8") as f:
        contents = [line.strip() for line in f]

    with open(url_file, encoding="utf-8") as f:
        urls = [line.strip() for line in f]

    # Verify content and URL files have matching line counts
    assert len(contents) == len(urls), f"Line count mismatch in {date_tag}"

    # Filter out duplicates while preserving content-URL relationship
    unique_contents = []
    unique_urls = []

    for content, url in zip(contents, urls):
        if content not in seen_contents:
            seen_contents.add(content)
            unique_contents.append(content)
            unique_urls.append(url)

    # Save deduplicated content and URLs
    out_content = os.path.join(cleaned_content_dir, f"covid19_content_deduped_{date_tag}.txt")
    out_url = os.path.join(cleaned_url_dir, f"covid19_url_deduped_{date_tag}.txt")

    with open(out_content, "w", encoding="utf-8") as f:
        for c in unique_contents:
            f.write(c + "\n")

    with open(out_url, "w", encoding="utf-8") as f:
        for u in unique_urls:
            f.write(u + "\n")

    # Store metrics for analysis
    list_unique_pages.append(len(unique_contents))
    list_date_tag.append(date_tag)

    print(f"✅ {date_tag}: kept {len(unique_contents)} unique pages (out of {len(contents)})")

In [None]:
# Visualize the number of unique pages over time
# This helps understand how the website content evolved
df_unique_pages = pd.DataFrame({"date_tag": list_date_tag, "unique_pages": list_unique_pages})
df_unique_pages["date_tag"] = pd.to_datetime(df_unique_pages["date_tag"])
df_unique_pages.set_index("date_tag", inplace=True)

# Create a time series plot
plt.figure(figsize=(12, 6))
ax = df_unique_pages.plot()
ax.set_title("Number of New Unique Pages Over Time")
ax.set_ylabel("Count")
ax.set_xlabel("Date")
plt.tight_layout()
plt.show()

## 4. Advanced analysis with embeddings and semantic search

Traditional keyword search is limited when analyzing large text corpora. Semantic search using embeddings allows us to find content based on meaning rather than exact word matches.

In this section, we'll:
1. Create vector embeddings for each text snippet
2. Store these embeddings in a vector database (ChromaDB)
3. Implement semantic search to find relevant content

First, let's import the required packages for embedding and vector storage.

In [None]:
# Import embedding and vector database libraries
from sentence_transformers import SentenceTransformer
import chromadb

# Configuration for vector database
db_collection_name = "covid_webpages"  # Name for our vector collection
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"  # Pre-trained embedding model
path_chroma = os.path.join(res_folder, "chroma_db")  # Path to store the vector database
input_content_dir = cleaned_content_dir  # Directory with cleaned text files

ChromaDB is a vector database designed for efficient storage and retrieval of embeddings. It allows us to perform similarity searches across our corpus.

In [None]:
# Initialize ChromaDB client and collection
client = chromadb.PersistentClient(path=path_chroma)
collection = client.get_or_create_collection(name=db_collection_name)

# Load the sentence transformer model for creating embeddings
model = SentenceTransformer(embedding_model)

# Get all cleaned text files
files = sorted([f for f in os.listdir(input_content_dir) if f.endswith(".txt")])

# Process each file and add its contents to the vector database
for fname in files:
    # Load content file
    content_file_path = os.path.join(input_content_dir, fname)
    with open(content_file_path, encoding="utf-8") as f:
        lines = [line.strip() for line in f]
    
    # Load corresponding URL file
    url_file_path = content_file_path.replace("content", "url")
    with open(url_file_path, encoding="utf-8") as f:
        urls = [line.strip() for line in f]
    
    # Create embeddings for each line of text
    # This converts text into numerical vectors that capture semantic meaning
    embeddings = model.encode(lines, show_progress_bar=True, convert_to_numpy=True)
    datestr = extract_date(fname)
    
    # Add embeddings and metadata to ChromaDB
    try:
        collection.add(
            ids=[f"{fname}_{i}" for i in range(len(lines))],
            embeddings=embeddings.tolist(),
            documents=lines,
            metadatas=[{"filename": fname, "date": datestr, "line": i, "url": url} for i, url in enumerate(urls)]
        )
    except ValueError:
        print(f"⚠️ Skipped {fname} due to ValueError")

print("✅ Indexed all lines into Chroma!")

Now that we have our vector database ready, we can perform semantic searches to find content based on meaning rather than exact keyword matches.

In [None]:
def semantic_search(query, n_results=5):
    """Perform semantic search on the corpus using vector embeddings.
    
    Args:
        query: The search query as a string
        n_results: Number of results to return (default: 5)
        
    Returns:
        Dictionary containing search results from ChromaDB
    """
    # Load the embedding model
    model = SentenceTransformer(embedding_model)

    # Connect to the vector database
    client = chromadb.PersistentClient(path=path_chroma)
    collection = client.get_or_create_collection(name=db_collection_name)

    # Convert query to embedding vector
    query_emb = model.encode([query], convert_to_numpy=True).tolist()

    # Search for similar content in the database
    results = collection.query(
        query_embeddings=query_emb,
        n_results=n_results
    )

    # Display results with metadata
    for text, meta in zip(results["documents"][0], results["metadatas"][0]):
        print(f"{meta['filename']} (line {meta['line']}) — {meta['url']}")
        print(f"→ {text[:]}\n")

    return results

In [None]:
# Example semantic search query
# This demonstrates finding content related to economic impacts without requiring exact keyword matches
res = semantic_search("What is the impact of the pandemic on the economy?")

## 5. Building a question-answering system with the corpus

Beyond search, we can build a question-answering system that combines our vector database with language models to provide direct answers to questions about the corpus.

This approach, known as Retrieval-Augmented Generation (RAG), involves:
1. Retrieving relevant passages using semantic search
2. Using a language model to generate answers based on the retrieved content

In [None]:
# Install additional packages for the QA system
!pip -q install transformers langchain-community

In [None]:
# Import libraries for building the QA system
from transformers import pipeline
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings

In [None]:
# Set up the embedding model for retrieval
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Connect to the existing vector database
vectorstore = Chroma(
    persist_directory=path_chroma,
    collection_name=db_collection_name,
    embedding_function=embedding_model
)

# Create a retriever that will fetch relevant documents
retriever = vectorstore.as_retriever(search_kwargs={"k": 1})  # Retrieve 5 most relevant documents

In [None]:
# Set up a language model for answering questions
# We use a small open-source text generation model (flan-t5-base)
gen_pipeline = pipeline(
    "text2text-generation",
    model="google/flan-t5-base",
    tokenizer="google/flan-t5-base",
    max_length=512
)

# Wrap the pipeline in a LangChain compatible format
llm = HuggingFacePipeline(pipeline=gen_pipeline)

In [None]:
# Create a question-answering chain
# This combines retrieval and generation into a single pipeline
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # "stuff" concatenates retrieved docs into the prompt
    retriever=retriever,
    return_source_documents=True  # Include source documents in the response
)

In [None]:
# Example question to demonstrate the QA system
query = "What is the impact of the pandemic on the economy?"
result = qa_chain.invoke({"query": query})

# Display the question, answer, and sources
print("🔹 Question:", query)
print("🔹 Answer:", result["result"])
print("\n🔹 Sources:")
for doc in result["source_documents"]:
    print("-", doc.metadata.get("filename"), "→", doc.page_content[:200], "...\n")

## Conclusion

In this notebook, we've demonstrated a complete workflow for processing and analyzing text extracted from web archives:

1. **Extraction**: Converting web archive HTML to plain text
2. **Preprocessing**: Cleaning and deduplicating text data
3. **Semantic Search**: Finding content based on meaning using vector embeddings
4. **Question Answering**: Building an AI system that can answer questions about the corpus

These techniques enable researchers to extract valuable insights from web archives, making historical web content more accessible and useful for analysis.

### Next Steps

- Experiment with different embedding models for improved search quality
- Apply topic modeling to discover themes in the corpus
- Integrate with larger language models for more sophisticated question answering