# Processing Web Archive Text Corpus

This notebook demonstrates how to work with text files extracted from web archive HTML snapshots. We'll cover the following steps:

1. Setting up the environment
2. Extracting text from web archive HTML
3. Cleaning and preprocessing the text data
4. Basic text analysis
5. Advanced analysis with embeddings and semantic search
6. Building a question-answering system with the corpus

## 1. Setting up the environment

First, let's install the required packages.

In [None]:
# Install pre-requisites
!pip -q install warcio>=1.7.4 validators boto3>=1.40.26 s3fs bs4 wordcloud
!pip -q install selenium chromedriver-autoinstaller
!pip -q install sentence-transformers chromadb # additional packages for embeddings and semantic search

# Install wa_nlnz_toolkit
!pip -q install -i https://test.pypi.org/simple/ wa-nlnz-toolkit==0.2.1

In [None]:
# Check if running on Google Colab
try:
    import google.colab
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

# Set default result folder
if IN_COLAB:
    res_folder = "/content/sample_data"
else:
    res_folder = "./sample_data"

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import wa_nlnz_toolkit as want
from bs4 import BeautifulSoup
from tqdm import tqdm
from glob import glob
from collections import Counter
from wordcloud import WordCloud

## 2. Extracting text from web archive HTML

First, let's remind ourselves how to extract text content from web archive HTML files. We'll use the `extract_payload` and `extract_content_html` functions from the `wa_nlnz_toolkit`.


In [None]:
# Example: Extract HTML payload from a WARC file
# Replace with your own WARC file path and offset
warc_file = "s3://ndha-public-data-ap-southeast-2/iPRES-2025/sample-data/covid19.govt.nz/2023-12-14_IE89493927/FL89493929_NLNZ-20231212233435565-00000-72544~wlgprdwctweb01.natlib.govt.nz~8443.warc.gz"
offset = 3126252

# Extract the HTML payload
html_payload = want.extract_payload(warc_file, offset)

# Check if we got a payload
if html_payload:
    print(f"Successfully extracted HTML payload of {len(html_payload)} bytes")
else:
    print("Failed to extract HTML payload")

In [None]:
# Extract text content from the HTML payload
if html_payload:
    # Use the toolkit's function to extract content
    paragraphs = want.extract_content_html(html_payload)
    
    # Print the first few paragraphs
    print(f"Extracted {len(paragraphs)} paragraphs. Here are the first 5:")
    for i, para in enumerate(paragraphs[:5]):
        print(f"\n{i+1}. {para}")

## 3. Cleaning and preprocessing the text data

Now, let's look at how to clean and preprocess the text data from multiple web archive snapshots. We'll work with the existing corpus files in the `covid19_corpus/raw/` directory.

In [None]:
# Define paths
raw_folder_path = os.path.join(res_folder, "covid19_corpus/raw/")
cleaned_folder_path = os.path.join(res_folder, "covid19_corpus/cleaned/")

# Create the cleaned directory if it doesn't exist
cleaned_content_dir = os.path.join(cleaned_folder_path, "content")
cleaned_url_dir = os.path.join(cleaned_folder_path, "url")
os.makedirs(cleaned_content_dir, exist_ok=True)
os.makedirs(cleaned_url_dir, exist_ok=True)

# List all text files in the raw folder
raw_content_dir = os.path.join(raw_folder_path, "content")
raw_url_dir = os.path.join(raw_folder_path, "url")
raw_content_files = [f for f in os.listdir(raw_content_dir) if f.endswith(".txt")]
print(f"Found {len(raw_content_files)} raw text files")

In [None]:
# Let's define a simple function to extract date from filename
def extract_date(fname):
    match = re.search(r'(\d{4}-\d{2}-\d{2})', fname)
    return match.group(1) if match else None


# Find all content files, sort by date (assuming date in filename)
content_files = sorted(glob(os.path.join(raw_content_dir, "covid19_content_cleaned_*.txt")))
seen_contents = set()

list_date_tag = []
list_unique_pages = []
for content_file in content_files:
    date_tag = extract_date(os.path.basename(content_file))
    url_file = os.path.join(raw_url_dir, f"covid19_url_cleaned_{date_tag}.txt")

    if not os.path.exists(url_file):
        print(f"⚠️ URL file missing for {date_tag}, skipping.")
        continue

    print(f"Processing {date_tag} ...")

    with open(content_file, encoding="utf-8") as f:
        contents = [line.strip() for line in f]

    with open(url_file, encoding="utf-8") as f:
        urls = [line.strip() for line in f]

    assert len(contents) == len(urls), f"Line count mismatch in {date_tag}"

    unique_contents = []
    unique_urls = []

    for content, url in zip(contents, urls):
        if content not in seen_contents:
            seen_contents.add(content)
            unique_contents.append(content)
            unique_urls.append(url)

    # Save deduplicated version for this date
    out_content = os.path.join(cleaned_content_dir, f"covid19_content_deduped_{date_tag}.txt")
    out_url = os.path.join(cleaned_url_dir, f"covid19_url_deduped_{date_tag}.txt")

    with open(out_content, "w", encoding="utf-8") as f:
        for c in unique_contents:
            f.write(c + "\n")

    with open(out_url, "w", encoding="utf-8") as f:
        for u in unique_urls:
            f.write(u + "\n")

    list_unique_pages.append(len(unique_contents))
    list_date_tag.append(date_tag)

    print(f"✅ {date_tag}: kept {len(unique_contents)} unique pages (out of {len(contents)})")

In [None]:
# construct a dataframe
df_unique_pages = pd.DataFrame({"date_tag": list_date_tag, "unique_pages": list_unique_pages})
df_unique_pages["date_tag"] = pd.to_datetime(df_unique_pages["date_tag"])
df_unique_pages.set_index("date_tag", inplace=True)
df_unique_pages.plot()

## 4. Advanced analysis with embeddings and semantic search

Now, let's use embeddings to perform semantic search and analysis on our corpus. 

We will use the corpus and the corresponding urls preprocessed in the previous notebook (exp-02_Exploring_NLNZ_WebArchive.ipynb).

First, let's import the required packages used here.

In [None]:
import os
import re
from sentence_transformers import SentenceTransformer
import chromadb


# define the collection name
db_collection_name = "covid_webpages"

# define embedding model
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"

# define chroma db path
path_chroma = os.path.join(res_folder, "chroma_db")

# define the input directories
input_content_dir = cleaned_content_dir

Here we will use Chroma DB to store and query the embeddings. Chroma DB is a vector database that allows us to store and query embeddings efficiently.

In [None]:
client = chromadb.PersistentClient(path=path_chroma)
collection = client.get_or_create_collection(name=db_collection_name)

model = SentenceTransformer(embedding_model)

files = sorted([f for f in os.listdir(input_content_dir) if f.endswith(".txt")])

for fname in files:
    content_file_path = os.path.join(input_content_dir, fname)
    with open(content_file_path, encoding="utf-8") as f:
        lines = [line.strip() for line in f]
    
    url_file_path = content_file_path.replace("content", "url")
    with open(url_file_path, encoding="utf-8") as f:
        urls = [line.strip() for line in f]
    
    embeddings = model.encode(lines, show_progress_bar=True, convert_to_numpy=True)
    # we can also try to encode the urls since they also contain some useful information
    # urls_updated = [" ".join(url.split("https://covid19.govt.nz")[1].split("/")).strip() for url in urls]
    # embeddings = model.encode(urls_updated, show_progress_bar=True, convert_to_numpy=True)
    datestr = extract_date(fname)
    
    # Add to Chroma
    try:
        collection.add(
            ids=[f"{fname}_{i}" for i in range(len(lines))],
            embeddings=embeddings.tolist(),
            documents=lines,
            metadatas=[{"filename": fname, "date": datestr, "line": i, "url": url} for i, url in enumerate(urls)]
        )
    except ValueError:
        print(f"⚠️ Skipped {fname} due to ValueError")

print("✅ Indexed all lines into Chroma!")

Having the vector database ready, we can now apply a semantic search to find the most relevant documents for a given query.

In [None]:
def semantic_search(query, n_results=5):
    model = SentenceTransformer(embedding_model)

    client = chromadb.PersistentClient(path=path_chroma)
    collection = client.get_or_create_collection(name=db_collection_name)

    query_emb = model.encode([query], convert_to_numpy=True).tolist()

    results = collection.query(
        query_embeddings=query_emb,
        n_results=5
    )

    for text, meta in zip(results["documents"][0], results["metadatas"][0]):
        print(f"{meta['filename']} (line {meta['line']}) — {meta['url']}")
        print(f"→ {text[:]}...\n")

    return results

In [None]:
res = semantic_search("What is the impact of the pandemic on the economy?")

## 5. Building a question-answering system with the corpus

Finally, let's build a simple question-answering system using our corpus and a language model. For this section, we'll need to install additional packages.

In [None]:
# Install additional packages for the QA system
!pip -q install transformers langchain-community

In [None]:
from transformers import pipeline
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings

In [None]:
# Load the same embedding model you used to create Chroma
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Load the persistent Chroma DB
vectorstore = Chroma(
    persist_directory=path_chroma,
    collection_name=db_collection_name,
    embedding_function=embedding_model
)

# Build a retriever from the vectorstore directly (no need to import VectorStoreRetriever)
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})

In [None]:
from transformers import pipeline

# qa_pipeline = pipeline(
#     "question-answering",
#     model="distilbert-base-cased-distilled-squad",
#     tokenizer="distilbert-base-cased-distilled-squad"
# )

# llm = HuggingFacePipeline(pipeline=qa_pipeline)

# Use a small open-source text generation model
gen_pipeline = pipeline(
    "text2text-generation",
    model="google/flan-t5-base",
    tokenizer="google/flan-t5-base",
    max_length=512
)

llm = HuggingFacePipeline(pipeline=gen_pipeline)

In [None]:
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # "stuff" simply concatenates retrieved docs
    retriever=retriever,
    return_source_documents=True
)

In [None]:
query = "What is the impact of the pandemic on the economy?"
result = qa_chain.invoke({"query": query})

print("🔹 Question:", query)
print("🔹 Answer:", result["result"])
print("\n🔹 Sources:")
for doc in result["source_documents"]:
    print("-", doc.metadata.get("filename"), "→", doc.page_content[:200], "...\n")


## Conclusion

In this notebook, we've demonstrated how to work with text files extracted from web archive HTML snapshots. We covered:

1. Setting up the environment
2. Extracting text from web archive HTML
3. Cleaning and preprocessing the text data
4. Advanced analysis with embeddings and semantic search
5. Building a question-answering system with the corpus

These techniques can be applied to any web archive corpus to extract insights and make the data more accessible and useful for research and analysis.