# ETH Zurich Web Archive Indexing Pipeline

This notebook walks through the complete pipeline for processing WARC files and indexing them to Elasticsearch.

## Overview

The pipeline consists of 5 main steps:
1. **Extract** HTML and PDF files from WARC archives
2. **Combine** HTML files by domain and timestamp
3. **Convert** HTML to Markdown format
4. **Index** documents to Elasticsearch with embeddings
5. **Query** the indexed documents

## Prerequisites

- Ollama running locally: `ollama serve`
- Embedding model downloaded: `ollama pull all-minilm`
- `.env` file configured with Elasticsearch credentials

## Step 0: Setup Environment

Load environment variables and configure settings.

In [2]:
import os
import shutil
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Read Elasticsearch configuration from .env file
es_username = os.getenv('ELASTIC_USERNAME')
es_password = os.getenv('ELASTIC_PASSWORD')
es_url = os.getenv('ES_URL', 'https://es.swissai.cscs.ch')
embedding_model = os.getenv('EMBEDDING_MODEL', 'all-minilm')
index_name = os.getenv('INDEX_NAME', 'ethz_webarchive')

# Validate credentials
if not es_username or not es_password:
    raise ValueError("Please set ELASTIC_USERNAME and ELASTIC_PASSWORD in your .env file")

print(f"Configuration:")
print(f"  ES URL: {es_url}")
print(f"  ES User: {es_username}")
print(f"  Embedding Model: {embedding_model}")
print(f"  Index Name: {index_name}")

Configuration:
  ES URL: https://es.swissai.cscs.ch
  ES User: lsaie-1
  Embedding Model: all-minilm
  Index Name: ethz_webarchive


## Step 1: Extract HTML and PDF from WARC Files

Extract content from WARC archives. This step:
- Parses WARC files in the data directory
- Extracts HTML pages (for text content)
- Extracts PDF files (for document content)
- Organizes files by domain

**Note:** We clean the output directory first to ensure a fresh start.

In [2]:
from prep_warc_files import warc_to_html, warc_to_pdf

# Clean output directory
if os.path.exists("output"):
    shutil.rmtree("output")
    print("Cleaned output directory")

# Define collections to process
coll_list = ["19945"]

print("\nExtracting HTML and PDF files from WARC archives...\n")

for coll in coll_list:
    print(f"Processing collection: {coll}")
    
    # Extract HTML files
    print("  Extracting HTML...")
    warc_to_html("./data/ethz_websites_2022-2025_examples", f"output/html_raw/{coll}/")
    
    # Extract PDF files
    print("  Extracting PDFs...")
    warc_to_pdf("./data/ethz_websites_2022-2025_examples", f"output/pdf_raw/{coll}/")
    
    print(f"  ✓ Completed collection {coll}\n")

Cleaned output directory

Extracting HTML and PDF files from WARC archives...

Processing collection: 19945
  Extracting HTML...
parsing ARCHIVEIT-19945-WEEKLY-JOB2538523-SEED3004629-20250410105708162-00000-h3.warc.gz
parsing ARCHIVEIT-19945-TEST-JOB2537999-0-SEED4432726-20250409125132954-00000-uhxaspzf.warc.gz
parsing ARCHIVEIT-19945-WEEKLY-JOB2538523-SEED3004637-20250410105648171-00000-h3.warc.gz
parsing ARCHIVEIT-19945-TEST-JOB2538000-0-SEED4432727-20250409125201867-00000-9618ziof.warc.gz
-----------------------------

Count of records.
59

Count of types.
{'response': 59}

Count of warc-content.
{'application/http': 59}

Count of http-content.
{'text/html': 59}

Count of status.
{'200': 52, '404': 6, '500': 1}
  Extracting PDFs...
parsing ARCHIVEIT-19945-WEEKLY-JOB2538523-SEED3004629-20250410105708162-00000-h3.warc.gz
parsing ARCHIVEIT-19945-TEST-JOB2537999-0-SEED4432726-20250409125132954-00000-uhxaspzf.warc.gz
parsing ARCHIVEIT-19945-WEEKLY-JOB2538523-SEED3004637-20250410105648171

## Step 2: Combine HTML Files by Domain

Combine HTML files that belong to the same domain. This step:
- Groups files by domain (e.g., all files from ethz.ch)
- Keeps the latest version of each page based on timestamp
- Creates a mapping file for domain-to-URL conversion
- Saves timestamps for each file (used later for metadata)

This is important for:
- Avoiding duplicate content
- Tracking when each page was archived
- Organizing content by source domain

In [3]:
from combine_domains import combine_domains_by_timestamp

print("Combining HTML files by domain...\n")

for coll in coll_list:
    result = combine_domains_by_timestamp(
        input_dir=f"output/html_raw/{coll}",
        output_dir=f"output/html_combined/{coll}",
        timestamps_json_path=f"output/mappings/{coll}/timestamps.json"
    )

    print(f"\n✓ Collection {coll} Summary:")
    print(f"  Processed {result['domains_count']} domains")
    print(f"  Total files: {result['total_files']}")
    print(f"  Domains: {', '.join(result['domains'])}\n")

Combining HTML files by domain...

Domain HTML Combiner
Input:  output/html_raw/19945
Output: output/html_combined/19945
Found: geosynklinale.ch - 2025-04-09 12:51:32 - ARCHIVEIT-19945-TEST-JOB2537999-0-SEED4432726-20250409125132954-00000-uhxaspzf.warc.gz_geosynklinale.ch
Found: ethz.ch - 2025-04-10 10:56:48 - ARCHIVEIT-19945-WEEKLY-JOB2538523-SEED3004637-20250410105648171-00000-h3.warc.gz_ethz.ch
Found: youtube.com - 2025-04-09 12:52:01 - ARCHIVEIT-19945-TEST-JOB2538000-0-SEED4432727-20250409125201867-00000-9618ziof.warc.gz_youtube.com
Found: ethz.ch - 2025-04-10 10:57:08 - ARCHIVEIT-19945-WEEKLY-JOB2538523-SEED3004629-20250410105708162-00000-h3.warc.gz_ethz.ch
Found: hytac.arch.ethz.ch - 2025-04-09 12:52:01 - ARCHIVEIT-19945-TEST-JOB2538000-0-SEED4432727-20250409125201867-00000-9618ziof.warc.gz_hytac.arch.ethz.ch
Found: geolocation.onetrust.com - 2025-04-10 10:56:48 - ARCHIVEIT-19945-WEEKLY-JOB2538523-SEED3004637-20250410105648171-00000-h3.warc.gz_geolocation.onetrust.com


Found 5 u

## Step 3: Convert HTML to Markdown

Convert HTML files to clean Markdown format. This step:
- Parses HTML and extracts text content
- Converts to Markdown for better text processing
- Optionally filters domains based on Excel file (topics mapping)
- Creates domain mappings (domain → original URL)

**Excel filtering:** If you provide an Excel file with allowed domains, only those domains will be processed. This is useful for focusing on specific content.

In [4]:
from html_combined_to_markdown import convert_html_combined_to_markdown

print("Converting HTML to Markdown...\n")

for coll in coll_list:
    result = convert_html_combined_to_markdown(
        input_dir=f"output/html_combined/{coll}",
        output_dir=f"output/markdown/{coll}",
        excel_path="data/2025-11-20_19945_topics.xlsx",  # Optional: filter domains
        mappings_path=f"output/mappings/{coll}/domain_mappings.json"
    )
    
    print(f"\n✓ Collection {coll} completed\n")

Converting HTML to Markdown...

HTML to Markdown Converter
Input:  output/html_combined/19945
Output: output/markdown/19945
Excel:  data/2025-11-20_19945_topics.xlsx
Loaded 3 allowed domains from Excel
Saved domain mappings to output/mappings/19945/domain_mappings.json

Processing 5 domain folders...


Processing domains: 100%|██████████| 5/5 [00:00<00:00, 24.48domain/s]


✓ Processed 3 domains
✓ Converted 50 files
✓ Skipped 6 files
✓ Output directory: output/markdown/19945

✓ Collection 19945 completed






## Step 4: Index to Elasticsearch

Index the Markdown documents to Elasticsearch with embeddings. This step:
- Reads Markdown files with metadata
- Splits documents into chunks (for better retrieval)
- Generates embeddings using Ollama (local embedding model)
- Stores documents and embeddings in Elasticsearch

**Metadata included:**
- Original URL
- Domain
- Retrieval timestamp
- Page title

**Performance notes:**
- Documents are processed in batches of 10
- This may take several minutes depending on document count
- Progress is shown for each batch

In [5]:
from index_to_elasticsearch import index_markdown_to_elasticsearch

print("Indexing documents to Elasticsearch...\n")

for coll in coll_list:
    result = index_markdown_to_elasticsearch(
        clean_index=True,  # Delete existing index first
        es_user=es_username,
        es_password=es_password,
        es_url=es_url,
        embedding_model=embedding_model,
        markdown_dir=f"output/markdown/{coll}",
        index_name=index_name,
        mappings_path=f"output/mappings/{coll}/domain_mappings.json",
        timestamps_path=f"output/mappings/{coll}/timestamps.json"
    )
    
    print(f"\n✓ Successfully indexed {result['documents_indexed']} documents")
    print(f"✓ Index: {result['index_name']}\n")

Indexing documents to Elasticsearch...

Elasticsearch Indexing Pipeline
Markdown dir: output/markdown/19945
Index name:   ethz_webarchive
ES URL:       https://es.swissai.cscs.ch
Embedding:    all-minilm
Deleted existing index: ethz_webarchive
Loaded 3 domain mappings
Loaded timestamps for 59 files

Loading markdown documents...
Found 50 markdown files


Loading markdown files: 100%|██████████| 50/50 [00:00<00:00, 3088.27it/s]

Loaded 50 documents
Saved 50 documents to output/ethz_webarchive_documents.json






Initializing Ollama embedding model...
✓ Connected to Ollama with model: all-minilm

Processing 50 documents...
This may take a while depending on document size and embedding model...

Processing batch 1/5 (10 documents)...
✓ Batch 1 completed successfully (10/50 documents)

Processing batch 2/5 (10 documents)...
✓ Batch 2 completed successfully (20/50 documents)

Processing batch 3/5 (10 documents)...
✓ Batch 3 completed successfully (30/50 documents)

Processing batch 4/5 (10 documents)...
✓ Batch 4 completed successfully (40/50 documents)

Processing batch 5/5 (10 documents)...
✓ Batch 5 completed successfully (50/50 documents)

✓ Successfully indexed 50 documents
✓ Index name: ethz_webarchive
✓ Elasticsearch URL: https://es.swissai.cscs.ch

✓ Successfully indexed 50 documents
✓ Index: ethz_webarchive



## Step 5: Query the Index

Now that documents are indexed, you can query them using semantic search.

The search:
- Converts your query to an embedding
- Finds the most similar documents using vector similarity
- Returns results with metadata (URL, domain, retrieval date)

**Try different queries to explore the indexed content!**

In [None]:
from query_elasticsearch import simple_search, print_search_results

# Example query
query = "Was ist die Geosynklinale?"
top_k = 3

print(f"Searching for: '{query}'\n")

results = simple_search(
    query=query,
    index_name=index_name,
    es_url=es_url,
    embedding_model=embedding_model,
    top_k=top_k
)

print_search_results(results)

## Inspect Search Results

View the raw search results as a Python list of dictionaries.

In [None]:
# View first result in detail
if results:
    print("First result:")
    for key, value in results[0].items():
        if key == 'text':
            print(f"  {key}: {value[:200]}...")  # Truncate long text
        else:
            print(f"  {key}: {value}")

## Try Your Own Queries

Modify the cell below to search for different topics:

In [5]:
from query_elasticsearch import simple_search, print_search_results

# Try your own query here!
my_query = "Was ist die Geosynklinale?"
my_top_k = 5

my_results = simple_search(
    query=my_query,
    index_name=index_name,
    es_url=es_url,
    embedding_model=embedding_model,
    top_k=my_top_k,
    es_user=es_username,
    es_password=es_password
)

print_search_results(my_results)

Found 5 results

[1] index
    URL: https://geosynklinale.ch/index.html
    URL Preview (browser-friendly): https://geosynklinale.ch
    Domain: geosynklinale.ch
    Retrieved: None
    Score: 1.0000
    Preview: \\n\\n\\n\\n

Seit vielen Jahren wird die Geosynklinale als ein Weihnachtsessen im Dezember organisiert.

\\n\\n\\n\\n

![](https://live.geosynklinale...

[2] index
    URL: https://geosynklinale.ch/archiv/index.html
    URL Preview (browser-friendly): https://geosynklinale.ch/archiv
    Domain: geosynklinale.ch
    Retrieved: None
    Score: 0.5622
    Preview: \\n\\n\\n\\n

Hier sind die Geosynklinale\-Einladungen aus den letzten paar Jahren.

\\n\\n\\n\\n

---
\\n\\n\\n\\n

2024
----

\\n\\n\\n\\n

![Einlad...

[3] index
    URL: https://geosynklinale.ch/index.html
    URL Preview (browser-friendly): https://geosynklinale.ch
    Domain: geosynklinale.ch
    Retrieved: None
    Score: 0.3101
    Preview: ch/naechste-geosynklinale/)

\\n* [Archiv](https://geosynklinale.ch/arc