# Lovli - Index + Source-Gating Validation (Colab A100 GPU)

This notebook indexes Norwegian laws/regulations into Qdrant Cloud, then runs the v3 source-gating validation pipeline in Colab.

**Requirements:**
- Colab GPU runtime (A100 preferred)
- `lovli-data.tar.bz2` in your Google Drive (root folder)
- Qdrant Cloud URL and API key

**Runs after indexing:**
- `scripts/build_catalog.py` (merge `data/nl` + `data/sf`)
- `scripts/validate_reindex.py`
- `scripts/analyze_law_contamination.py`
- `scripts/sweep_retrieval_thresholds.py`

**Estimated time:** indexing ~20-30 min + evaluation/sweeps depending on GPU.

## 1. Setup

In [None]:
import os
os.environ["HF_HUB_DISABLE_TELEMETRY"] = "1"
# Set HF_TOKEN if you have one (reduces rate limit warnings):
# os.environ["HF_TOKEN"] = "your_token_here"

In [None]:
# Colab bootstrap (robust)
%cd /content
!rm -rf lovli
!git clone https://github.com/AndreasRamsli/lovli.git
%cd /content/lovli

# Install base runtime deps first (avoid upgrading Colab's pinned requests)
%pip install -q sentence-transformers qdrant-client beautifulsoup4

# Install project in editable mode WITHOUT pulling/upgrading all deps again
%pip install -q -e . --no-deps

# Hard fallback: ensure src path is importable even if editable install is flaky
import sys
from pathlib import Path
src_path = str(Path("/content/lovli/src"))
if src_path not in sys.path:
    sys.path.insert(0, src_path)

# Verify parser import
import lovli.parser as lp
print("Parser module:", lp.__file__)

In [None]:
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    name = torch.cuda.get_device_name(0)
    props = torch.cuda.get_device_properties(0)
    vram_gb = props.total_memory / (1024**3)
    print(f"GPU: {name} ({vram_gb:.1f} GB VRAM)")
    if "A100" not in name:
        print("  Note: Optimized for A100; other GPUs may need smaller batch sizes")
else:
    print("WARNING: No GPU detected. Go to Runtime > Change runtime type > A100 GPU")

## 2. Configuration

Fill in your Qdrant Cloud credentials:

In [None]:
# --- FILL THESE IN ---
QDRANT_URL = "https://acc5c492-7d2c-4b95-b0c5-2931ff2ecebd.eu-west-1-0.aws.cloud.qdrant.io"
QDRANT_API_KEY = ""  # Paste your Qdrant API key here, or use getpass below
# ---------------------

if not QDRANT_API_KEY:
    import getpass
    QDRANT_API_KEY = getpass.getpass("Qdrant API key: ")

COLLECTION_NAME = "lovli_laws_v3"  # Blue/green: keep lovli_laws_v2 as rollback
EMBEDDING_MODEL = "BAAI/bge-m3"
EMBEDDING_DIMENSION = 1024
EMBEDDING_BATCH_SIZE = 256  # A100 80GB can handle large batches
INDEX_BATCH_SIZE = 500      # Upsert batch size to Qdrant

# Editorial payload guardrails
EDITORIAL_NOTES_PER_PROVISION_CAP = 3
EDITORIAL_NOTE_MAX_CHARS = 600

# Network/retry tuning for Qdrant Cloud
QDRANT_TIMEOUT_SECONDS = 120
UPSERT_MAX_RETRIES = 5
UPSERT_BACKOFF_SECONDS = 2

# Runtime env for downstream validation scripts
import os
os.environ['QDRANT_URL'] = QDRANT_URL
os.environ['QDRANT_API_KEY'] = QDRANT_API_KEY
os.environ['QDRANT_COLLECTION_NAME'] = COLLECTION_NAME
os.environ['OPENROUTER_API_KEY'] = os.environ.get('OPENROUTER_API_KEY', 'dummy')
os.environ['LANGCHAIN_TRACING_V2'] = 'false'
os.environ['LANGSMITH_TRACING'] = 'false'
os.environ['SWEEP_SKIP_INDEX_SCAN'] = 'true'

# v3 retrieval profile
os.environ['RETRIEVAL_K_INITIAL'] = '15'
os.environ['RERANKER_CONFIDENCE_THRESHOLD'] = '0.45'
os.environ['RERANKER_MIN_DOC_SCORE'] = '0.35'
os.environ['RERANKER_AMBIGUITY_MIN_GAP'] = '0.05'
os.environ['RERANKER_AMBIGUITY_TOP_SCORE_CEILING'] = '0.7'

# law routing + coherence settings
os.environ['LAW_ROUTING_ENABLED'] = 'true'
os.environ['LAW_CATALOG_PATH'] = 'data/law_catalog.json'
os.environ['LAW_COHERENCE_FILTER_ENABLED'] = 'true'
os.environ['LAW_COHERENCE_MIN_LAW_COUNT'] = '2'
os.environ['LAW_COHERENCE_SCORE_GAP'] = '0.15'

assert QDRANT_API_KEY, "Please set QDRANT_API_KEY above"
print('QDRANT_COLLECTION_NAME =', os.environ['QDRANT_COLLECTION_NAME'])
print('LAW_ROUTING_ENABLED   =', os.environ['LAW_ROUTING_ENABLED'])

## 3. Data

Mount Google Drive and extract `lovli-data.tar.bz2` directly from Drive (no copy step).

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
DATA_DIRS = ["data/nl", "data/sf"]

In [None]:
# Extract directly into the cloned repo (skip macOS ._ resource fork files).
!tar -xjf /content/drive/MyDrive/lovli-data.tar.bz2 -C /content/lovli --exclude='._*'
!ls /content/lovli/data/nl/*.xml 2>/dev/null | wc -l && ls /content/lovli/data/sf/*.xml 2>/dev/null | wc -l

# Build merged law catalog used by routing (fast path: no summaries).
%cd /content/lovli
!python scripts/build_catalog.py data/nl data/sf --no-summaries --output data/law_catalog.json

## 4. Optional Parser Sanity Check

In [None]:
from pathlib import Path
from lovli.parser import parse_xml_file_grouped

# Optional: quick parser sanity check before long indexing runs.
sample = Path('data/nl/nl-19990326-017.xml')
if sample.exists():
    sample_articles = list(parse_xml_file_grouped(sample, per_provision_cap=3, editorial_note_max_chars=600))
    assert sample_articles, 'Sample parse returned no provisions'
    first = sample_articles[0]
    print('Parser sanity OK:', {'count': len(sample_articles), 'first_article_id': first.article_id})
else:
    print('Skipping parser sanity check (sample file missing).')

In [None]:
# (Intentionally left minimal; parser sanity is covered in the previous cell.)

In [None]:
# (Removed duplicate parser sanity logic to keep notebook concise.)

## 5. Script-Based Indexing (recommended)

In [None]:
# Configure runtime knobs consumed by scripts/index_laws.py (via Settings).
import os

os.environ['EMBEDDING_MODEL_NAME'] = EMBEDDING_MODEL
os.environ['EMBEDDING_DIMENSION'] = str(EMBEDDING_DIMENSION)
os.environ['EMBEDDING_BATCH_SIZE'] = str(EMBEDDING_BATCH_SIZE)
os.environ['INDEX_BATCH_SIZE'] = str(INDEX_BATCH_SIZE)
os.environ['EDITORIAL_NOTES_PER_PROVISION_CAP'] = str(EDITORIAL_NOTES_PER_PROVISION_CAP)
os.environ['EDITORIAL_NOTE_MAX_CHARS'] = str(EDITORIAL_NOTE_MAX_CHARS)

print('Script indexing config:')
print('  EMBEDDING_MODEL_NAME =', os.environ['EMBEDDING_MODEL_NAME'])
print('  EMBEDDING_BATCH_SIZE =', os.environ['EMBEDDING_BATCH_SIZE'])
print('  INDEX_BATCH_SIZE     =', os.environ['INDEX_BATCH_SIZE'])

## 6. Index with scripts/index_laws.py

In [None]:
%cd /content/lovli

# Recreate collection and index first directory.
# Then append second directory into the same collection.
!python scripts/index_laws.py data/nl --collection lovli_laws_v3 --recreate
!python scripts/index_laws.py data/sf --collection lovli_laws_v3

## 7. Optional Retry Pass (if indexing failed)

In [None]:
%cd /content/lovli

# Optional rerun if one of the previous indexing commands failed.
# Re-run only the failing directory.
# !python scripts/index_laws.py data/nl --collection lovli_laws_v3
# !python scripts/index_laws.py data/sf --collection lovli_laws_v3

print('Indexing is script-driven now; use the commands above for retries.')

## 8. Ensure Payload Indexes (idempotent)

In [None]:
%cd /content/lovli
!python scripts/create_payload_indexes.py

## 9. Optional Full Rebuild (if needed)

Use this only if you want a fresh collection rebuild after major indexing changes.

In [None]:
%cd /content/lovli

# Full rebuild flow (uncomment to run):
# !python scripts/index_laws.py data/nl --collection lovli_laws_v3 --recreate
# !python scripts/index_laws.py data/sf --collection lovli_laws_v3

print('Uncomment both commands above only if you want a full rebuild.')

## 10. Verify Collection and Run Source-Gating Validation

In [None]:
%cd /content/lovli

# Scripted metadata + smoke validation (includes collection-level sanity output).
!python scripts/validate_reindex.py --collection lovli_laws_v3 --with-smoke

# Cross-law contamination analysis
!python -u scripts/analyze_law_contamination.py --output eval/law_contamination_report.json

In [None]:
## 11. Retrieval Sweep (quick check + full run)

import os
%cd /content/lovli

# Quick check (optional but recommended before full run)
os.environ['SWEEP_SAMPLE_SIZE'] = '10'
!python -u scripts/sweep_retrieval_thresholds.py
os.environ.pop('SWEEP_SAMPLE_SIZE', None)

# Full run
!python -u scripts/sweep_retrieval_thresholds.py

## 12. Artifact Overview and Metrics

In [None]:
%cd /content/lovli
!ls -lah eval

import json
from pathlib import Path

artifacts = [
    Path('data/law_catalog.json'),
    Path('eval/law_contamination_report.json'),
    Path('eval/retrieval_sweep_results.json'),
]
for p in artifacts:
    print(f'{p}:', 'exists' if p.exists() else 'missing')

report_path = Path('eval/law_contamination_report.json')
if report_path.exists():
    report = json.loads(report_path.read_text(encoding='utf-8'))
    agg = report.get('aggregate', {})
    print('\nContamination aggregate:')
    for k in [
        'total_questions',
        'contamination_rate',
        'singleton_foreign_rate',
        'unexpected_citation_rate',
        'mean_foreign_score_gap',
    ]:
        print(f'  {k}: {agg.get(k)}')

sweep_path = Path('eval/retrieval_sweep_results.json')
if sweep_path.exists():
    rows = json.loads(sweep_path.read_text(encoding='utf-8'))
    if rows:
        top = rows[0]
        print('\nTop sweep row:')
        for k in [
            'recall_at_k',
            'citation_precision',
            'unexpected_citation_rate',
            'law_contamination_rate',
            'law_coherence_filtered_count',
        ]:
            print(f'  {k}: {top.get(k)}')