# FAISS Index Builder (GPU)

**⚠️ This notebook requires a GPU environment!**

This notebook builds FAISS indices for all embedding models:
1. Legal-BERT (`nlpaueb/legal-bert-base-uncased`)
2. GTE-Large (`thenlper/gte-large`)
3. BGE-Large (`BAAI/bge-large-en-v1.5`)

For both:
- Content field
- Metadata field

Indices are saved to disk for later use on CPU.

In [None]:
import sys
sys.path.append('..')

import torch
import pandas as pd
import numpy as np
from pathlib import Path

from src.data_loader import load_data, prepare_data, get_documents_by_field
from src.faiss_retriever import FAISSRetriever, build_all_indices
from src.config import EMBEDDING_MODELS, INDICES_DIR, EMBEDDING_BATCH_SIZE

print("✓ Imports successful")

## 1. Check GPU Availability

In [None]:
# Check GPU availability
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU device: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("\n⚠️ WARNING: No GPU detected!")
    print("This notebook is designed for GPU. Index building will be slow on CPU.")
    print("Consider running on a GPU-enabled environment (Google Colab, Kaggle, etc.)")

## 2. Load and Prepare Data

In [None]:
# Load data
df = load_data()
df = prepare_data(df)

print(f"Loaded {len(df)} documents")

# Get document lists
documents_content = get_documents_by_field(df, 'content')
documents_metadata = get_documents_by_field(df, 'metadata')

print(f"Content documents: {len(documents_content)}")
print(f"Metadata documents: {len(documents_metadata)}")

## 3. Review Embedding Models

In [None]:
# Display embedding models
print("Embedding Models Configuration:")
print("="*80)

for key, config in EMBEDDING_MODELS.items():
    print(f"\n{key}:")
    print(f"  Model: {config['name']}")
    print(f"  Description: {config['description']}")
    print(f"  Max Length: {config['max_length']}")

## 4. Build All FAISS Indices

This will build 6 indices total:
- 3 models × 2 fields (content + metadata)

**Note:** This may take 30-60 minutes depending on GPU and dataset size.

In [None]:
# Build all indices
# Set use_gpu=True if GPU is available, False otherwise
use_gpu = torch.cuda.is_available()

print(f"Building indices with GPU: {use_gpu}")
print(f"Batch size: {EMBEDDING_BATCH_SIZE}")
print(f"\nThis will take some time...\n")

build_all_indices(
    documents_content=documents_content,
    documents_metadata=documents_metadata,
    use_gpu=use_gpu,
    batch_size=EMBEDDING_BATCH_SIZE
)

print("\n✓ All indices built and saved successfully!")

## 5. Verify Saved Indices

In [None]:
# List all saved indices
import json

print("Saved FAISS Indices:")
print("="*80)

for model_key in EMBEDDING_MODELS.keys():
    for field in ['content', 'metadata']:
        index_dir = INDICES_DIR / f"{field}_{model_key}"
        
        if index_dir.exists():
            # Load metadata
            metadata_file = index_dir / "metadata.json"
            if metadata_file.exists():
                with open(metadata_file, 'r') as f:
                    metadata = json.load(f)
                
                print(f"\n{field}_{model_key}:")
                print(f"  Model: {metadata['model_name']}")
                print(f"  Documents: {metadata['num_documents']}")
                print(f"  Dimension: {metadata['dimension']}")
                
                # Check file sizes
                index_file = index_dir / "index.faiss"
                if index_file.exists():
                    size_mb = index_file.stat().st_size / (1024 * 1024)
                    print(f"  Index size: {size_mb:.2f} MB")
        else:
            print(f"\n⚠️ {field}_{model_key}: NOT FOUND")

## 6. Test Index Loading (Quick Verification)

In [None]:
# Test loading one index to verify it works
test_model = "legal-bert"
test_field = "content"

print(f"Testing index loading: {test_field}_{test_model}")

index_path = INDICES_DIR / f"{test_field}_{test_model}"
retriever = FAISSRetriever(test_model, index_path=index_path)

# Test retrieval
test_query = "What are the procedures for presidential elections?"
indices, scores, ret_time = retriever.retrieve(test_query, top_k=5)

print(f"\nTest query: {test_query}")
print(f"Retrieval time: {ret_time:.4f}s")
print(f"\nTop 5 results:")
for i, (idx, score) in enumerate(zip(indices, scores)):
    print(f"  {i+1}. Index: {idx}, Score: {score:.4f}")

print("\n✓ Index loading and retrieval working correctly!")

## Summary

FAISS index building complete!

Created indices:
1. **Legal-BERT** (content + metadata)
   - Domain-specific legal model
   - Best for legal terminology understanding

2. **GTE-Large** (content + metadata)
   - State-of-the-art general embedding
   - Strong cross-domain performance

3. **BGE-Large** (content + metadata)
   - Top MTEB leaderboard model
   - Excellent for retrieval tasks

All indices saved to: `../indices/`

Next steps:
- Run retrieval evaluation on CPU (notebook 04)
- Compare with BM25 results
- Apply reranking (notebook 05)