# FAISS Index Builder (GPU) - Google Colab Standalone

**⚠️ This notebook requires a GPU environment!**

## Running on Google Colab

1. **Enable GPU:** Runtime → Change runtime type → Hardware accelerator: GPU (T4)
2. **Upload dataset:** Upload `acts_with_metadata.tsv` to Colab files
3. **Run all cells** in order
4. **Download indices:** After completion, download the `indices.zip` file

This notebook builds FAISS indices for all embedding models:
1. Legal-BERT (`nlpaueb/legal-bert-base-uncased`)
2. GTE-Large (`thenlper/gte-large`)
3. BGE-Large (`BAAI/bge-large-en-v1.5`)
4. BGE-M3 (`BAAI/bge-m3`) - 8K context window

For both:
- Content field
- Metadata field

**Total:** 8 indices (4 models × 2 fields)

In [None]:
# Install required packages
# Note: Google Colab already has torch and CUDA pre-installed

# Check CUDA availability first
import subprocess
import sys

try:
    cuda_available = subprocess.run(['nvidia-smi'], capture_output=True).returncode == 0
    print(f"CUDA available: {cuda_available}")
except:
    cuda_available = False
    print("CUDA not available")

# Install FAISS
# GPU version requires conda (not available via pip)
# Google Colab doesn't have conda, so we use CPU version for now
print("Installing FAISS...")
!pip install -q faiss-cpu  # PyPI only has CPU version

# Install other packages
print("Installing other dependencies...")
!pip install -q transformers sentence-transformers
!pip install -q FlagEmbedding
!pip install -q pandas numpy tqdm

print("\n✓ All packages installed successfully!")
print("\n⚠️ Note: Using faiss-cpu. GPU acceleration for FAISS requires conda installation.")
print("However, embedding models (transformers) will still use GPU for encoding.")

## 0. Install Dependencies

In [None]:
# Import necessary libraries
import torch
import pandas as pd
import numpy as np
import json
from pathlib import Path
from typing import List, Dict, Any, Tuple, Optional
from sentence_transformers import SentenceTransformer
import faiss
from tqdm.auto import tqdm

# Import BGE-M3 model
try:
    from FlagEmbedding import BGEM3FlagModel
    BGE_M3_AVAILABLE = True
except ImportError:
    BGE_M3_AVAILABLE = False
    print("Warning: FlagEmbedding not available. BGE-M3 model will not work.")

print("✓ Imports successful")

In [None]:
class FAISSIndexBuilder:
    """FAISS index builder for embedding models."""
    
    def __init__(self, model_name: str):
        """Initialize the index builder with a specific model."""
        self.model_key = model_name
        self.model_config = EMBEDDING_MODELS[model_name]
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.library = self.model_config.get("library", "sentence-transformers")
        
        print(f"Loading model: {self.model_config['name']}")
        print(f"Library: {self.library}")
        print(f"Device: {self.device}")
        
        # Load embedding model
        if self.library == "flagembedding":
            if not BGE_M3_AVAILABLE:
                raise ImportError("FlagEmbedding library required for BGE-M3")
            self.model = BGEM3FlagModel(
                self.model_config["name"],
                use_fp16=True
            )
        else:
            self.model = SentenceTransformer(
                self.model_config["name"],
                device=self.device
            )
        
        print("✓ Model loaded successfully")
    
    def encode_documents(
        self, 
        documents: List[str],
        batch_size: int = 128,
        show_progress: bool = True
    ) -> np.ndarray:
        """Encode documents into embeddings."""
        print(f"Encoding {len(documents)} documents...")
        
        if self.library == "flagembedding":
            # BGE-M3 encoding
            max_length = self.model_config.get("max_length", 8192)
            embeddings = self.model.encode(
                documents,
                batch_size=batch_size,
                max_length=max_length,
                return_dense=True,
                return_sparse=False,
                return_colbert_vecs=False
            )['dense_vecs']
            # Normalize
            embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
        else:
            # SentenceTransformer encoding
            embeddings = self.model.encode(
                documents,
                batch_size=batch_size,
                show_progress_bar=show_progress,
                convert_to_numpy=True,
                normalize_embeddings=True,
            )
        
        print(f"Embeddings shape: {embeddings.shape}")
        return embeddings
    
    def build_index(
        self, 
        documents: List[str],
        batch_size: int = 128,
        use_gpu: bool = False
    ):
        """Build FAISS index from documents."""
        print(f"\nBuilding FAISS index for {self.model_key}...")
        
        # Encode documents (this uses GPU via PyTorch/Transformers)
        embeddings = self.encode_documents(documents, batch_size)
        
        # Create FAISS index (CPU-only with faiss-cpu package)
        dimension = embeddings.shape[1]
        index = faiss.IndexFlatIP(dimension)  # Inner product (for normalized vectors = cosine)
        
        # Check if FAISS has GPU support
        has_gpu_support = hasattr(faiss, 'StandardGpuResources')
        
        if use_gpu and has_gpu_support and torch.cuda.is_available():
            # Only use GPU FAISS if available (requires faiss-gpu package)
            print("Moving FAISS index to GPU...")
            res = faiss.StandardGpuResources()
            index = faiss.index_cpu_to_gpu(res, 0, index)
        else:
            if use_gpu and not has_gpu_support:
                print("⚠️ FAISS GPU not available (using faiss-cpu). Index operations will use CPU.")
                print("   Note: Embeddings are still computed on GPU (the slow part)!")
        
        # Add embeddings
        print("Adding embeddings to index...")
        index.add(embeddings.astype('float32'))
        
        print(f"✓ Index built: {index.ntotal} vectors")
        
        # Convert back to CPU for saving if needed
        if use_gpu and has_gpu_support and torch.cuda.is_available():
            index = faiss.index_gpu_to_cpu(index)
        
        self.index = index
        self.documents = documents
        
        return index
    
    def save_index(self, save_dir: Path):
        """Save FAISS index and metadata."""
        save_dir.mkdir(parents=True, exist_ok=True)
        
        # Save FAISS index
        index_path = save_dir / "index.faiss"
        faiss.write_index(self.index, str(index_path))
        print(f"✓ Index saved to: {index_path}")
        
        # Save metadata
        metadata = {
            "model_name": self.model_config["name"],
            "model_key": self.model_key,
            "num_documents": len(self.documents),
            "dimension": self.index.d,
            "library": self.library,
        }
        
        metadata_path = save_dir / "metadata.json"
        with open(metadata_path, 'w') as f:
            json.dump(metadata, f, indent=2)
        print(f"✓ Metadata saved to: {metadata_path}")
        
        # Save document list (for reference)
        docs_path = save_dir / "documents.txt"
        with open(docs_path, 'w', encoding='utf-8') as f:
            for doc in self.documents:
                f.write(doc + '\n')
        print(f"✓ Documents list saved to: {docs_path}")

print("✓ Helper functions defined")

## Helper Functions

In [None]:
# Configuration
EMBEDDING_MODELS = {
    "legal-bert": {
        "name": "nlpaueb/legal-bert-base-uncased",
        "description": "Domain-specific BERT model trained on legal corpora",
        "max_length": 512,
        "library": "sentence-transformers",
    },
    "gte-large": {
        "name": "thenlper/gte-large",
        "description": "State-of-the-art general-purpose embedding model",
        "max_length": 512,
        "library": "sentence-transformers",
    },
    "bge-large": {
        "name": "BAAI/bge-large-en-v1.5",
        "description": "Top-performing model for retrieval tasks",
        "max_length": 512,
        "library": "sentence-transformers",
    },
    "bge-m3": {
        "name": "BAAI/bge-m3",
        "description": "BGE-M3: Multi-Functionality, Multi-Linguality model with 8192 token support",
        "max_length": 8192,
        "library": "flagembedding",
    },
}

# Batch size for encoding (adjust based on your GPU VRAM)
# 128 for 16GB GPU, 256 for 24GB GPU, 64 for 12GB GPU
EMBEDDING_BATCH_SIZE = 128

# Metadata fields to combine
METADATA_FIELDS = ["short_title", "keywords", "section_title", "summary"]
METADATA_SEPARATOR = " | "

# Output directory for indices
INDICES_DIR = Path("indices")
INDICES_DIR.mkdir(exist_ok=True)

print("✓ Configuration loaded")

## Configuration

## 1. Check GPU Availability

In [None]:
# Check GPU availability
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU device: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("\n⚠️ WARNING: No GPU detected!")
    print("This notebook is designed for GPU. Index building will be slow on CPU.")
    print("Consider running on a GPU-enabled environment (Google Colab, Kaggle, etc.)")

# Check FAISS capabilities
print(f"\nFAISS version: {faiss.__version__}")
has_gpu = hasattr(faiss, 'StandardGpuResources')
print(f"FAISS GPU support: {has_gpu}")

if not has_gpu:
    print("\n💡 Using faiss-cpu package (expected for Colab)")
    print("   ✅ GPU will still be used for encoding (95% of compute time)")
    print("   ✅ FAISS index operations on CPU are fast anyway (~2-3 min per model)")
    print("   ⏱️ Total impact: ~5 minutes extra across all models")

## 2. Upload and Load Dataset

**⚠️ Upload your `acts_with_metadata.tsv` file using the file upload button in Colab**

In [None]:
# Option 1: Upload file manually in Colab
# Click the folder icon on the left, then upload acts_with_metadata.tsv

# Option 2: If you have it in Google Drive
# from google.colab import drive
# drive.mount('/content/drive')
# data_file = '/content/drive/MyDrive/path/to/acts_with_metadata.tsv'

# For local Colab upload:
data_file = 'acts_with_metadata.tsv'

# Load the dataset
print("Loading dataset...")
df = pd.read_csv(data_file, sep='\t')

print(f"✓ Loaded {len(df)} documents")
print(f"\nDataset shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nFirst few rows:")
print(df.head(2))

In [None]:
# Prepare metadata field (join 4 fields into 1)
print("Preparing metadata field...")

# Fill NaN values
for field in METADATA_FIELDS:
    if field in df.columns:
        df[field] = df[field].fillna('')

# Join metadata fields
df['metadata'] = df.apply(
    lambda row: METADATA_SEPARATOR.join([
        str(row.get(field, '')) for field in METADATA_FIELDS
    ]),
    axis=1
)

print("✓ Metadata field created")

# Get document lists
documents_content = df['content'].astype(str).tolist()
documents_metadata = df['metadata'].astype(str).tolist()

print(f"\n✓ Content documents: {len(documents_content)}")
print(f"✓ Metadata documents: {len(documents_metadata)}")

# Show examples
print("\nContent example:")
print(documents_content[0][:200] + "...")
print("\nMetadata example:")
print(documents_metadata[0][:200] + "...")

## Prepare Data Fields

## 3. Review Embedding Models

In [None]:
# Display embedding models configuration
print("Embedding Models Configuration:")
print("="*80)

for key, config in EMBEDDING_MODELS.items():
    print(f"\n{key}:")
    print(f"  Model: {config['name']}")
    print(f"  Description: {config['description']}")
    print(f"  Max Length: {config['max_length']} tokens")
    print(f"  Library: {config['library']}")

print(f"\n{'='*80}")
print(f"Batch Size: {EMBEDDING_BATCH_SIZE}")
print(f"Total Models: {len(EMBEDDING_MODELS)}")
print(f"Total Indices to Build: {len(EMBEDDING_MODELS) * 2} (content + metadata for each)")

## 4. Build All FAISS Indices

This will build 8 indices total:
- 4 models × 2 fields (content + metadata)

**Models:**
1. Legal-BERT (512 tokens, 768-dim)
2. GTE-Large (512 tokens, 1024-dim)
3. BGE-Large (512 tokens, 1024-dim)
4. BGE-M3 (8192 tokens, 1024-dim) - Long context!

**Note:** This may take 40-60 minutes depending on GPU. Monitor GPU usage with the cell below.

In [None]:
# Build all indices
use_gpu = torch.cuda.is_available()

print(f"Building indices with GPU: {use_gpu}")
print(f"Batch size: {EMBEDDING_BATCH_SIZE}")
print(f"\nThis will take some time...\n")
print("="*80)

# Build indices for each model
for model_idx, model_key in enumerate(EMBEDDING_MODELS.keys(), 1):
    print(f"\n{'='*80}")
    print(f"MODEL {model_idx}/{len(EMBEDDING_MODELS)}: {model_key}")
    print(f"{'='*80}\n")
    
    try:
        # Initialize builder
        builder = FAISSIndexBuilder(model_key)
        
        # Build content index
        print(f"\n--- Building CONTENT index for {model_key} ---")
        builder.build_index(documents_content, batch_size=EMBEDDING_BATCH_SIZE, use_gpu=use_gpu)
        content_dir = INDICES_DIR / f"content_{model_key}"
        builder.save_index(content_dir)
        
        # Build metadata index
        print(f"\n--- Building METADATA index for {model_key} ---")
        builder = FAISSIndexBuilder(model_key)  # Reinitialize for clean state
        builder.build_index(documents_metadata, batch_size=EMBEDDING_BATCH_SIZE, use_gpu=use_gpu)
        metadata_dir = INDICES_DIR / f"metadata_{model_key}"
        builder.save_index(metadata_dir)
        
        print(f"\n✓ Completed {model_key} ({model_idx}/{len(EMBEDDING_MODELS)})")
        
        # Clear GPU cache
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
        
    except Exception as e:
        print(f"\n❌ Error building indices for {model_key}: {str(e)}")
        import traceback
        traceback.print_exc()

print("\n" + "="*80)
print("✓ ALL INDICES BUILT AND SAVED SUCCESSFULLY!")
print("="*80)

In [None]:
# Check GPU memory usage
if torch.cuda.is_available():
    print(f"GPU Memory Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
    print(f"GPU Memory Cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB")
    print(f"GPU Memory Total: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    
    allocated_pct = (torch.cuda.memory_allocated() / torch.cuda.get_device_properties(0).total_memory) * 100
    print(f"\nGPU Utilization: {allocated_pct:.1f}%")
else:
    print("No GPU available")

## (Optional) Monitor GPU Usage

Run this in a separate cell while building indices to monitor GPU utilization:

## 5. Verify Saved Indices

In [None]:
# List all saved indices
import json

print("Saved FAISS Indices:")
print("="*80)

for model_key in EMBEDDING_MODELS.keys():
    for field in ['content', 'metadata']:
        index_dir = INDICES_DIR / f"{field}_{model_key}"
        
        if index_dir.exists():
            # Load metadata
            metadata_file = index_dir / "metadata.json"
            if metadata_file.exists():
                with open(metadata_file, 'r') as f:
                    metadata = json.load(f)
                
                print(f"\n{field}_{model_key}:")
                print(f"  Model: {metadata['model_name']}")
                print(f"  Documents: {metadata['num_documents']}")
                print(f"  Dimension: {metadata['dimension']}")
                
                # Check file sizes
                index_file = index_dir / "index.faiss"
                if index_file.exists():
                    size_mb = index_file.stat().st_size / (1024 * 1024)
                    print(f"  Index size: {size_mb:.2f} MB")
        else:
            print(f"\n⚠️ {field}_{model_key}: NOT FOUND")

## 6. Download Indices (for use on local machine)

Zip all indices and download to your local machine.

In [None]:
# Create a zip file of all indices
import shutil

print("Creating zip archive of indices...")

# Create zip file
zip_path = shutil.make_archive('indices', 'zip', INDICES_DIR)

print(f"✓ Indices zipped to: {zip_path}")

# Get file size
import os
zip_size_mb = os.path.getsize(zip_path) / (1024 * 1024)
print(f"Archive size: {zip_size_mb:.2f} MB")

# In Colab, download the file
try:
    from google.colab import files
    print("\nDownloading zip file...")
    files.download(zip_path)
    print("✓ Download started!")
except ImportError:
    print("\nNot running in Colab - file saved locally as 'indices.zip'")
    print("You can download it manually from the files panel.")

In [None]:
# Quick test to verify indices work
test_model = "legal-bert"
test_field = "content"

print(f"Testing index: {test_field}_{test_model}\n")

# Load index
index_path = INDICES_DIR / f"{test_field}_{test_model}" / "index.faiss"
index = faiss.read_index(str(index_path))

# Load model for query encoding
builder = FAISSIndexBuilder(test_model)

# Test query
test_query = "What are the procedures for presidential elections?"
print(f"Test query: {test_query}\n")

# Encode query
query_embedding = builder.encode_documents([test_query], batch_size=1, show_progress=False)

# Search
k = 5
scores, indices = index.search(query_embedding.astype('float32'), k)

print(f"Top {k} results:")
for i, (idx, score) in enumerate(zip(indices[0], scores[0])):
    print(f"\n{i+1}. Document Index: {idx}")
    print(f"   Score: {score:.4f}")
    print(f"   Content: {documents_content[idx][:150]}...")

print("\n✓ Index loading and search working correctly!")

## 7. Test Index Loading (Verification)

## Summary

### ✅ FAISS Index Building Complete!

Created **8 indices** for:

1. **Legal-BERT** (content + metadata)
   - Domain-specific legal model
   - 768-dim embeddings, 512 max tokens

2. **GTE-Large** (content + metadata)
   - State-of-the-art general embedding
   - 1024-dim embeddings, 512 max tokens

3. **BGE-Large** (content + metadata)
   - Top MTEB leaderboard model
   - 1024-dim embeddings, 512 max tokens

4. **BGE-M3** (content + metadata) 🆕
   - Multi-lingual, long-context model
   - 1024-dim embeddings, **8192 max tokens**
   - Supports 100+ languages

All indices saved to: `indices/` directory

### Next Steps:

1. **Download** the `indices.zip` file to your local machine
2. **Extract** to your project's `indices/` folder
3. **Run** the FAISS retrieval notebook (03) on CPU
4. **Compare** with BM25 results
5. **Apply** reranking (notebook 04)

### File Structure:
```
indices/
├── content_legal-bert/
│   ├── index.faiss
│   ├── metadata.json
│   └── documents.txt
├── content_gte-large/
├── content_bge-large/
├── content_bge-m3/
├── metadata_legal-bert/
├── metadata_gte-large/
├── metadata_bge-large/
└── metadata_bge-m3/
```

**Total Size:** ~22 GB (8 indices)

---

**💡 Tip:** You can now run retrieval and evaluation on CPU. Only index building requires GPU!