# üèóÔ∏è Notebook 2: Faiss Index Creation & Quantization
**Author:** Gabriele Righi
**Project:** Dense vs Sparse Retrieval Reproducibility

## üéØ Objective
This notebook is responsible for the **indexing phase** of the pipeline.
It takes the pre-computed document embeddings (from BGE-Base) and builds efficient **Faiss** indexes to enable fast similarity search.

## ‚öôÔ∏è Key Operations
1.  **HNSW FP32:** Builds a graph-based index (Hierarchical Navigable Small World) for maximum precision.
2.  **Quantization (INT8):** Trains a scalar quantizer (SQ8) to compress vectors from 32-bit floats to 8-bit integers, reducing memory usage by 75%.
3.  **HNSW INT8:** Builds a quantized graph index (High Speed + Low RAM).
4.  **Flat INT8:** Builds a quantized brute-force index.

## üìÇ Inputs & Outputs
* **Input:** `doc_embeddings.npy` (Generated in Notebook 1)
* **Output:** `.faiss` index files saved to disk for the Benchmark phase (Notebook 3).

---

In [None]:
# Using faiss-cpu which is compatible with everything and avoids installation errors
!pip install faiss-cpu sentence-transformers

In [None]:
import os
import numpy as np

# Automatically locate the output files from the previous notebook
print("üîç Searching for output files from the previous notebook...")

base_input_path = '/kaggle/input/notebooke6a6d451f2/nq_experiments'
doc_emb_path = None
query_emb_path = None

# Automatically scan to find the correct directory
for root, dirs, files in os.walk(base_input_path):
    if 'doc_embeddings.npy' in files:
        doc_emb_path = os.path.join(root, 'doc_embeddings.npy')
        query_emb_path = os.path.join(root, 'query_embeddings.npy')
        print(f"‚úÖ Found in: {root}")
        break

if doc_emb_path:
    # Load using the found paths
    doc_embeddings = np.load(doc_emb_path, mmap_mode='r')
    query_embeddings = np.load(query_emb_path)
    print("üöÄ Data loaded successfully! We can proceed.")
else:
    print("‚ùå Error: Files not found. Ensure you added the Notebook via 'Add Data' -> 'Notebooks'.")

In [None]:
import faiss
import time
import os
import numpy as np
import gc

# Output Configuration
output_dir = '/kaggle/working/nq_indexes'
os.makedirs(output_dir, exist_ok=True)

# Parameters
dimension = doc_embeddings.shape[1]
M = 16
ef_construction = 100 
ef_search = 1000

# ---------------------------------------------------------
# 1. HNSW FP32 (CRITICAL: Save this!)
# ---------------------------------------------------------
print("\nüèóÔ∏è  1/3: Building HNSW FP32 (~45-60 min)...")
start = time.time()

hnsw_index = faiss.IndexHNSWFlat(dimension, M, faiss.METRIC_INNER_PRODUCT)
hnsw_index.hnsw.efConstruction = ef_construction
hnsw_index.add(doc_embeddings)

faiss.write_index(hnsw_index, os.path.join(output_dir, 'hnsw_index.faiss'))
print(f"‚úÖ HNSW FP32 saved ({time.time()-start:.0f}s).")

# Free RAM immediately
del hnsw_index
gc.collect()

# ---------------------------------------------------------
# 2. Flat FP32 (DO NOT SAVE TO DISK)
# ---------------------------------------------------------
# We skip saving 'flat_index.faiss' to save 8GB.
# We will rebuild it on the fly during the search phase in 2 seconds.
print("\n‚è≠Ô∏è  Skipping Flat FP32 save to save disk space (useless duplicate).")

# ---------------------------------------------------------
# Preparing Training for INT8
# ---------------------------------------------------------
print("\n‚öôÔ∏è  Training Quantization...")
train_vectors = np.array(doc_embeddings[:50000])

# ---------------------------------------------------------
# 3. HNSW INT8 (Save this!)
# ---------------------------------------------------------
print("\nüèóÔ∏è  2/3: Building HNSW INT8 (SQ8)...")
start = time.time()
index_string = f"HNSW{M},SQ8"
hnsw_int8 = faiss.index_factory(dimension, index_string, faiss.METRIC_INNER_PRODUCT)
hnsw_int8.hnsw.efConstruction = ef_construction
hnsw_int8.train(train_vectors)
hnsw_int8.add(doc_embeddings)

faiss.write_index(hnsw_int8, os.path.join(output_dir, 'hnsw_int8_index.faiss'))
print(f"‚úÖ HNSW INT8 saved.")

del hnsw_int8
gc.collect()

# ---------------------------------------------------------
# 4. Flat INT8 (Save this!)
# ---------------------------------------------------------
print("\nüèóÔ∏è  3/3: Building Flat INT8 (SQ8)...")
start = time.time()
flat_int8 = faiss.index_factory(dimension, "SQ8", faiss.METRIC_INNER_PRODUCT)
flat_int8.train(train_vectors)
flat_int8.add(doc_embeddings)

faiss.write_index(flat_int8, os.path.join(output_dir, 'flat_int8_index.faiss'))
print(f"‚úÖ Flat INT8 saved.")

print("\nüéâ FINISHED! Disk space used: ~12 GB (safe under the 20GB limit)")