# Task 1: Build FAISS Search System - SOLUTION

Implement semantic search using FAISS and sentence-transformers.

**Goals:**
- Create FAISS index for ticket corpus
- Implement search function
- Compare Flat vs IVF vs HNSW performance
- Implement metadata filtering

In [1]:
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
import pandas as pd
import json
import time

## Load Data

Load ticket dataset from fixtures.

In [None]:
# Load tickets
with open('../fixtures/input/tickets.json', 'r') as f:
    tickets = json.load(f)

print(f"Loaded {len(tickets)} tickets")
print(f"\nSample ticket:")
print(json.dumps(tickets[0], indent=2))

## Task 1: Generate Embeddings

Use sentence-transformers to embed ticket titles and descriptions.

In [None]:
# SOLUTION

# 1. Load sentence-transformers model
model = SentenceTransformer('all-MiniLM-L6-v2')

# 2. Create list of texts (title + description for each ticket)
texts = [f"{ticket['title']}. {ticket['description']}" for ticket in tickets]

# 3. Generate embeddings with normalize_embeddings=True
# 4. Convert to float32
embeddings = model.encode(texts, normalize_embeddings=True, show_progress_bar=True)
embeddings = embeddings.astype('float32')

print(f"Embeddings shape: {embeddings.shape}")
print(f"Embeddings dtype: {embeddings.dtype}")

# TEST - Do not modify
assert model is not None, "Model not loaded"
assert len(texts) == len(tickets), f"Expected {len(tickets)} texts, got {len(texts)}"
assert embeddings is not None, "Embeddings not generated"
assert embeddings.shape == (len(tickets), 384), f"Wrong shape: {embeddings.shape}"
assert embeddings.dtype == np.float32, f"Wrong dtype: {embeddings.dtype}"
# Check normalization
norms = np.linalg.norm(embeddings, axis=1)
assert np.allclose(norms, 1.0, atol=1e-5), "Embeddings not normalized"
print("✓ Task 1 passed")

## Task 2: Create FAISS Index

Build IndexFlatIP and add embeddings.

In [None]:
# SOLUTION

# 1. Create IndexFlatIP with correct dimension
dimension = embeddings.shape[1]  # 384
index_flat = faiss.IndexFlatIP(dimension)

# 2. Add embeddings to index
index_flat.add(embeddings)

print(f"Index contains {index_flat.ntotal} vectors")

# TEST - Do not modify
assert index_flat is not None, "Index not created"
assert index_flat.ntotal == len(tickets), f"Expected {len(tickets)} vectors, got {index_flat.ntotal}"
assert index_flat.d == 384, f"Wrong dimension: {index_flat.d}"
print("✓ Task 2 passed")

## Task 3: Implement Search Function

Create function to search tickets by text query.

In [None]:
# SOLUTION

def search_tickets(query_text, k=5):
    """
    Search for similar tickets.
    
    Args:
        query_text: Text query
        k: Number of results
    
    Returns:
        List of dicts with 'ticket', 'score' keys
    """
    # 1. Encode query text with model (normalized, float32, 2D)
    query_embedding = model.encode([query_text], normalize_embeddings=True).astype('float32')
    
    # 2. Search index
    scores, indices = index_flat.search(query_embedding, k)
    
    # 3. Format results as list of dicts
    results = []
    for score, idx in zip(scores[0], indices[0]):
        results.append({
            'ticket': tickets[idx],
            'score': float(score)
        })
    
    return results

# TEST - Do not modify
results = search_tickets("password reset issue", k=3)
assert len(results) == 3, f"Expected 3 results, got {len(results)}"
assert 'ticket' in results[0], "Missing 'ticket' key"
assert 'score' in results[0], "Missing 'score' key"
assert isinstance(results[0]['score'], float), "Score should be float"
# Scores should be descending
scores = [r['score'] for r in results]
assert scores == sorted(scores, reverse=True), "Scores not sorted"
print("✓ Task 3 passed")

# Show results
print("\nSearch results for 'password reset issue':")
for i, result in enumerate(results):
    print(f"{i+1}. [{result['score']:.3f}] {result['ticket']['title']}")

## Task 4: Build IVF Index

Create and train IVF index for faster search.

In [None]:
# SOLUTION

# 1. Create quantizer (IndexFlatIP)
quantizer = faiss.IndexFlatIP(dimension)

# 2. Create IndexIVFFlat with nlist=10
nlist = 10
index_ivf = faiss.IndexIVFFlat(quantizer, dimension, nlist, faiss.METRIC_INNER_PRODUCT)

# 3. Train on embeddings
index_ivf.train(embeddings)

# 4. Add embeddings
index_ivf.add(embeddings)

# 5. Set nprobe=5
index_ivf.nprobe = 5

print(f"IVF Index trained: {index_ivf.is_trained}")
print(f"IVF Index contains {index_ivf.ntotal} vectors")
print(f"nlist: {index_ivf.nlist}, nprobe: {index_ivf.nprobe}")

# TEST - Do not modify
assert index_ivf is not None, "IVF index not created"
assert index_ivf.is_trained, "Index not trained"
assert index_ivf.ntotal == len(tickets), f"Expected {len(tickets)} vectors"
assert index_ivf.nlist == 10, f"Expected nlist=10, got {index_ivf.nlist}"
assert index_ivf.nprobe == 5, f"Expected nprobe=5, got {index_ivf.nprobe}"
print("✓ Task 4 passed")

## Task 5: Measure Recall

Compare IVF results to Flat (ground truth).

In [None]:
# SOLUTION

# Create test queries
test_queries = [
    "cannot login",
    "payment failed",
    "slow performance"
]

k = 10

# 1. Encode test queries
query_embeddings = model.encode(test_queries, normalize_embeddings=True).astype('float32')

# 2. Search with both Flat and IVF indexes
recalls = []

for query_emb in query_embeddings:
    query_emb_2d = query_emb.reshape(1, -1)
    
    # Ground truth from Flat index
    _, flat_indices = index_flat.search(query_emb_2d, k)
    flat_set = set(flat_indices[0])
    
    # Results from IVF index
    _, ivf_indices = index_ivf.search(query_emb_2d, k)
    ivf_set = set(ivf_indices[0])
    
    # 3. Calculate recall@10 for each query
    recall = len(flat_set.intersection(ivf_set)) / k
    recalls.append(recall)

# 4. Calculate average recall
avg_recall = np.mean(recalls)

print(f"Individual recalls: {[f'{r:.4f}' for r in recalls]}")

# TEST - Do not modify
assert avg_recall > 0, "Recall not calculated"
assert avg_recall >= 0.8, f"Recall too low: {avg_recall:.3f} (should be >0.8)"
print(f"✓ Task 5 passed")
print(f"Average Recall@10: {avg_recall:.4f}")

## Task 6: Build HNSW Index

Create HNSW index and compare performance.

In [None]:
# SOLUTION

# 1. Create IndexHNSWFlat with M=32
M = 32
index_hnsw = faiss.IndexHNSWFlat(dimension, M, faiss.METRIC_INNER_PRODUCT)

# 2. Set efConstruction=200
index_hnsw.hnsw.efConstruction = 200

# 3. Add embeddings
index_hnsw.add(embeddings)

# 4. Set efSearch=64
index_hnsw.hnsw.efSearch = 64

print(f"HNSW Index contains {index_hnsw.ntotal} vectors")
print(f"M: {M}, efSearch: {index_hnsw.hnsw.efSearch}")

# TEST - Do not modify
assert index_hnsw is not None, "HNSW index not created"
assert index_hnsw.ntotal == len(tickets), f"Expected {len(tickets)} vectors"
assert index_hnsw.hnsw.efSearch == 64, f"Expected efSearch=64"
print("✓ Task 6 passed")

## Task 7: Benchmark All Indexes

Compare Flat, IVF, and HNSW on latency and recall.

In [None]:
# SOLUTION

# Prepare queries
query_embeddings = model.encode(test_queries, normalize_embeddings=True).astype('float32')
k = 10

benchmark_results = {}

# Benchmark Flat index (ground truth)
start_time = time.time()
flat_results = []
for query_emb in query_embeddings:
    query_emb_2d = query_emb.reshape(1, -1)
    _, indices = index_flat.search(query_emb_2d, k)
    flat_results.append(set(indices[0]))
flat_time = (time.time() - start_time) * 1000 / len(test_queries)  # ms per query

benchmark_results['Flat'] = {
    'latency_ms': flat_time,
    'recall': 1.0  # 100% recall against itself
}

# Benchmark IVF index
start_time = time.time()
ivf_recalls = []
for i, query_emb in enumerate(query_embeddings):
    query_emb_2d = query_emb.reshape(1, -1)
    _, indices = index_ivf.search(query_emb_2d, k)
    ivf_set = set(indices[0])
    recall = len(flat_results[i].intersection(ivf_set)) / k
    ivf_recalls.append(recall)
ivf_time = (time.time() - start_time) * 1000 / len(test_queries)

benchmark_results['IVF'] = {
    'latency_ms': ivf_time,
    'recall': np.mean(ivf_recalls)
}

# Benchmark HNSW index
start_time = time.time()
hnsw_recalls = []
for i, query_emb in enumerate(query_embeddings):
    query_emb_2d = query_emb.reshape(1, -1)
    _, indices = index_hnsw.search(query_emb_2d, k)
    hnsw_set = set(indices[0])
    recall = len(flat_results[i].intersection(hnsw_set)) / k
    hnsw_recalls.append(recall)
hnsw_time = (time.time() - start_time) * 1000 / len(test_queries)

benchmark_results['HNSW'] = {
    'latency_ms': hnsw_time,
    'recall': np.mean(hnsw_recalls)
}

# TEST - Do not modify
assert benchmark_results['Flat']['latency_ms'] > 0, "Flat latency not measured"
assert benchmark_results['IVF']['latency_ms'] > 0, "IVF latency not measured"
assert benchmark_results['HNSW']['latency_ms'] > 0, "HNSW latency not measured"
assert benchmark_results['Flat']['recall'] == 1.0, "Flat should have 100% recall"
assert benchmark_results['IVF']['recall'] >= 0.8, f"IVF recall too low: {benchmark_results['IVF']['recall']}"
assert benchmark_results['HNSW']['recall'] >= 0.8, f"HNSW recall too low: {benchmark_results['HNSW']['recall']}"
# IVF and HNSW should be faster than Flat
assert benchmark_results['HNSW']['latency_ms'] < benchmark_results['Flat']['latency_ms'], "IVF should be faster than Flat"
print("✓ Task 7 passed")

# Display results
print("\nBenchmark Results:")
print(f"{'Index':<10} {'Latency (ms)':<15} {'Recall@10':<12} {'Speedup':<10}")
print("-" * 50)
flat_latency = benchmark_results['Flat']['latency_ms']
for name, metrics in benchmark_results.items():
    speedup = flat_latency / metrics['latency_ms'] if metrics['latency_ms'] > 0 else 0
    print(f"{name:<10} {metrics['latency_ms']:<15.2f} {metrics['recall']:<12.4f} {speedup:<10.1f}x")

## Task 8: Implement Metadata Filtering

Search with category and status filters.

In [None]:
# SOLUTION

def search_with_filter(query_text, k=5, category=None, status=None):
    """
    Search with optional metadata filters.
    
    Args:
        query_text: Query string
        k: Number of results
        category: Filter by category (optional)
        status: Filter by status (optional)
    
    Returns:
        List of filtered results
    """
    # 1. Encode query
    query_embedding = model.encode([query_text], normalize_embeddings=True).astype('float32')
    
    # 2. Search index (retrieve k*10 to account for filtering)
    retrieve_k = min(k * 10, len(tickets))
    scores, indices = index_flat.search(query_embedding, retrieve_k)
    
    # 3. Filter results by category and status
    filtered_results = []
    for score, idx in zip(scores[0], indices[0]):
        ticket = tickets[idx]
        
        # Apply filters
        if category is not None and ticket['category'] != category:
            continue
        if status is not None and ticket['status'] != status:
            continue
        
        filtered_results.append({
            'ticket': ticket,
            'score': float(score)
        })
        
        # 4. Return top k filtered results
        if len(filtered_results) >= k:
            break
    
    return filtered_results

# TEST - Do not modify
# Test category filter
results_billing = search_with_filter("payment", k=3, category="billing")
assert len(results_billing) > 0, "No results with billing filter"
for r in results_billing:
    assert r['ticket']['category'] == 'billing', f"Wrong category: {r['ticket']['category']}"

# Test status filter
results_open = search_with_filter("issue", k=3, status="open")
assert len(results_open) > 0, "No results with status filter"
for r in results_open:
    assert r['ticket']['status'] == 'open', f"Wrong status: {r['ticket']['status']}"

# Test combined filters
results_combined = search_with_filter("problem", k=2, category="technical", status="open")
for r in results_combined:
    assert r['ticket']['category'] == 'technical', "Wrong category"
    assert r['ticket']['status'] == 'open', "Wrong status"

print("✓ Task 8 passed")

# Show filtered results
print("\nFiltered search (category=billing):")
for i, r in enumerate(results_billing):
    print(f"{i+1}. [{r['score']:.3f}] {r['ticket']['title']}")

## Summary

You've successfully:
- ✓ Generated embeddings with sentence-transformers
- ✓ Built FAISS indexes (Flat, IVF, HNSW)
- ✓ Measured performance and recall
- ✓ Implemented metadata filtering

**Next steps:**
- Experiment with different index parameters
- Try larger datasets
- Combine FAISS with RAG in Module 6!