# Building the Recommendation Engine

## Objective
Build a fast, scalable recommendation system using FAISS for similarity search.

## What is FAISS?
**FAISS (Facebook AI Similarity Search)** is a library for efficient similarity search in high-dimensional spaces.

### Why FAISS?
- **Fast:** Finds nearest neighbors in milliseconds (even with millions of vectors)
- **Scalable:** Used in production by Facebook, Spotify, Pinterest
- **Memory efficient:** Various index types for different speed/memory trade-offs

### How It Works:
1. Build index: Organize 9,280 embeddings for fast lookup
2. Search: Given query embedding, find K nearest neighbors
3. Return: Top-K most similar papers

## Pipeline
1. Load embeddings + paper metadata
2. Build FAISS index
3. Test recommendation quality
4. Save index for API deployment

## Expected Performance
- Index build time: < 1 second
- Search time: < 10ms per query
- Memory: ~30 MB

In [1]:
import numpy as np
import pandas as pd
import faiss
import pickle
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import time

print("✓ Imports successful")
print(f"FAISS version: {faiss.__version__}")

✓ Imports successful
FAISS version: 1.8.0


In [2]:
# Load embeddings
embeddings = np.load('../data/processed/embeddings.npy')
print(f"✓ Loaded embeddings: {embeddings.shape}")

# Load papers with metadata
df = pd.read_pickle('../data/processed/papers_with_embeddings.pkl')
print(f"✓ Loaded papers: {len(df)}")

# Verify they match
assert len(embeddings) == len(df), "Embeddings and dataframe size mismatch!"
print(f"✓ Data verified")

✓ Loaded embeddings: (9280, 768)
✓ Loaded papers: 9280
✓ Data verified


In [3]:
# FAISS works with float32 (not float64)
embeddings_float32 = embeddings.astype('float32')

# Get embedding dimension
dimension = embeddings.shape[1]
print(f"Embedding dimension: {dimension}")

# Build FAISS index
# IndexFlatL2 = exhaustive search, exact results
# (There are faster approximate indexes, but we'll use exact for now)
print("\nBuilding FAISS index...")
start_time = time.time()

index = faiss.IndexFlatL2(dimension)
index.add(embeddings_float32)

build_time = time.time() - start_time
print(f"✓ FAISS index built in {build_time:.3f} seconds")
print(f"  Index size: {index.ntotal} vectors")

Embedding dimension: 768

Building FAISS index...
✓ FAISS index built in 0.016 seconds
  Index size: 9280 vectors


In [4]:
# Test search performance
test_idx = 100
query_vector = embeddings_float32[test_idx:test_idx+1]  # Keep 2D shape

# Search for top 10 most similar papers
k = 10  # Number of neighbors to return
print(f"Searching for {k} nearest neighbors...")

start_time = time.time()
distances, indices = index.search(query_vector, k)
search_time = time.time() - start_time

print(f"✓ Search completed in {search_time*1000:.2f} milliseconds")
print(f"\nTop {k} results:")
print(f"  Indices: {indices[0]}")
print(f"  Distances: {distances[0]}")

Searching for 10 nearest neighbors...
✓ Search completed in 161.33 milliseconds

Top 10 results:
  Indices: [ 100 5331 2250  685 6737  168 3841 1995 7068 8111]
  Distances: [ 0.       29.718075 30.052883 30.182102 30.416515 30.983072 32.332436
 32.345028 33.084488 33.532707]


In [5]:
def recommend_papers(query_idx, k=10, exclude_query=True):
    """
    Find similar papers given a paper index
    
    Args:
        query_idx: Index of the query paper
        k: Number of recommendations to return
        exclude_query: Whether to exclude the query paper from results
    
    Returns:
        DataFrame with recommended papers and their similarity scores
    """
    # Get query vector
    query_vector = embeddings_float32[query_idx:query_idx+1]
    
    # Search (add 1 to k if excluding query)
    search_k = k + 1 if exclude_query else k
    distances, indices = index.search(query_vector, search_k)
    
    # Get results
    result_indices = indices[0]
    result_distances = distances[0]
    
    # Exclude query if requested
    if exclude_query:
        result_indices = result_indices[1:]
        result_distances = result_distances[1:]
    
    # Convert L2 distances to similarity scores (0-1 range)
    # Smaller distance = higher similarity
    # We use: similarity = 1 / (1 + distance)
    similarities = 1 / (1 + result_distances)
    
    # Build result dataframe
    results = []
    for idx, dist, sim in zip(result_indices, result_distances, similarities):
        paper = df.iloc[idx]
        results.append({
            'paper_id': paper['paper_id'],
            'title': paper['title'],
            'categories': paper['categories'],
            'published': paper['published'],
            'similarity': sim,
            'distance': dist,
            'abstract': paper['abstract'][:200] + '...'  # Truncated
        })
    
    return pd.DataFrame(results)

print("✓ Recommendation function created")

✓ Recommendation function created


In [6]:
# Test with a random paper
test_idx = 100

print("Query paper:")
query_paper = df.iloc[test_idx]
print(f"  Title: {query_paper['title']}")
print(f"  Categories: {query_paper['categories']}")
print(f"  Abstract: {query_paper['abstract'][:200]}...\n")

# Get recommendations
recommendations = recommend_papers(test_idx, k=5)

print("Top 5 Recommendations:")
print("="*80)
for i, row in recommendations.iterrows():
    print(f"\n{i+1}. {row['title']}")
    print(f"   Similarity: {row['similarity']:.3f} | Distance: {row['distance']:.2f}")
    print(f"   Categories: {row['categories']}")
    print(f"   Abstract: {row['abstract']}")

Query paper:
  Title: BiasScope: Towards Automated Detection of Bias in LLM-as-a-Judge Evaluation
  Categories: ['cs.CL', 'cs.AI', 'cs.SE']
  Abstract: LLM-as-a-Judge has been widely adopted across various research and practical applications, yet the robustness and reliability of its evaluation remain a critical issue. A core challenge it faces is bi...

Top 5 Recommendations:

1. Benchmarking Bias Mitigation Toward Fairness Without Harm from Vision to LVLMs
   Similarity: 0.033 | Distance: 29.72
   Categories: ['cs.CV', 'cs.LG']
   Abstract: Machine learning models trained on real-world data often inherit and amplify biases against certain social groups, raising urgent concerns about their deployment at scale. While numerous bias mitigati...

2. Fault-Tolerant Evaluation for Sample-Efficient Model Performance Estimators
   Similarity: 0.032 | Distance: 30.05
   Categories: ['cs.LG']
   Abstract: In the era of Model-as-a-Service, organizations increasingly rely on third-party AI models

In [7]:
# Quick test - compare FAISS distance to actual cosine similarity
test_vector = embeddings_float32[100]
similar_vector = embeddings_float32[5331]  # Top match from FAISS

# Cosine similarity (what we expect)
from sklearn.metrics.pairwise import cosine_similarity
cos_sim = cosine_similarity(
    test_vector.reshape(1, -1), 
    similar_vector.reshape(1, -1)
)[0][0]

# FAISS L2 distance
faiss_dist = 29.72

print(f"Cosine similarity: {cos_sim:.3f}")
print(f"FAISS L2 distance: {faiss_dist:.2f}")

Cosine similarity: 0.953
FAISS L2 distance: 29.72


In [8]:
# Normalize embeddings to unit length
print("Normalizing embeddings...")

# Calculate L2 norms (magnitude of each vector)
norms = np.linalg.norm(embeddings_float32, axis=1, keepdims=True)
embeddings_normalized = embeddings_float32 / norms

# Verify normalization
sample_norm = np.linalg.norm(embeddings_normalized[0])
print(f"✓ Normalized (sample vector norm: {sample_norm:.6f})")

# Rebuild FAISS index with normalized embeddings
print("\nRebuilding FAISS index with normalized embeddings...")
index_normalized = faiss.IndexFlatL2(dimension)
index_normalized.add(embeddings_normalized)

print(f"✓ Index rebuilt with {index_normalized.ntotal} normalized vectors")

Normalizing embeddings...
✓ Normalized (sample vector norm: 1.000000)

Rebuilding FAISS index with normalized embeddings...
✓ Index rebuilt with 9280 normalized vectors


In [9]:
# Test search with normalized embeddings
query_vector_norm = embeddings_normalized[100:101]
distances, indices = index_normalized.search(query_vector_norm, k=6)

print("Results with normalized embeddings:")
print(f"Indices: {indices[0]}")
print(f"Distances: {distances[0]}\n")

# Now convert to cosine similarity
# For normalized vectors: cosine_sim = 1 - (L2_distance² / 2)
similarities = 1 - (distances[0] ** 2) / 2

print("Converted to cosine similarity:")
for i, (idx, dist, sim) in enumerate(zip(indices[0], distances[0], similarities)):
    if i == 0:
        print(f"{i}. Paper {idx} (query itself)")
        print(f"   Distance: {dist:.6f}, Similarity: {sim:.6f}\n")
    else:
        paper = df.iloc[idx]
        print(f"{i}. {paper['title'][:60]}...")
        print(f"   Distance: {dist:.4f}, Similarity: {sim:.4f}")

Results with normalized embeddings:
Indices: [ 100 5331 2250  685 6737  168]
Distances: [0.         0.09453616 0.09735808 0.09793453 0.09915101 0.10086937]

Converted to cosine similarity:
0. Paper 100 (query itself)
   Distance: 0.000000, Similarity: 1.000000

1. Benchmarking Bias Mitigation Toward Fairness Without Harm fr...
   Distance: 0.0945, Similarity: 0.9955
2. Fault-Tolerant Evaluation for Sample-Efficient Model Perform...
   Distance: 0.0974, Similarity: 0.9953
3. GenArena: How Can We Achieve Human-Aligned Evaluation for Vi...
   Distance: 0.0979, Similarity: 0.9952
4. DICE: Discrete Interpretable Comparative Evaluation with Pro...
   Distance: 0.0992, Similarity: 0.9951
5. OmniReview: A Large-scale Benchmark and LLM-enhanced Framewo...
   Distance: 0.1009, Similarity: 0.9949


In [10]:
def recommend_papers(query_idx, k=10, exclude_query=True):
    """
    Find similar papers given a paper index
    
    Args:
        query_idx: Index of the query paper
        k: Number of recommendations to return
        exclude_query: Whether to exclude the query paper from results
    
    Returns:
        DataFrame with recommended papers and their similarity scores
    """
    # Get normalized query vector
    query_vector = embeddings_normalized[query_idx:query_idx+1]
    
    # Search (add 1 to k if excluding query)
    search_k = k + 1 if exclude_query else k
    distances, indices = index_normalized.search(query_vector, search_k)
    
    # Get results
    result_indices = indices[0]
    result_distances = distances[0]
    
    # Exclude query if requested
    if exclude_query:
        result_indices = result_indices[1:]
        result_distances = result_distances[1:]
    
    # Convert L2 distances to cosine similarity
    # For normalized vectors: cosine_sim = 1 - (L2_distance² / 2)
    similarities = 1 - (result_distances ** 2) / 2
    
    # Build result dataframe
    results = []
    for idx, dist, sim in zip(result_indices, result_distances, similarities):
        paper = df.iloc[idx]
        results.append({
            'paper_id': paper['paper_id'],
            'title': paper['title'],
            'categories': paper['categories'],
            'published': paper['published'],
            'similarity': sim,
            'distance': dist,
            'abstract': paper['abstract'][:200] + '...'
        })
    
    return pd.DataFrame(results)

print("✓ Updated recommendation function")

✓ Updated recommendation function


In [11]:
# Test with the same paper
test_idx = 100

print("Query paper:")
query_paper = df.iloc[test_idx]
print(f"  Title: {query_paper['title']}")
print(f"  Categories: {query_paper['categories']}\n")

# Get recommendations
recommendations = recommend_papers(test_idx, k=5)

print("Top 5 Recommendations:")
print("="*80)
for i, row in recommendations.iterrows():
    print(f"\n{i+1}. {row['title']}")
    print(f"   Similarity: {row['similarity']:.4f}")
    print(f"   Categories: {row['categories']}")
    print(f"   Published: {row['published']}")
    print(f"   Abstract: {row['abstract']}")

Query paper:
  Title: BiasScope: Towards Automated Detection of Bias in LLM-as-a-Judge Evaluation
  Categories: ['cs.CL', 'cs.AI', 'cs.SE']

Top 5 Recommendations:

1. Benchmarking Bias Mitigation Toward Fairness Without Harm from Vision to LVLMs
   Similarity: 0.9955
   Categories: ['cs.CV', 'cs.LG']
   Published: 2026-02-03
   Abstract: Machine learning models trained on real-world data often inherit and amplify biases against certain social groups, raising urgent concerns about their deployment at scale. While numerous bias mitigati...

2. Fault-Tolerant Evaluation for Sample-Efficient Model Performance Estimators
   Similarity: 0.9953
   Categories: ['cs.LG']
   Published: 2026-02-06
   Abstract: In the era of Model-as-a-Service, organizations increasingly rely on third-party AI models for rapid deployment. However, the dynamic nature of emerging AI applications, the continual introduction of ...

3. GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?

In [12]:
# Save normalized embeddings
np.save('../data/processed/embeddings_normalized.npy', embeddings_normalized)
print("✓ Saved normalized embeddings")

# Save FAISS index
faiss.write_index(index_normalized, '../data/processed/papers.index')
print("✓ Saved FAISS index")

# Update metadata
import json
metadata = {
    'n_papers': len(df),
    'embedding_dim': embeddings_normalized.shape[1],
    'model': 'allenai/specter2_base',
    'normalized': True,
    'index_type': 'IndexFlatL2',
    'date_generated': pd.Timestamp.now().isoformat()
}

with open('../data/processed/metadata.json', 'w') as f:
    json.dump(metadata, f, indent=2)
print("✓ Updated metadata")

print(f"\n✓ All files ready for deployment!")
print(f"  embeddings_normalized.npy: {embeddings_normalized.nbytes / 1e6:.1f} MB")
print(f"  papers.index: FAISS index for fast search")
print(f"  papers_with_embeddings.pkl: {df.memory_usage(deep=True).sum() / 1e6:.1f} MB")

✓ Saved normalized embeddings
✓ Saved FAISS index
✓ Updated metadata

✓ All files ready for deployment!
  embeddings_normalized.npy: 28.5 MB
  papers.index: FAISS index for fast search
  papers_with_embeddings.pkl: 20.3 MB


---

## Text-Based Recommendations (Future Feature)

Currently, our recommendation system works with **paper-to-paper** similarity:
```python
recommend_papers(paper_idx=100, k=5)  # Find papers similar to paper #100
```

### Planned: Text Query Search

For production deployment, we'll add **text-to-paper** recommendations:
```python
# User query
"transformers for natural language processing"
  ↓
# SPECTER2 encodes text → embedding
  ↓
# FAISS finds similar paper embeddings
  ↓
# Return top K papers
```

**Implementation:** This will be added in the API (next notebook) where SPECTER2 is loaded once at startup and kept in memory for all requests.

**Why not here?** Loading SPECTER2 twice (notebook 2 + notebook 3) exceeds available memory in Jupyter.

---