# Portfolio C: Semantic Search
**Build a research paper search engine**

Build a semantic search system that lets researchers find relevant scientific papers using natural language queries — including across languages. Start with a bi-encoder for fast retrieval, then add a cross-encoder for precision.

**Dataset**: SciFact (scientific abstracts + claims)
**Your goal**: Build a search pipeline, evaluate retrieval quality, and experiment with multilingual search.

### Deliverables
- Working search pipeline (bi-encoder retrieval + optional reranking)
- Evaluation: precision@5 on at least 5 queries
- At least one experiment (multilingual, different models, reranking comparison)
- Brief model card

**Estimated time**: Sprint 1 (55 min) + Sprint 2 (90 min)

## Setup

In [None]:
!pip install -q datasets sentence-transformers faiss-cpu matplotlib seaborn

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
from sentence_transformers import SentenceTransformer
import faiss

## 1. Load Corpus

In [None]:
from datasets import load_dataset

dataset = load_dataset("mteb/scifact", split="corpus")
corpus_df = dataset.to_pandas()
corpus_df.columns = ['doc_id', 'title', 'text']
corpus_df['full_text'] = corpus_df['title'] + ". " + corpus_df['text']
print(f"Corpus: {len(corpus_df)} documents")
corpus_df.head()

## 2. Encode & Index

In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2')

start = time.time()
corpus_embeddings = model.encode(
    corpus_df['full_text'].tolist(),
    show_progress_bar=True,
    batch_size=64,
    normalize_embeddings=True,
)
print(f"Encoded {len(corpus_embeddings)} docs in {time.time()-start:.1f}s")

dimension = corpus_embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)
index.add(corpus_embeddings.astype('float32'))
print(f"FAISS index: {index.ntotal} vectors, {dimension}d")

In [None]:
def search(query: str, top_k: int = 5) -> pd.DataFrame:
    q_emb = model.encode([query], normalize_embeddings=True).astype('float32')
    scores, indices = index.search(q_emb, top_k)
    results = []
    for rank, (score, idx) in enumerate(zip(scores[0], indices[0]), 1):
        results.append({
            'rank': rank,
            'score': float(score),
            'title': corpus_df.iloc[idx]['title'],
            'text': corpus_df.iloc[idx]['full_text'][:200] + '...',
        })
    return pd.DataFrame(results)

# Test it
results = search("effects of vaccination on immune response")
print(results[['rank', 'score', 'title']].to_string(index=False))

## 3. Evaluate: Precision@5
Define queries with expected relevant terms, then measure how many of the top-5 results are actually relevant.

In [None]:
eval_queries = [
    {"query": "vaccine effectiveness against viral infections",
     "relevant_terms": ["vaccine", "immunization", "viral", "antibod"]},
    {"query": "genetic mutations and cancer risk",
     "relevant_terms": ["mutat", "cancer", "oncogen", "tumor", "genetic"]},
    {"query": "effects of exercise on mental health",
     "relevant_terms": ["exercis", "physical", "mental", "depress", "anxiety"]},
    {"query": "antibiotic resistance mechanisms in bacteria",
     "relevant_terms": ["antibiotic", "resistan", "bacteri", "antimicrob"]},
    {"query": "neural networks in medical imaging",
     "relevant_terms": ["neural", "deep learning", "imaging", "diagnos", "radiol"]},
]

def is_relevant(doc_text, terms):
    doc_lower = doc_text.lower()
    return any(t in doc_lower for t in terms)

for eq in eval_queries:
    results = search(eq['query'], top_k=5)
    relevant = sum(is_relevant(row['text'], eq['relevant_terms']) for _, row in results.iterrows())
    print(f"P@5={relevant/5:.0%} | {eq['query']}")

## 4. Your Turn: Improve the Pipeline

Choose one or more experiments:
- **Cross-encoder reranking** (from NB07): Retrieve top-20 with bi-encoder, rerank with `cross-encoder/ms-marco-MiniLM-L-6-v2`
- **Multilingual search**: Switch to `BAAI/bge-m3` and try queries in French/German
- **Different embedding model**: Try `all-mpnet-base-v2` or another model
- **Better evaluation**: Write more queries, use tighter relevance criteria

In [None]:
# YOUR CODE HERE — Experiment 1

In [None]:
# YOUR CODE HERE — Experiment 2 (optional)

## 5. Compare Results

In [None]:
# YOUR CODE HERE — compare baseline vs improved pipeline
# e.g., average P@5 before and after reranking

## 6. Model Card

| Field | Value |
|-------|-------|
| **Task** | Scientific paper retrieval |
| **Corpus** | SciFact (_N_ documents) |
| **Embedding model** | _model name_ |
| **Avg P@5 (baseline)** | _score_ |
| **Avg P@5 (improved)** | _score_ |
| **Best query** | _which query works best?_ |
| **Worst query** | _which query fails?_ |
| **Improvement idea** | _what you'd try next_ |