# Task 1: Reranking and Zero-Shot Classification

Build a production-ready retrieve-rerank pipeline and zero-shot classifier.

**Goals:**
- Implement retrieve-rerank pipeline
- Measure reranking improvement (MRR, NDCG)
- Build zero-shot classifier
- Handle multi-label classification
- Test on edge cases

In [None]:
from sentence_transformers import SentenceTransformer, CrossEncoder
from transformers import pipeline
import numpy as np
import json
from sklearn.metrics.pairwise import cosine_similarity

## Load Data

In [None]:
# Load queries and documents
with open('../fixtures/input/queries_documents.json', 'r') as f:
    queries_data = json.load(f)

# Load classification texts
with open('../fixtures/input/classification_texts.json', 'r') as f:
    classification_data = json.load(f)

print(f"Loaded {len(queries_data)} query sets")
print(f"Loaded {len(classification_data)} classification examples")

## Task 1: Baseline Bi-Encoder Search

Implement basic search using only bi-encoder.

In [None]:
# YOUR CODE HERE
# 1. Load bi-encoder model
# 2. For each query, rank documents by cosine similarity
# 3. Return top-3 document IDs for each query

bi_encoder = None  # TODO: Load model
bi_encoder_results = {}  # TODO: Store results as {query_id: [doc_ids]}

# TEST - Do not modify
assert bi_encoder is not None, "Bi-encoder not loaded"
assert len(bi_encoder_results) == len(queries_data), "Missing results"
for query_id, doc_ids in bi_encoder_results.items():
    assert len(doc_ids) == 3, f"Expected 3 results for {query_id}"
print("✓ Task 1 passed")

## Task 2: Cross-Encoder Reranking

Rerank bi-encoder results with cross-encoder.

In [None]:
# YOUR CODE HERE
# 1. Load cross-encoder model (ms-marco-MiniLM-L-6-v2)
# 2. For each query, rerank all documents using cross-encoder
# 3. Return top-3 document IDs for each query

cross_encoder = None  # TODO: Load model
cross_encoder_results = {}  # TODO: Store results as {query_id: [doc_ids]}

# TEST - Do not modify
assert cross_encoder is not None, "Cross-encoder not loaded"
assert len(cross_encoder_results) == len(queries_data), "Missing results"
for query_id, doc_ids in cross_encoder_results.items():
    assert len(doc_ids) == 3, f"Expected 3 results for {query_id}"
print("✓ Task 2 passed")

## Task 3: Calculate MRR (Mean Reciprocal Rank)

Measure ranking quality with MRR metric.

In [None]:
def calculate_mrr(results, ground_truth):
    """
    Calculate Mean Reciprocal Rank
    
    Args:
        results: Dict of {query_id: [ranked_doc_ids]}
        ground_truth: Dict of {query_id: [relevant_doc_ids]}
    
    Returns:
        float: MRR score
    """
    # YOUR CODE HERE
    # 1. For each query, find rank of first relevant document
    # 2. Calculate reciprocal rank (1/rank)
    # 3. Return mean of reciprocal ranks
    
    return 0.0  # TODO: Calculate MRR

# Prepare ground truth
ground_truth = {q['query_id']: q['relevant_docs'] for q in queries_data}

# Calculate MRR for both methods
mrr_bi_encoder = calculate_mrr(bi_encoder_results, ground_truth)
mrr_cross_encoder = calculate_mrr(cross_encoder_results, ground_truth)

print(f"Bi-Encoder MRR: {mrr_bi_encoder:.3f}")
print(f"Cross-Encoder MRR: {mrr_cross_encoder:.3f}")
print(f"Improvement: {(mrr_cross_encoder - mrr_bi_encoder) / mrr_bi_encoder * 100:.1f}%")

# TEST - Do not modify
assert mrr_bi_encoder > 0, "Bi-encoder MRR not calculated"
assert mrr_cross_encoder > 0, "Cross-encoder MRR not calculated"
assert mrr_cross_encoder >= mrr_bi_encoder, "Cross-encoder should improve MRR"
print("✓ Task 3 passed")

## Task 4: Calculate NDCG@3

Measure ranking quality with NDCG metric.

In [None]:
def calculate_ndcg_at_k(results, ground_truth, k=3):
    """
    Calculate Normalized Discounted Cumulative Gain at K
    
    Args:
        results: Dict of {query_id: [ranked_doc_ids]}
        ground_truth: Dict of {query_id: [relevant_doc_ids]}
        k: Cutoff rank
    
    Returns:
        float: NDCG@k score
    """
    # YOUR CODE HERE
    # 1. For each query, create relevance vector (1 if relevant, 0 if not)
    # 2. Calculate DCG = sum(rel_i / log2(i+1)) for i=1..k
    # 3. Calculate IDCG (DCG of perfect ranking)
    # 4. NDCG = DCG / IDCG
    # 5. Return mean NDCG across queries
    
    return 0.0  # TODO: Calculate NDCG

# Calculate NDCG for both methods
ndcg_bi_encoder = calculate_ndcg_at_k(bi_encoder_results, ground_truth, k=3)
ndcg_cross_encoder = calculate_ndcg_at_k(cross_encoder_results, ground_truth, k=3)

print(f"Bi-Encoder NDCG@3: {ndcg_bi_encoder:.3f}")
print(f"Cross-Encoder NDCG@3: {ndcg_cross_encoder:.3f}")
print(f"Improvement: {(ndcg_cross_encoder - ndcg_bi_encoder) / ndcg_bi_encoder * 100:.1f}%")

# TEST - Do not modify
assert ndcg_bi_encoder > 0, "Bi-encoder NDCG not calculated"
assert ndcg_cross_encoder > 0, "Cross-encoder NDCG not calculated"
assert ndcg_cross_encoder >= ndcg_bi_encoder, "Cross-encoder should improve NDCG"
print("✓ Task 4 passed")

## Task 5: Zero-Shot Classification

Classify texts without training data.

In [None]:
# YOUR CODE HERE
# 1. Load zero-shot classification pipeline
# 2. For each text in classification_data, predict top label
# 3. Store results as {text_id: predicted_label}

zero_shot_classifier = None  # TODO: Load pipeline
classification_results = {}  # TODO: Store predictions

# TEST - Do not modify
assert zero_shot_classifier is not None, "Classifier not loaded"
assert len(classification_results) == len(classification_data), "Missing predictions"
print("✓ Task 5 passed")

## Task 6: Calculate Classification Accuracy

Measure zero-shot accuracy against ground truth.

In [None]:
# YOUR CODE HERE
# 1. Compare predictions to true_labels (first label)
# 2. Calculate accuracy

accuracy = 0.0  # TODO: Calculate accuracy

print(f"Zero-shot Accuracy: {accuracy:.1%}")

# Show predictions
for item in classification_data[:5]:
    text_id = item['text_id']
    predicted = classification_results[text_id]
    actual = item['true_labels'][0]
    match = "✓" if predicted == actual else "✗"
    print(f"{match} {text_id}: predicted={predicted}, actual={actual}")

# TEST - Do not modify
assert accuracy > 0, "Accuracy not calculated"
assert accuracy >= 0.5, f"Accuracy too low: {accuracy:.1%}"
print("✓ Task 6 passed")

## Task 7: Multi-Label Classification

Handle texts with multiple labels.

In [None]:
# YOUR CODE HERE
# 1. For texts with multiple true_labels, use multi_label=True
# 2. Predict all labels with score > threshold (0.5)
# 3. Calculate F1 score for multi-label predictions

def calculate_multilabel_f1(predictions, ground_truth):
    """
    Calculate F1 for multi-label classification
    
    Args:
        predictions: Dict of {text_id: [predicted_labels]}
        ground_truth: Dict of {text_id: [true_labels]}
    
    Returns:
        float: Average F1 score
    """
    # YOUR CODE HERE
    # For each text:
    # - Calculate precision = TP / (TP + FP)
    # - Calculate recall = TP / (TP + FN)
    # - F1 = 2 * (precision * recall) / (precision + recall)
    
    return 0.0  # TODO: Calculate F1

multilabel_predictions = {}  # TODO: Predict multiple labels
multilabel_ground_truth = {item['text_id']: item['true_labels'] 
                           for item in classification_data}

f1_score = calculate_multilabel_f1(multilabel_predictions, multilabel_ground_truth)

print(f"Multi-label F1: {f1_score:.3f}")

# TEST - Do not modify
assert len(multilabel_predictions) == len(classification_data), "Missing predictions"
assert f1_score > 0, "F1 not calculated"
assert f1_score >= 0.5, f"F1 too low: {f1_score:.3f}"
print("✓ Task 7 passed")

## Task 8: Handle Edge Cases

Test on challenging examples.

In [None]:
# Load edge cases
with open('../fixtures/edge_cases/test_cases.json', 'r') as f:
    edge_cases = json.load(f)

# YOUR CODE HERE
# 1. Test reranking edge cases
# 2. Test classification edge cases
# 3. Identify which cases the models handle well/poorly

reranking_edge_results = {}  # TODO: Test reranking edge cases
classification_edge_results = {}  # TODO: Test classification edge cases

# Analyze results
print("Reranking Edge Cases:")
for case in edge_cases['reranking_edge_cases']:
    case_name = case['case']
    if case_name in reranking_edge_results:
        print(f"  {case_name}: {reranking_edge_results[case_name]}")

print("\nClassification Edge Cases:")
for case in edge_cases['classification_edge_cases']:
    case_name = case['case']
    if case_name in classification_edge_results:
        predicted = classification_edge_results[case_name]
        expected = case.get('expected', 'N/A')
        match = "✓" if predicted == expected else "✗"
        print(f"  {match} {case_name}: predicted={predicted}, expected={expected}")

# TEST - Do not modify
assert len(reranking_edge_results) > 0, "No reranking edge cases tested"
assert len(classification_edge_results) > 0, "No classification edge cases tested"
print("\n✓ Task 8 passed")

## Summary

You've successfully:
- ✓ Built retrieve-rerank pipeline
- ✓ Measured improvement with MRR and NDCG
- ✓ Implemented zero-shot classification
- ✓ Handled multi-label scenarios
- ✓ Tested edge cases

**Next steps:**
- Try different cross-encoder models
- Experiment with hypothesis templates
- Combine with FAISS for production pipeline