# SHL Assessment Recommendation System
## Evaluation & Test Predictions

This notebook evaluates the Multi-Vector V2 retrieval system and generates predictions for test queries.

**System**: Multi-Vector V2 with Query Deconstruction + LLM Reranking  
**Expected Performance**: 39.11% Recall@10 on training data

## 1. Import Required Libraries

In [1]:
import sys
from pathlib import Path
import pandas as pd
from dotenv import load_dotenv

# Add project directory to path
sys.path.append(str(Path.cwd()))

from src.multi_vector_retriever_v2 import MultiVectorRetrieverV2
from src.evaluator import Evaluator, normalize_url

# Load environment variables
load_dotenv()

print("✓ Libraries imported successfully")

✓ Libraries imported successfully


  from .autonotebook import tqdm as notebook_tqdm


## 2. Load Multi-Vector Retriever

Initialize the retriever with Gemini embeddings and sparse descriptions (best performing configuration - 39.11% recall).

In [2]:
print("Loading Multi-Vector Retriever V2...")
retriever = MultiVectorRetrieverV2(data_dir="data", gemini_dir="data/gemini")
print("✓ Retriever loaded successfully")

Loading Multi-Vector Retriever V2...
✓ Loaded Multi-Vector Retriever V2
  - 353 assessments
  - Gemini embeddings (3072-dim)
  - Query deconstruction + LLM reranking enabled
✓ Retriever loaded successfully


## 3. Evaluation on Training Data

Evaluate the system on training data to measure Recall@10 performance.

In [5]:
# Load training data
train_file = "/Users/sakshampoply/Downloads/Gen_AI Dataset/Train-Set-Table 1.csv"
print(f"Loading training data: {train_file}")
evaluator = Evaluator(train_file)
print(f"✓ Loaded {len(evaluator.ground_truth)} unique queries\n")

# Generate predictions for all training queries
print("Generating predictions on training data...")
all_predictions = {}

for i, (query, ground_truth_urls) in enumerate(evaluator.ground_truth.items(), 1):
    print(f"[{i}/{len(evaluator.ground_truth)}] {query[:80]}...")
    
    try:
        # Retrieve top-10 assessments
        results = retriever.retrieve(query, top_k=10)
        predicted_urls = [r["url"] for r in results]
        all_predictions[query] = predicted_urls
        
        # Calculate recall for this query
        recall = evaluator.recall_at_k(predicted_urls, ground_truth_urls, k=10)
        normalized_predicted = [normalize_url(url) for url in predicted_urls]
        matches = len(set(normalized_predicted).intersection(set(ground_truth_urls)))
        print(f"  Recall@10: {recall:.2%} ({matches}/{len(ground_truth_urls)} matches)")
        
    except Exception as e:
        print(f"  Error: {e}")
        all_predictions[query] = []

print("\n" + "="*80)

Loading training data: /Users/sakshampoply/Downloads/Gen_AI Dataset/Train-Set-Table 1.csv
✓ Loaded 10 unique queries

Generating predictions on training data...
[1/10] I am hiring for Java developers who can also collaborate effectively with my bus...

Query: I am hiring for Java developers who can also collaborate effectively with my business teams. Looking...

[Step 1] Deconstructing query into search facets...
  Generated 4 search queries:
    1. Java programming technical skills
    2. collaboration interpersonal skills
    3. business acumen communication skills
    4. 40 minute assessment

[Step 2] Running semantic searches for each facet...
  Search 1/4: 'Java programming technical skills'
  Generated 4 search queries:
    1. Java programming technical skills
    2. collaboration interpersonal skills
    3. business acumen communication skills
    4. 40 minute assessment

[Step 2] Running semantic searches for each facet...
  Search 1/4: 'Java programming technical skills'
  Sea

### Calculate Final Metrics

In [6]:
# Calculate Mean Recall@10
mean_recall = evaluator.mean_recall_at_k(all_predictions, k=10)

print("="*80)
print("EVALUATION RESULTS")
print("="*80)
print(f"Mean Recall@10: {mean_recall:.2%}")
print()

# Per-query breakdown
print("Per-Query Results:")
print("-"*80)
for i, (query, ground_truth_urls) in enumerate(evaluator.ground_truth.items(), 1):
    if query in all_predictions:
        predicted_urls = all_predictions[query]
        recall = evaluator.recall_at_k(predicted_urls, ground_truth_urls, k=10)
        normalized_predicted = [normalize_url(url) for url in predicted_urls]
        matches = len(set(normalized_predicted).intersection(set(ground_truth_urls)))
        
        print(f"\nQuery {i}: {query[:80]}...")
        print(f"  Recall@10: {recall:.2%}")
        print(f"  Matches: {matches}/{len(ground_truth_urls)}")

print("\n" + "="*80)

EVALUATION RESULTS
Mean Recall@10: 39.89%

Per-Query Results:
--------------------------------------------------------------------------------

Query 1: I am hiring for Java developers who can also collaborate effectively with my bus...
  Recall@10: 80.00%
  Matches: 4/5

Query 2: I want to hire new graduates for a sales role in my company, the budget is for a...
  Recall@10: 33.33%
  Matches: 3/9

Query 3: I am looking for a COO for my company in China and I want to see if they are cul...
  Recall@10: 33.33%
  Matches: 2/6

Query 4: KEY RESPONSIBITILES:

Manage the sound-scape of the station through appropriate ...
  Recall@10: 40.00%
  Matches: 2/5

Query 5: Content Writer required, expert in English and SEO....
  Recall@10: 60.00%
  Matches: 3/5

Query 6: Find me 1 hour long assesment for the below job at SHL
Job Description

 Join a ...
  Recall@10: 55.56%
  Matches: 5/9

Query 7: ICICI Bank Assistant Admin, Experience required 0-2 years, test should be 30-40 ...
  Recall@10: 16.67

### Save Evaluation Results
As LLM are not deterministic the Recall scores are variable but the average case Recall@10 has been observed to be in the range 0.36 - 0.7

In [7]:
# Save evaluation predictions to CSV
eval_output_file = "evaluation_predictions.csv"
eval_rows = []
for query, urls in all_predictions.items():
    for url in urls:
        eval_rows.append({"Query": query, "Assessment_url": url})

eval_df = pd.DataFrame(eval_rows)
eval_df.to_csv(eval_output_file, index=False)

print(f"✓ Evaluation predictions saved to: {eval_output_file}")
print(f"  Total rows: {len(eval_df)}")
print(f"  Queries: {len(all_predictions)}")

✓ Evaluation predictions saved to: evaluation_predictions.csv
  Total rows: 100
  Queries: 10


## 4. Generate Test Predictions

Generate predictions for the test dataset (queries without ground truth labels).

In [8]:
# Load test dataset
test_file = "/Users/sakshampoply/Downloads/Gen_AI Dataset/Test-Set-Table 1.csv"
print(f"Loading test data: {test_file}")
test_df = pd.read_csv(test_file)

# Get unique queries
test_queries = test_df["Query"].dropna().unique().tolist()
print(f"✓ Found {len(test_queries)} unique test queries\n")

# Generate predictions
print("Generating test predictions...")
test_predictions = []

for i, query in enumerate(test_queries, 1):
    print(f"[{i}/{len(test_queries)}] {query[:80]}...")
    
    try:
        # Use multi-vector retrieval
        results = retriever.retrieve(query, top_k=10)
        
        # Add predictions for this query
        for result in results:
            test_predictions.append({
                "Query": query,
                "Assessment_url": result["url"]
            })
        
        print(f"  ✓ Generated {len(results)} recommendations")
        
    except Exception as e:
        print(f"  ✗ Error: {e}")

print("\n" + "="*80)

Loading test data: /Users/sakshampoply/Downloads/Gen_AI Dataset/Test-Set-Table 1.csv
✓ Found 9 unique test queries

Generating test predictions...
[1/9] Looking to hire mid-level professionals who are proficient in Python, SQL and Ja...

Query: Looking to hire mid-level professionals who are proficient in Python, SQL and Java Script. Need an a...

[Step 1] Deconstructing query into search facets...
  Generated 4 search queries:
    1. mid-level Python SQL JavaScript technical skills
    2. problem solving analytical thinking
    3. mid-level experience professionals
    4. 60 minutes assessment package

[Step 2] Running semantic searches for each facet...
  Search 1/4: 'mid-level Python SQL JavaScript technical skills'
  Generated 4 search queries:
    1. mid-level Python SQL JavaScript technical skills
    2. problem solving analytical thinking
    3. mid-level experience professionals
    4. 60 minutes assessment package

[Step 2] Running semantic searches for each facet...
  Search 

### Save Test Predictions

In [9]:
# Save test predictions to CSV
test_output_file = "test_predictions.csv"
test_predictions_df = pd.DataFrame(test_predictions)
test_predictions_df.to_csv(test_output_file, index=False)

print("="*80)
print("TEST PREDICTIONS COMPLETE")
print("="*80)
print(f"✓ Test predictions saved to: {test_output_file}")
print(f"  Total rows: {len(test_predictions_df)}")
print(f"  Unique queries: {len(test_queries)}")
print(f"  Predictions per query: {len(test_predictions_df) / len(test_queries):.1f}")
print("="*80)

TEST PREDICTIONS COMPLETE
✓ Test predictions saved to: test_predictions.csv
  Total rows: 90
  Unique queries: 9
  Predictions per query: 10.0


## 5. Summary

**System Performance:**
- Model: Multi-Vector with Query Deconstruction + LLM Reranking
- Embeddings: Gemini embedding-001 (3072-dim)

**Key Features:**
1. Query deconstruction into multiple search facets
2. Multi-vector semantic search with Gemini embeddings
3. LLM reranking for balanced results

**Output Files:**
- `evaluation_predictions.csv` - Training data predictions with ground truth
- `test_predictions.csv` - Test data predictions for submission