# Experiment 1: Basic TF-IDF Approach


**Goal:** Start simple with TF-IDF to get a baseline

**Hypothesis:** TF-IDF should capture keyword matches between queries and assessments

Let's see what we get...

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Load data
df = pd.read_csv('../data/shl_individual_test_solutions.csv')
train_df = pd.read_excel('../data/Gen_AI Dataset (1).xlsx', sheet_name='Train-Set')

print(f"Assessments: {len(df)}")
print(f"Training examples: {len(train_df)}")

## Create simple documents

Just concatenate name and description for now

In [2]:
# Create documents - keeping it simple
documents = []
for _, row in df.iterrows():
    doc = f"{row['name']} {row['description']}"
    documents.append(doc)

print(f"Created {len(documents)} documents")
print(f"Sample: {documents[0][:100]}...")

In [3]:
# Build TF-IDF
# Using default parameters first
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
tfidf_matrix = vectorizer.fit_transform(documents)

print(f"TF-IDF shape: {tfidf_matrix.shape}")

## Test with a query

Let's try "Java developer"

In [4]:
query = "Java developer"
query_vec = vectorizer.transform([query])
scores = cosine_similarity(query_vec, tfidf_matrix)[0]

# Get top 10
top_indices = np.argsort(scores)[-10:][::-1]

print(f"Top 10 for '{query}':")
for i, idx in enumerate(top_indices, 1):
    print(f"{i}. {df.iloc[idx]['name']} (score: {scores[idx]:.3f})")

**Observations:**
- Java assessments show up! ✓
- Scores are pretty low though (< 0.3)
- Missing some relevant tests

Let's evaluate properly on training set...

In [5]:
# Quick evaluation
# Normalize URLs first
df['normalized_url'] = df['url'].str.replace('/solutions/products/', '/products/')
train_df['normalized_url'] = train_df['Assessment_url'].str.replace('/solutions/products/', '/products/')

recalls = []
for query, group in train_df.groupby('Query'):
    ground_truth = set(group['normalized_url'])
    
    # Get predictions
    query_vec = vectorizer.transform([query])
    scores = cosine_similarity(query_vec, tfidf_matrix)[0]
    top_10_idx = np.argsort(scores)[-10:][::-1]
    predicted = set(df.iloc[top_10_idx]['normalized_url'])
    
    # Calculate recall
    found = len(ground_truth & predicted)
    recall = found / len(ground_truth) if len(ground_truth) > 0 else 0
    recalls.append(recall)
    
    print(f"Query: {query[:50]}... | Recall: {recall:.2f}")

print(f"\nMean Recall@10: {np.mean(recalls):.3f}")

## Results: 26.2% Mean Recall@10

**Not great...**

**Issues:**
1. Only exact keyword matches work
2. No semantic understanding
3. Not using training data patterns

**Next steps:**
- Try semantic embeddings?
- Weight fields differently?
- Use LLM for query understanding?

Will try embeddings next...