# Experiment 1: Basic TF-IDF Approach



**Goal:** Start simple with TF-IDF to get a baseline

**Hypothesis:** TF-IDF should capture keyword matches between queries and assessments

Let's see what we get...

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Load data
df = pd.read_csv('../data/shl_individual_test_solutions.csv')
train_df = pd.read_excel('../data/Gen_AI Dataset (1).xlsx', sheet_name='Train-Set')

print(f"Assessments: {len(df)}")
print(f"Training examples: {len(train_df)}")

Assessments: 377
Training examples: 65


## Create simple documents

Just concatenate name and description for now

In [2]:
# Create documents - keeping it simple
documents = []
for _, row in df.iterrows():
    doc = f"{row['name']} {row['description']}"
    documents.append(doc)

print(f"Created {len(documents)} documents")
print(f"Sample: {documents[0][:100]}...")

Created 377 documents
Sample: Global Skills Development Report This report is designed to be given to individuals who have complet...


In [3]:
# Build TF-IDF
# Using default parameters first
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
tfidf_matrix = vectorizer.fit_transform(documents)

print(f"TF-IDF shape: {tfidf_matrix.shape}")

TF-IDF shape: (377, 5000)


## Test with a query

Let's try "Java developer"

In [4]:
query = "Java developer"
query_vec = vectorizer.transform([query])
scores = cosine_similarity(query_vec, tfidf_matrix)[0]

# Get top 10
top_indices = np.argsort(scores)[-10:][::-1]

print(f"Top 10 for '{query}':")
for i, idx in enumerate(top_indices, 1):
    print(f"{i}. {df.iloc[idx]['name']} (score: {scores[idx]:.3f})")

Top 10 for 'Java developer':
1. Java 8 (New) (score: 0.380)
2. Java Platform Enterprise Edition 7 (Java EE 7) (score: 0.274)
3. Core Java (Advanced Level) (New) (score: 0.244)
4. Java Frameworks (New) (score: 0.237)
5. Java Design Patterns (New) (score: 0.206)
6. Java Web Services (New) (score: 0.188)
7. Enterprise Java Beans (New) (score: 0.178)
8. Core Java (Entry Level) (New) (score: 0.177)
9. Informatica (Developer) (New) (score: 0.168)
10. Java 2 Platform Enterprise Edition 1.4 Fundamental (score: 0.110)


**Observations:**
- Java assessments show up! âœ“
- Scores are pretty low though (< 0.3)
- Missing some relevant tests

Let's evaluate properly on training set...

In [5]:
# Quick evaluation
# Normalize URLs first
df['normalized_url'] = df['url'].str.replace('/solutions/products/', '/products/')
train_df['normalized_url'] = train_df['Assessment_url'].str.replace('/solutions/products/', '/products/')

recalls = []
for query, group in train_df.groupby('Query'):
    ground_truth = set(group['normalized_url'])
    
    # Get predictions
    query_vec = vectorizer.transform([query])
    scores = cosine_similarity(query_vec, tfidf_matrix)[0]
    top_10_idx = np.argsort(scores)[-10:][::-1]
    predicted = set(df.iloc[top_10_idx]['normalized_url'])
    
    # Calculate recall
    found = len(ground_truth & predicted)
    recall = found / len(ground_truth) if len(ground_truth) > 0 else 0
    recalls.append(recall)
    
    print(f"Query: {query[:50]}... | Recall: {recall:.2f}")

print(f"\nMean Recall@10: {np.mean(recalls):.3f}")

Query: Based on the JD below recommend me assessment for ... | Recall: 0.20
Query: Content Writer required, expert in English and SEO... | Recall: 0.60
Query: Find me 1 hour long assesment for the below job at... | Recall: 0.00
Query: I am hiring for Java developers who can also colla... | Recall: 0.20
Query: I am looking for a COO for my company in China and... | Recall: 0.17
Query: I want to hire a Senior Data Analyst with 5 years ... | Recall: 0.50
Query: I want to hire new graduates for a sales role in m... | Recall: 0.11
Query: ICICI Bank Assistant Admin, Experience required 0-... | Recall: 0.00
Query: KEY RESPONSIBITILES:

Manage the sound-scape of th... | Recall: 0.20
Query: We're looking for a Marketing Manager who can driv... | Recall: 0.00

Mean Recall@10: 0.198
