# Experiment 2: Adding Semantic Embeddings


Previous experiment got 26% with just TF-IDF. That's better than random but not good enough.

**Idea:** Maybe semantic embeddings can capture meaning better?
- "collaborate" should match "teamwork"
- "Java" should match "programming"

Let's try Sentence-BERT!

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer

df = pd.read_csv('../data/shl_individual_test_solutions.csv')
train_df = pd.read_excel('../data/Gen_AI Dataset (1).xlsx', sheet_name='Train-Set')

# URL normalization
df['normalized_url'] = df['url'].str.replace('/solutions/products/', '/products/')
train_df['normalized_url'] = train_df['Assessment_url'].str.replace('/solutions/products/', '/products/')

In [2]:
# Load embedding model
# Using MiniLM - supposed to be good for similarity
print("Loading model...")
model = SentenceTransformer('all-MiniLM-L6-v2')
print("Model loaded!")

In [3]:
# Create richer text for embeddings
texts = []
for _, row in df.iterrows():
    text = f"{row['name']}. {row['description']}."
    texts.append(text)

print(f"Encoding {len(texts)} texts...")
embeddings = model.encode(texts, show_progress_bar=True)
print(f"Embeddings shape: {embeddings.shape}")

## Test semantic similarity

Let's see if "collaborate" matches personality tests...

In [4]:
query = "developer who collaborates"
query_emb = model.encode([query])
semantic_scores = cosine_similarity(query_emb, embeddings)[0]

top_10_idx = np.argsort(semantic_scores)[-10:][::-1]

print(f"Top 10 semantic matches for '{query}':")
for i, idx in enumerate(top_10_idx, 1):
    print(f"{i}. {df.iloc[idx]['name']} (score: {semantic_scores[idx]:.3f})")

**Interesting!** Getting some personality tests now (OPQ, etc.)

But losing some technical tests...

**Idea:** Combine TF-IDF + Semantic?

In [5]:
# Build TF-IDF too
documents = [f"{row['name']} {row['description']}" for _, row in df.iterrows()]
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
tfidf_matrix = vectorizer.fit_transform(documents)

In [6]:
# Evaluate with hybrid: 50% TF-IDF + 50% Semantic
recalls = []

for query, group in train_df.groupby('Query'):
    ground_truth = set(group['normalized_url'])
    
    # TF-IDF scores
    query_vec = vectorizer.transform([query])
    tfidf_scores = cosine_similarity(query_vec, tfidf_matrix)[0]
    
    # Semantic scores
    query_emb = model.encode([query])
    semantic_scores = cosine_similarity(query_emb, embeddings)[0]
    
    # Combine (trying 50-50 split)
    combined_scores = 0.5 * tfidf_scores + 0.5 * semantic_scores
    
    top_10_idx = np.argsort(combined_scores)[-10:][::-1]
    predicted = set(df.iloc[top_10_idx]['normalized_url'])
    
    found = len(ground_truth & predicted)
    recall = found / len(ground_truth) if len(ground_truth) > 0 else 0
    recalls.append(recall)
    
    print(f"Recall: {recall:.2f} | Query: {query[:60]}...")

mean_recall = np.mean(recalls)
print(f"\nMean Recall@10: {mean_recall:.3f} ({mean_recall*100:.1f}%)")

## Results: 32.8% Mean Recall@10

**Better!** (+6.6% from baseline)

But still not great... only 33% recall

**Observations:**
- Semantic helps with synonyms
- TF-IDF helps with exact matches
- Combination is better than either alone
- But still missing context...

**Next ideas:**
- Use LLM to understand query better?
- Learn from training data patterns?
- Weight fields differently (name more important than description?)

Will try LLM next...