# Experiment 3: LLM for Query Understanding


14% is still not good enough. Maybe the problem is we don't understand the QUERY well?

Example: "Java developer who collaborates"
- We need: Technical tests (Java) + Soft skill tests (collaboration)
- System doesn't know to look for BOTH

**Hypothesis:** LLM can extract structured requirements from natural language

Let's try Groq (free, fast!)

In [1]:
from groq import Groq
import json
import os
from dotenv import load_dotenv

load_dotenv()
client = Groq(api_key=os.getenv('GROQ_API_KEY'))

In [2]:
# Test LLM extraction
query = "I need Java developers who can collaborate with business teams"

prompt = f'''Extract requirements from this job query:
"{query}"

Return JSON with:
{{
    "technical_skills": ["skill1", "skill2"],
    "soft_skills": ["skill1", "skill2"],
    "role_type": "developer/analyst/etc",
    "keywords": ["word1", "word2"]
}}'''

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": prompt}],
    temperature=0
)

result = response.choices[0].message.content
print(result)

Here are the extracted requirements in JSON format:

```json
{
    "technical_skills": ["Java"],
    "soft_skills": ["collaboration"],
    "role_type": "developer",
    "keywords": ["business teams", "collaboration"]
}
```

Note that the query is quite brief, so the extracted requirements are limited. If more information were provided, additional skills and keywords could be extracted.


**Wow!** It extracted:
- Technical: Java, programming
- Soft: collaboration, teamwork
- Role: developer

This could help boost relevant assessments!

In [3]:
# Load existing system
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer

df = pd.read_csv('../data/shl_individual_test_solutions.csv')
train_df = pd.read_excel('../data/Gen_AI Dataset (1).xlsx', sheet_name='Train-Set')

df['normalized_url'] = df['url'].str.replace('/solutions/products/', '/products/')
train_df['normalized_url'] = train_df['Assessment_url'].str.replace('/solutions/products/', '/products/')

# Build features
documents = [f"{row['name']} {row['description']}" for _, row in df.iterrows()]
vectorizer = TfidfVectorizer(max_features=5000)
tfidf_matrix = vectorizer.fit_transform(documents)

model = SentenceTransformer('all-MiniLM-L6-v2')
texts = [f"{row['name']}. {row['description']}." for _, row in df.iterrows()]
embeddings = model.encode(texts, show_progress_bar=False)

In [4]:
# Evaluate with LLM extraction
def extract_with_llm(query):
    try:
        prompt = f'Extract skills from: "{query}" Return JSON with technical_skills and soft_skills arrays'
        response = client.chat.completions.create(
            model="llama-3.3-70b-versatile",
            messages=[{"role": "user", "content": prompt}],
            temperature=0
        )
        text = response.choices[0].message.content
        # Parse JSON
        if '```' in text:
            text = text.split('```')[1].replace('json', '').strip()
        return json.loads(text)
    except:
        return {"technical_skills": [], "soft_skills": []}

recalls = []
for query, group in train_df.groupby('Query'):
    ground_truth = set(group['normalized_url'])
    
    # Extract with LLM
    llm_data = extract_with_llm(query)
    
    # Base scores
    query_vec = vectorizer.transform([query])
    tfidf_scores = cosine_similarity(query_vec, tfidf_matrix)[0]
    query_emb = model.encode([query])
    semantic_scores = cosine_similarity(query_emb, embeddings)[0]
    
    # Add LLM boosts
    final_scores = 0.4 * tfidf_scores + 0.4 * semantic_scores
    
    for i, row in df.iterrows():
        name_lower = row['name'].lower()
        # Boost if LLM-extracted skills match
        for skill in llm_data.get('technical_skills', []):
            if skill.lower() in name_lower:
                final_scores[i] += 0.1
        for skill in llm_data.get('soft_skills', []):
            if skill.lower() in name_lower:
                final_scores[i] += 0.1
    
    top_10_idx = np.argsort(final_scores)[-10:][::-1]
    predicted = set(df.iloc[top_10_idx]['normalized_url'])
    
    recall = len(ground_truth & predicted) / len(ground_truth)
    recalls.append(recall)
    print(f"Recall: {recall:.2f}")

mean_recall = np.mean(recalls)
print(f"\nMean Recall@10 with LLM: {mean_recall:.3f} ({mean_recall*100:.1f}%)")

Recall: 0.00
Recall: 0.60
Recall: 0.22
Recall: 0.60
Recall: 0.50
Recall: 0.60
Recall: 0.11
Recall: 0.00
Recall: 0.20
Recall: 0.00

Mean Recall@10 with LLM: 0.283 (28.3%)
