## Potential Talents - Part 4

----

# Job Title Similarity using LLMs-as-Rankers

### Objectives
Retrieve the most similar job titles to a query **using LLMs as deterministic rankers** and compare their rankings against the **existing embeddings + cosine** baseline.

### Constraints
- Local execution on **GTX 1080 Ti**.
- **Deterministic** decoding (no sampling): `temperature=0`.
- Score **only** from the provided list of 105 titles; **no generation** of new titles.
- LLM outputs are **numeric similarity scores (0–100)** in strict JSON, then ranked.

### Models (initial)
- **FLAN-T5-Large** (🤗, encoder–decoder).
- **Phi-3-mini-4k-instruct** (🤗).  

### Method Overview
- **Baseline (done already at part 3):** embeddings + cosine similarity → per-query rankings and scores.
- **LLM-as-Ranker:** for each `(query, title)` ask the model for an integer score **0–100**
  Batch candidates to keep context small; parse JSON; rank by score.
  
### Display & Evaluation
- **Notebook display:** same format as before  
  Query: <query>
   0.793 <title_raw>
   0.748 <title_raw>

- **Files:** per-model/per-query CSV with `id, score_llm, rank_llm, title_raw, title_clean`.
- **Comparison metric:** **nDCG@k** (reuse function from the embedding notebook, part 3).  

---

## Summary / Roadmap

0) Setup & Data  
1) Load **baseline results** (embeddings + cosine) for each query  
2) Load **LLM model** (start with FLAN-T5-Large)  
3) **Prompt & batch scoring** → JSON `{id, score}`  
4) Build **LLM ranking** and **print** in the prior format; save CSV  
5) **Compare** to baseline via **nDCG@k**  
6) Repeat Steps 2–5 for **Phi-3-mini**  
7) Create a **short summary table** (per model × query)



----


### Step 0 - Imports, config, folders

In [None]:
1+1

In [None]:
# core
import os, json, math, re, random, time, sys
import numpy as np
import pandas as pd

# HF
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

In [None]:
# reproducibility
SEED = 23
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)

# paths
DATA_DIR = "data"
OUT_DIR  = "outputs"
os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(OUT_DIR, exist_ok=True)

QUERIES = ["data scientist", "machine learning engineer", "backend developer", "product manager"]  # same queries from Part 3

### Step 1 - Load titles and make a clean field

In [None]:
df = pd.read_csv(os.path.join(DATA_DIR, "potential_talents.csv"))

In [None]:
titles = df["job_title"].astype(str).tolist()
len(titles), titles[:5]

### Step 2 - Load SBERT top-10 baseline (as-is, from the previous project part 3)

In [None]:
# Load your SBERT baseline as produced in Part 3 (no changes to schema)
BASELINE_TOP10_CSV = os.path.join(OUT_DIR, "sbert_ranking_output.csv")
base = pd.read_csv(BASELINE_TOP10_CSV)

print(base.head(3))
print("Queries in baseline:", base["query"].unique())

### Step 3 - Pretty printer (same style as Part 3)

In [None]:
def print_ranking(query, rows_df, score_col="score", title_col="job_titles", top_k=10):
    print(f"\nQuery: {query}")
    for _, r in rows_df.head(top_k).iterrows():
        print(f"   {r[score_col]: .3f}  {r[title_col]}")


In [None]:
for query in QUERIES:
    print_ranking(query, base)

### Step 4 - Load the first LLM (FLAN-T5-Large)

In [None]:
print("torch:", torch.__version__)
print("built with CUDA:", torch.version.cuda)
print("cuda available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("gpu:", torch.cuda.get_device_name(0))

In [None]:
MODEL_ID = "google/flan-t5-large"

tok = AutoTokenizer.from_pretrained(MODEL_ID)
mdl = AutoModelForSeq2SeqLM.from_pretrained(MODEL_ID)

if torch.cuda.is_available():
    mdl.to(device)

pipe = pipeline(
    "text2text-generation",
    model=mdl,
    tokenizer=tok,
    device=0 if torch.cuda.is_available() else -1
)