## Potential Talents - Part 4

----

# Job Title Similarity using LLMs-as-Rankers

### Objective
Given a query, ask a small LLM to score **all 104 job titles at once** (0–100, one score per line, same order), then rank the scores to compare the **top-10** with other results (from embeddings + cosine or other LLMs results)

### Constraints
- Local GPU: **GTX 1080 Ti**.
- **Deterministic** generation: `do_sample=False`, `num_beams=1`.

### Models (initial)
- **1:** `microsoft/phi-3-mini-4k-instruct` (4k context, small & GPU-friendly).
- **2:** `google/gemma-2-2b-it` (8k context, very small).
- (After some tests we will avoid FLAN-T5 here due to the ~512 token input limit.)

### Method
1) Load SBERT top-10 baseline (from Part 3).  
2) Load a small **causal LM**.  
3) Build a prompt that lists all **104** titles (numbered).  
4) Generate **104 lines of integers**; parse → rank; print top-10; save top-10 CSV (`query,score,job_titles`).  
5) Repeat for the 4 queries; later compute nDCG@10 and compare.



----


### Step 0 - Imports, config, folders

In [1]:
# core
import os, json, math, re, random, time, sys
import numpy as np
import pandas as pd

# HF
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModelForCausalLM, pipeline

In [2]:
# reproducibility
SEED = 23
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)

# paths
DATA_DIR = "data"
OUT_DIR  = "outputs"
os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(OUT_DIR, exist_ok=True)

QUERIES = ["data scientist", "machine learning engineer", "backend developer", "product manager"]  # same queries from Part 3

### Step 1 - Load titles and make a clean field

In [3]:
df = pd.read_csv(os.path.join(DATA_DIR, "potential_talents.csv"))

In [4]:
titles = df["job_title"].astype(str).tolist()
len(titles), titles[:5]

(104,
 ['2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional',
  'Native English Teacher at EPIK (English Program in Korea)',
  'Aspiring Human Resources Professional',
  'People Development Coordinator at Ryan',
  'Advisory Board Member at Celal Bayar University'])

### Step 2 - Load SBERT top-10 baseline (as-is, from the previous project part 3)

In [5]:
# Load your SBERT baseline as produced in Part 3 (no changes to schema)
BASELINE_TOP10_CSV = os.path.join(OUT_DIR, "sbert_ranking_output.csv")
base = pd.read_csv(BASELINE_TOP10_CSV)

print(base.head(3))
print("Queries in baseline:", base["query"].unique())

            query     score                                         job_titles
0  data scientist  0.595830  Information Systems Specialist and Programmer ...
1  data scientist  0.494619                       Human Resources Professional
2  data scientist  0.456588           Junior MES Engineer| Information Systems
Queries in baseline: ['data scientist' 'machine learning engineer' 'backend developer'
 'product manager']


### Step 3 - Pretty printer (same style as Part 3)

In [6]:
def print_ranking(query, rows_df, score_col="score", title_col="job_titles", top_k=10):
    print(f"\nQuery: {query}")
    for _, r in rows_df.head(top_k).iterrows():
        print(f"   {r[score_col]: .3f}  {r[title_col]}")


In [7]:
for query in QUERIES:
    print_ranking(query, base)


Query: data scientist
    0.596  Information Systems Specialist and Programmer with a love for data and organization.
    0.495  Human Resources Professional
    0.457  Junior MES Engineer| Information Systems
    0.450  Aspiring Human Resources Specialist
    0.449  Human Resources professional for the world leader in GIS software
    0.441  HR Senior Specialist
    0.433  Human Resources Generalist at ScottMadden, Inc.
    0.416  Liberal Arts Major. Aspiring Human Resources Analyst.
    0.410  Student
    0.403  Human Resources Specialist at Luxottica

Query: machine learning engineer
    0.596  Information Systems Specialist and Programmer with a love for data and organization.
    0.495  Human Resources Professional
    0.457  Junior MES Engineer| Information Systems
    0.450  Aspiring Human Resources Specialist
    0.449  Human Resources professional for the world leader in GIS software
    0.441  HR Senior Specialist
    0.433  Human Resources Generalist at ScottMadden, Inc.
  

### Step 4 - Load a small LLM (Phi-3-mini from Microsoft)

In [8]:
print("torch:", torch.__version__)
print("built with CUDA:", torch.version.cuda)
print("cuda available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("gpu:", torch.cuda.get_device_name(0))

torch: 2.6.0+cu124
built with CUDA: 12.4
cuda available: True
gpu: NVIDIA GeForce GTX 1080 Ti


In [9]:
MODEL_ID = "microsoft/phi-3-mini-4k-instruct"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tok = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
mdl = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    dtype=torch.float16 if torch.cuda.is_available() else None,
).to("cuda" if torch.cuda.is_available() else "cpu").eval()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [10]:
# Build a proper chat prompt
msgs = [
    {"role": "system", "content": "You are a calculator. Reply with digits only."},
    {"role": "user",   "content": "Return the number 7."}
]
prompt = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)


In [11]:
# Encode & generate (greedy)
inputs = tok(prompt, return_tensors="pt").to(mdl.device)
eos = [tok.eos_token_id]
try:
    eos.append(tok.convert_tokens_to_ids("<|end|>"))
except Exception:
    pass

In [12]:
gen = mdl.generate(
    **inputs,
    do_sample=False,
    num_beams=1,
    max_new_tokens=3,
    eos_token_id=eos,
)


In [13]:
out = tok.decode(gen[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True).strip()
print(out)  # -> 7

7


### Step 5 — turn LLM into a job title ranker

We will turn the LLM into a ranker by asking it to assign an integer score (0–100) to each raw job title for a given query.


In [14]:
def build_prompt_all_chat(query: str, titles: list[str]) -> str:
    lines = "\n".join(f"{i+1}) {t}" for i, t in enumerate(titles))
    rubric = (
        "You are a recruiter scoring job-title similarity to the query.\n"
        "Rate each candidate with an integer 0–100 using the FULL scale:\n"
        " • 90–100 = exact/near-exact role match\n"
        " • 70–89  = same discipline or very similar role\n"
        " • 40–69  = related/adjacent\n"
        " • 10–39  = mostly unrelated\n"
        " • 0–9    = completely unrelated\n"
        "Use diverse scores; do NOT give 0 or 100 to many candidates.\n"
        "Ignore employer names, locations, programs.\n"
        "Output EXACTLY one integer per line, in the SAME ORDER as the candidates. No words, no punctuation."
    )
    # Non-extreme example
    example = "Example for 3 candidates:\n82\n41\n7"
    user = f'Query: "{query}"\n\nCandidates:\n{lines}\n\n{example}'
    msgs = [{"role": "system", "content": rubric},
            {"role": "user",   "content": user}]
    return tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)


In [15]:
def parse_scores_n(out: str, n: int) -> list[int]:
    # prefer last int per non-empty line; fallback to last N ints in whole text
    lines = [l.strip() for l in out.splitlines() if l.strip()]
    scores = []
    
    for line in lines:
        ints = re.findall(r"-?\d+", line)
        if ints:
            scores.append(int(ints[-1]))
        if len(scores) >= n:
            break
        
    if len(scores) < n:
        all_ints = [int(x) for x in re.findall(r"-?\d+", out)]
        scores = all_ints[-n:]
        
    scores = [max(0, min(100, int(s))) for s in scores]
    
    # if still short, add a padding
    if len(scores) < n:  
        scores += [0] * (n - len(scores))
    return scores[:n]

In [16]:
def score_all_titles_once(query: str, titles: list[str], max_new_tokens: int = 300, build_fn=build_prompt_all_chat):
    prompt = build_fn(query, titles)
    print("Prompt tokens:", len(tok(prompt)["input_ids"]))
    inputs = tok(prompt, return_tensors="pt").to(mdl.device)
    gen = mdl.generate(
        **inputs,
        do_sample=False,
        num_beams=1,
        max_new_tokens=max_new_tokens,
        eos_token_id=[tok.eos_token_id],
        # optional: ensure we don’t stop too early
        min_new_tokens=min(104, max_new_tokens-1),
    )
    out_text = tok.decode(gen[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
    scores = parse_scores_n(out_text, len(titles))
    df = pd.DataFrame({"idx": range(len(titles)), "score": scores})
    df["job_titles"] = df["idx"].map(lambda i: titles[i])
    df = df.sort_values("score", ascending=False).reset_index(drop=True)
    return df, out_text


Test **build_prompt_all_chat**, **parse_scores_n** and **score_all_tittles_once**

In [17]:
test_query = "data scientist"

# A Tiny subset to inspect everything
subset = titles[:5]

demo_prompt = build_prompt_all_chat(test_query, subset)
print("=== DEMO PROMPT (first 30 lines) ===")
print("\n".join(demo_prompt.splitlines()[:30]))
print("Token count (subset):", len(tok(demo_prompt)["input_ids"]))


=== DEMO PROMPT (first 30 lines) ===
<|system|>
You are a recruiter scoring job-title similarity to the query.
Rate each candidate with an integer 0–100 using the FULL scale:
 • 90–100 = exact/near-exact role match
 • 70–89  = same discipline or very similar role
 • 40–69  = related/adjacent
 • 10–39  = mostly unrelated
 • 0–9    = completely unrelated
Use diverse scores; do NOT give 0 or 100 to many candidates.
Ignore employer names, locations, programs.
Output EXACTLY one integer per line, in the SAME ORDER as the candidates. No words, no punctuation.<|end|>
<|user|>
Query: "data scientist"

Candidates:
1) 2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional
2) Native English Teacher at EPIK (English Program in Korea)
3) Aspiring Human Resources Professional
4) People Development Coordinator at Ryan
5) Advisory Board Member at Celal Bayar University

Example for 3 candidates:
82
41
7<|end|>
<|assistant|>
Token count (subset): 279


In [18]:
df_sub, raw_sub = score_all_titles_once(test_query, subset, max_new_tokens=60)
print("\n=== RAW MODEL OUTPUT (subset) ===")
print(raw_sub)

Prompt tokens: 279

=== RAW MODEL OUTPUT (subset) ===
0
0
7
41
0 Query: "software engineer specializing in machine learning"

Candidates:
1) 2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and asp


In [19]:
scores_sub = parse_scores_n(raw_sub, len(subset))
print("\nParsed scores (subset):", scores_sub)
print("\nPaired (score, title) in ranked order:")
# the output DataFrame from `socre_all_titles_once`, df_sub, is sorted in not ascending order
for _, r in df_sub.iterrows():
    print(f"{r['score']:>3}  {r['job_titles']}")


Parsed scores (subset): [0, 0, 7, 41, 0]

Paired (score, title) in ranked order:
 41  People Development Coordinator at Ryan
  7  Aspiring Human Resources Professional
  0  2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional
  0  Native English Teacher at EPIK (English Program in Korea)
  0  Advisory Board Member at Celal Bayar University


In [20]:
# B) One full run (preview only; avoids flooding output)
full_prompt = build_prompt_all_chat(test_query, titles)
print("\nToken count (full):", len(tok(full_prompt)["input_ids"]))


Token count (full): 1928


In [21]:
# truncation of long strings
pd.set_option("display.max_colwidth", None)
pd.set_option("display.width", 200)


In [None]:
df_full, raw_full = score_all_titles_once(test_query, titles, max_new_tokens=300)
print("\nFull run: got", len(df_full), "scores.")
print("Top-3 preview:")
print(df_full.head(10)[["score", "job_titles"]])

Prompt tokens: 1928


In [None]:
def print_ranking(query, rows_df, top_k=10):
    print(f"\nQuery: {query}")
    for _, r in rows_df.head(top_k).iterrows():
        print(f"   {r['score']/100: .3f}  {r['job_titles']}")

def run_query_full(queries: list[str], model_tag: str = "phi3_mini_4k"):
    
    out_dir = os.path.join(OUT_DIR, "llm")
    os.makedirs(out_dir, exist_ok=True)
    
    top10_blocks = []
    
    for query in queries:
        df_rank, raw = score_all_titles_once(query, titles)
        print_ranking(query, df_rank, top_k=10)

        df_q = df_rank.head(10)[["score", "job_titles"]].copy()
        df_q.insert(0, "query", query)
        top10_blocks.append(df_q)
        
    # one combined CSV for all queries        
    top10 = pd.concat(top10_blocks, ignore_index=True)
    path = os.path.join(out_dir, f"llm_top10__{model_tag}__all_queries.csv")
    top10.to_csv(path, index=True)
    print("Saved:", path)
    
    return top10, path

In [None]:
top10_all, out_path = run_query_full(QUERIES)

Prompt tokens: 1928

Query: data scientist
    1.000  Student at Humber College and Aspiring Human Resources Generalist
    1.000  Advisory Board Member at Celal Bayar University
    1.000  Aspiring Human Resources Professional
    0.900  Aspiring Human Resources Specialist
    0.890  Student at Humber College and Aspiring Human Resources Generalist
    0.740  Aspiring Human Resources Professional
    0.720  Native English Teacher at EPIK (English Program in Korea)
    0.700  2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional
    0.700  HR Senior Specialist
    0.690  Student at Chapman University
Prompt tokens: 1928

Query: machine learning engineer
    1.000  Student at Humber College and Aspiring Human Resources Generalist
    1.000  Advisory Board Member at Celal Bayar University
    1.000  Aspiring Human Resources Professional
    0.900  Aspiring Human Resources Specialist
    0.890  Student at Humber College and Aspiring Human

----

### Step 6 - Experience with another small LLM: **Gemma-2-2b-it from Google**

Free some GPU allocated memory:

In [None]:
import gc

# 1) Move model to CPU first (helps release GPU contexts cleanly)
try:
    mdl.to("cpu")
except Exception:
    pass

# 2) Delete big objects
for obj in ["pipe", "mdl", "tok"]:
    if obj in globals():
        del globals()[obj]

# 3) Garbage collect + empty CUDA cache
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    torch.cuda.ipc_collect()

# 4) Sanity check
if torch.cuda.is_available():
    print("VRAM allocated (MB):", torch.cuda.memory_allocated() / (1024**2))
    print("VRAM reserved  (MB):", torch.cuda.memory_reserved() / (1024**2))


VRAM allocated (MB): 8.12646484375
VRAM reserved  (MB): 24.0


In [28]:
MODEL_ID = "google/gemma-2-2b-it"
HF_TOKEN = os.getenv("llm_gemma")

tok = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
mdl = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    dtype=torch.float16 if device.type=="cuda" else None,
    token=HF_TOKEN
).to(device).eval()

if tok.pad_token_id is None:
    tok.pad_token = tok.eos_token

pipe = pipeline("text-generation", model=mdl, tokenizer=tok, device=0 if device.type=="cuda" else -1)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

Device set to use cuda:0


Define a newer **prompt builder** function with a one-shot prompt and being more specific with the matching job titles. Also it drops the `system role` options that we used with Phi-3:

In [34]:
def build_prompt_all_chat(query: str, titles: list[str]) -> str:
    # Numbered candidates
    lines = "\n".join(f"{i+1}) {t}" for i, t in enumerate(titles))

    # Rubric + strict output format
    rubric = (
        "You are a recruiter scoring job-title similarity to the query.\n"
        "Rate each candidate with an integer 0–100 using the FULL scale:\n"
        "  90–100 = exact/near-exact role match\n"
        "  70–89  = same discipline or very similar role\n"
        "  40–69  = related/adjacent\n"
        "  20–39  = mostly unrelated\n"
        "  0–19   = completely unrelated\n"
        "Rules:\n"
        " - Prefer same functional domain as the query.\n"
        " - If the query is technical (data/ML/backend), HR/People titles are mostly unrelated.\n"
        " - Titles with 'Student' or 'Aspiring' get lower scores unless they explicitly match the role.\n"
        "Ignore employer names, locations, programs.\n"
        "Return EXACTLY one integer per line, in the SAME ORDER as the candidates.\n"
        "No numbering, no words, no punctuation."
    )
    example = "Example for 3 candidates:\n82\n41\n7"
    user_text = f'Query: "{query}"\n\nCandidates:\n{lines}\n\n{example}'

    # Use system role only if the tokenizer’s chat template supports it
    tmpl = (getattr(tok, "chat_template", "") or "")
    supports_system = "system" in tmpl

    if supports_system:
        msgs = [
            {"role": "system", "content": rubric},
            {"role": "user",   "content": user_text},
        ]
    else:
        # Gemma path: fold rubric into the user turn
        msgs = [{"role": "user", "content": rubric + "\n\n" + user_text}]

    return tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)


In [35]:
# Tiny subset check (no changes elsewhere)
_subset = titles[:5]
_demo = build_prompt_all_chat("data scientist", _subset)
print("Prompt tokens:", len(tok(_demo)["input_ids"]))
df_sub, raw_sub = score_all_titles_once("data scientist", _subset, max_new_tokens=60)
print(df_sub[["score","job_titles"]])


TemplateError: System role not supported

In [31]:
top10_all, out_path = run_query_full(QUERIES, "gemma-2-2b-it")

TemplateError: System role not supported