## Potential Talents - Part 4

----

# Job Title Similarity using LLMs-as-Rankers

### Objective
Given a query, ask a small LLM to score **all 104 job titles at once** (0–100, one score per line, same order), then rank the scores to compare the **top-10** with other results (from embeddings + cosine or other LLMs results)

### Constraints
- Local GPU: **GTX 1080 Ti**.
- **Deterministic** generation: `do_sample=False`, `num_beams=1`.

### Models (initial)
- **1:** `microsoft/phi-3-mini-4k-instruct` (4k context, small & GPU-friendly).
- **2:** `google/gemma-2-2b-it` (8k context, very small).
- **3:** `qwen2.5-3B-instruct` (32k context, ~3B params, list-style outputs).
- (After some tests we will avoid FLAN-T5 here due to the ~512 token input limit.)

### Method
1) Load SBERT top-10 baseline (from Part 3).  
2) Load a small **causal LM**.  
3) Build a prompt that lists all **104** titles (numbered).  
4) Generate **104 lines of integers**; parse → rank; print top-10; save top-10 CSV (`query,score,job_titles`).  
5) Repeat for the 4 queries; later compute nDCG@10 and compare.



----


### Step 0 - Imports, config, folders

In [1]:
# core
import os, json, math, re, random, time, sys
import numpy as np
import pandas as pd

# HF
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModelForCausalLM, pipeline

In [2]:
# reproducibility
SEED = 23
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)

# paths
DATA_DIR = "data"
OUT_DIR  = "outputs"
os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(OUT_DIR, exist_ok=True)

QUERIES = ["data scientist", "machine learning engineer", "backend developer", "product manager"]  # same queries from Part 3

### Step 1 - Load titles and make a clean field

In [3]:
df = pd.read_csv(os.path.join(DATA_DIR, "potential_talents.csv"))

In [4]:
titles = df["job_title"].astype(str).tolist()
len(titles), titles[:5]

(104,
 ['2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional',
  'Native English Teacher at EPIK (English Program in Korea)',
  'Aspiring Human Resources Professional',
  'People Development Coordinator at Ryan',
  'Advisory Board Member at Celal Bayar University'])

### Step 2 - Load SBERT top-10 baseline (as-is, from the previous project part 3)

In [5]:
# Load your SBERT baseline as produced in Part 3 (no changes to schema)
BASELINE_TOP10_CSV = os.path.join(OUT_DIR, "sbert_ranking_output.csv")
base = pd.read_csv(BASELINE_TOP10_CSV)

print(base.head(3))
print("Queries in baseline:", base["query"].unique())

            query     score                                         job_titles
0  data scientist  0.595830  Information Systems Specialist and Programmer ...
1  data scientist  0.494619                       Human Resources Professional
2  data scientist  0.456588           Junior MES Engineer| Information Systems
Queries in baseline: ['data scientist' 'machine learning engineer' 'backend developer'
 'product manager']


### Step 3 - Pretty printer (same style as Part 3)

In [6]:
def print_ranking(query, rows_df, score_col="score", title_col="job_titles", top_k=10):
    print(f"\nQuery: {query}")
    for _, r in rows_df.head(top_k).iterrows():
        print(f"   {r[score_col]: .3f}  {r[title_col]}")


In [7]:
for query in QUERIES:
    print_ranking(query, base)


Query: data scientist
    0.596  Information Systems Specialist and Programmer with a love for data and organization.
    0.495  Human Resources Professional
    0.457  Junior MES Engineer| Information Systems
    0.450  Aspiring Human Resources Specialist
    0.449  Human Resources professional for the world leader in GIS software
    0.441  HR Senior Specialist
    0.433  Human Resources Generalist at ScottMadden, Inc.
    0.416  Liberal Arts Major. Aspiring Human Resources Analyst.
    0.410  Student
    0.403  Human Resources Specialist at Luxottica

Query: machine learning engineer
    0.596  Information Systems Specialist and Programmer with a love for data and organization.
    0.495  Human Resources Professional
    0.457  Junior MES Engineer| Information Systems
    0.450  Aspiring Human Resources Specialist
    0.449  Human Resources professional for the world leader in GIS software
    0.441  HR Senior Specialist
    0.433  Human Resources Generalist at ScottMadden, Inc.
  

### Step 4 - Load a small LLM (Phi-3-mini from Microsoft)

**Phi-3 Mini (Microsoft)**
- **Release Date**: April 23, 2024.
- **Architecture**: Decoder-only (autoregressive).
- **Parameters**: ~3.8B.
- **Layers**: 32 transformer blocks, 32 attention heads.
- **Context Window**: 4k tokens.
- **Tokenizer**: SentencePiece-like (subword BPE).
- **Objective**: Next-token prediction, trained as a general causal LM.
- **Training**: Mixture of web, code, math, scientific texts; instruction-tuned for dialogue.
- **Efficiency**: Optimized for small GPUs (runs on 8–12GB VRAM), strong FP16/INT8 support.
- **License**: MIT-style permissive.
- **Notes**: Very lightweight, deterministic, good for structured tasks on consumer GPUs.

In [8]:
print("torch:", torch.__version__)
print("built with CUDA:", torch.version.cuda)
print("cuda available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("gpu:", torch.cuda.get_device_name(0))

torch: 2.6.0+cu124
built with CUDA: 12.4
cuda available: True
gpu: NVIDIA GeForce GTX 1080 Ti


In [9]:
MODEL_ID = "microsoft/phi-3-mini-4k-instruct"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tok = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
mdl = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    dtype=torch.float16 if torch.cuda.is_available() else None,
).to("cuda" if torch.cuda.is_available() else "cpu").eval()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [10]:
# Build a proper chat prompt
msgs = [
    {"role": "system", "content": "You are a calculator. Reply with digits only."},
    {"role": "user",   "content": "Return the number 7."}
]
prompt = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)


In [11]:
# Encode & generate (greedy)
inputs = tok(prompt, return_tensors="pt").to(mdl.device)
eos = [tok.eos_token_id]
try:
    eos.append(tok.convert_tokens_to_ids("<|end|>"))
except Exception:
    pass

In [12]:
gen = mdl.generate(
    **inputs,
    do_sample=False,
    num_beams=1,
    max_new_tokens=3,
    eos_token_id=eos,
)


In [13]:
out = tok.decode(gen[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True).strip()
print(out)  # -> 7

7


### Step 5 — turn LLM into a job title ranker

We will turn the LLM into a ranker by asking it to assign an integer score (0–100) to each raw job title for a given query.


In [14]:
def build_prompt_all_chat_phi(query: str, titles: list[str]) -> str:
    lines = "\n".join(f"{i+1}) {t}" for i, t in enumerate(titles))
    rubric = (
        "You are a recruiter scoring job-title similarity to the query.\n"
        "Rate each candidate with an integer 0–100 using the FULL scale:\n"
        " • 90–100 = exact/near-exact role match\n"
        " • 70–89  = same discipline or very similar role\n"
        " • 40–69  = related/adjacent\n"
        " • 10–39  = mostly unrelated\n"
        " • 0–9    = completely unrelated\n"
        "Use diverse scores; do NOT give 0 or 100 to many candidates.\n"
        "Ignore employer names, locations, programs.\n"
        "Output EXACTLY one integer per line, in the SAME ORDER as the candidates. No words, no punctuation."
    )
    # Non-extreme example
    example = "Example for 3 candidates:\n82\n41\n7"
    user = f'Query: "{query}"\n\nCandidates:\n{lines}\n\n{example}'
    msgs = [{"role": "system", "content": rubric},
            {"role": "user",   "content": user}]
    return tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
    


In [15]:
def parse_scores_n(out: str, n: int) -> list[int]:
    # prefer last int per non-empty line; fallback to last N ints in whole text
    lines = [l.strip() for l in out.splitlines() if l.strip()]
    scores = []
    
    for line in lines:
        ints = re.findall(r"-?\d+", line)
        if ints:
            scores.append(int(ints[-1]))
        if len(scores) >= n:
            break
        
    if len(scores) < n:
        all_ints = [int(x) for x in re.findall(r"-?\d+", out)]
        scores = all_ints[-n:]
        
    scores = [max(0, min(100, int(s))) for s in scores]
    
    # if still short, add a padding
    if len(scores) < n:  
        scores += [0] * (n - len(scores))
    return scores[:n]

In [16]:
def score_all_titles_once(query: str,
                          titles: list[str],
                          max_new_tokens: int = 300,
                          build_fn=None):
    build_fn = build_fn or build_prompt_all_chat_phi
    prompt = build_fn(query, titles)
    print("Prompt tokens:", len(tok(prompt)["input_ids"]))
    inputs = tok(prompt, return_tensors="pt").to(mdl.device)
    gen = mdl.generate(
        **inputs,
        do_sample=False,
        num_beams=1,
        max_new_tokens=max_new_tokens,
        eos_token_id=[tok.eos_token_id],
        min_new_tokens=min(len(titles), max_new_tokens-1),
    )
    out_text = tok.decode(gen[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
    scores = parse_scores_n(out_text, len(titles))
    df = pd.DataFrame({"idx": range(len(titles)), "score": scores})
    df["job_titles"] = [titles[i] for i in df["idx"]]
    df = df.sort_values("score", ascending=False).reset_index(drop=True)
    return df, out_text



Test **build_prompt_all_chat**, **parse_scores_n** and **score_all_tittles_once**

In [17]:
test_query = "data scientist"

# A Tiny subset to inspect everything
subset = titles[:5]

demo_prompt = build_prompt_all_chat_phi(test_query, subset)
print("=== DEMO PROMPT (first 30 lines) ===")
print("\n".join(demo_prompt.splitlines()[:30]))
print("Token count (subset):", len(tok(demo_prompt)["input_ids"]))


=== DEMO PROMPT (first 30 lines) ===
<|system|>
You are a recruiter scoring job-title similarity to the query.
Rate each candidate with an integer 0–100 using the FULL scale:
 • 90–100 = exact/near-exact role match
 • 70–89  = same discipline or very similar role
 • 40–69  = related/adjacent
 • 10–39  = mostly unrelated
 • 0–9    = completely unrelated
Use diverse scores; do NOT give 0 or 100 to many candidates.
Ignore employer names, locations, programs.
Output EXACTLY one integer per line, in the SAME ORDER as the candidates. No words, no punctuation.<|end|>
<|user|>
Query: "data scientist"

Candidates:
1) 2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional
2) Native English Teacher at EPIK (English Program in Korea)
3) Aspiring Human Resources Professional
4) People Development Coordinator at Ryan
5) Advisory Board Member at Celal Bayar University

Example for 3 candidates:
82
41
7<|end|>
<|assistant|>
Token count (subset): 279


In [18]:
df_sub, raw_sub = score_all_titles_once(test_query, subset, max_new_tokens=60)
print("\n=== RAW MODEL OUTPUT (subset) ===")
print(raw_sub)

Prompt tokens: 279

=== RAW MODEL OUTPUT (subset) ===
0
0
7
41
0


In [19]:
scores_sub = parse_scores_n(raw_sub, len(subset))
print("\nParsed scores (subset):", scores_sub)
print("\nPaired (score, title) in ranked order:")
# the output DataFrame from `socre_all_titles_once`, df_sub, is sorted in not ascending order
for _, r in df_sub.iterrows():
    print(f"{r['score']:>3}  {r['job_titles']}")


Parsed scores (subset): [0, 0, 7, 41, 0]

Paired (score, title) in ranked order:
 41  People Development Coordinator at Ryan
  7  Aspiring Human Resources Professional
  0  2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional
  0  Native English Teacher at EPIK (English Program in Korea)
  0  Advisory Board Member at Celal Bayar University


In [20]:
# B) One full run (preview only; avoids flooding output)
full_prompt = build_prompt_all_chat_phi(test_query, titles)
print("\nToken count (full):", len(tok(full_prompt)["input_ids"]))


Token count (full): 1928


In [21]:
# truncation of long strings
pd.set_option("display.max_colwidth", None)
pd.set_option("display.width", 200)


In [22]:
df_full, raw_full = score_all_titles_once(test_query, titles, max_new_tokens=300)
print("\nFull run: got", len(df_full), "scores.")
print("Top-3 preview:")
print(df_full.head(10)[["score", "job_titles"]])

Prompt tokens: 1928

Full run: got 104 scores.
Top-3 preview:
   score                                                                                                job_titles
0    100                                         Student at Humber College and Aspiring Human Resources Generalist
1    100                                                           Advisory Board Member at Celal Bayar University
2    100                                                                     Aspiring Human Resources Professional
3     90                                                                       Aspiring Human Resources Specialist
4     89                                         Student at Humber College and Aspiring Human Resources Generalist
5     74                                                                     Aspiring Human Resources Professional
6     72                                                 Native English Teacher at EPIK (English Program in Korea)
7     70  2019 C.T

In [23]:
def print_ranking(query, rows_df, top_k=10):
    print(f"\nQuery: {query}")
    for _, r in rows_df.head(top_k).iterrows():
        print(f"   {r['score']/100: .3f}  {r['job_titles']}")

def run_query_full(queries: list[str],
                   model_tag: str = "phi3_mini_4k",
                   build_prompt_fn=None):
    out_dir = os.path.join(OUT_DIR, "llm"); os.makedirs(out_dir, exist_ok=True)
    top10_blocks = []
    for query in queries:
        df_rank, raw = score_all_titles_once(query, titles, build_fn=build_prompt_fn)
        print_ranking(query, df_rank, top_k=10)
        df_q = df_rank.head(10)[["score", "job_titles"]].copy()
        df_q.insert(0, "query", query)
        top10_blocks.append(df_q)
    top10 = pd.concat(top10_blocks, ignore_index=True)
    path = os.path.join(out_dir, f"llm_top10__{model_tag}__all_queries.csv")
    top10.to_csv(path, index=False)
    print("Saved:", path)
    return top10, path

In [None]:
top10_phi, path_phi = run_query_full(
    QUERIES,
    model_tag="phi3_mini_4k__listwise",
    build_prompt_fn=build_prompt_all_chat_phi
)


Prompt tokens: 1928

Query: data scientist
    1.000  Student at Humber College and Aspiring Human Resources Generalist
    1.000  Advisory Board Member at Celal Bayar University
    1.000  Aspiring Human Resources Professional
    0.900  Aspiring Human Resources Specialist
    0.890  Student at Humber College and Aspiring Human Resources Generalist
    0.740  Aspiring Human Resources Professional
    0.720  Native English Teacher at EPIK (English Program in Korea)
    0.700  2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional
    0.700  HR Senior Specialist
    0.690  Student at Chapman University
Prompt tokens: 1928

Query: machine learning engineer
    1.000  Student at Humber College and Aspiring Human Resources Generalist
    1.000  Advisory Board Member at Celal Bayar University
    1.000  Aspiring Human Resources Professional
    0.900  Aspiring Human Resources Specialist
    0.890  Student at Humber College and Aspiring Human

----

### Step 6 - Experience with another small LLM: **Gemma-2-2b-it from Google**

**Gemma-2-2B-IT (Google)**
- **Release Date**: June 27, 2024 (Gemma 2 family launch).
- **Architecture**: Decoder-only (autoregressive).
- **Parameters**: ~2.6B.
- **Layers**: ~26 transformer layers (with RoPE).
- **Context Window**: 8k tokens.
- **Tokenizer**: SentencePiece (same family as PaLM-2 / Gemini).
- **Objective**: Next-token prediction with instruction-tuning.
- **Training**: Web-scale datasets filtered for quality, multilingual corpora.
- **Efficiency**: Ultra-compact, designed for edge devices; runs well on 6–8GB GPUs.
- **License**: Apache 2.0 (permissive).
- *Notes*: Very small but instruction-tuned, produces stable integer list outputs if well-prompted.

Free some GPU allocated memory:

In [25]:
import gc

In [26]:
def free_GPU_memory():
    def print_vram(prefix=""):
        if not torch.cuda.is_available():
            print(prefix + "CUDA not available")
            return
        torch.cuda.synchronize()
        alloc = torch.cuda.memory_allocated() / (1024**2)      # MiB
        reserv = torch.cuda.memory_reserved() / (1024**2)      # MiB
        total = torch.cuda.get_device_properties(0).total_memory / (1024**2)
        print(f"\n{prefix}allocated: {alloc:.1f} MiB | reserved: {reserv:.1f} MiB | total: {total:,.0f} MiB")

    # Print memory allocation before freeing it
    print("Measure memory usage before and after freeing it")
    print_vram("Before:\n")

    # move model to CPU + delete big refs
    try: mdl.to("cpu")
    except: pass
    # free memory
    for name in ("pipe","mdl","tok","inputs","gen"):
        if name in globals(): del globals()[name]

    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.ipc_collect()
    print_vram("After:\n")


In [27]:
free_GPU_memory()

Measure memory usage before and after freeing it

Before:
allocated: 7296.5 MiB | reserved: 8596.0 MiB | total: 11,264 MiB

After:
allocated: 8.1 MiB | reserved: 20.0 MiB | total: 11,264 MiB


In [28]:
MODEL_ID = "google/gemma-2-2b-it"
HF_TOKEN = os.getenv("llm_gemma")

tok = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
mdl = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    dtype=torch.float16 if device.type=="cuda" else None,
    token=HF_TOKEN
).to(device).eval()

if tok.pad_token_id is None:
    tok.pad_token = tok.eos_token

pipe = pipeline("text-generation", model=mdl, tokenizer=tok, device=0 if device.type=="cuda" else -1)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0


Define a newer **prompt builder** function with a one-shot prompt and being more specific with the matching job titles. Also it drops the `system role` options that we used with Phi-3:

In [29]:

def build_prompt_all_chat_gemma(query: str, titles: list[str]) -> str:
    lines = "\n".join(f"{i+1}) {t}" for i, t in enumerate(titles))
    rubric = (
        "You are a recruiter scoring job-title similarity to the query.\n"
        "Rate each candidate with an integer between zero and one hundred using the full scale.\n"
        "Use diverse scores; avoid giving many zeros or many hundreds.\n"
        "Ignore employer names, locations, and programs.\n"
        "Return EXACTLY one integer per line in the SAME ORDER as the candidates.\n"
        "No words, no punctuation, no numbering.\n"
    )
    user_text = f'Query: "{query}"\n\nCandidates:\n{lines}\n\nSCORES:'
    # Gemma’s template may not support a system role → fold rubric into the user turn
    msgs = [{"role": "user", "content": rubric + "\n\n" + user_text}]
    return tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)

In [None]:
top10_gemma, path_gemma = run_query_full(
    QUERIES,
    model_tag="gemma-2-2b-it__listwise",
    build_prompt_fn=build_prompt_all_chat_gemma
)

Prompt tokens: 1583

Query: data scientist
    0.200  2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional
    0.100  Native English Teacher at EPIK (English Program in Korea)
    0.100  Aspiring Human Resources Professional
    0.100  People Development Coordinator at Ryan
    0.100  Advisory Board Member at Celal Bayar University
    0.100  Aspiring Human Resources Specialist
    0.100  Student at Humber College and Aspiring Human Resources Generalist
    0.100  HR Senior Specialist
    0.100  Student at Humber College and Aspiring Human Resources Generalist
    0.100  Seeking Human Resources HRIS and Generalist Positions
Prompt tokens: 1584

Query: machine learning engineer
    0.200  2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional
    0.100  Native English Teacher at EPIK (English Program in Korea)
    0.100  Aspiring Human Resources Professional
    0.100  People Developmen

----

### Step 7 - Experience with another small LLM: **Qwen2.5-3B-Instruct from Qwen**

**Qwen2.5-3B-Instruct (Alibaba / Qwen Team)**
- **Release Date**: September 5, 2024 (Qwen2.5 family release).
- **Architecture**: Decoder-only (autoregressive).
- **Parameters**: ~2.7–3B.
- **Layers**: 28 transformer layers, 32 attention heads.
- **Context Window**: 32k tokens (longest among your three).
- **Tokenizer**: Custom BPE with multilingual coverage.
- **Objective**: Next-token prediction, instruction-tuned with ChatML formatting.
- **Training**: Massive multilingual web + code datasets, plus safety/alignment finetuning.
- **Efficiency**: Larger context needs more VRAM, but still runnable on 12GB with FP16/INT8.
- **License**: Apache 2.0.
- *Notes*: Very strong for structured outputs (list-style, JSON); context length makes it robust for 104-title scoring.

In [31]:
free_GPU_memory()

Measure memory usage before and after freeing it

Before:
allocated: 4995.6 MiB | reserved: 5710.0 MiB | total: 11,264 MiB

After:
allocated: 8.1 MiB | reserved: 20.0 MiB | total: 11,264 MiB


In [32]:
MODEL_ID = "Qwen/Qwen2.5-3B-Instruct"

tok = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
mdl = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    dtype=torch.float16 if torch.cuda.is_available() else None,
).to("cuda" if torch.cuda.is_available() else "cpu").eval()

if tok.pad_token_id is None:
    tok.pad_token = tok.eos_token


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [33]:
# Qwen-specific prompt builder (ChatML-friendly, explicit N, hard stop)
def build_prompt_all_chat_qwen(query: str, titles: list[str]) -> str:
    n = len(titles)
    lines = "\n".join(f"{i}) {t}" for i, t in enumerate(titles, start=1))

    rubric = (
        "You are a recruiter scoring job-title similarity to the query.\n"
        "Score each candidate with an integer 0–100 using the FULL scale:\n"
        "  90–100 = exact/near-exact role match\n"
        "  70–89  = same discipline or very similar role\n"
        "  40–69  = related/adjacent\n"
        "  20–39  = mostly unrelated\n"
        "  0–19   = completely unrelated\n"
        "Rules:\n"
        " - Prefer same functional domain as the query.\n"
        " - If the query is technical (data/ML/backend), HR/People titles are mostly unrelated.\n"
        " - Titles with 'Student' or 'Aspiring' get lower scores unless they explicitly match the role.\n"
        f"Output EXACTLY {n} integers, one per line, in the SAME ORDER as the candidates.\n"
        "No words, no commas, no numbering, no punctuation.\n"
        f"After the {n}th line, output the token <END> and stop."
    )

    # very small one-shot to demonstrate format (3 lines + <END>)
    # keep it generic so it transfers across queries
    example = (
        "Example (3 candidates):\n"
        "Candidates:\n"
        "1) Senior Data Scientist\n"
        "2) HR Coordinator\n"
        "3) Retail Cashier\n"
        "Expected output:\n"
        "95\n"
        "20\n"
        "0\n"
        "<END>"
    )

    user_text = (
        f"{rubric}\n\n"
        f'Query: "{query}"\n\n'
        f"Candidates:\n{lines}\n\n"
        f"{example}"
    )

    # Qwen supports system; if anything fails, fall back to user-only.
    msgs_sys = [
        {"role": "system", "content": "You are precise and output only the requested numbers."},
        {"role": "user",   "content": user_text},
    ]
    msgs_user = [{"role": "user", "content": user_text}]  # fallback

    try:
        return tok.apply_chat_template(msgs_sys, tokenize=False, add_generation_prompt=True)
    except Exception:
        return tok.apply_chat_template(msgs_user, tokenize=False, add_generation_prompt=True)


In [None]:
top10_qwen, path_qwen = run_query_full(
    QUERIES,
    model_tag="qwen2.5-3b-instruct__listwise",
    build_prompt_fn=build_prompt_all_chat_qwen
)

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Prompt tokens: 1774

Query: data scientist
    0.900  2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional
    0.900  People Development Coordinator at Ryan
    0.900  Student at Humber College and Aspiring Human Resources Generalist
    0.900  People Development Coordinator at Ryan
    0.900  2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional
    0.900  Native English Teacher at EPIK (English Program in Korea)
    0.900  Human Resources Coordinator at InterContinental Buckhead Atlanta
    0.900  Seeking Human Resources HRIS and Generalist Positions
    0.900  People Development Coordinator at Ryan
    0.900  Student at Humber College and Aspiring Human Resources Generalist
Prompt tokens: 1775

Query: machine learning engineer
    0.900  2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional
    0.400  People Development Coordinator

----

----

As a reference, this is the **output using ChatGPT5**:

Top matches (title • score/100)
- Business Intelligence and Analytics at Travelers • 0.78
- Information Systems Specialist and Programmer with a love for data and organization. • 0.62
- Junior MES Engineer | Information Systems • 0.56
- Undergraduate Research Assistant at Styczynski Lab • 0.54
- Liberal Arts Major. Aspiring Human Resources Analyst. • 0.42
- Seeking Human Resources HRIS and Generalist Positions • 0.28
- Human Resources Generalist at Loparex • 0.25
- Human Resources Specialist at Luxottica • 0.23
- HR Senior Specialist • 0.22
- Human Resources Professional • 0.20


I used the following **prompt** (same used with Phi-3 mini): 


You are a recruiter scoring job-title similarity to the query Rate each candidate with an integer 0–100 using the FULL scale: • 90–100 = exact/near-exact role match • 70–89 = same discipline or very similar role • 40–69 = related/adjacent • 10–39 = mostly unrelated • 0–9 = completely unrelated Use diverse scores; do NOT give 0 or 100 to many candidates. Ignore employer names, locations, programs. Output EXACTLY one integer per line, in the SAME ORDER as the candidates. No words, no punctuation. Example for return for 3 candidates: 82 41 7 --- Query: "data scientist" Candidates: "2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional Native English Teacher at EPIK (English Program in Korea) Aspiring Human Resources Professional People Development Coordinator at Ryan Advisory Board Member at Celal Bayar University Aspiring Human Resources Specialist Student at Humber College and Aspiring Human Resources Generalist HR Senior Specialist Student at Humber College and Aspiring Human Resources Generalist Seeking Human Resources HRIS and Generalist Positions Student at Chapman University SVP, CHRO, Marketing & Communications, CSR Officer | ENGIE | Houston | The Woodlands | Energy | GPHR | SPHR Human Resources Coordinator at InterContinental Buckhead Atlanta 2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional 2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional Native English Teacher at EPIK (English Program in Korea) Aspiring Human Resources Professional People Development Coordinator at Ryan 2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional Native English Teacher at EPIK (English Program in Korea) Aspiring Human Resources Professional People Development Coordinator at Ryan Advisory Board Member at Celal Bayar University Aspiring Human Resources Specialist Student at Humber College and Aspiring Human Resources Generalist HR Senior Specialist Aspiring Human Resources Management student seeking an internship Seeking Human Resources Opportunities Aspiring Human Resources Management student seeking an internship Seeking Human Resources Opportunities 2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional Native English Teacher at EPIK (English Program in Korea) Aspiring Human Resources Professional People Development Coordinator at Ryan Advisory Board Member at Celal Bayar University Aspiring Human Resources Specialist Student at Humber College and Aspiring Human Resources Generalist HR Senior Specialist Student at Humber College and Aspiring Human Resources Generalist Seeking Human Resources HRIS and Generalist Positions Student at Chapman University SVP, CHRO, Marketing & Communications, CSR Officer | ENGIE | Houston | The Woodlands | Energy | GPHR | SPHR Human Resources Coordinator at InterContinental Buckhead Atlanta 2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional Native English Teacher at EPIK (English Program in Korea) Aspiring Human Resources Professional People Development Coordinator at Ryan Advisory Board Member at Celal Bayar University Aspiring Human Resources Specialist Student at Humber College and Aspiring Human Resources Generalist HR Senior Specialist Student at Humber College and Aspiring Human Resources Generalist Seeking Human Resources HRIS and Generalist Positions Student at Chapman University SVP, CHRO, Marketing & Communications, CSR Officer | ENGIE | Houston | The Woodlands | Energy | GPHR | SPHR Human Resources Coordinator at InterContinental Buckhead Atlanta 2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional Aspiring Human Resources Professional People Development Coordinator at Ryan Aspiring Human Resources Specialist HR Senior Specialist Seeking Human Resources HRIS and Generalist Positions Student at Chapman University SVP, CHRO, Marketing & Communications, CSR Officer | ENGIE | Houston | The Woodlands | Energy | GPHR | SPHR Human Resources Coordinator at InterContinental Buckhead Atlanta Experienced Retail Manager and aspiring Human Resources Professional Human Resources, Staffing and Recruiting Professional Human Resources Specialist at Luxottica Director of Human Resources North America, Groupe Beneteau Retired Army National Guard Recruiter, office manager, seeking a position in Human Resources. Human Resources Generalist at ScottMadden, Inc. Business Management Major and Aspiring Human Resources Manager Aspiring Human Resources Manager, seeking internship in Human Resources. Human Resources Professional Nortia Staffing is seeking Human Resources, Payroll & Administrative Professionals!! (408) 709-2621 Aspiring Human Resources Professional | Passionate about helping to create an inclusive and engaging work environment "Human Resources| Conflict Management| Policies & Procedures|Talent Management|Benefits & Compensation" Human Resources Generalist at Schwan's Liberal Arts Major. Aspiring Human Resources Analyst. Junior MES Engineer| Information Systems Senior Human Resources Business Partner at Heil Environmental Aspiring Human Resources Professional | An energetic and Team-Focused Leader HR Manager at Endemol Shine North America Human Resources professional for the world leader in GIS software RRP Brand Portfolio Executive at JTI (Japan Tobacco International) Information Systems Specialist and Programmer with a love for data and organization. Bachelor of Science in Biology from Victoria University of Wellington Human Resources Management Major Director Human Resources at EY Undergraduate Research Assistant at Styczynski Lab Lead Official at Western Illinois University Seeking employment opportunities within Customer Service or Patient Care Admissions Representative at Community medical center long beach Seeking Human Resources Opportunities. Open to travel and relocation. Student at Westfield State University "Student at Indiana University Kokomo - Business Management - Retail Manager at Delphi Hardware and Paint" Aspiring Human Resources Professional Student Seeking Human Resources Position Aspiring Human Resources Manager | Graduating May 2020 | Seeking an Entry-Level Human Resources Position in St. Louis Human Resources Generalist at Loparex Business Intelligence and Analytics at Travelers Always set them up for Success Director Of Administration at Excellence Logging

----

----

### Step 8 - Experience with Kimi K2 / Moonshot AI

**Kimi K2 (K2-0711)/ Moonshot AI**
- **Release Date**: July 11, 2025.
- **Architecture**: Mixture-of-Experts (MoE) Transformer with MLA (Multi-head Latent Attention)
- **Parameters**: ~1T total, ~32B activated (MoE).
- **Layers**: 61 total (including 1 dense layer); 64 attention heads; 384 experts; 8 selected experts per token; 1 shared expert.
- **Context Window**: 256K tokens (longest among your three).
- **Tokenizer**: Custom tokenizer; covab size 160K.
- **Objective**: Language Modeling (causal/next-token) with multi-stage post-training focused on agentic capabilities (tool use, planning) using RL variants.
- **Training**: Pre-trained on 15.T tokens with the MuonChip optimizer; post-training includes large-scale agentic data synthesis and RLVR + self-critique.
- **Efficiency**: MoE with 32B active params; block-FP8 checkpointsl recommended engines include vLLM, SGLang, KTransformers, TensorRT-LLM
- **License**: Modified MIT (code and weights)
- *Notes*: OpenAI/Anthropic-compatible API available via platform.moonshot.ai; recommended temperature ~ 0.6 for Instruct variants; strong tool-calling support.

In [35]:
free_GPU_memory()

Measure memory usage before and after freeing it

Before:
allocated: 6002.6 MiB | reserved: 6798.0 MiB | total: 11,264 MiB

After:
allocated: 8.1 MiB | reserved: 20.0 MiB | total: 11,264 MiB


In [36]:
from openai import OpenAI
from typing import List, Tuple
from dotenv import load_dotenv

For this model we will use the MOONSHOT API. Therefor, we will read the API KEY from the .env file

In [37]:
load_dotenv()
api_key = os.getenv("MOONSHOT_API_KEY")
base_url = os.getenv("MOONSHOT_API_BASE", "https://api.moonshot.ai/v1")
if not api_key:
    raise RuntimeError("MOONSHOT_API_KEY is not set. Check your .env or environment.")

client = OpenAI(api_key=api_key, base_url=base_url)

In [38]:
# Choose model (check Moonshot platform for latest stable/previews)
KIMI_MODEL_ID = os.getenv("KIMI_MODEL_ID", "kimi-k2-0905-preview")

##### Here we add a thin **Kimi (Moonshot)** adapter so the evaluation pipeline stays identical to previous local HF models: same rubric/prompt format, same strict parser (`parse_scores_n` → exactly **N** integers 0–100), same **ranking/CSV artifacts**, and **temperature=0** for determinism. 

##### The only change is the generation call (OpenAI-compatible Chat Completions API) wrapped by `score_all_titles_once_kimi` and `run_query_full_kimi`. This preserves apples-to-apples comparisons against Reference-A (ChatGPT-5) while letting swap models with just the prompt builder and the scorer entry point.

In [39]:
# Kimi prompt builder (simple, parser-friendly, same rubric)
def build_scoring_prompt_kimi(query: str, candidates: list[str]) -> str:
    n = len(candidates)
    header = (
        "You are a recruiter scoring job-title similarity to the query\n"
        "Rate each candidate with an integer 0–100 using the FULL scale:\n"
        "• 90–100 = exact/near-exact role match\n"
        "• 70–89  = same discipline or very similar role\n"
        "• 40–69  = related/adjacent\n"
        "• 10–39  = mostly unrelated\n"
        "• 0–9    = completely unrelated\n"
        "Use diverse scores; do NOT give 0 or 100 to many candidates.\n"
        "Ignore employer names, locations, programs.\n"
        f"Output EXACTLY {n} integers, one per line, in the SAME ORDER as the candidates.\n"
        "Integers ONLY (no decimals, no percentages, no words, no punctuation).\n\n"
        "Example for 3 candidates:\n82\n41\n7\n"
        "---\n"
    )
    return header + f'Query: "{query}"\nCandidates:\n' + "\n".join(candidates)

In [40]:
# Kimi scorer (adapter with same return shape as your local scorer)
def score_all_titles_once_kimi(query: str,
                               titles: list[str],
                               max_new_tokens: int = 1200,
                               build_fn=build_scoring_prompt_kimi):
    prompt = build_fn(query, titles)
    resp = client.chat.completions.create(
        model=KIMI_MODEL_ID,         
        temperature=0.1,
        top_p=1.0,
        frequency_penalty=0.2,
        max_tokens=max_new_tokens,
        messages=[{"role": "user", "content": prompt}],
    )
    # debug code
    # raw = resp.choices[0].message.content or ""
    # lines = [ln for ln in raw.splitlines() if ln.strip()]
    # print("parsed_ints:", sum(ln.strip().isdigit() for ln in lines))
    # print("unique_ints:", len({int(ln.strip()) for ln in lines if ln.strip().isdigit()}))

    out_text = resp.choices[0].message.content or ""
    scores = parse_scores_n(out_text, len(titles))   # reuse the original parser
    df = pd.DataFrame({"idx": range(len(titles)), "score": scores})
    df["job_titles"] = [titles[i] for i in df["idx"]]
    df = df.sort_values("score", ascending=False).reset_index(drop=True)
    return df, out_text

In [41]:
# Wrapper runner mirroring the run_query_full original function
def run_query_full_kimi(queries: list[str],
                        model_tag: str = "kimi-k2-0905",
                        build_prompt_fn=build_scoring_prompt_kimi):
    out_dir = os.path.join(OUT_DIR, "llm"); os.makedirs(out_dir, exist_ok=True)
    top10_blocks = []
    for query in queries:
        df_rank, raw = score_all_titles_once_kimi(query, titles, build_fn=build_prompt_fn)
        print_ranking(query, df_rank, top_k=10)   # <- reuse pretty-printer
        df_q = df_rank.head(10)[["score", "job_titles"]].copy()
        df_q.insert(0, "query", query)
        top10_blocks.append(df_q)
    top10 = pd.concat(top10_blocks, ignore_index=True)
    path = os.path.join(out_dir, f"llm_top10__{model_tag}__all_queries.csv")
    top10.to_csv(path, index=False)
    print("Saved:", path)
    return top10, path

In [None]:
top10_kimi, path_kimi = run_query_full_kimi(
    QUERIES,
    model_tag="kimi-k2-0905-preview__listwise",
    build_prompt_fn=build_scoring_prompt_kimi
)


Query: data scientist
    0.060  Human Resources Management Major
    0.060  Human Resources Generalist at ScottMadden, Inc.
    0.050  Junior MES Engineer| Information Systems
    0.050  Director Human Resources  at EY
    0.050  Retired Army National Guard Recruiter, office manager,  seeking a position in Human Resources.
    0.050  Business Management Major and Aspiring Human Resources Manager
    0.050  Aspiring Human Resources Professional | Passionate about helping to create an inclusive and engaging work environment
    0.050  Human Resources Professional
    0.050  RRP Brand Portfolio Executive at JTI (Japan Tobacco International)
    0.050  Liberal Arts Major. Aspiring Human Resources Analyst.

Query: machine learning engineer
    0.060  Human Resources Specialist at Luxottica
    0.050  Aspiring Human Resources Manager, seeking internship in Human Resources.
    0.050  Nortia Staffing is seeking Human Resources, Payroll & Administrative Professionals!!  (408) 709-2621
    0.

### Step 9 - Experience with LLaMA / Meta AI

**LLaMA 3.2 3B Instruct / Meta AI**
- **Release Date**: September 25, 2024.
- **Architecture**: Autoregressive (decoder-only) Transformer with Grouped-Query Attention (GQA) and RoPE  .
- **Parameters**: ~3.21B
- **Layers**: 28 transformer layers, 24 attention heads, 8 KV heads, hidden size 3072 (per model configs) 
- **Context Window**: 128K tokens
- **Tokenizer**: Llama-3 tokenizer (BPE) with 128,256 vocab size (vs. 32K in Llama 2).
- **Objective**: Causal next-token LM; instruction models aligned for dialogue/agentic tasks.
- **Training**: Pretrained on up to ~9T+ tokens; for 1B/3B, destillation from Llama 3.1 8B/70B was used; knowledge cutoff Dec 2023.
- **Efficiency**: Designed for edge/on-device and small-GPU use; keeps the long 128 K context for large candidate lists.
- **License**: Llama 3.2 Community License.
- *Notes*: Use recent Transformers with the official Llama 3.x chat template for best adherence to constrained outputs.

In [None]:
free_GPU_memory()

Measure memory usage before and after freeing it

Before:
allocated: 8.1 MiB | reserved: 20.0 MiB | total: 11,264 MiB

After:
allocated: 8.1 MiB | reserved: 20.0 MiB | total: 11,264 MiB


#### **\\!/** Note on running LLaMA locally (Windows, VRAM, and speed)

I first tried `Llama 3.1 8B Instruct` in FP16/BF16 on Windows. With my GPU (~11 GB VRAM), the model didn’t fit, so device_map="auto" offloaded layers to CPU.
Result: **generation became very slow** (one run sat for ~28 minutes) because most of the compute happened on the CPU instead of the GPU.

I also considered 4-bit quantization to fit 8B, but on Windows the **`bitsandbytes`** wheel isn’t reliably available/officially supported. It’s easier on Linux, but I’m avoiding that detour for now.

Decision: switch to **`LLaMA 3 (3.2) 3B Instruct`**, which fits comfortably in ~11 GB and runs on the GPU at normal speed (no offload).

In [46]:
# Model: 3B Instruct (works well on ~11 GB VRAM)
MODEL_ID = "meta-llama/Llama-3.2-3B-Instruct"

tok = AutoTokenizer.from_pretrained(MODEL_ID, token=os.getenv("HF_TOKEN"), use_fast=True)

mdl = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    dtype=torch.bfloat16 if torch.cuda.is_available() else None,  # or torch.float16
    device_map="auto",
    token=os.getenv("HF_TOKEN"),
    #attn_implementation="flash_attention_2",  # let Transformers pick SDPA/eager automatically
).eval()

if tok.pad_token_id is None:
    tok.pad_token = tok.eos_token

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

In [51]:
# LLaMA prompt builder (same simple rubric + 3-line example)
def build_prompt_all_chat_llama(query: str, titles: list[str]) -> str:
    n = len(titles)
    rubric = (
        "You are a recruiter scoring job-title similarity to the query.\n"
        "Rate each candidate with an integer 0–100 using the FULL scale:\n"
        "• 90–100 = exact/near-exact • 70–89 = very similar • 40–69 = related\n"
        "• 10–39 = mostly unrelated • 0–9 = unrelated\n"
        f"Output EXACTLY {n} integers, one per line, in the SAME ORDER as the candidates.\n"
        "Integers ONLY (no decimals, no words, no punctuation). Use the full range; avoid identical scores.\n\n"
        f'Query: "{query}"\n'
        "Candidates:\n" + "\n".join(titles)
    )
    messages = [
        {"role": "system", "content": "You are precise and output only the requested integers."},
        {"role": "user", "content": rubric},
    ]
    try:
        return tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    except Exception:
        return rubric


#### The rationale for introducing the new pairwise scoring functions:

With small local models (e.g., LLaMA-3.2-3B on ~11 GB VRAM), the original *listwise prompt*,“scoring all 104 titles at once”, is brittle. The model must hold a long rubric plus 100+ candidates and emit exactly 104 integers in order. In practice this led to early stopping, example-echoing, flat/identical scores across queries, and heavy parser padding with zeros.

**The new design scores one title at a time**: one compact prompt per (query, title) returning a single integer (0–100). This drastically lowers the cognitive and formatting burden, so the 3B model gives stable, query-dependent scores with greedy decoding. A simple one-int parser replaces the fragile N-line parser, and the final ranking comes from sorting the 104 individual scores.

To **balance quality vs. cost**, we’ll use a hybrid approach:
• For local small LMs → pairwise (robust, accurate on the available hardware).
• For hosted/strong LMs (DeepSeek R1, ChatGPT-class, Kimi K2, etc.) → **listwise** (one shot per query is cheaper and fast, and these models follow the rubric well).


In [55]:
# 1) Single-candidate prompt (zero-shot, integers only)
def build_prompt_single_llama(query: str, title: str) -> str:
    n = 1
    instr = (
        "You are a recruiter scoring job-title similarity to the query.\n"
        "Return EXACTLY one integer 0–100.\n"
        "Scale:\n"
        " • 90–100 = exact/near-exact\n"
        " • 70–89  = very similar\n"
        " • 40–69  = related/adjacent\n"
        " • 10–39  = mostly unrelated\n"
        " • 0–9    = unrelated\n"
        "Return the integer ONLY (no words, no punctuation, no decimals)."
    )
    user = f'Query: "{query}"\nCandidate:\n{title}'
    messages = [
        {"role": "system", "content": instr},
        {"role": "user",   "content": user},
    ]
    try:
        return tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    except Exception:
        return instr + "\n\n" + user

# 2) Parse exactly one integer
def parse_one_int(text: str) -> int:
    import re
    m = re.findall(r"-?\d+", text)
    if not m: 
        return 0
    x = int(m[-1])
    return max(0, min(100, x))

# 3) Pairwise scorer that preserves your downstream format
def score_titles_llama_pairwise(query: str, all_titles: list[str], max_new_tokens: int = 8):
    rows = []
    for i, title in enumerate(all_titles):
        prompt = build_prompt_single_llama(query, title)
        inputs = tok(prompt, return_tensors="pt").to(mdl.device)
        out_ids = mdl.generate(
            **inputs,
            do_sample=False,
            num_beams=1,
            max_new_tokens=max_new_tokens,
            pad_token_id=tok.eos_token_id,
            use_cache=True,
        )
        out_txt = tok.decode(out_ids[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
        score = parse_one_int(out_txt)
        rows.append({"idx": i, "score": score, "job_titles": title})

    df = pd.DataFrame(rows).sort_values("score", ascending=False).reset_index(drop=True)
    return df

# Runner that matches your existing CSV artifact shape
def run_query_full_llama_pairwise(queries: list[str], model_tag: str = "llama-3.2-3b-instruct-pairwise"):
    out_dir = os.path.join(OUT_DIR, "llm"); os.makedirs(out_dir, exist_ok=True)
    top10_blocks = []
    for q in queries:
        df_rank = score_titles_llama_pairwise(q, titles)
        print_ranking(q, df_rank, top_k=10)
        df_q = df_rank.head(10)[["score", "job_titles"]].copy()
        df_q.insert(0, "query", q)
        top10_blocks.append(df_q)
    top10 = pd.concat(top10_blocks, ignore_index=True)
    path = os.path.join(out_dir, f"llm_top10__{model_tag}__all_queries.csv")
    top10.to_csv(path, index=False)
    print("Saved:", path)
    return top10, path


In [56]:
top10_llama, path_llama = run_query_full_llama_pairwise(
    QUERIES,
    model_tag="llama-3.2-3b-instruct-pairwise",
)



Query: data scientist
    0.700  Business Intelligence and Analytics at Travelers
    0.390  Junior MES Engineer| Information Systems
    0.300  Information Systems Specialist and Programmer with a love for data and organization.
    0.300  Human Resources|
Conflict Management|
Policies & Procedures|Talent Management|Benefits & Compensation
    0.100  2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional
    0.100  Aspiring Human Resources Professional
    0.100  People Development Coordinator at Ryan
    0.100  Aspiring Human Resources Specialist
    0.100  Student at Humber College and Aspiring Human Resources Generalist
    0.100  People Development Coordinator at Ryan

Query: machine learning engineer
    0.400  Junior MES Engineer| Information Systems
    0.390  Business Intelligence and Analytics at Travelers
    0.300  Information Systems Specialist and Programmer with a love for data and organization.
    0.100  Seeking Human 

### Step 10 - Experience with DeepSeek R1 / DeepSeek

**LLaMA 3.1 / Meta AI**
- **Release Date**: July 23, 2024.
- **Architecture**: Autoregressive (decoder-only) Transformer with Grouped-Query Attention (GQA).
- **Parameters**: ~8B, ~70B and ~405B variants.
- **Layers**: 
        - 8B: 32 transformer layers, 32 attention heads, 8 KV heads, hidden size 4096, intermediate size 14336.
        - 70B: 80 transformer layers, 64 attention heads.
- **Context Window**: 128K tokens (all 3.1 sizes).
- **Tokenizer**: New Llama 3 tokenizer (SentencePiece/BPE) with 128,256 vocab size (vs. 32K in Llama 2).
- **Objective**: Next-token prediction; instruction models aligned with superviced fine-tuning (SFT) and RLHF.
- **Training**: Pretrained on ~15T+ tokens; knowledge cutoff December 2023; multilingual coverage (8 supported languages)
- **Efficiency**: GQA for scalable inference; strong ecosystem support in Transformers (FlashAttention 2, 4-bit, quitization via bitsandbytes)
- **License**: Llama 3.1 Community License.
- *Notes*: Official chat template / tool-use formats provided; use Transformers >= 4.43 with Llama 3.1 prompt format for best results.

In [57]:
free_GPU_memory()

Measure memory usage before and after freeing it

Before:
allocated: 6136.0 MiB | reserved: 7314.0 MiB | total: 11,264 MiB

After:
allocated: 8.1 MiB | reserved: 20.0 MiB | total: 11,264 MiB


In [58]:
# Load env + build client
load_dotenv()
DS_API_KEY  = os.getenv("DEEPSEEK_API_KEY")
DS_API_BASE = os.getenv("DEEPSEEK_API_BASE")

if not DS_API_KEY:
    raise RuntimeError("DEEPSEEK_API_KEY is not set. Add it to your .env")

In [59]:
client_ds = OpenAI(api_key=DS_API_KEY, base_url=DS_API_BASE)

In [60]:
# Choose model
DEEPSEEK_MODEL_ID = "deepseek-chat"

In [61]:
top10_DS, path_DS = run_query_full_kimi(
    QUERIES,
    model_tag="deepseek-chat__listwise",
    build_prompt_fn=build_scoring_prompt_kimi
)


Query: data scientist
    0.060  Aspiring Human Resources Professional | An energetic and Team-Focused Leader
    0.060  Senior Human Resources Business Partner at Heil Environmental
    0.050  RRP Brand Portfolio Executive at JTI (Japan Tobacco International)
    0.050  HR Manager at Endemol Shine North America
    0.050  Director of Human Resources North America, Groupe Beneteau
    0.050  Human Resources Generalist at ScottMadden, Inc.
    0.050  Lead Official at Western Illinois University
    0.050  Human Resources Generalist at Schwan's
    0.050  Human Resources Management Major
    0.050  Undergraduate Research Assistant at Styczynski Lab

Query: machine learning engineer
    0.150  Human Resources|
Conflict Management|
Policies & Procedures|Talent Management|Benefits & Compensation
    0.080  Student at Chapman University
    0.080  Nortia Staffing is seeking Human Resources, Payroll & Administrative Professionals!!  (408) 709-2621
    0.070  Human Resources, Staffing and Rec