## GPT Embeddings Baseline (OpenRouter)

This notebook reproduces the same baseline flow as `baseline.ipynb`, but **replaces TF-IDF with embeddings** from OpenRouter using the model **`openai/text-embedding-ada-002`**.

High-level steps:
- Load `Baseline/cleaned_resumes.csv`
- Auto-select the same feature columns (exclude IDs + target)
- Embed **each column separately**, then concatenate the embeddings into one feature matrix
- Train a **Random Forest** classifier and evaluate with a **confusion matrix plot**


## 1. Imports

We import standard libraries, the HTTP client (`requests`) for calling OpenRouter, and scikit-learn utilities for training + evaluation.


In [3]:
import os
import time
from pathlib import Path

import requests
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


## 2. OpenRouter config (API key + model)

This cell:
- Reads your OpenRouter API key from environment variables
- Defines the embedding model `openai/text-embedding-ada-002`
- Sets the OpenRouter base URL (OpenAI-compatible)

Nothing sensitive is printed (we only print which env var name was used).


In [4]:
OPENROUTER_BASE_URL = os.getenv("OPENROUTER_BASE_URL", "https://openrouter.ai/api/v1")
EMBEDDING_MODEL = "openai/text-embedding-ada-002"

# To avoid extremely long payloads, we truncate each field to a safe character limit.
# If you want to be stricter/looser, change this value.
MAX_CHARS_PER_FIELD = 8000


def _get_openrouter_api_key():
    for env_name in ["OPENROUTER_API_KEY", "OPENAI_API_KEY"]:
        val = os.getenv(env_name)
        if val:
            return val, env_name
    raise RuntimeError(
        "Missing API key. Set OPENROUTER_API_KEY in your environment (preferred). "
        "If you are using an OpenAI-compatible key name, OPENAI_API_KEY is also checked."
    )


OPENROUTER_API_KEY, API_KEY_SOURCE = _get_openrouter_api_key()
print(f"Using API key from env var: {API_KEY_SOURCE}")
print(f"Using OpenRouter base URL: {OPENROUTER_BASE_URL}")
print(f"Embedding model: {EMBEDDING_MODEL}")


def _truncate_text(s: str, max_chars: int = MAX_CHARS_PER_FIELD) -> str:
    if s is None:
        return ""
    s = str(s)
    if len(s) <= max_chars:
        return s
    return s[:max_chars]


RuntimeError: Missing API key. Set OPENROUTER_API_KEY in your environment (preferred). If you are using an OpenAI-compatible key name, OPENAI_API_KEY is also checked.

## 3. Load the dataset + select columns

This cell:
- Loads `cleaned_resumes.csv` from the Baseline folder (with a small path check)
- Auto-selects the same feature columns as `baseline.ipynb`
- Prepares `X_df` (features) and `y` (target)


In [None]:
# Locate the CSV without assuming the current working directory
candidate_paths = [Path("cleaned_resumes.csv"), Path("Baseline") / "cleaned_resumes.csv"]
for p in candidate_paths:
    if p.exists():
        data_path = p
        break
else:
    raise FileNotFoundError(f"Could not find cleaned_resumes.csv. Tried: {[str(x) for x in candidate_paths]}")

print(f"Loading: {data_path}")
df = pd.read_csv(data_path)

TARGET = "experience_level"
ID_COLS = {"name", "email", "linkedin", "github", "summary_count"}

feature_cols = [c for c in df.columns if c != TARGET and c not in ID_COLS]

# Ensure text values are strings (embeddings API expects strings)
for col in feature_cols:
    df[col] = df[col].astype(str).fillna("")

X_df = df[feature_cols]
y = df[TARGET]

print(f"Dataset shape: {df.shape}")
print(f"Using {len(feature_cols)} feature columns: {feature_cols}")


## 4. OpenRouter embedding helper

This cell defines a small helper that:
- Calls the OpenRouter **OpenAI-compatible** `/embeddings` endpoint
- Batches requests (to avoid huge payloads)
- Retries on temporary failures
- Deduplicates identical strings to reduce the number of paid embedding inputs


In [None]:
def _openrouter_embeddings_request(inputs, *, model=EMBEDDING_MODEL, timeout_s=90):
    url = f"{OPENROUTER_BASE_URL.rstrip('/')}/embeddings"
    headers = {
        "Authorization": f"Bearer {OPENROUTER_API_KEY}",
        "Content-Type": "application/json",
    }
    payload = {
        "model": model,
        "input": inputs,
    }

    resp = requests.post(url, headers=headers, json=payload, timeout=timeout_s)
    if resp.status_code != 200:
        raise RuntimeError(f"OpenRouter embeddings error {resp.status_code}: {resp.text[:500]}")

    data = resp.json()
    items = data.get("data", [])

    # Ensure we return embeddings in the same order as inputs
    items_sorted = sorted(items, key=lambda x: x.get("index", 0))
    embeddings = [it["embedding"] for it in items_sorted]

    if len(embeddings) != len(inputs):
        raise RuntimeError(f"Expected {len(inputs)} embeddings, got {len(embeddings)}")

    return embeddings


def embed_texts_openrouter(texts, *, batch_size=64, max_retries=5, retry_sleep_s=1.5):
    # Truncate to protect against overly long fields
    texts = [_truncate_text(t) for t in texts]

    # Deduplicate to reduce API calls
    unique_texts = list(dict.fromkeys(texts))

    unique_embeddings = []
    for start in range(0, len(unique_texts), batch_size):
        batch = unique_texts[start : start + batch_size]

        for attempt in range(max_retries):
            try:
                batch_embs = _openrouter_embeddings_request(batch)
                unique_embeddings.extend(batch_embs)
                break
            except Exception as e:
                if attempt == max_retries - 1:
                    raise
                sleep_for = retry_sleep_s * (2 ** attempt)
                time.sleep(sleep_for)

    if len(unique_embeddings) != len(unique_texts):
        raise RuntimeError("Internal error: embedding count mismatch after batching")

    lookup = {t: emb for t, emb in zip(unique_texts, unique_embeddings)}
    dense = np.array([lookup[t] for t in texts], dtype=np.float32)

    return dense


## 5. Build the embedding feature matrix

We embed **each selected column** separately (same idea as multi-column TF-IDF), then concatenate all embeddings into a single matrix `X_embed`.

This keeps the resume fields separated in the representation while still producing one vector per resume.


In [None]:
column_embeds = []

for col in feature_cols:
    print(f"Embedding column: {col}")
    texts = X_df[col].tolist()
    col_vecs = embed_texts_openrouter(texts, batch_size=64)
    print(f"  -> {col_vecs.shape}")
    column_embeds.append(col_vecs)

X_embed = np.concatenate(column_embeds, axis=1)
print(f"Final embedding matrix shape: {X_embed.shape}")


## 6. Train/test split + Random Forest training

We split embeddings into train/test (stratified by `experience_level`) and train a Random Forest.

Like `baseline.ipynb`, we give extra weight to **`mid`** to encourage the model to focus more on separating the middle class.


In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X_embed, y, test_size=0.2, random_state=42, stratify=y
)

class_weights = {'junior': 1.0, 'mid': 3.0, 'senior': 1.0}

rf = RandomForestClassifier(
    n_estimators=200,
    random_state=42,
    n_jobs=-1,
    class_weight=class_weights,
)

print(f"Training Random Forest with class weights: {class_weights}")
rf.fit(X_train, y_train)
print("Training complete")


## 7. Evaluation + confusion matrix

We evaluate the Random Forest on the test set and visualize the confusion matrix to see where it confuses `junior`, `mid`, and `senior`.


In [None]:
y_pred = rf.predict(X_test)

acc = accuracy_score(y_test, y_pred)
print(f"Accuracy: {acc:.4f}")
print("Classification Report:")
print(classification_report(y_test, y_pred))

labels = ['junior', 'mid', 'senior']
cm = confusion_matrix(y_test, y_pred, labels=labels)

plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=labels, yticklabels=labels)
plt.title(f'Confusion Matrix: RF on ada-002 embeddings (Acc: {acc:.2%})')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()
