# NeuraPath Demo Notebook
## Step 1. Define the Problem
Goal: Given a candidate’s resume (list of skills) and a target job role (aggregated from jobs in selected industries), compute:

Fit Score (0–1 weighted overlap)

Matched vs. Missing skill categories

Optional salary context (median/min per role cluster)

JSON output for the frontend

Primary user story: As a job seeker, I want to see how my skills compare to a target role so I know which skills to develop next.
This notebook runs the rapid ML prototype for Resume→Role fit scoring.


## Step 2. Select a Small Dataset
We’ll subset from your large job tables to keep iteration fast.

2A. Pick Role Clusters (editable)
We’ll build 5 starter “roles” by filtering industries whose names contain certain keywords (case‑insensitive). Edit as needed:

Role Key	Industry Keyword(s)	Notes
DATA_ANALYST	data, analytics	Data Infra & Analytics industries.
ML_ENGINEER	software, data security, technology	Tech/Software heavy.
NETWORK_ADMIN	networking, telecommunications, it system	Infra/Net focus.
FRONTEND_DEV	software, internet, web	Frontend / web product industries.
IT_SUPPORT	it system, consulting, services	IT services / support focus.

You can change these keywords or provide a list of industry IDs if you prefer.

2B. Sample Size Rule
Up to 5,000 jobs per role (or fewer if limited).

Drop duplicates.

Keep only job_ids that appear in job_skills.csv so we have skills to aggregate.

Merge salary rows if available; will help produce “market view” per role.

## Step 4. Build & Train Quickly
Below: minimal working code skeleton that you can paste into a notebook (neurapath_demo.ipynb). Adjust paths if needed.

In [None]:
from pathlib import Path
import pandas as pd
import numpy as np
import re

DATA_DIR = Path("/mnt/data")

# --- Load core tables ---
skills         = pd.read_csv(DATA_DIR / "skills.csv").assign(skill_name=lambda d: d.skill_name.str.strip())
job_skills     = pd.read_csv(DATA_DIR / "job_skills.csv")          # job_id, skill_abr
industries     = pd.read_csv(DATA_DIR / "industries.csv")          # industry_id, industry_name
job_industries = pd.read_csv(DATA_DIR / "job_industries.csv")      # job_id, industry_id
salaries       = pd.read_csv(DATA_DIR / "salaries.csv")            # job_id, salary fields (optional)


### 4A. Helper: Map industries to roles
Edit ROLE_KEYWORDS to tune grouping.

In [None]:
ROLE_KEYWORDS = {
    "DATA_ANALYST":    ["data", "analytics"],
    "ML_ENGINEER":     ["software", "technology", "security"],
    "NETWORK_ADMIN":   ["network", "telecommunications", "it system"],
    "FRONTEND_DEV":    ["internet", "software", "web"],
    "IT_SUPPORT":      ["consult", "it services", "services"]
}

# build industry → role map
def assign_role_from_industry(industry_name: str) -> list:
    name_l = str(industry_name).lower()
    roles = []
    for role, kws in ROLE_KEYWORDS.items():
        if any(kw in name_l for kw in kws):
            roles.append(role)
    return roles

industries["roles"] = industries["industry_name"].apply(assign_role_from_industry)


ind_role = industries.explode("roles").dropna(subset=["roles"])
ind_role = ind_role.rename(columns={"roles":"role"})
job_roles = job_industries.merge(ind_role[["industry_id","role"]], on="industry_id", how="inner")
# job_id may map to multiple roles; keep all for now


### 4B. Build per‑role skill profiles
Aggregate skill frequency for all jobs mapped to that role, then mark “required”.

In [None]:
def build_role_skill_profile(role, min_jobs=50, required_pct=0.30):
    job_ids = job_roles.loc[job_roles.role == role, "job_id"].unique()
    if len(job_ids) == 0:
        return pd.DataFrame()
    js = job_skills[job_skills.job_id.isin(job_ids)]
    freq = js.skill_abr.value_counts().rename("count").reset_index().rename(columns={"index":"skill_abr"})
    freq["pct_jobs"] = freq["count"] / len(job_ids)
    freq = freq.merge(skills, on="skill_abr", how="left")
    freq["required"] = freq["pct_jobs"] >= required_pct
    freq["role"] = role
    return freq.sort_values("pct_jobs", ascending=False).reset_index(drop=True)

role_profiles = {r: build_role_skill_profile(r) for r in ROLE_KEYWORDS}


### 4C. Resume skill extraction (keyword match prototype)
Provide a simple dictionary; expand with synonyms from real resumes.

In [None]:
# Build regex per skill category from its name (very rough prototype)
skills["pattern"] = skills["skill_name"].str.lower().str.replace(r"[^a-z0-9]+", "|", regex=True)

def extract_resume_skills(resume_text: str):
    text = resume_text.lower()
    found = set()
    for _, row in skills.iterrows():
        pat = row.pattern
        if not pat or pat == "|":
            continue
        if re.search(r"\b(" + pat + r")\b", text):
            found.add(row.skill_abr)
    return found


## Step 5. Evaluate with Simple Metrics
Given we lack gold labels, we bootstrap:

In [None]:
def fit_score(resume_skills, role_profile, required_weight=2.0):
    req = set(role_profile.loc[role_profile.required, "skill_abr"])
    opt = set(role_profile.loc[~role_profile.required, "skill_abr"])
    score = 0
    denom = required_weight*len(req) + len(opt)
    for s in resume_skills:
        if s in req:
            score += required_weight
        elif s in opt:
            score += 1
    return score / denom if denom else 0.0


Quick Diagnostic Metrics (no labels yet):

Coverage % = len(resume_skills ∩ role_skills) / len(role_skills)

Required coverage % = same but using required only

Compare multiple resumes; rank roles; check if top role matches user intent

If you can hand‑tag 10 resume→role pairs, compute Precision/Recall & confusion matrix for the logistic model later.

## Step 6. Visualize & Present Results
Minimal, readable visualizations:

In [None]:
import matplotlib.pyplot as plt

def plot_gap(resume_skills, role_profile, top_n=15):
    df = role_profile.copy()
    df["in_resume"] = df["skill_abr"].isin(resume_skills)
    df["missing"] = ~df["in_resume"]
    df = df.sort_values(["required","pct_jobs"], ascending=[False,False]).head(top_n)
    colors = df["in_resume"].map({True:"green", False:"red"})
    plt.barh(df["skill_name"], df["pct_jobs"], color=colors)
    plt.gca().invert_yaxis()
    plt.xlabel("% of jobs requiring skill")
    plt.title(f"Resume vs Role: {df['role'].iloc[0]}")
    plt.show()


In [None]:
def build_fit_report(resume_id, resume_text, role):
    rskills = extract_resume_skills(resume_text)
    prof = role_profiles[role]
    score = fit_score(rskills, prof)
    matched  = prof.loc[prof.skill_abr.isin(rskills), "skill_name"].tolist()
    missing  = prof.loc[~prof.skill_abr.isin(rskills) & prof.required, "skill_name"].tolist()
    return {
        "resume_id": resume_id,
        "role": role,
        "fit_score": round(score, 2),
        "matched_skills": matched,
        "missing_skills": missing,
    }


## Step 7. Get Feedback & Iterate
Iteration loop:

Demo to team: do the required skill lists make sense?

Adjust industry keywords → role definitions.

Add manual synonym table per skill.

Collect 5 real resumes; check which skills are missed → improve regex.

## Quick End‑to‑End Demo (Sample Use)

In [None]:
# Example resume snippet (replace with your text)
resume_txt = """
Data analysis with SQL, Power BI dashboards, Python (pandas, scikit-learn),
report automation, stakeholder reporting, Excel financial modeling, basic cloud (Azure).
"""

# Try all roles and rank
reports = []
for role, prof in role_profiles.items():
    if prof.empty:
        continue
    rep = build_fit_report("r001", resume_txt, role)
    reports.append(rep)

pd.DataFrame(reports).sort_values("fit_score", ascending=False)


In [None]:
best_role = max(reports, key=lambda r: r["fit_score"])["role"]
plot_gap(extract_resume_skills(resume_txt), role_profiles[best_role])
