# Training Pipeline: Build Role Knowledge Base & Artifacts

This notebook loads your UTF-8 CSV, builds:
- A FAISS embedding index over role guidance text
- A TF-IDF job-role matcher
- A unified skills vocabulary
- Per-role prompt skeletons

Artifacts are saved to `../artifacts/` and used by the Streamlit app.

In [1]:
import os, sys, pandas as pd
from pathlib import Path

ROOT = Path.cwd().resolve().parent
DATA = ROOT / 'data'
ART = ROOT / 'artifacts'
print('ROOT:', ROOT)
print('DATA:', DATA)
print('ART:', ART)

os.makedirs(ART, exist_ok=True)

ROOT: /Users/aaronrao/Desktop/projects/HCL
DATA: /Users/aaronrao/Desktop/projects/HCL/data
ART: /Users/aaronrao/Desktop/projects/HCL/artifacts


In [2]:
!pip -q install -r requirements.txt

## 1) Load CSV
Place your CSV at `../data/jobs.csv` or adjust the path below.

In [3]:
from pathlib import Path
import pandas as pd

CSV_PATH = Path("/Users/aaronrao/Desktop/projects/HCL/smart resume viewer/sample_jobs.csv")

# Read CSV with UTF-8 encoding
df = pd.read_csv(CSV_PATH, encoding="utf-8")

# Clean non-UTF characters from all string columns
for col in df.select_dtypes(include='object').columns:
    df[col] = df[col].map(lambda x: str(x).encode('utf-8', errors='ignore').decode('utf-8'))

# Check the first few rows
df.head()

Unnamed: 0,job_position,relevant_skills,required_qualifications,job_responsibilities,ideal_candidate_summary
0,Hearing Care Provider,Hearing Aid Dispensing License. Audiometric eq...,Maintain an active Hearing Aid Dispensing Lice...,Provide quality care and aftercare of dispensi...,The ideal candidate is a skilled professional ...
1,Shipping & Receiving Associate 2nd shift,Material handling. Forklift operation. Attenti...,2+ years previous inbound/outbound logistics e...,Pack and prepare customer orders for shipping....,"The ideal candidate is detail-oriented, self-m..."
2,Engineering Manager,Project management. Engineering design. Automo...,,Plan and direct engineering directives. Establ...,The ideal candidate will have strong leadershi...
3,Cook,Culinary skills. Food preparation. Recipe adhe...,Associates degree in culinary arts. 3 years of...,Oversee the work of kitchen staff. Ensure prot...,A dedicated culinary professional with experie...
4,Principal Cloud Security Engineer/Architect,AWS Security Pillar. Cloud security architectu...,8+ years of experience securing enterprise-sca...,"Define, build, and maintain cloud security pol...","A self-starter with a strong work ethic, excel..."


## 2) Build & Save Artifacts
Uses `JDIndex` helper to produce FAISS index, TF-IDF matcher, skills vocab, and role prompts.

In [4]:
from pathlib import Path
import sys
import os

# Project root
ROOT = Path("/Users/aaronrao/Desktop/projects/HCL/smart resume viewer")

# Add 'app' folder to sys.path so 'components' can be imported
sys.path.insert(0, str(ROOT / "app"))

# Ensure artifacts folder exists
ART = ROOT / "artifacts"
ART.mkdir(exist_ok=True)

In [5]:
from components.jd_index import JDIndex

index = JDIndex(max_features=5000)
index.build_from_csv(
    "/Users/aaronrao/Desktop/projects/HCL/smart resume viewer/sample_jobs.csv",
    sample_size=None  # full 33k rows
)

[jd_index] Artifact directory available at: /Users/aaronrao/Desktop/projects/HCL/smart resume viewer/artifacts
📌 Reading CSV in chunks...


📂 Processing chunks: 17it [04:38, 16.37s/it]


✅ Total valid records: 22706
📌 Training classifier with progress bar...


Training epochs: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [03:03<00:00, 36.76s/it]


✅ Artifacts saved in /Users/aaronrao/Desktop/projects/HCL/smart resume viewer/artifacts, total records: 22706


## 3) Quick Sanity Test
Query the index with a short text and predict the closest role.

In [6]:
from components.jd_index import JDIndex

# Initialize and load trained artifacts
index = JDIndex()
index.load()

# Test input text
test_text = "Built classification models in Python and scikit-learn; dashboarded KPIs in SQL; A/B testing."

# Top-3 similar roles from FAISS
print("Top-3 similar roles:")
for r in index.query(test_text, k=3):
    print(r['job_position'], r['score'])

# Predicted role with probability from the SGDClassifier
role, proba = index.match_role(test_text)
print("\nPredicted role:", role, "with probability:", proba)

✅ Artifacts loaded successfully
Top-3 similar roles:
Senior Test Engineer 1.2922947
IT Generalist/Python 1.4633856
Full-Stack Software Developer 1.4807752

Predicted role: Software Engineer with probability: 0.00038047357649675386


## 4) Next Steps
- Launch the UI: `streamlit run app/app.py`
- Provide API keys in `.env` if you want cloud LLMs.
- Customize prompts in `artifacts/role_prompts.json`.
- Extend ATS rules in `app/components/ats_scoring.py`.