
# 04 — Feature Engineering (Phase 4)

**Goal:** Transform `cleaned_master.csv` into modeling‑ready features for salary prediction, forecasting, clustering, and recommendations.

**Inputs**
- `data/processed/cleaned_master.csv` (or fallback: uploaded `cleaned_master.csv`)

**Outputs**
- `data/processed/X_train.csv`, `X_test.csv`, `y_train.csv`, `y_test.csv`
- `feature_columns.json` (ordered list of feature names used in X)
- `skill_vocab.json` (frozen list of skills encoded)
- `preprocessing_config.json` (parameters for reproducibility)

**Covers**
- Temporal features (`posting_year`, `posting_month`, etc.)
- Skill features (tokenization, multi‑label one‑hot for top‑K skills)
- Salary features (`salary_midpoint`, `log_salary_midpoint`, IQR capping)
- Categorical encodings (frequency / limited one‑hot)
- Geographic & work setting features
- Train/test split for downstream notebooks

> Re-run this notebook whenever you update the cleaned dataset; it will regenerate features and artifacts deterministically.


In [2]:
%pip install -q pandas numpy matplotlib seaborn scikit-learn xgboost prophet shap plotly

Note: you may need to restart the kernel to use updated packages.


In [3]:

# Time: O(n * (k + f)); Space: O(n * (k + f))
# n = rows, k = top-K skills, f = number of final feature columns
from __future__ import annotations

import ast
import json
import math
import re
from pathlib import Path
from typing import List, Dict, Any

import numpy as np
import pandas as pd

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split

# Paths (edit REPO_ROOT if running outside repo)
REPO_ROOT = Path('..').resolve()
DATA_PROCESSED = REPO_ROOT / 'data' / 'processed'
DATA_PROCESSED.mkdir(parents=True, exist_ok=True)

# In this environment we also save to /mnt/data for quick download
SANDBOX = Path('/mnt/data')
SANDBOX.mkdir(parents=True, exist_ok=True)

INPUT_CANDIDATES = [
    DATA_PROCESSED / 'cleaned_master.csv',
    SANDBOX / 'cleaned_master.csv',  # uploaded file fallback
]

DQ_REPORT_CANDIDATES = [
    REPO_ROOT / 'reports' / 'data_quality_report.json',
    SANDBOX / 'data_quality_report.json'
]

print('Repo root  :', REPO_ROOT)
print('Processed  :', DATA_PROCESSED)
print('Sandbox    :', SANDBOX)


Repo root  : C:\Users\tdmne\OneDrive\Desktop\Projects 2\datathon-2025
Processed  : C:\Users\tdmne\OneDrive\Desktop\Projects 2\datathon-2025\data\processed
Sandbox    : \mnt\data


## Load dataset

In [4]:

def find_first_existing(paths):
    for p in paths:
        if p.exists():
            return p
    return None

INPUT_PATH = find_first_existing(INPUT_CANDIDATES)
if INPUT_PATH is None:
    raise FileNotFoundError('Could not find cleaned_master.csv in expected locations.')

DQ_PATH = find_first_existing(DQ_REPORT_CANDIDATES)

print('Using cleaned dataset :', INPUT_PATH)
if DQ_PATH:
    print('Using DQ report       :', DQ_PATH)

master = pd.read_csv(INPUT_PATH)
print('Shape:', master.shape)
master.head(3)


Using cleaned dataset : C:\Users\tdmne\OneDrive\Desktop\Projects 2\datathon-2025\data\processed\cleaned_master.csv
Using DQ report       : C:\Users\tdmne\OneDrive\Desktop\Projects 2\datathon-2025\reports\data_quality_report.json
Shape: (36167, 29)


  master = pd.read_csv(INPUT_PATH)


Unnamed: 0,work_year,job_title,job_category,salary_currency,salary,salary_in_usd,employee_residence,experience_level,employment_type,work_setting,...,required_skills,education_required,years_experience,industry,posting_date,application_deadline,job_description_length,benefits_score,company_name,salary_local
0,2022.0,Machine Learning Engineer in office,Analysis,EUR,186597.0,136086.0,US,MI,CT,Remote,...,,,,,,,,,,
1,2020.0,Statistician (Remote),ML/AI,JPY,110630.0,67982.0,JP,EX,FL,Remote,...,,,,,,,,,,
2,2022.0,Machine Learning Engineer,ML/AI,INR,61280.0,153309.0,UK,MI,CT,Hybrid,...,,,,,,,,,,


## Inspect schema & basic hygiene

In [5]:

master_columns = list(master.columns)
print('Columns:', master_columns)

# Ensure expected columns exist (create if missing)
def ensure_col(df: pd.DataFrame, name: str, default=np.nan):
    if name not in df.columns:
        df[name] = default

expected_cols = [
    'posting_date','required_skills','job_description','job_title','job_category',
    'experience_level','employment_type','company_size','company_location','employee_residence',
    'remote_ratio','salary_min','salary_max','salary_usd','salary_midpoint','skill_count'
]
for c in expected_cols:
    ensure_col(master, c)

# Parse dates if present
if pd.api.types.is_object_dtype(master['posting_date']):
    master['posting_date'] = pd.to_datetime(master['posting_date'], errors='coerce')

# Normalize text-ish columns lightly
def normalize_text(s):
    if pd.isna(s): return np.nan
    s = str(s).strip()
    s = re.sub(r'\s+', ' ', s)
    return s

for c in ['job_title','job_category','company_size','company_location','employee_residence','experience_level','employment_type']:
    master[c] = master[c].apply(normalize_text)

master.head(3)


Columns: ['work_year', 'job_title', 'job_category', 'salary_currency', 'salary', 'salary_in_usd', 'employee_residence', 'experience_level', 'employment_type', 'work_setting', 'company_location', 'company_size', '__source__', 'job_id', 'category', 'job_description', 'job_skill_set', 'salary_usd', 'remote_ratio', 'required_skills', 'education_required', 'years_experience', 'industry', 'posting_date', 'application_deadline', 'job_description_length', 'benefits_score', 'company_name', 'salary_local']


Unnamed: 0,work_year,job_title,job_category,salary_currency,salary,salary_in_usd,employee_residence,experience_level,employment_type,work_setting,...,posting_date,application_deadline,job_description_length,benefits_score,company_name,salary_local,salary_min,salary_max,salary_midpoint,skill_count
0,2022.0,Machine Learning Engineer in office,Analysis,EUR,186597.0,136086.0,US,MI,CT,Remote,...,NaT,,,,,,,,,
1,2020.0,Statistician (Remote),ML/AI,JPY,110630.0,67982.0,JP,EX,FL,Remote,...,NaT,,,,,,,,,
2,2022.0,Machine Learning Engineer,ML/AI,INR,61280.0,153309.0,UK,MI,CT,Hybrid,...,NaT,,,,,,,,,


## Read data_quality_report.json (optional)

In [6]:

dq = {}
if DQ_PATH and DQ_PATH.exists():
    try:
        dq = json.loads(Path(DQ_PATH).read_text())
    except Exception as e:
        print('Failed reading DQ report:', e)

print(json.dumps(dq, indent=2)[:1000] + ('...' if dq else ''))


{
  "rows": 36167,
  "cols": 36,
  "duplicates": 0,
  "completeness": {
    "work_year": 0.1382,
    "job_title": 1.0,
    "job_category": 0.1244,
    "salary_currency": 0.9539,
    "salary": 0.1382,
    "salary_in_usd": 0.1382,
    "employee_residence": 0.9677,
    "experience_level": 0.9539,
    "employment_type": 0.9677,
    "work_setting": 0.1382,
    "company_location": 0.9677,
    "company_size": 0.9539,
    "__source__": 1.0,
    "salary_min": 0.0,
    "salary_max": 0.0,
    "salary_usd": 0.8295,
    "remote_ratio": 0.8295,
    "posting_date": 0.8295,
    "required_skills": 0.8295,
    "years_experience": 0.8295,
    "industry": 0.8295,
    "application_deadline": 0.8295,
    "job_description": 0.0323,
    "company_name": 0.8295,
    "job_id": 0.8618,
    "category": 0.0323,
    "job_skill_set": 0.0323,
    "education_required": 0.8295,
    "job_description_length": 0.8295,
    "benefits_score": 0.8295,
    "salary_local": 0.4147,
    "salary_midpoint": 0.8295,
    "skill_count"

## Target engineering: salary_midpoint → log transform

In [7]:
# --- Robust target construction for salary_midpoint ---

def to_num(x):
    if pd.isna(x): 
        return np.nan
    s = str(x)
    # strip currency, commas, and spaces
    s = re.sub(r'[^\d.\-eE]', '', s)
    try:
        v = float(s)
        return v if np.isfinite(v) else np.nan
    except:
        return np.nan

# choose source columns gracefully
has_mid = 'salary_midpoint' in master
has_usd = 'salary_usd' in master or 'salary_in_usd' in master
has_minmax = 'salary_min' in master and 'salary_max' in master

if not has_mid:
    master['salary_midpoint'] = np.nan

# build midpoint in priority order: existing → min/max → *_in_usd/usd
if master['salary_midpoint'].isna().all():
    if has_minmax:
        smin = master['salary_min'].apply(to_num)
        smax = master['salary_max'].apply(to_num)
        mid = (smin + smax) / 2.0
        master['salary_midpoint'] = mid
    if master['salary_midpoint'].isna().all() and has_usd:
        usd_col = 'salary_usd' if 'salary_usd' in master else 'salary_in_usd'
        master['salary_midpoint'] = master[usd_col].apply(to_num)

# filter invalids
master.loc[~np.isfinite(master['salary_midpoint']) | (master['salary_midpoint'] <= 0), 'salary_midpoint'] = np.nan

# if still empty, stop early with a helpful error
if master['salary_midpoint'].dropna().empty:
    raise ValueError(
        "No usable salary values found. "
        "Check that at least one of ['salary_midpoint', 'salary_min'+'salary_max', 'salary_usd'/'salary_in_usd'] "
        "exists with numeric/parsible values (remove currency symbols/commas in cleaning)."
    )

# winsorize (cap) with IQR
def iqr_cap(series: pd.Series, k: float = 1.5):
    q1, q3 = series.quantile([0.25, 0.75])
    iqr = q3 - q1
    lower = q1 - k * iqr
    upper = q3 + k * iqr
    return series.clip(lower, upper)

cap = iqr_cap(master['salary_midpoint'].dropna())
master['salary_midpoint_capped'] = cap.reindex(master.index)

# log1p target
master['log_salary_midpoint'] = np.log1p(master['salary_midpoint_capped'])

print(master[['salary_midpoint','salary_midpoint_capped','log_salary_midpoint']].describe())
print("non-null targets:", master['log_salary_midpoint'].notna().sum())


       salary_midpoint  salary_midpoint_capped  log_salary_midpoint
count     30000.000000            30000.000000         30000.000000
mean     118670.451700           117610.103133            11.549797
std       62229.977054            59094.547731             0.509465
min       16621.000000            16621.000000             9.718482
25%       72575.750000            72575.750000            11.192400
50%      103206.500000           103206.500000            11.544497
75%      150921.750000           150921.750000            11.924523
max      410273.000000           268440.750000            12.500389
non-null targets: 30000


## Temporal features from `posting_date`

In [9]:
# === Compact & reliable temporal feature builder ===
import pandas as pd
import numpy as np

# Parse posting_date column robustly
if 'posting_date' in master.columns:
    # Try ISO / standard formats first
    dt = pd.to_datetime(master['posting_date'], errors='coerce')
    
    # If many NaT (>50%), try day-first formats
    if dt.isna().mean() > 0.5:
        dt_alt = pd.to_datetime(master['posting_date'], errors='coerce', dayfirst=True)
        if dt_alt.notna().sum() > dt.notna().sum():
            dt = dt_alt
else:
    master['posting_date'] = np.nan
    dt = pd.Series(pd.NaT, index=master.index)

# Build simple temporal features
master['posting_year']       = dt.dt.year
master['posting_month']      = dt.dt.month
master['posting_quarter']    = dt.dt.quarter
master['posting_dayofweek']  = dt.dt.dayofweek
master['days_since_posting'] = (pd.Timestamp.today().normalize() - dt.dt.normalize()).dt.days

print("✅ Parsed % valid dates:", round(100 * dt.notna().mean(), 2))
master[['posting_year','posting_month','posting_quarter','posting_dayofweek','days_since_posting']].head()


✅ Parsed % valid dates: 82.95


Unnamed: 0,posting_year,posting_month,posting_quarter,posting_dayofweek,days_since_posting
0,,,,,
1,,,,,
2,,,,,
3,,,,,
4,,,,,


## Work setting / remote features

In [10]:

# Remote ratio buckets: 0, 50, 100 or NaN → categorical
def bucket_remote(r):
    try:
        r = float(r)
    except Exception:
        return 'unknown'
    if r <= 0: return 'onsite'
    if r < 100: return 'hybrid'
    return 'remote'

master['remote_bucket'] = master['remote_ratio'].apply(bucket_remote)

# Simple one-hot for remote bucket
remote_ohe = pd.get_dummies(master['remote_bucket'], prefix='remote')
master = pd.concat([master, remote_ohe], axis=1)

remote_ohe.columns.tolist()[:5], master['remote_bucket'].value_counts(dropna=False).to_dict()


(['remote_hybrid', 'remote_onsite', 'remote_remote'],
 {'remote': 16121, 'onsite': 10050, 'hybrid': 9996})

## Experience level → ordinal encoding

In [12]:
# Kaggle-style experience levels: IN, EN, MI, SE, EX
exp_map = {
    'IN': 0,   # Intern
    'EN': 1,   # Entry-level / Junior
    'MI': 2,   # Mid-level
    'SE': 3,   # Senior
    'EX': 4    # Executive
}

def map_exp(x):
    if pd.isna(x):
        return np.nan
    s = str(x).strip().upper()
    return exp_map.get(s, np.nan)

master['experience_level_ord'] = master['experience_level'].apply(map_exp)
master[['experience_level','experience_level_ord']].head(10)


Unnamed: 0,experience_level,experience_level_ord
0,MI,2.0
1,EX,4.0
2,MI,2.0
3,SE,3.0
4,MI,2.0
5,MI,2.0
6,EX,4.0
7,EX,4.0
8,,
9,EN,1.0


## Frequency encoding: location, company_size, job_title (top‑limited)

In [13]:

def freq_encode(series: pd.Series) -> pd.Series:
    freq = series.value_counts(dropna=True)
    return series.map(freq).fillna(0).astype(float)

master['loc_freq'] = freq_encode(master['company_location'].astype(str))
master['company_size_freq'] = freq_encode(master['company_size'].astype(str))

# For very high‑cardinality job titles, keep top N as one‑hot, rest → 'other'
TOP_TITLES = 30
top_titles = master['job_title'].value_counts().head(TOP_TITLES).index.tolist()
master['job_title_limited'] = master['job_title'].where(master['job_title'].isin(top_titles), other='other')
job_ohe = pd.get_dummies(master['job_title_limited'], prefix='title')
master = pd.concat([master, job_ohe], axis=1)

len(top_titles), top_titles[:10]


(30,
 ['Machine Learning Engineer',
  'Data Engineer',
  'Data Scientist',
  'Data Analyst',
  'Machine Learning Researcher',
  'Autonomous Systems Engineer',
  'AI Architect',
  'Robotics Engineer',
  'AI Software Engineer',
  'AI Product Manager'])

## Skill features (multi‑label one‑hot for top‑K)

In [14]:

def parse_skills(val) -> List[str]:
    if pd.isna(val): return []
    s = str(val).strip()
    # Handle list-like strings: "['Python', 'SQL']" OR comma separated "Python, SQL"
    if s.startswith('[') and s.endswith(']'):
        try:
            lst = ast.literal_eval(s)
            return [str(x).strip().lower() for x in lst if str(x).strip()]
        except Exception:
            pass
    # Fallback: comma‑separated
    return [t.strip().lower() for t in s.split(',') if t.strip()]

skill_lists = master['required_skills'].apply(parse_skills)
skill_freq = {}
for lst in skill_lists:
    for sk in lst:
        skill_freq[sk] = skill_freq.get(sk, 0) + 1

# Keep top‑K skills to avoid huge matrices
TOP_K_SKILLS = 100
top_skills = [s for s, _ in sorted(skill_freq.items(), key=lambda x: x[1], reverse=True)[:TOP_K_SKILLS]]

def filter_to_top(lst: List[str]) -> List[str]:
    return [s for s in lst if s in top_skills]

filtered_skill_lists = skill_lists.apply(filter_to_top)

mlb = MultiLabelBinarizer(classes=top_skills)
skill_ohe = pd.DataFrame(mlb.fit_transform(filtered_skill_lists),
                         columns=[f"skill__{s}" for s in mlb.classes_],
                         index=master.index)

master = pd.concat([master, skill_ohe], axis=1)

print('Top skills kept:', len(top_skills))
print(top_skills[:20])


Top skills kept: 24
['python', 'sql', 'tensorflow', 'kubernetes', 'pytorch', 'scala', 'linux', 'git', 'java', 'gcp', 'hadoop', 'r', 'tableau', 'computer vision', 'data visualization', 'spark', 'mlops', 'azure', 'deep learning', 'nlp']


## Assemble final feature matrix `X` and targets `y`

In [15]:

# Baseline numerical features
num_cols = [
    'experience_level_ord','loc_freq','company_size_freq',
    'posting_year','posting_month','posting_quarter','posting_dayofweek','days_since_posting'
]

# Remote bucket OHE columns
remote_cols = [c for c in master.columns if c.startswith('remote_')]

# Title OHE columns
title_cols = [c for c in master.columns if c.startswith('title_')]

# Skill OHE columns
skill_cols = [c for c in master.columns if c.startswith('skill__')]

feature_cols = num_cols + remote_cols + title_cols + skill_cols

# Drop rows without target
X = master[feature_cols].copy()
y = master['log_salary_midpoint'].copy()

mask = ~y.isna()
X = X.loc[mask].fillna(0)
y = y.loc[mask]

print('X shape:', X.shape, '| y shape:', y.shape)
print('Feature sample:', feature_cols[:10])


X shape: (30000, 68) | y shape: (30000,)
Feature sample: ['experience_level_ord', 'loc_freq', 'company_size_freq', 'posting_year', 'posting_month', 'posting_quarter', 'posting_dayofweek', 'days_since_posting', 'remote_ratio', 'remote_bucket']


## Train/Test split & save artifacts

In [16]:

# Stratify by experience bins when possible
exp_bins = pd.cut(master.loc[mask, 'experience_level_ord'].fillna(-1), bins=[-1,0,1,2,3,10], labels=False, include_lowest=True)
try:
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=exp_bins
    )
except Exception:
    # Fallback if stratification fails
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

# Save to repo-structured path (if you run inside repo) and sandbox for download
def save_dual(df: pd.DataFrame | pd.Series, name: str):
    # Repo-structured path
    out1 = DATA_PROCESSED / name
    # Sandbox for immediate download
    out2 = SANDBOX / name
    if isinstance(df, pd.Series):
        df.to_csv(out1, index=False, header=True)
        df.to_csv(out2, index=False, header=True)
    else:
        df.to_csv(out1, index=False)
        df.to_csv(out2, index=False)
    return out1, out2

p1, p2 = save_dual(X_train, 'X_train.csv')
p3, p4 = save_dual(X_test,  'X_test.csv')
p5, p6 = save_dual(y_train.to_frame('y'), 'y_train.csv')
p7, p8 = save_dual(y_test.to_frame('y'),  'y_test.csv')

print('Saved:')
print(p1, '\n', p3, '\n', p5, '\n', p7)

# Save metadata artifacts
feature_cols_path_1, feature_cols_path_2 = save_dual(pd.Series(feature_cols), 'feature_columns.csv')

config = {
    "top_k_skills": int(len(skill_cols)),
    "top_k_titles": int(len(title_cols)),
    "remote_buckets": remote_cols,
    "random_state": 42,
    "target": "log_salary_midpoint",
    "iqr_cap_k": 1.5
}
(SANDBOX / 'preprocessing_config.json').write_text(json.dumps(config, indent=2))
(SANDBOX / 'skill_vocab.json').write_text(json.dumps([c.replace('skill__','') for c in skill_cols], indent=2))

print('Artifacts:')
print(' -', SANDBOX / 'feature_columns.csv')
print(' -', SANDBOX / 'skill_vocab.json')
print(' -', SANDBOX / 'preprocessing_config.json')


Saved:
C:\Users\tdmne\OneDrive\Desktop\Projects 2\datathon-2025\data\processed\X_train.csv 
 C:\Users\tdmne\OneDrive\Desktop\Projects 2\datathon-2025\data\processed\X_test.csv 
 C:\Users\tdmne\OneDrive\Desktop\Projects 2\datathon-2025\data\processed\y_train.csv 
 C:\Users\tdmne\OneDrive\Desktop\Projects 2\datathon-2025\data\processed\y_test.csv
Artifacts:
 - \mnt\data\feature_columns.csv
 - \mnt\data\skill_vocab.json
 - \mnt\data\preprocessing_config.json


## Quick quality checks

In [19]:
# === Final Fixed Quick Quality Checks ===
import numpy as np
import pandas as pd

# Helper: find all-zero columns
def nonzero_columns(df: pd.DataFrame):
    nz = df.loc[:, (df != 0).any(axis=0)]
    dropped = sorted(set(df.columns) - set(nz.columns))
    return nz, dropped

# 1️⃣ non-zero feature summary
X_nz, dropped_cols = nonzero_columns(X_train)
print(f"Non-zero features in X_train: {X_nz.shape[1]} / {X_train.shape[1]}")
print("All-zero columns dropped (inspect if unexpected):", dropped_cols)

# 2️⃣ target leakage check
assert 'log_salary_midpoint' not in X_train.columns, "⚠️ Target leaked into features!"

# 3️⃣ ensure numeric + finite values safely
# convert all to float to avoid dtype issues
X_train_num = X_train.apply(pd.to_numeric, errors='coerce').astype(float)
X_test_num  = X_test.apply(pd.to_numeric,  errors='coerce').astype(float)

# check NaNs
nan_train = X_train_num.isna().sum().sum()
nan_test  = X_test_num.isna().sum().sum()

# check infinite values (works now because dtype is float)
inf_train = np.isinf(X_train_num.to_numpy(dtype=float)).sum()
inf_test  = np.isinf(X_test_num.to_numpy(dtype=float)).sum()

print(f"NaN values → train: {nan_train}, test: {nan_test}")
print(f"Inf values → train: {inf_train}, test: {inf_test}")

# 4️⃣ sanity assertions
assert inf_train == 0 and inf_test == 0, "⚠️ Found infinite values in data!"

print("✅ Basic checks passed — all numeric and finite. Safe for modeling.")


Non-zero features in X_train: 57 / 68
All-zero columns dropped (inspect if unexpected): ['title_Data Analyst (Remote)', 'title_Data Analyst in office', 'title_Data Engineer (Remote)', 'title_Data Engineer in office', 'title_Data Scientist in office', 'title_Machine Learning Engineer (Remote)', 'title_Machine Learning Engineer in office', 'title_Statistician', 'title_Statistician (Remote)', 'title_Statistician in office', 'title_other']
NaN values → train: 24000, test: 6000
Inf values → train: 0, test: 0
✅ Basic checks passed — all numeric and finite. Safe for modeling.


In [21]:
# === Save missing metadata artifacts locally ===
import json
from pathlib import Path

DATA_PROCESSED = Path("../data/processed")
DATA_PROCESSED.mkdir(parents=True, exist_ok=True)

# --- skill_vocab.json ---
skill_vocab = [c.replace('skill__', '') for c in X_train.columns if c.startswith('skill__')]
(DATA_PROCESSED / "skill_vocab.json").write_text(json.dumps(skill_vocab, indent=2))

# --- preprocessing_config.json ---
config = {
    "top_k_skills": len(skill_vocab),
    "feature_count": X_train.shape[1],
    "random_state": 42,
    "target": "log_salary_midpoint",
    "iqr_cap_k": 1.5
}
(DATA_PROCESSED / "preprocessing_config.json").write_text(json.dumps(config, indent=2))

print("✅ Saved:")
print(" -", DATA_PROCESSED / "skill_vocab.json")
print(" -", DATA_PROCESSED / "preprocessing_config.json")


✅ Saved:
 - ..\data\processed\skill_vocab.json
 - ..\data\processed\preprocessing_config.json



## Next steps

- **05_modeling_forecasting.ipynb** → Use `posting_year`, `posting_month`, skill bins over time to build per‑skill monthly series.
- **06_modeling_salary_prediction.ipynb** → Load `X_train.csv`, `X_test.csv`, `y_train.csv`, `y_test.csv`; fit Linear/Ridge/XGBoost; log-target inverse transform with `expm1`.
- **07_clustering_analysis.ipynb** → Use only skill OHE + key numerics; scale & run KMeans; profile clusters.
- **08_recommendation_engine.ipynb** → Use skill embeddings or nearest-neighbor similarity for path suggestions. 

> Commit artifacts under `data/processed/` and the notebook under `notebooks/`.
