# Tech Career Growth Project — Preprocessing with Dummy/Indicator Features
 
The goal of this notebook is to **transform categorical variables into numeric form** so that machine learning models can use them effectively.

---

## What we will do in this notebook

1. **Load the dataset** (`merged_bls_kaggle.csv` and `merged_skill_counts.csv`).
2. **Inspect the data**  
   - View shape, column types, and sample values.
3. **Identify categorical variables** (e.g., job titles, education levels, skills).
4. **Clean categorical values**  
   - Trim whitespace  
   - Standardize naming (e.g., “Mid-West” → “Midwest”)  
   - Replace missing values with `__MISSING__`  
   - Lump rare categories into `__OTHER__`
5. **One-hot encode** the categorical columns using `pandas.get_dummies`.
6. **Verify encoding** by checking rows and dummy columns.
7. **Save outputs**  
   - Encoded dataset (`.csv`)  
   - Category map (`.json`) for consistent train/test encoding later

---

## Table of Contents

- [1. Load Dataset](#1-load-dataset)  
- [2. Inspect Data](#2-inspect-data)  
- [3. Identify Categorical Columns](#3-identify-categorical-columns)  
- [4. Clean Categorical Values](#4-clean-categorical-values)  
- [5. Handle Rare Categories](#5-handle-rare-categories)  
- [6. One-Hot Encode](#6-one-hot-encode)  
- [7. Verify Encoding](#7-verify-encoding)  
- [8. Save Encoded Data & Category Maps]
- [9. Merge Skills → BLS (by job_title_norm)]
- [10. Identify Numeric Features to Scale]
- [11. Train/Test Split (before scaling)]
- [12. Standardize Numeric Features (fit on TRAIN only)]
- [13. Save Scaled Splits & Scaler (+ feature manifest)]

---



## 1. Load Dataset

In this step we will:

- Define the file path for our dataset (`merged_bls_kaggle.csv`and `merged_skill_counts.csv`).
- Load the dataset into a pandas DataFrame.
- Display the first few rows to confirm it loaded correctly.

In [126]:
# 1. Load Dataset

import pandas as pd
from pandas.api.types import CategoricalDtype
import os,json
import re
from difflib import SequenceMatcher
import numpy as np

os.makedirs("data/encoded", exist_ok=True)

# Path to your dataset (adjust if needed)
BLS_INPUT_PATH = "data/merged_data/merged_bls_kaggle.csv"
SKILLS_INPUT_PATH = "data/merged_data/merged_skill_counts.csv"

# Load into a DataFrame
df_bls = pd.read_csv(BLS_INPUT_PATH)
df_sk = pd.read_csv(SKILLS_INPUT_PATH)

## 2. Inspect Data

In this step we will:

- Check the **shape** of the dataset (rows × columns).
- Review the **column names and data types** to see which are numeric, categorical, or boolean.
- Look at the number of **missing values** per column.
- Preview some sample rows for a quick sense of the data.

In [67]:
# 2. Inspect Data

# Shape (rows, columns)
print("bls Shape:", df_bls.shape)
print("sk Shape:", df_sk.shape)

# Column names and data types
print("\nData types:")
print("bls: ", df_bls.dtypes)
print("sk: ", df_sk.dtypes)

# Missing values per column (top 10)
print("\nMissing values:")
print("bls: ", df_bls.isnull().sum().sort_values(ascending=False).head(10))
print("sk: ", df_sk.isnull().sum().sort_values(ascending=False).head(10))

# Quick preview of first 5 rows
print("bls head:", df_bls.head())
print("sk head:", df_sk.head())

bls Shape: (22, 22)
sk Shape: (4638, 4)

Data types:
bls:  job_title                                                                     object
soc_code                                                                      object
occupation_type                                                               object
employment_2023                                                              float64
employment_2033                                                              float64
employment_distribution_percent_2023                                         float64
employment_distribution_percent_2033                                         float64
employment_change_numeric_2023-33                                            float64
employment_change_percent_2023-33                                            float64
percent_self_employed_2023                                                   float64
occupational_openings_2023-33_annual_average                                 float64
median

## 3. Identify Categorical Columns

In this step we will:

- Detect which columns are **categorical** by checking their data types.
  - Typically, columns with type `object`, `category`, or `bool` are categorical.
- Optionally, **manually add** numeric-looking columns that should be treated as categories (e.g., `soc_code`).
- Print the list of categorical columns to confirm.

In [69]:
# 3. Identify Categorical Columns

# Function to detect categorical columns
def find_categorical_columns(df):
    cat_cols = []
    for col in df.columns:
        dtype = df[col].dtype
        if (pd.api.types.is_object_dtype(dtype) 
            or isinstance(dtype, CategoricalDtype) 
            or pd.api.types.is_bool_dtype(dtype)):
            cat_cols.append(col)
    return cat_cols

# Detect categorical columns
bls_cat_cols = find_categorical_columns(df_bls)
print("bls categorical columns:")
print(bls_cat_cols)

sk_cat_cols = find_categorical_columns(df_sk)
print("sk categorical columns: ", sk_cat_cols)

bls categorical columns:
['job_title', 'soc_code', 'occupation_type', 'typical_education_needed_for_entry', 'work_experience_in_a_related_occupation', 'typical_on-the-job_training_needed_to_attain_competency_in_the_occupation', 'related_occupational_outlook_handbook_(ooh)_content', '_is_tech', '_tech_reason', 'job_title_norm']
sk categorical columns:  ['job_title_norm', 'canon_skill']


## 4. Clean Categorical Values

In this step we will:

- **Normalize strings**: remove leading/trailing spaces and fix inconsistent formatting.
- **Standardize naming** for a few known variants (e.g., "Mid-West" → "Midwest").
- **Fill missing values** with a placeholder (`__MISSING__`).
- Apply these cleaning rules to all categorical columns.


In [75]:
# Section 4: Clean Categorical Values (BLS + Skills) ---

# Reuse the same cleaner for skills
clean_text_sk = clean_text_bls  # alias

# BLS
df_bls_clean = df_bls.copy()
for c in bls_cat_cols:
    if df_bls_clean[c].dtype != "bool":   # don’t apply text ops to boolean
        df_bls_clean[c] = clean_text_bls(df_bls_clean[c])
    df_bls_clean[c] = df_bls_clean[c].fillna("__MISSING__")

# Quick check on a couple columns
print("bls: ", df_bls_clean[cat_cols].head(10))


# SKILLS
# (Make sure you've already defined cat_cols_sk = ["job_title_norm", "canon_skill"])
df_sk_clean = df_sk.copy()
for c in sk_cat_cols:
    df_sk_clean[c] = clean_text_sk(df_sk_clean[c])
    df_sk_clean[c] = df_sk_clean[c].fillna("__MISSING__")

print("sk: ", df_sk_clean[sk_cat_cols].head(10))

bls:                                        job_title soc_code occupation_type  \
0     Computer and information systems managers  11-3021       Line item   
1                     Computer systems analysts  15-1211       Line item   
2                 Information security analysts  15-1212       Line item   
3  Computer and information research scientists  15-1221       Line item   
4          Computer network support specialists  15-1231       Line item   
5             Computer user support specialists  15-1232       Line item   
6                   Computer network architects  15-1241       Line item   
7                       Database administrators  15-1242       Line item   
8                           Database architects  15-1243       Line item   
9   Network and computer systems administrators  15-1244       Line item   

  typical_education_needed_for_entry work_experience_in_a_related_occupation  \
0                  Bachelor's degree                         5 years or more 

## 5. Handle Rare Categories

Why: If a category appears only once or twice, it can create very sparse, brittle features.  
We’ll replace very infrequent categories with a single label `__OTHER__`.

Guideline:
- Small dataset (like ours): keep thresholds low (e.g., `min_count=2`, `min_frac=0.01`).
- Larger/high-cardinality data (e.g., skills): raise thresholds a bit to control the number of columns.

In [83]:
# 5. Handle Rare Categories (Optional)

BLS_MIN_COUNT = 2
BLS_MIN_FRAC  = 0.01    # 1% of rows

SK_MIN_COUNT  = 10      # skills are high-cardinality
SK_MIN_FRAC   = 0.005   # 0.5% of rows

def lump_rare(s, min_count=2, min_frac=0.01):
    """
    Replace infrequent levels with '__OTHER__'.
    Keep a level if it appears at least `min_count` times OR
    its share is >= `min_frac`.
    """
    vc = s.value_counts(dropna=False)
    keep = vc[(vc >= min_count) | (vc / len(s) >= min_frac)].index
    return s.where(s.isin(keep), "__OTHER__")

def other_share(s):
    """Proportion of values that became '__OTHER__' (sanity check)."""
    return float((s == "__OTHER__").mean())

# ===========================
#           BLS
# ===========================
# Inputs assumed from Section 4:
#   df_bls_clean  (cleaned strings + __MISSING__)
#   bls_cat_cols  (categorical columns for BLS)

df_bls_ready = df_bls_clean.copy()

for c in bls_cat_cols:
    # skip booleans — they don't need string lumping
    if pd.api.types.is_bool_dtype(df_bls_ready[c]):
        continue
    df_bls_ready[c] = lump_rare(df_bls_ready[c], min_count=BLS_MIN_COUNT, min_frac=BLS_MIN_FRAC)

print("BLS ready (after rare-level lumping) — head:")
display(df_bls_ready[bls_cat_cols].head(10))

print("\nBLS '__OTHER__' shares:")
for c in bls_cat_cols:
    if not pd.api.types.is_bool_dtype(df_bls_ready[c]):
        print(f"  {c}: {other_share(df_bls_ready[c]):.2%}")

# ===========================
#          SKILLS
# ===========================
# Treat job_title_norm as an identifier/key (no lumping / no encoding)
id_col_sk = "job_title_norm"
if id_col_sk not in df_sk_clean.columns:
    raise KeyError(f"Expected identifier column '{id_col_sk}' not found in skills dataset.")

# encode ONLY 'canon_skill'
sk_cat_cols_enc = []
if "canon_skill" in df_sk_clean.columns:
    sk_cat_cols_enc = ["canon_skill"]
else:
    raise KeyError("Expected 'canon_skill' in skills dataset for encoding.")

df_sk_ready = df_sk_clean.copy()

# Lumping for skills: only on canon_skill
df_sk_ready["canon_skill"] = lump_rare(
    df_sk_ready["canon_skill"], 
    min_count=SK_MIN_COUNT, 
    min_frac=SK_MIN_FRAC
)

print("\nSkills ready (after rare-level lumping) — top levels:")
print("\ncanon_skill:")
print(df_sk_ready["canon_skill"].value_counts().head(15))

# Optional diagnostics
print("\nSkills '__OTHER__' shares:")
print(f"  canon_skill: {other_share(df_sk_ready['canon_skill']):.2%}")

print("\nSkills identifier uniqueness check:")
print("  unique job_title_norm:", df_sk_ready[id_col_sk].nunique())
print("  total rows:", len(df_sk_ready))

BLS ready (after rare-level lumping) — head:


Unnamed: 0,job_title,soc_code,occupation_type,typical_education_needed_for_entry,work_experience_in_a_related_occupation,typical_on-the-job_training_needed_to_attain_competency_in_the_occupation,related_occupational_outlook_handbook_(ooh)_content,_is_tech,_tech_reason,job_title_norm
0,Computer and information systems managers,11-3021,Line item,Bachelor's degree,5 years or more,__MISSING__,OOH Content,True,hard_include_or_must_have,computer and information systems managers
1,Computer systems analysts,15-1211,Line item,Bachelor's degree,__MISSING__,__MISSING__,OOH Content,True,hard_include_or_must_have,computer systems analysts
2,Information security analysts,15-1212,Line item,Bachelor's degree,Less than 5 years,__MISSING__,OOH Content,True,hard_include_or_must_have,information security analysts
3,Computer and information research scientists,15-1221,Line item,Master's degree,__MISSING__,__MISSING__,OOH Content,True,soc15_default_keep,computer and information research scientists
4,Computer network support specialists,15-1231,Line item,Associate's degree,__MISSING__,Moderate-term on-the-job training,OOH Content,True,soc15_default_keep,computer network support specialists
5,Computer user support specialists,15-1232,Line item,"Some college, no degree",__MISSING__,Moderate-term on-the-job training,OOH Content,True,hard_include_or_must_have,computer user support specialists
6,Computer network architects,15-1241,Line item,Bachelor's degree,5 years or more,__MISSING__,OOH Content,True,hard_include_or_must_have,computer network architects
7,Database administrators,15-1242,Line item,Bachelor's degree,__MISSING__,__MISSING__,OOH Content,True,hard_include_or_must_have,database administrators
8,Database architects,15-1243,Line item,Bachelor's degree,Less than 5 years,__MISSING__,OOH Content,True,hard_include_or_must_have,database architects
9,Network and computer systems administrators,15-1244,Line item,Bachelor's degree,__MISSING__,__MISSING__,OOH Content,True,soc15_default_keep,network and computer systems administrators



BLS '__OTHER__' shares:
  job_title: 0.00%
  soc_code: 0.00%
  occupation_type: 0.00%
  typical_education_needed_for_entry: 0.00%
  work_experience_in_a_related_occupation: 0.00%
  typical_on-the-job_training_needed_to_attain_competency_in_the_occupation: 0.00%
  related_occupational_outlook_handbook_(ooh)_content: 0.00%
  _tech_reason: 0.00%
  job_title_norm: 0.00%

Skills ready (after rare-level lumping) — top levels:

canon_skill:
canon_skill
Technology Literacy         1941
AI & Big Data               1259
Programming                 1161
Networks & Cybersecurity     205
Design & UX                   72
Name: count, dtype: Int64

Skills '__OTHER__' shares:
  canon_skill: 0.00%

Skills identifier uniqueness check:
  unique job_title_norm: 1941
  total rows: 4638


## 6. One-Hot Encode

We will convert each categorical column into multiple **indicator (0/1) columns** using `pandas.get_dummies`.

Notes:
- We pass `columns=cat_cols` so only those columns are expanded.
- `drop_first=False` keeps all categories (we can drop a baseline later for linear models).
- `dtype="uint8"` keeps memory usage small.

In [91]:
# 6. One-Hot Encode
# -------------------------------------------------------------------------
# 6.1 BLS — One-hot encode all BLS categorical columns
# -------------------------------------------------------------------------
df_bls_dummies = pd.get_dummies(
    df_bls_ready,
    columns=bls_cat_cols,
    prefix_sep="=",
    drop_first=False,              # keep every level incl. __MISSING__/__OTHER__
    dtype="uint8"
)

# Order columns: non-cat first, then dummy cols (sorted) -> easier to read
non_cat_cols_bls = [c for c in df_bls_ready.columns if c not in bls_cat_cols]
dummy_cols_bls   = sorted([c for c in df_bls_dummies.columns if c not in non_cat_cols_bls])
df_bls_dummies   = df_bls_dummies[non_cat_cols_bls + dummy_cols_bls]

print("BLS encoded shape:", df_bls_dummies.shape)
display(df_bls_dummies.head(5))

# Integrity check: each categorical base should sum to 1 across its dummy columns
def check_one_hot(df_encoded: pd.DataFrame, base_col: str):
    cols = [c for c in df_encoded.columns if c.startswith(base_col + "=")]
    if not cols:
        return True, []
    sums = df_encoded[cols].sum(axis=1)
    bad_idx = sums.index[sums != 1].tolist()
    return len(bad_idx) == 0, bad_idx

bls_integrity = {base: check_one_hot(df_bls_dummies, base)[0] for base in bls_cat_cols}
print("\nBLS one-hot integrity:", bls_integrity)

# -------------------------------------------------------------------------
# 6.2 Skills — One-hot encode ONLY `canon_skill` and aggregate by job_title_norm
# -------------------------------------------------------------------------
id_col_sk = "job_title_norm"
if id_col_sk not in df_sk_ready.columns:
    raise KeyError(f"Expected identifier column '{id_col_sk}' not found in skills dataset.")

# Ensure canon_skill is present and will be encoded
if "canon_skill" not in df_sk_ready.columns:
    raise KeyError("Expected 'canon_skill' column in skills dataset.")

# Encode only canon_skill (treat job_title_norm as ID; do not encode job_title_norm)
skill_dummies = pd.get_dummies(
    df_sk_ready,
    columns=["canon_skill"],
    prefix_sep="=",
    drop_first=False,
    dtype="uint8"
)
skill_dummy_cols = sorted([c for c in skill_dummies.columns if c.startswith("canon_skill=")])

print("\nSkills (row-level) encoded shape:", skill_dummies.shape, "| #skill dummies:", len(skill_dummy_cols))
display(skill_dummies[[id_col_sk] + skill_dummy_cols].head(3))

# Each row should have exactly 1 hot skill column
row_sums = skill_dummies[skill_dummy_cols].sum(axis=1)
if not (row_sums == 1).all():
    bad = np.where(row_sums != 1)[0][:10]
    raise AssertionError(f"One-hot check failed for skills rows. Example bad indices: {bad}")

# Aggregate to one row per job title (binary presence of skills)
skills_by_title = (
    skill_dummies.groupby(id_col_sk, as_index=False)[skill_dummy_cols].max()
)

print("Aggregated skills_by_title shape:", skills_by_title.shape)
display(skills_by_title.head(3))

# -------------------------------------------------------------------------
# 6.3 Save artifacts (encoded CSVs + category maps)
# -------------------------------------------------------------------------
os.makedirs("data/encoded", exist_ok=True)

# BLS outputs
bls_out_csv = "data/encoded/bls_encoded_step1.csv"
df_bls_dummies.to_csv(bls_out_csv, index=False)

# Category map for BLS (levels observed after cleaning/lumping)
bls_catmap = {}
for c in bls_cat_cols:
    if pd.api.types.is_bool_dtype(df_bls_ready[c]):
        levels = ["False", "True"]
    else:
        levels = sorted(df_bls_ready[c].dropna().astype(str).unique().tolist())
    bls_catmap[c] = levels

bls_catmap_json = "data/encoded/bls_categories_map.json"
with open(bls_catmap_json, "w") as f:
    json.dump(bls_catmap, f, indent=2)

# Skills outputs
skills_row_out      = "data/encoded/skills_row_level_step1.csv"
skills_by_title_out = "data/encoded/skills_features_by_title_step1.csv"
skills_catmap_json  = "data/encoded/skills_categories_map.json"

skill_dummies.to_csv(skills_row_out, index=False)
skills_by_title.to_csv(skills_by_title_out, index=False)
with open(skills_catmap_json, "w") as f:
    json.dump({"canon_skill": sorted(df_sk_ready["canon_skill"].astype(str).unique().tolist())}, f, indent=2)

# Summary
def mb(n): return round(n / (1024**2), 3)
print("\nSaved:")
print(" - BLS encoded CSV:", bls_out_csv)
print(" - BLS categories map:", bls_catmap_json)
print(" - Skills row-level CSV:", skills_row_out)
print(" - Skills features-by-title CSV:", skills_by_title_out)
print(" - Skills category map:", skills_catmap_json)
print("Memory usage (BLS encoded):", mb(df_bls_dummies.memory_usage(deep=True).sum()), "MB")
print("Num BLS dummy columns:", len(dummy_cols_bls))
print("Num skill dummy columns:", len(skill_dummy_cols))

BLS encoded shape: (22, 94)


Unnamed: 0,employment_2023,employment_2033,employment_distribution_percent_2023,employment_distribution_percent_2033,employment_change_numeric_2023-33,employment_change_percent_2023-33,percent_self_employed_2023,occupational_openings_2023-33_annual_average,median_annual_wage_dollars_2024,postings,...,typical_education_needed_for_entry=Associate's degree,typical_education_needed_for_entry=Bachelor's degree,typical_education_needed_for_entry=Master's degree,"typical_education_needed_for_entry=Some college, no degree",typical_on-the-job_training_needed_to_attain_competency_in_the_occupation=Long-term on-the-job training,typical_on-the-job_training_needed_to_attain_competency_in_the_occupation=Moderate-term on-the-job training,typical_on-the-job_training_needed_to_attain_competency_in_the_occupation=__MISSING__,work_experience_in_a_related_occupation=5 years or more,work_experience_in_a_related_occupation=Less than 5 years,work_experience_in_a_related_occupation=__MISSING__
0,613.5,720.4,0.4,0.4,106.9,17.4,1.0,54.7,171200.0,,...,0,1,0,0,0,0,1,1,0,0
1,527.2,583.7,0.3,0.3,56.5,10.7,2.6,37.3,103790.0,,...,0,1,0,0,0,0,1,0,0,1
2,180.7,239.8,0.1,0.1,59.1,32.7,1.2,17.3,124910.0,,...,0,1,0,0,0,0,1,0,1,0
3,36.6,46.0,0.0,0.0,9.4,25.6,,3.4,140910.0,,...,0,0,1,0,0,0,1,0,0,1
4,166.7,178.8,0.1,0.1,12.1,7.3,2.2,12.1,73340.0,,...,1,0,0,0,0,1,0,0,0,1



BLS one-hot integrity: {'job_title': True, 'soc_code': True, 'occupation_type': True, 'typical_education_needed_for_entry': True, 'work_experience_in_a_related_occupation': True, 'typical_on-the-job_training_needed_to_attain_competency_in_the_occupation': True, 'related_occupational_outlook_handbook_(ooh)_content': True, '_is_tech': True, '_tech_reason': True, 'job_title_norm': True}

Skills (row-level) encoded shape: (4638, 8) | #skill dummies: 5


Unnamed: 0,job_title_norm,canon_skill=AI & Big Data,canon_skill=Design & UX,canon_skill=Networks & Cybersecurity,canon_skill=Programming,canon_skill=Technology Literacy
0,0247 project associate,1,0,0,0,0
1,0247 project associate,0,0,0,0,1
2,2019 gdi and a summer intern,1,0,0,0,0


Aggregated skills_by_title shape: (1941, 6)


Unnamed: 0,job_title_norm,canon_skill=AI & Big Data,canon_skill=Design & UX,canon_skill=Networks & Cybersecurity,canon_skill=Programming,canon_skill=Technology Literacy
0,0247 project associate,1,0,0,0,1
1,2019 gdi and a summer intern,1,0,0,1,1
2,2019 human resources - industrial/organization...,1,0,0,0,1



Saved:
 - BLS encoded CSV: data/encoded/bls_encoded_step1.csv
 - BLS categories map: data/encoded/bls_categories_map.json
 - Skills row-level CSV: data/encoded/skills_row_level_step1.csv
 - Skills features-by-title CSV: data/encoded/skills_features_by_title_step1.csv
 - Skills category map: data/encoded/skills_categories_map.json
Memory usage (BLS encoded): 0.004 MB
Num BLS dummy columns: 82
Num skill dummy columns: 5


## 7. Verify Encoding (BLS + Skills)

This cell validates:
- **BLS**: each categorical field encodes to exactly one hot dummy per row; dummies are `uint8`; `__MISSING__/__OTHER__` handled; matches the saved category map.
- **Skills**: row-level `canon_skill` is truly one-hot; aggregated `skills_by_title` are binary 0/1; dummies are `uint8`; matches the saved category map.

If any check fails, it prints which rows/columns to inspect.

In [143]:
bls_catmap_fixed = {}
for c in bls_cat_cols:
    # Use only levels that actually appear after cleaning/lumping
    levels = sorted(df_bls_ready[c].dropna().astype(str).unique().tolist())
    bls_catmap_fixed[c] = levels

with open("data/encoded/bls_categories_map.json", "w") as f:
    json.dump(bls_catmap_fixed, f, indent=2)

print("Rewrote BLS categories_map with observed levels only")

# ======================
# Helpers
# ======================
def check_one_hot(df_encoded: pd.DataFrame, base_col: str):
    """Return (ok, bad_rows, cols) for a base categorical column encoded into dummies."""
    cols = [c for c in df_encoded.columns if c.startswith(base_col + "=")]
    if not cols:
        return True, [], cols
    sums = df_encoded[cols].sum(axis=1)
    bad_idx = sums.index[sums != 1].tolist()
    return (len(bad_idx) == 0), bad_idx, cols

def hot_label(row_dummies: pd.Series, base: str):
    cols = [c for c in row_dummies.index if c.startswith(base + "=")]
    if not cols:
        return None
    sub = row_dummies[cols]
    if sub.max() != 1:
        return "__AMBIG__"
    return sub.idxmax().split("=", 1)[1]

def appeared_as_value(df, col, value):
    return (df[col] == value).any() if col in df.columns else False

def all_binary_uint8(df, cols):
    if not cols:
        return True, []
    dtypes_ok = all(str(dt) == "uint8" for dt in df[cols].dtypes)
    non_binary_rows = []
    if dtypes_ok:
        bad_mask = ~df[cols].isin([0,1]).all(axis=1)
        non_binary_rows = df.index[bad_mask].tolist()
    return dtypes_ok and (len(non_binary_rows) == 0), non_binary_rows

# ======================
# BLS verification
# ======================
print("\n=== BLS verification ===")
bls_dummy_cols = [c for c in df_bls_dummies.columns if any(c.startswith(base+"=") for base in bls_cat_cols)]

# 1) one-hot integrity per categorical base
bls_results = {}
for base in bls_cat_cols:
    ok, bad_rows, cols = check_one_hot(df_bls_dummies, base)
    bls_results[base] = {"ok": ok, "dummies": len(cols), "bad_rows": bad_rows[:5]}

print("One-hot integrity (BLS):")
for base, r in bls_results.items():
    print(f" - {base}: ok={r['ok']} dummies={r['dummies']} bad_rows_sample={r['bad_rows']}")

# 2) random spot-checks
np.random.seed(0)
sample_idx = np.random.choice(df_bls_ready.index, size=min(5, len(df_bls_ready)), replace=False)
print("\nRandom spot-checks (BLS):")
for i in sample_idx:
    row_report = []
    for base in bls_cat_cols:
        orig = df_bls_ready.loc[i, base]
        enc  = hot_label(df_bls_dummies.loc[i], base)
        row_report.append((base, str(orig), str(enc)))
    print(f"Row {i}:")
    for base, orig, enc in row_report:
        print(f"  {base}: orig='{orig}'  |  enc='{enc}'")

# 3) dummy dtypes + binary
dtype_ok, nonbin_rows = all_binary_uint8(df_bls_dummies, bls_dummy_cols)
print("\nBLS dtypes/binary check: ok =", dtype_ok, "| non-binary rows:", nonbin_rows[:5])

# 4) sentinel presence
print("\nSentinel presence (BLS):")
for base in bls_cat_cols:
    base_levels = [c.split("=",1)[1] for c in bls_dummy_cols if c.startswith(base + "=")]
    print(f" - {base}: "
          f"__MISSING__ in source? {appeared_as_value(df_bls_ready, base, '__MISSING__')}, "
          f"dummy present? {'__MISSING__' in base_levels};  "
          f"__OTHER__ in source? {appeared_as_value(df_bls_ready, base, '__OTHER__')}, "
          f"dummy present? {'__OTHER__' in base_levels}")

# 5) compare dummies to the (fixed) category map
with open("data/encoded/bls_categories_map.json", "r") as f:
    bls_map = json.load(f)

missing_from_dummies = {}
for base, levels in bls_map.items():
    expected = {f"{base}={lvl}" for lvl in levels}   # using observed levels only
    actual   = {c for c in df_bls_dummies.columns if c.startswith(base + "=")}
    miss = sorted(list(expected - actual))
    if miss:
        missing_from_dummies[base] = miss

if missing_from_dummies:
    print("\nWARNING: Missing expected BLS dummy columns per categories_map (after fix):")
    for base, miss in missing_from_dummies.items():
        print(f" - {base}: {miss[:10]}{' ...' if len(miss) > 10 else ''}")
else:
    print("\nCategory map match (BLS): OK")

# ======================
# Skills verification
# ======================
print("\n=== Skills verification ===")
id_col_sk = "job_title_norm"
skill_dummy_cols = [c for c in skill_dummies.columns if c.startswith("canon_skill=")]
agg_skill_cols   = [c for c in skills_by_title.columns if c.startswith("canon_skill=")]

# 1) row-level one-hot integrity for canon_skill
row_sums = skill_dummies[skill_dummy_cols].sum(axis=1) if skill_dummy_cols else pd.Series(dtype=int)
onehot_ok = (row_sums == 1).all() if len(row_sums) else True
bad_rows  = row_sums.index[row_sums != 1].tolist()[:5] if len(row_sums) else []
print("Row-level one-hot (canon_skill): ok =", onehot_ok, "| bad_rows_sample:", bad_rows)

# 2) dtypes/binary on row-level and aggregated
row_ok, row_nonbin = all_binary_uint8(skill_dummies, skill_dummy_cols)
agg_ok, agg_nonbin = all_binary_uint8(skills_by_title, agg_skill_cols)
print("Dtypes/binary (row-level):", row_ok, "| non-binary rows:", row_nonbin[:5])
print("Dtypes/binary (aggregated):", agg_ok, "| non-binary rows:", agg_nonbin[:5])

# 3) uniqueness & coverage stats
n_titles = skills_by_title[id_col_sk].nunique()
titles_with_any = (skills_by_title[agg_skill_cols].sum(axis=1) > 0).sum() if agg_skill_cols else 0
print("Aggregated rows (unique job_title_norm):", n_titles)
print("Titles with ≥1 skill:", titles_with_any, f"({titles_with_any / max(1, n_titles):.1%})")

# 4) sentinel presence for canon_skill
has_missing = (df_sk_ready["canon_skill"] == "__MISSING__").any() if "canon_skill" in df_sk_ready.columns else False
has_other   = (df_sk_ready["canon_skill"] == "__OTHER__").any() if "canon_skill" in df_sk_ready.columns else False
print("Sentinel (skills): __MISSING__ in source?", has_missing, 
      "| dummy present?", ("canon_skill=__MISSING__" in skill_dummy_cols))
print("Sentinel (skills): __OTHER__ in source?", has_other, 
      "| dummy present?", ("canon_skill=__OTHER__" in skill_dummy_cols))

# 5) compare to saved skills category map (if present)
try:
    with open("data/encoded/skills_categories_map.json", "r") as f:
        sk_map = json.load(f)
    levels = sk_map.get("canon_skill", [])
    expected = {f"canon_skill={lvl}" for lvl in levels}
    actual   = set(skill_dummy_cols)
    miss = sorted(list(expected - actual))
    if miss:
        print("\nWARNING: Missing expected skills dummy columns per categories_map:")
        print(f" - canon_skill: {miss[:10]}{' ...' if len(miss) > 10 else ''}")
    else:
        print("\nCategory map match (Skills): OK")
except FileNotFoundError:
    print("\n[Note] Skills categories_map not found — skipped map check.")

Rewrote BLS categories_map with observed levels only

=== BLS verification ===
One-hot integrity (BLS):
 - job_title: ok=True dummies=22 bad_rows_sample=[]
 - soc_code: ok=True dummies=22 bad_rows_sample=[]
 - occupation_type: ok=True dummies=1 bad_rows_sample=[]
 - typical_education_needed_for_entry: ok=True dummies=4 bad_rows_sample=[]
 - work_experience_in_a_related_occupation: ok=True dummies=3 bad_rows_sample=[]
 - typical_on-the-job_training_needed_to_attain_competency_in_the_occupation: ok=True dummies=3 bad_rows_sample=[]
 - related_occupational_outlook_handbook_(ooh)_content: ok=True dummies=2 bad_rows_sample=[]
 - _is_tech: ok=True dummies=1 bad_rows_sample=[]
 - _tech_reason: ok=True dummies=2 bad_rows_sample=[]
 - job_title_norm: ok=True dummies=22 bad_rows_sample=[]

Random spot-checks (BLS):
Row 20:
  job_title: orig='Data scientists'  |  enc='Data scientists'
  soc_code: orig='15-2051'  |  enc='15-2051'
  occupation_type: orig='Line item'  |  enc='Line item'
  typical_ed

## 8) Save Encoded Data & Category Maps

**Goal:** persist the outputs of Step 1 so you can (a) reuse them in later steps, and (b) apply the **same** encoding scheme to validation/test data.

### What we will save
- **BLS (row-level, encoded):**  
  `data/encoded/bls_encoded_step1.csv`
- **BLS category map (observed levels only):**  
  `data/encoded/bls_categories_map.json`  
  *Used later to guarantee the same dummy columns appear in the same order.*
- **Skills (row-level, with `canon_skill` one-hot):**  
  `data/encoded/skills_row_level_step1.csv`
- **Skills (aggregated to one row per `job_title_norm`):**  
  `data/encoded/skills_features_by_title_step1.csv`
- **Skills category map (observed levels only for `canon_skill`):**  
  `data/encoded/skills_categories_map.json`
- **(If you merged) BLS + Skills modeling frame:**  
  `data/encoded/bls_plus_skills_step1.csv`

In [138]:
import os, json
os.makedirs("data/encoded", exist_ok=True)

df_bls_dummies.to_csv("data/encoded/bls_encoded_step1.csv", index=False)
bls_catmap = {c: sorted(df_bls_ready[c].dropna().astype(str).unique().tolist()) for c in bls_cat_cols}
with open("data/encoded/bls_categories_map.json", "w") as f:
    json.dump(bls_catmap, f, indent=2)
print("Saved BLS encodings + category map.")

# Save Skills encodings (row-level + aggregated) + category map
skill_dummies.to_csv("data/encoded/skills_row_level_step1.csv", index=False)
skills_by_title.to_csv("data/encoded/skills_features_by_title_step1.csv", index=False)
with open("data/encoded/skills_categories_map.json", "w") as f:
    json.dump({"canon_skill": sorted(df_sk_ready["canon_skill"].astype(str).unique().tolist())}, f, indent=2)
print("Saved Skills encodings + category map.")

Saved BLS encodings + category map.
Saved Skills encodings + category map.


## 9) Merge Skills → BLS (by `job_title_norm`)

**Goal:** attach skills features (from the skills dataset) to the BLS feature matrix so each BLS occupation row includes skill indicators.

### Inputs
- `df_bls_dummies` — BLS one-hot encoded frame
- `df_bls_ready` — BLS cleaned/lumped source (used to reattach `job_title_norm`)
- `skills_by_title` — one row per `job_title_norm` with `canon_skill=` dummy columns (binary presence)
- *(From Step 9C/9D)* `data/encoded/title_mapping_suggestions_auto.csv` — auto-suggested title matches

### Process
1. **Canonicalize merge keys** on both sides (lowercase, strip punctuation, `&→and`, collapse spaces).
2. **Auto-map BLS titles → closest skills titles** with a hybrid similarity score  
   (token Jaccard + sequence similarity), keep best rank-1 **above a threshold** (e.g., `0.60`).
3. **Deduplicate skills side by canonical key** (`sk_key`) by taking `max()` across binary skill features.
4. **Left-join** skills into BLS via the accepted mapping.  
   Fill missing skill columns with `0` (`uint8`) so rows without matches remain valid.
5. **Order columns**: ids → BLS numerics → BLS dummies → prefixed skill columns.

### Output artifacts
- **Merged modeling frame:** `data/encoded/bls_plus_skills_step1.csv`
- **Suggestions CSV (for review/edits):** `data/encoded/title_mapping_suggestions_auto.csv`

In [141]:
STOPWORDS = {
    "and","or","of","the","a","an","to","for","in","on","with","without",
    "senior","jr","junior","sr","lead","principal","associate","assistant",
    "intern","internship","i","ii","iii","iv","v","project","program","summer",
    "ms","bs","phd"
}

def canon_text(s: str) -> str:
    s = s.lower()
    s = s.replace("&", " and ")
    s = re.sub(r"\d{4}", " ", s)           # drop years
    s = re.sub(r"[^a-z0-9]+", " ", s)      # punctuation -> space
    s = re.sub(r"\s+", " ", s).strip()
    return s

def tokens(s: str) -> list:
    s = canon_text(s)
    toks = [t for t in s.split() if t not in STOPWORDS and len(t) > 1]
    return toks

def jaccard(a: set, b: set) -> float:
    if not a and not b: return 0.0
    return len(a & b) / len(a | b)

def seq_ratio(a: str, b: str) -> float:
    return SequenceMatcher(None, a, b).ratio()

def hybrid_score(a_raw: str, b_raw: str) -> float:
    """Blend token Jaccard and sequence similarity for robust matching."""
    a_c, b_c = canon_text(a_raw), canon_text(b_raw)
    a_t, b_t = set(tokens(a_raw)), set(tokens(b_raw))
    return 0.6 * jaccard(a_t, b_t) + 0.4 * seq_ratio(a_c, b_c)

def canon_key_series(s: pd.Series) -> pd.Series:
    """Canonical string key for joining."""
    s = s.astype(str).str.lower()
    s = s.str.replace("&", " and ", regex=False)
    s = s.str.replace(r"\d{4}", " ", regex=True)
    s = s.str.replace(r"[^a-z0-9]+", " ", regex=True)
    s = s.str.replace(r"\s+", " ", regex=True).str.strip()
    return s

def slugify(x: str) -> str:
    x = x.lower()
    x = re.sub(r"[^\w]+", "_", x)
    x = re.sub(r"_+", "_", x).strip("_")
    return x

# --------------------------
# 9.1 Ensure BLS has merge key column
# --------------------------
id_col = "job_title_norm"
if id_col not in df_bls_dummies.columns:
    df_bls_dummies = df_bls_dummies.copy()
    df_bls_dummies[id_col] = df_bls_ready[id_col].values

# --------------------------
# 9.2 Build auto-suggestions (top-5) BLS -> Skills
# --------------------------
bls_titles = sorted(df_bls_ready[id_col].astype(str).unique())
sk_titles  = sorted(skills_by_title[id_col].astype(str).unique())

rows = []
for bt in bls_titles:
    scores = [(st, hybrid_score(bt, st)) for st in sk_titles]
    scores.sort(key=lambda x: x[1], reverse=True)
    for rank, (st, sc) in enumerate(scores[:5], 1):
        rows.append({"bls_title": bt, "suggested_sk_title": st, "rank": rank, "score": round(sc, 3)})

suggest_df = pd.DataFrame(rows)
suggest_csv = "data/encoded/title_mapping_suggestions_auto.csv"
suggest_df.to_csv(suggest_csv, index=False)
print(f"[9.2] Wrote suggestions to: {suggest_csv}")
display(suggest_df.head(10))

# --------------------------
# 9.3 Accept rank-1 suggestions above a threshold and make a mapping
# --------------------------
THRESHOLD = 0.60  # adjust 0.55–0.70 based on quality/coverage
mapping = (suggest_df[suggest_df["rank"] == 1]
           .query("score >= @THRESHOLD")
           .copy())

# Build canonical keys for mapping
mapping["bls_key"] = canon_key_series(mapping["bls_title"])
mapping["sk_key"]  = canon_key_series(mapping["suggested_sk_title"])

# Keep best-scoring row per bls_key
mapping = (mapping
           .sort_values(["bls_key", "score"], ascending=[True, False])
           .drop_duplicates(subset=["bls_key"], keep="first"))

print(f"[9.3] Auto-accepted mappings (score >= {THRESHOLD}): {len(mapping)}")
display(mapping.sort_values("score", ascending=False).head(10))

# --------------------------
# 9.4 Prepare skills features table with unique sk_key on the right
# --------------------------
skill_cols_raw = [c for c in skills_by_title.columns if c.startswith("canon_skill=")]
rename_map = {c: "skill_" + slugify(c.split("=", 1)[1]) for c in skill_cols_raw}

skills_prefixed = skills_by_title.rename(columns=rename_map).copy()
skills_prefixed["sk_key"] = canon_key_series(skills_prefixed[id_col])

skill_feature_cols = list(rename_map.values())

# Deduplicate any repeated sk_key by max across binary features
skills_right_unique = (skills_prefixed.groupby("sk_key", as_index=False)[skill_feature_cols].max())

print(f"[9.4] Skills features: {len(skill_feature_cols)} columns; unique sk_key rows: {len(skills_right_unique)}")

# --------------------------
# 9.5 Merge mapping into BLS, then attach skills by sk_key
# --------------------------
df_bls_dummies = df_bls_dummies.copy()
df_bls_dummies["bls_key"] = canon_key_series(df_bls_dummies[id_col])

# Attach chosen skills key to each BLS row (m:1)
df_bls_map = df_bls_dummies.merge(
    mapping[["bls_key", "sk_key", "score"]],
    on="bls_key",
    how="left",
    validate="m:1"
)

# Attach skills features (m:1) by unique sk_key
df_model = df_bls_map.merge(
    skills_right_unique,
    on="sk_key",
    how="left",
    validate="m:1"
)

# Fill missing skill features with 0 and keep uint8
if skill_feature_cols:
    df_model[skill_feature_cols] = df_model[skill_feature_cols].fillna(0).astype("uint8")

# --------------------------
# 9.6 Order columns & report coverage
# --------------------------
# BLS numerics = original non-categorical BLS columns
bls_num_cols   = [c for c in df_bls_ready.columns if c not in bls_cat_cols]
# BLS dummy columns (exclude id+keys)
bls_dummy_cols = [c for c in df_bls_dummies.columns if c not in bls_num_cols + [id_col, "bls_key"]]
bls_dummy_cols = sorted(bls_dummy_cols)

ordered_cols = [id_col, "bls_key", "sk_key", "score"] + bls_num_cols + bls_dummy_cols + sorted(skill_feature_cols)
ordered_cols = [c for c in ordered_cols if c in df_model.columns]
df_model = df_model[ordered_cols]

print(f"[9.6] Re-merged frame shape: {df_model.shape}")
display(df_model.head(3))

# Coverage: share of BLS rows with >=1 skill attached
coverage = (df_model[skill_feature_cols].sum(axis=1) > 0).mean() if skill_feature_cols else 0.0
print(f"[9.6] Coverage after auto-mapping (threshold {THRESHOLD}): {coverage:.1%}")

# --------------------------
# 9.7 Save merged modeling frame
# --------------------------
merged_out = "data/encoded/bls_plus_skills_step1.csv"
df_model.to_csv(merged_out, index=False)
print("[9.7] Saved merged frame:", merged_out)

[9.2] Wrote suggestions to: data/encoded/title_mapping_suggestions_auto.csv


Unnamed: 0,bls_title,suggested_sk_title,rank,score
0,actuaries,actuarial student,1,0.246
1,actuaries,actuarial assistant,2,0.229
2,actuaries,actuarial analyst- exam,3,0.206
3,actuaries,data experience,4,0.2
4,actuaries,data quality tester,5,0.2
5,computer and information research scientists,ai research scientist,1,0.358
6,computer and information research scientists,research scientist,2,0.352
7,computer and information research scientists,research scientist ii,3,0.342
8,computer and information research scientists,operations research analyst 3,4,0.341
9,computer and information research scientists,stops research scientist 1,5,0.34


[9.3] Auto-accepted mappings (score >= 0.6): 5


Unnamed: 0,bls_title,suggested_sk_title,rank,score,bls_key,sk_key
45,data scientists,data scientists analyst,1,0.716,data scientists,data scientists analyst
60,information security analysts,information security analyst,1,0.693,information security analysts,information security analyst
80,operations research analysts,operations research analyst 3,1,0.679,operations research analysts,operations research analyst 3
90,software quality assurance analysts and testers,software quality assurance analyst,1,0.636,software quality assurance analysts and testers,software quality assurance analyst
35,computer systems analysts,computer systems administrator,1,0.605,computer systems analysts,computer systems administrator


[9.4] Skills features: 5 columns; unique sk_key rows: 1929
[9.6] Re-merged frame shape: (22, 104)


Unnamed: 0,job_title_norm,bls_key,sk_key,score,employment_2023,employment_2033,employment_distribution_percent_2023,employment_distribution_percent_2033,employment_change_numeric_2023-33,employment_change_percent_2023-33,...,typical_on-the-job_training_needed_to_attain_competency_in_the_occupation=Moderate-term on-the-job training,typical_on-the-job_training_needed_to_attain_competency_in_the_occupation=__MISSING__,work_experience_in_a_related_occupation=5 years or more,work_experience_in_a_related_occupation=Less than 5 years,work_experience_in_a_related_occupation=__MISSING__,skill_ai_big_data,skill_design_ux,skill_networks_cybersecurity,skill_programming,skill_technology_literacy
0,computer and information systems managers,computer and information systems managers,,,613.5,720.4,0.4,0.4,106.9,17.4,...,0,1,1,0,0,0,0,0,0,0
1,computer systems analysts,computer systems analysts,computer systems administrator,0.605,527.2,583.7,0.3,0.3,56.5,10.7,...,0,1,0,0,1,0,0,0,1,1
2,information security analysts,information security analysts,information security analyst,0.693,180.7,239.8,0.1,0.1,59.1,32.7,...,0,1,0,1,0,1,0,1,1,1


[9.6] Coverage after auto-mapping (threshold 0.6): 22.7%
[9.7] Saved merged frame: data/encoded/bls_plus_skills_step1.csv


## 10) Identify Numeric Features to Scale

**Goal:** Decide which columns should be standardized in Step 12 (using a scaler).  
We’ll target **continuous numeric** features and **exclude**:
- Identifier columns (e.g., `job_title_norm`, `bls_key`, `sk_key`)
- One-hot dummies (columns containing `"="`)
- Skill indicator columns (prefix `skill_`)

**What this cell does**
1. Chooses the working feature matrix:
   - If you merged Skills → BLS, it uses `df_model`
   - Otherwise, it falls back to `df_bls_dummies`
2. Detects numeric columns and removes IDs, one-hot dummies, and skill flags.
3. Prints the final `num_to_scale` list and some quick stats so you can sanity-check.


In [146]:
# 10.1 Pick the working frame (merged if available, otherwise BLS-only)
df = df_model if 'df_model' in globals() else df_bls_dummies

# 10.2 Identify columns to exclude from scaling
id_cols = [c for c in ["job_title_norm", "bls_key", "sk_key"] if c in df.columns]
skill_cols = [c for c in df.columns if c.startswith("skill_")]
# One-hot dummies produced by get_dummies typically have "column=value" in the name
dummy_cols = [c for c in df.columns if "=" in c] + skill_cols

# 10.3 Candidate numeric columns (ints/floats)
candidate_num = df.select_dtypes(include=["number"]).columns.tolist()

# 10.4 Final list: numeric columns minus IDs and dummies/skills
num_to_scale = [c for c in candidate_num if c not in (set(id_cols) | set(dummy_cols))]

print("Working frame shape:", df.shape)
print("Identifier columns (excluded):", id_cols)
print("Detected one-hot & skill columns (excluded):", len(dummy_cols))
print("Numeric candidates:", candidate_num)
print("\n Numeric columns to scale:", num_to_scale)

# 10.5 Quick health check on the columns to scale
if num_to_scale:
    # Count missing values in columns to scale
    na_counts = df[num_to_scale].isna().sum().sort_values(ascending=False)
    print("\nMissing values among columns to scale (top 10):")
    print(na_counts.head(10))

    # Basic descriptive stats to sanity-check ranges (show a few)
    print("\nPreview stats (first 6 columns to scale):")
    display(df[num_to_scale[:6]].describe(include='all').T)
else:
    print("\n No numeric columns selected for scaling. "
          "If this seems wrong, double-check that your continuous features weren't filtered out.")

Working frame shape: (22, 104)
Identifier columns (excluded): ['job_title_norm', 'bls_key', 'sk_key']
Detected one-hot & skill columns (excluded): 87
Numeric candidates: ['score', 'employment_2023', 'employment_2033', 'employment_distribution_percent_2023', 'employment_distribution_percent_2033', 'employment_change_numeric_2023-33', 'employment_change_percent_2023-33', 'percent_self_employed_2023', 'occupational_openings_2023-33_annual_average', 'median_annual_wage_dollars_2024', 'postings', 'median_salary_usd', 'remote_ratio_avg', '_is_tech=True', '_tech_reason=hard_include_or_must_have', '_tech_reason=soc15_default_keep', 'job_title=Actuaries', 'job_title=Computer and information research scientists', 'job_title=Computer and information systems managers', 'job_title=Computer network architects', 'job_title=Computer network support specialists', 'job_title=Computer occupations, all other', 'job_title=Computer programmers', 'job_title=Computer systems analysts', 'job_title=Computer use

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
score,5.0,0.6658,0.044774,0.605,0.636,0.679,0.693,0.716
employment_2023,22.0,274.140909,376.279275,2.5,66.175,153.05,302.8,1692.1
employment_2033,22.0,310.763636,437.685794,2.5,72.775,165.2,313.95,1995.7
employment_distribution_percent_2023,22.0,0.159091,0.226062,0.0,0.0,0.1,0.175,1.0
employment_distribution_percent_2033,22.0,0.168182,0.243753,0.0,0.0,0.1,0.2,1.1
employment_change_numeric_2023-33,22.0,36.636364,66.521436,-13.4,6.6,11.15,47.625,303.7


## 11) Train/Test Split (before scaling)

**Goal:** Create a hold-out test set and avoid data leakage by splitting **before** fitting the scaler.

**Notes**
- Pick a target `y_col` you want to predict (you can change this anytime).
- We’ll also **drop any numeric columns that are entirely NaN** (e.g., `postings`, `median_salary_usd`, `remote_ratio_avg` in your run) since they carry no information.

In [154]:
# 11) Train/Test split (before scaling) — FIXED

from sklearn.model_selection import train_test_split
import pandas as pd

df = df_model if 'df_model' in globals() else df_bls_dummies
assert 'num_to_scale' in globals(), "Run Step 10 first."

# Drop numeric columns that are entirely NaN
all_nan_cols = [c for c in num_to_scale if df[c].isna().all()]
if all_nan_cols:
    print("Dropping all-NaN numeric columns (no signal):", all_nan_cols)
    df = df.drop(columns=all_nan_cols)
    num_to_scale = [c for c in num_to_scale if c not in all_nan_cols]

# Choose your target
y_col = "median_annual_wage_dollars_2024"

X = df.drop(columns=[y_col]).copy()
y = df[y_col].copy()

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# IMPORTANT: remove target (if present) and keep only columns that exist in X
num_to_scale = [c for c in num_to_scale if c != y_col and c in X.columns]

print("Train size:", len(X_train), "| Test size:", len(X_test))
print("Columns to scale:", num_to_scale)

Train size: 17 | Test size: 5
Columns to scale: ['score', 'employment_2023', 'employment_2033', 'employment_distribution_percent_2023', 'employment_distribution_percent_2033', 'employment_change_numeric_2023-33', 'employment_change_percent_2023-33', 'percent_self_employed_2023', 'occupational_openings_2023-33_annual_average']


## 12) Standardize Numeric Features (fit on TRAIN only)

**Goal:** Put continuous features on a comparable scale (mean≈0, std≈1) using `StandardScaler`.

**What this does**
- Median-imputes numerics (TRAIN-only stats) so the scaler won’t see NaNs.
- **Fits** the scaler on TRAIN numerics and **transforms** both TRAIN and TEST.
- Leaves IDs, one-hot dummies, and skill flags **untouched**.
- Prints a quick sanity check on TRAIN means/stdevs.

In [156]:
# 12) Standardize numerics (TRAIN-only fit) — FIXED

from sklearn.preprocessing import StandardScaler
import pandas as pd

if not num_to_scale:
    print("No numeric columns to scale after filtering. Skipping scaling step.")
else:
    # Use TRAIN medians to fill NaNs in BOTH TRAIN and TEST
    train_medians = X_train[num_to_scale].median()

    X_train_num = X_train[num_to_scale].copy().fillna(train_medians)
    # reindex ensures same column order even if something got dropped upstream
    X_test_num  = X_test.reindex(columns=num_to_scale).copy().fillna(train_medians)

    scaler = StandardScaler()
    X_train_num_scaled = pd.DataFrame(
        scaler.fit_transform(X_train_num),
        columns=num_to_scale, index=X_train.index
    )
    X_test_num_scaled = pd.DataFrame(
        scaler.transform(X_test_num),
        columns=num_to_scale, index=X_test.index
    )

    # Recombine: scaled numerics + untouched columns
    X_train_scaled = pd.concat(
        [X_train.drop(columns=num_to_scale, errors="ignore"), X_train_num_scaled],
        axis=1
    )
    X_test_scaled = pd.concat(
        [X_test.drop(columns=num_to_scale, errors="ignore"), X_test_num_scaled],
        axis=1
    )

    # Sanity check: TRAIN means ~0, stds ~1
    print("Means (TRAIN, scaled):", X_train_scaled[num_to_scale].mean().round(3).to_dict())
    print("Stds  (TRAIN, scaled):", X_train_scaled[num_to_scale].std(ddof=0).round(3).to_dict())


Means (TRAIN, scaled): {'score': -0.0, 'employment_2023': 0.0, 'employment_2033': 0.0, 'employment_distribution_percent_2023': 0.0, 'employment_distribution_percent_2033': 0.0, 'employment_change_numeric_2023-33': -0.0, 'employment_change_percent_2023-33': 0.0, 'percent_self_employed_2023': -0.0, 'occupational_openings_2023-33_annual_average': -0.0}
Stds  (TRAIN, scaled): {'score': 1.0, 'employment_2023': 1.0, 'employment_2033': 1.0, 'employment_distribution_percent_2023': 1.0, 'employment_distribution_percent_2033': 1.0, 'employment_change_numeric_2023-33': 1.0, 'employment_change_percent_2023-33': 1.0, 'percent_self_employed_2023': 1.0, 'occupational_openings_2023-33_annual_average': 1.0}


## 13) Save Scaled Splits & Scaler (+ feature manifest)

**Goal:** Persist the TRAIN/TEST feature matrices, targets, and the fitted `StandardScaler`, plus a small
feature manifest so you can reliably reconstruct the same preprocessing later.

**We will save**
- `data/step2/X_train_scaled.csv`, `X_test_scaled.csv`
- `data/step2/y_train.csv`, `y_test.csv`
- `data/step2/standard_scaler.pkl`
- `data/step2/features_manifest.json` (lists of IDs, dummy/skill cols, and `num_to_scale`)

In [159]:
# 13) Save scaled splits, scaler, and a feature manifest

import os, json, pickle

os.makedirs("data/step2", exist_ok=True)

# Core artifacts
X_train_scaled.to_csv("data/step2/X_train_scaled.csv", index=False)
X_test_scaled.to_csv("data/step2/X_test_scaled.csv", index=False)
y_train.to_csv("data/step2/y_train.csv", index=False)
y_test.to_csv("data/step2/y_test.csv", index=False)

with open("data/step2/standard_scaler.pkl", "wb") as f:
    pickle.dump(scaler, f)

# Helpful manifest for reproducibility
manifest = {
    "id_cols": [c for c in ["job_title_norm", "bls_key", "sk_key"] if c in X_train_scaled.columns],
    "dummy_cols": sorted([c for c in X_train_scaled.columns if "=" in c]),
    "skill_cols": sorted([c for c in X_train_scaled.columns if c.startswith("skill_")]),
    "num_to_scale": num_to_scale,               # scaled numeric columns (order preserved in CSV)
    "target": "median_annual_wage_dollars_2024",
    "notes": "Scale with StandardScaler fitted on TRAIN; medians used for numeric imputation."
}
with open("data/step2/features_manifest.json", "w") as f:
    json.dump(manifest, f, indent=2)

print("Saved:")
print(" - data/step2/X_train_scaled.csv")
print(" - data/step2/X_test_scaled.csv")
print(" - data/step2/y_train.csv")
print(" - data/step2/y_test.csv")
print(" - data/step2/standard_scaler.pkl")
print(" - data/step2/features_manifest.json")


Saved:
 - data/step2/X_train_scaled.csv
 - data/step2/X_test_scaled.csv
 - data/step2/y_train.csv
 - data/step2/y_test.csv
 - data/step2/standard_scaler.pkl
 - data/step2/features_manifest.json
