# PwC Recruiting Insight Engine – V2  
## Training, Data Integrity & Diagnostics Notebook

This notebook documents the end-to-end process for:

1. Loading the raw HR datasets (`people.csv`, `salary.csv`, `descriptions.csv`)
2. Merging them via the **Data Integrity Layer**
3. Computing **merge KPIs** and the **Merge Health Index (MHI)**
4. Gating model training based on MHI
5. Training the salary prediction model (Random Forest + preprocessing pipeline)
6. Saving artifacts (preprocessor + model) for the API + Insight Engine
7. Running a small inference sanity check

The notebook is aligned with the **V2 compact & robust architecture**, using the same modules as the production code:

- `data_integrity.merge`
- `data_integrity.kpis`
- `data_integrity.mhi`
- `data_integrity.diagnostics`
- `model.train`
- `model.inference`
- `model.artifacts`


In [None]:
import os
import sys
import pandas as pd

# Optional: display options for debugging
pd.set_option("display.max_columns", 50)
pd.set_option("display.width", 120)

# -------------------------------------------------------------------
# Ensure project root is on sys.path so we can import internal modules
# -------------------------------------------------------------------
PROJECT_ROOT = os.path.abspath(".")  # notebook at project root
print("PROJECT_ROOT:", PROJECT_ROOT)

if PROJECT_ROOT not in sys.path:
    sys.path.insert(0, PROJECT_ROOT)

# -------------------------------------------------------------------
# Import internal modules (V2 architecture)
# -------------------------------------------------------------------
from data_integrity.merge import merge_tables
from data_integrity.kpis import compute_merge_kpis
from data_integrity.mhi import compute_mhi
from data_integrity.diagnostics import basic_merge_diagnostics

from model.train import FEATURE_COLUMNS, TARGET_COLUMN
from model.artifacts import save_artifacts
from model.inference import load_pipeline, predict_one


## 1. Load raw datasets

We load the three raw CSV files provided in the challenge:

- `people.csv` → demographic & job info
- `salary.csv` → target variable
- `descriptions.csv` → text descriptions (not used in V2 model, but used in integrity checks)

We start with basic shape checks to ensure the files are readable and consistent.


In [None]:
# Paths relative to project root
DATA_DIR = os.path.join(PROJECT_ROOT, "data")
PEOPLE_PATH = os.path.join(DATA_DIR, "people.csv")
SALARY_PATH = os.path.join(DATA_DIR, "salary.csv")
DESC_PATH = os.path.join(DATA_DIR, "descriptions.csv")

people_df = pd.read_csv(PEOPLE_PATH)
salary_df = pd.read_csv(SALARY_PATH)
desc_df = pd.read_csv(DESC_PATH)

print("people_df shape:", people_df.shape)
print("salary_df shape:", salary_df.shape)
print("desc_df shape:", desc_df.shape)

display(people_df.head())
display(salary_df.head())
display(desc_df.head())


## 2. Merge datasets via Data Integrity Layer

We do **not** merge manually here.  
Instead, we call `data_integrity.merge.merge_tables`, which encapsulates:

- join logic
- key alignment
- type coercions (e.g., ID as integer)
- defensive checks

This keeps the notebook aligned with the production pipeline.


In [None]:
merged_df = merge_tables(people_df, salary_df, desc_df)

print("merged_df shape:", merged_df.shape)
display(merged_df.head())

print("\nColumns:", list(merged_df.columns))
print("\nMissing values per column:")
print(merged_df.isna().sum())


## 3. Compute merge KPIs & diagnostics

We now compute:

- **KPIs** via `compute_merge_kpis`:
  - KAS, SCS, MDC, JSR, CCR_mean, MER_norm, DFC_norm, MDS, …

- **Diagnostics** via `basic_merge_diagnostics`:
  - row counts
  - missing values
  - duplicates
  - sample preview

These feed directly into the **Merge Health Index (MHI)**.


In [None]:
kpis = compute_merge_kpis(people_df, salary_df, desc_df, merged_df)
diagnostics = basic_merge_diagnostics(merged_df)

print("=== KPIs ===")
for k, v in kpis.items():
    print(f"{k}: {v}")

print("\n=== Diagnostics (summary) ===")
for k, v in diagnostics.items():
    if k != "sample_preview":
        print(f"{k}: {v}")

print("\nSample preview from diagnostics:")
pd.DataFrame(diagnostics["sample_preview"]).head()


## 4. Compute Merge Health Index (MHI)

Using the KPIs, we compute the **MHI** via `data_integrity.mhi.compute_mhi`.

Reminder of the structure:

- `Gate`  → binary (0/1) based on key/schema/determinism
- `Core`  → √(JSR × CCR_mean)
- `Refinement` → exponential penalty based on MER_norm, DFC_norm, drift
- `MHI`  → Gate × Core × Refinement
- `zone` → RED / YELLOW / GREEN

If **Gate = 0** or **zone = RED**, training should be aborted.


In [None]:
mhi_result = compute_mhi(kpis)

print("=== MHI RESULT ===")
for k, v in mhi_result.items():
    print(f"{k}: {v}")

zone = mhi_result.get("zone", "UNKNOWN")
mhi_value = mhi_result.get("MHI", None)

print("\nMHI zone:", zone)
if mhi_value is not None:
    print("MHI score:", round(mhi_value, 3))

if zone == "RED" or mhi_result.get("Gate", 0) == 0:
    raise RuntimeError(
        f"❌ Training should NOT proceed: MHI={mhi_value:.3f}, zone={zone}, Gate={mhi_result.get('Gate')}"
    )
else:
    print("\n✅ MHI is acceptable for training (YELLOW/GREEN).")


## 5. Prepare training dataset

We now:

1. Drop rows with missing target (`Salary`)
2. Use the same `FEATURE_COLUMNS` and `TARGET_COLUMN` as the production pipeline
3. Inspect basic statistics of the final training set


In [None]:
print("Rows before dropping missing Salary:", len(merged_df))

clean_df = merged_df.dropna(subset=[TARGET_COLUMN])
print("Rows after dropping missing Salary:", len(clean_df))

X = clean_df[FEATURE_COLUMNS].copy()
y = clean_df[TARGET_COLUMN].copy()

print("\nFeature columns:", FEATURE_COLUMNS)
print("Target column:", TARGET_COLUMN)

display(X.head())
print("\nTarget summary:")
display(y.describe())


## 6. Train model via production training logic

To stay consistent with the V2 architecture and avoid code drift,
we **reuse the training logic** from `model.train`.

Instead of re-implementing the pipeline here, we call `train_model()` and
capture its summary (MHI, KPIs, diagnostics) while it:

- builds the preprocessing pipeline
- trains the RandomForestRegressor
- saves artifacts (`preprocessor.pkl`, `model.pkl`) into `model/artifacts/`


In [None]:
from model.train import train_model

training_summary = train_model()

print("\n=== TRAINING SUMMARY (from train_model) ===")
for section, content in training_summary.items():
    print(f"\n[{section.upper()}]")
    print(content)


## 7. Load artifacts and run a small inference sanity check

Now we confirm that:

1. Artifacts were saved correctly to `model/artifacts/`
2. We can load the pipeline with `model.inference.load_pipeline`
3. We can call `predict_one()` for a sample candidate profile

This closes the loop: **Data Integrity → Training → Artifacts → Inference.**


In [None]:
# Load trained pipeline
pipeline = load_pipeline()

# Take a real row from the cleaned dataset as a candidate example
sample_row = clean_df.iloc[0]
candidate_example = {
    "Age": int(sample_row["Age"]),
    "Gender": str(sample_row["Gender"]),
    "Education Level": str(sample_row["Education Level"]),
    "Job Title": str(sample_row["Job Title"]),
    "Years of Experience": float(sample_row["Years of Experience"]),
}

print("Candidate example:", candidate_example)

predicted_salary = float(predict_one(pipeline, candidate_example))
print("\nPredicted salary for this example:", round(predicted_salary, 2))


## 8. Notebook summary

In this notebook we:

1. Loaded the raw HR datasets (`people`, `salary`, `descriptions`)
2. Merged them using the dedicated **Data Integrity Layer**
3. Computed **KPIs** and the **MHI**, gating training based on data quality
4. Prepared a clean training set with well-defined **features** and **target**
5. Trained the model using the production `train_model()` flow
6. Saved and reloaded artifacts for downstream use
7. Ran a small inference sanity check to validate the end-to-end pipeline

This notebook serves as:

- A **transparent, reproducible training log**
- A **technical companion** to the V2 architecture
- A **debug tool** to inspect data integrity and model behavior
