## 3. ML Framing & Metrics Justification

This section defines how the business problem is translated into a machine learning task, how success is measured, and which trade-offs guide model selection.  
It acts as the *conceptual bridge* between the Product Requirements (Deliverable 1), the Data Understanding & EDA (Deliverable 2), and the downstream modeling and deployment decisions (Deliverables 4–8).

---

### 3.1 Problem Type: Regression

The core task is to **predict the test bench time of a car given its configuration**.  
The target variable is a continuous, real-valued quantity measured in time units (e.g., minutes).

From an ML framing perspective, this naturally maps to a **supervised regression problem**, not classification or ranking.

Key characteristics justifying regression framing:

- **Continuity of the outcome**  
  Test duration varies smoothly with configuration changes (e.g., engine type, transmission, weight), rather than falling into discrete categories.

- **Operational relevance of magnitude**  
  The *absolute size* of the error matters: a 5-minute error has a very different operational impact than a 30-minute error.

- **Downstream decision usage**  
  Predictions are consumed by production planners to allocate test bench slots and sequence vehicles. This requires numeric estimates, not labels.

Connection to other deliverables:
- **Deliverable 1 (PRD):** Defines the prediction target explicitly as test time per vehicle.
- **Deliverable 2 (EDA):** Confirms sufficient variance and signal in features to support regression feasibility.
- **Deliverable 4 (Model Experiments):** Explores multiple regression families (linear, tree-based, ensemble).
- **Deliverable 8 (Post-Launch Plan):** Assumes continuous monitoring of prediction error over time.

Importantly, while more complex formulations (e.g., probabilistic regression or interval prediction) are conceivable, the initial framing prioritizes **point estimation** to keep the system interpretable and operationally simple in early deployment stages.

---

### 3.2 Evaluation Metrics: RMSE and MAE

Metric selection follows an *outcome-oriented* logic: metrics are chosen not for mathematical elegance alone, but for their alignment with operational risk and planning costs.

#### Root Mean Squared Error (RMSE)

RMSE is selected as the **primary evaluation metric**.

Justification:
- Penalizes larger errors more strongly due to squaring.
- Aligns with the business risk of **severe mis-scheduling**, where large underestimates or overestimates can cascade into idle benches or bottlenecks.
- Provides a single, continuous measure suitable for model comparison during experimentation.

Operational interpretation:
> A lower RMSE directly translates into fewer extreme scheduling failures.

#### Mean Absolute Error (MAE)

MAE is used as a **secondary, robustness-oriented metric**.

Justification:
- Less sensitive to outliers than RMSE.
- Provides a stable estimate of *typical* prediction error.
- Useful for diagnosing whether performance degradation is driven by rare extreme cases or by systematic bias.

Why both metrics are needed:
- RMSE captures *risk sensitivity*.
- MAE captures *typical planner experience*.

Connection to other deliverables:
- **Deliverable 1 (Success Metrics):** RMSE explicitly defined as success criterion.
- **Deliverable 5 (KPI–OKR Mapping):** RMSE and MAE serve as leading indicators for downstream KPIs such as idle bench reduction.
- **Deliverable 6 (Risk & Failure Analysis):** Divergence between RMSE and MAE can signal outlier-driven failure modes.

Metrics not chosen (and why):
- R² alone is insufficient due to weak interpretability for operational planning.
- MAPE is unstable near small denominators and less intuitive in this context.

---

### 3.3 Baseline Definition

A baseline is essential to demonstrate that the ML system adds value beyond existing heuristics.

Two baselines are defined:

#### Baseline A: Global Mean Test Time
- Predicts the historical average test duration for all cars.
- Represents the *implicit status quo* used in manual planning.

#### Baseline B: Simple Linear Regression
- Uses a limited subset of key numerical and encoded categorical features.
- Represents a transparent, low-complexity analytical model.

Purpose of baselines:
- Establish a **minimum performance threshold**.
- Prevent overestimating the value of complex models.
- Provide a reference for model risk discussions.

Connection to other deliverables:
- **Deliverable 2 (EDA):** Informs which features are reasonable even for a simple baseline.
- **Deliverable 4 (Experiment Log):** Shows incremental value over baselines.
- **Deliverable 3 (this section):** Frames baseline choice as a product decision, not a technical afterthought.

A model that fails to outperform these baselines consistently is considered **non-viable for production**, regardless of theoretical appeal.

---

### 3.4 Trade-offs and Decision Rationale

ML model selection is framed as a sequence of **explicit trade-offs**, not a search for maximum accuracy in isolation.

#### Accuracy vs. Model Complexity
- More complex models (e.g., Gradient Boosting) capture nonlinear interactions.
- However, complexity increases:
  - Maintenance cost
  - Risk of overfitting
  - Difficulty of explanation to stakeholders

Decision principle:
> Prefer the *simplest model* that meets accuracy and robustness requirements.

#### Interpretability vs. Performance
- Linear models offer high transparency but limited expressiveness.
- Tree-based ensembles offer improved performance with partial interpretability (feature importance, SHAP).

Given the Mercedes-Benz context:
- Full black-box models are less desirable.
- Partial explainability is sufficient if decision logic remains defensible.

#### Latency vs. Throughput
- Real-time dashboards require low-latency predictions.
- Batch planning tolerates higher latency.

Framing decision:
- Models must support **near-real-time inference**.
- Training cost is secondary to inference stability.

Connection to other deliverables:
- **Deliverable 4 (Model Log):** Documents how trade-offs influenced final model choice.
- **Deliverable 6 (Risk Analysis):** Trade-offs are linked to failure anticipation.
- **Deliverable 8 (Lifecycle Plan):** Complexity affects retraining and rollback strategies.

---

### 3.5 Role within the ML Workflow

ML Framing & Metrics Justification serves as the **decision anchor** of the entire workflow:

- Translates business goals into ML objectives.
- Constrains model experimentation space.
- Ensures consistency between data analysis, modeling, evaluation, and deployment.
- Enables transparent communication with non-ML stakeholders.

Without this framing, later improvements in model performance risk becoming **locally optimal but globally misaligned** with product and operational goals.

---

### Summary

The regression framing, metric choices, baselines, and trade-offs together ensure that:
- Model performance is meaningful in operational terms.
- Decisions are explainable and defensible.
- The system aligns with Mercedes-Benz requirements for robustness, predictability, and lifecycle sustainability.

This section thus operationalizes AI Product Thinking at the point where technical modeling decisions begin.


In [1]:
"""
Deliverable 3 — ML Framing & Experimentation Constraints
========================================================

THEORY
------
This script defines the *normative decision framework*
for model experimentation.

It translates findings from Deliverable 2 into:
- data handling policies
- modeling constraints
- evaluation metrics
- governance rules

NO model is trained here.
NO performance is measured.

This file defines the "rules of the game" for Deliverable 4.

INPUTS
------
- train.csv (raw, unchanged dataset)

OUTPUTS
-------
- Printed ML framing decisions
- Explicit constraints for:
  - data handling
  - model eligibility
  - evaluation metrics
  - validation logic

TERMINAL USAGE
--------------
python ml_framing.py

EPISTEMIC SCOPE
---------------
Allowed:
- Normative decisions
- Policy definitions
- Metric justification
- Risk-based constraints

Forbidden:
- Model fitting
- Cross-validation
- Any numerical performance evaluation
"""

import pandas as pd

def main():
    print("\n=== Deliverable 3: ML Framing & Metrics Justification ===\n")

    # ------------------------------------------------------------------
    # 1. Problem Framing
    # ------------------------------------------------------------------
    print("1. Problem Type")
    print("- Supervised regression")
    print("- Continuous target: test bench duration (y)")
    print("- Static, tabular dataset")
    print("- No temporal forecasting\n")

    # ------------------------------------------------------------------
    # 2. Data Handling Policies (Derived from Deliverable 2)
    # ------------------------------------------------------------------
    print("2. Data Handling Policies")

    print("\n2.1 Missingness Policy")
    print("- No NaN values observed")
    print("- Zeros represent structural absence of options")
    print("- No imputation will be applied")
    print("- Models must handle sparsity explicitly")

    print("\n2.2 Outlier Policy")
    print("- Extreme test durations are real operational cases")
    print("- No row-level removal or clipping")
    print("- Robustness handled via loss functions or regularization")

    print("\n2.3 Rare Configuration Policy")
    print("- Rare configurations are retained")
    print("- No grouping or collapsing at data level")
    print("- Generalization risk acknowledged and monitored")

    # ------------------------------------------------------------------
    # 3. Leakage Prevention Strategy
    # ------------------------------------------------------------------
    print("\n3. Leakage Prevention Strategy")

    print("- ID column must never be used as a feature")
    print("- Repeated configurations exist (up to 9 occurrences)")
    print("- Train/validation splits must be configuration-aware")
    print("- Identical configurations must not leak across splits")

    print("Approved validation strategies:")
    print("- Group-based split using full configuration signature")
    print("- OR time-aware split if temporal order becomes available")

    # ------------------------------------------------------------------
    # 4. Feature Policy
    # ------------------------------------------------------------------
    print("\n4. Feature Policy")

    print("- Nominal features (X0–X8):")
    print("  - Treated as categorical")
    print("  - Encoding strategy must preserve interpretability")

    print("- Binary features (X10+):")
    print("  - Sparse, high-dimensional")
    print("  - No feature removal without justification")
    print("  - Regularization preferred over hard selection")

    # ------------------------------------------------------------------
    # 5. Model Eligibility Constraints
    # ------------------------------------------------------------------
    print("\n5. Model Eligibility Constraints")

    print("Eligible model families must:")
    print("- Support regression")
    print("- Handle high-dimensional sparse inputs")
    print("- Be stable under rare configurations")

    print("\nInterpretability constraint:")
    print("- Model behavior must be explainable at feature-group level")
    print("- Pure black-box models require additional justification")

    # ------------------------------------------------------------------
    # 6. Evaluation Metrics (Normative, not measured here)
    # ------------------------------------------------------------------
    print("\n6. Evaluation Metrics")

    print("Primary metric:")
    print("- RMSE (penalizes large scheduling errors)")

    print("\nSecondary metrics:")
    print("- MAE (robust to outliers)")
    print("- Error distribution diagnostics")

    print("\nMetric usage constraints:")
    print("- Metrics must be reported with confidence intervals")
    print("- Single-point estimates are insufficient")

    # ------------------------------------------------------------------
    # 7. Baseline Definition
    # ------------------------------------------------------------------
    print("\n7. Baseline Definition")

    print("Mandatory baselines for Deliverable 4:")
    print("- Global mean predictor")
    print("- Simple linear regression")

    print("Purpose of baselines:")
    print("- Sanity check")
    print("- Value justification of complex models")

    # ------------------------------------------------------------------
    # 8. Decision Criteria for Model Selection
    # ------------------------------------------------------------------
    print("\n8. Model Selection Criteria")

    print("A model is acceptable only if:")
    print("- It outperforms baselines meaningfully")
    print("- It respects leakage and grouping constraints")
    print("- Its error behavior is stable across configurations")

    print("\nTrade-off awareness:")
    print("- Slightly worse RMSE may be accepted for:")
    print("  - higher robustness")
    print("  - better interpretability")
    print("  - lower operational risk")

    print("\n=== Deliverable 3 Framing Complete ===")

if __name__ == "__main__":
    main()



=== Deliverable 3: ML Framing & Metrics Justification ===

1. Problem Type
- Supervised regression
- Continuous target: test bench duration (y)
- Static, tabular dataset
- No temporal forecasting

2. Data Handling Policies

2.1 Missingness Policy
- No NaN values observed
- Zeros represent structural absence of options
- No imputation will be applied
- Models must handle sparsity explicitly

2.2 Outlier Policy
- Extreme test durations are real operational cases
- No row-level removal or clipping
- Robustness handled via loss functions or regularization

2.3 Rare Configuration Policy
- Rare configurations are retained
- No grouping or collapsing at data level
- Generalization risk acknowledged and monitored

3. Leakage Prevention Strategy
- ID column must never be used as a feature
- Repeated configurations exist (up to 9 occurrences)
- Train/validation splits must be configuration-aware
- Identical configurations must not leak across splits
Approved validation strategies:
- Group-based

In [2]:
"""
Deliverable 3 → 4 Bridge
========================
Data Cleaning & Framing Operationalization

This script applies ONLY the normative decisions
defined in Deliverable 3.

It does NOT:
- train models
- evaluate performance
- change target semantics

It DOES:
- enforce governance constraints
- remove non-informative features
- prepare a clean, model-ready dataset

INPUT
-----
../data/train.csv  (raw, unchanged)

OUTPUT
------
../data/train_clean.csv
../data/cleaning_report.txt

TERMINAL USAGE
--------------
python clean_data.py

EPISTEMIC STATUS
----------------
Normative → Operational
(No empirical validation yet)
"""

import pandas as pd

RAW_PATH = "../data/train.csv"
OUT_PATH = "../data/train_clean.csv"
REPORT_PATH = "../data/cleaning_report.txt"

def main():
    report = []

    print("Loading raw dataset...")
    df = pd.read_csv(RAW_PATH)
    original_shape = df.shape

    report.append(f"Original dataset shape: {original_shape}")

    # --------------------------------------------------
    # 1. Remove governance-only identifiers
    # --------------------------------------------------
    if "ID" in df.columns:
        df = df.drop(columns=["ID"])
        report.append("Removed column: ID (governance / leakage risk)")
    else:
        report.append("No ID column found")

    # --------------------------------------------------
    # 2. Identify feature groups
    # --------------------------------------------------
    feature_cols = [c for c in df.columns if c.startswith("X")]
    categorical = [c for c in feature_cols if df[c].dtype == object]
    binary = [c for c in feature_cols if df[c].dropna().isin([0, 1]).all()]

    report.append(f"Categorical features retained: {len(categorical)}")
    report.append(f"Binary features before cleaning: {len(binary)}")

    # --------------------------------------------------
    # 3. Remove inactive one-hot features (always 0)
    # --------------------------------------------------
    inactive_binary = [c for c in binary if df[c].mean() == 0]

    if inactive_binary:
        df = df.drop(columns=inactive_binary)
        report.append(
            f"Removed {len(inactive_binary)} inactive binary features (always 0)"
        )
    else:
        report.append("No inactive binary features found")

    active_binary = len(binary) - len(inactive_binary)
    report.append(f"Active binary features retained: {active_binary}")

    # --------------------------------------------------
    # 4. Target integrity check
    # --------------------------------------------------
    if "y" not in df.columns:
        raise ValueError("Target column 'y' missing after cleaning!")

    if df["y"].isna().any():
        raise ValueError("Unexpected NaNs in target variable!")

    report.append("Target variable y preserved without modification")

    # --------------------------------------------------
    # 5. Final consistency checks
    # --------------------------------------------------
    final_shape = df.shape
    report.append(f"Final dataset shape: {final_shape}")

    print("Saving cleaned dataset...")
    df.to_csv(OUT_PATH, index=False)

    print("Writing cleaning report...")
    with open(REPORT_PATH, "w") as f:
        for line in report:
            f.write(line + "\n")

    print("\n=== Cleaning Complete ===")
    for line in report:
        print(line)

if __name__ == "__main__":
    main()


Loading raw dataset...
Saving cleaned dataset...
Writing cleaning report...

=== Cleaning Complete ===
Original dataset shape: (4209, 378)
Removed column: ID (governance / leakage risk)
Categorical features retained: 8
Binary features before cleaning: 368
Removed 12 inactive binary features (always 0)
Active binary features retained: 356
Target variable y preserved without modification
Final dataset shape: (4209, 365)
