Model Interpretability & Feature Impact -Phase 4 
(For ACME Legacy Reimbursement System Project)

Author: Ayushi Bohra

1. **Why Interpretability Matters in This Project**

Since ACME’s legacy reimbursement system is undocumented and behaves like a “black box,” interpretability helps us:

* Understand which inputs influence reimbursement.
* Compare ML behavior with employee interviews & PRD business rules.
* Detect unexpected patterns or unintended logic.
* Build trust with ACME business users.
* Validate that our ML model is replicating the old system correctly.

---

 2. **Key Feature Importance Findings**

Using the ensemble model (Decision Tree + Random Forest + Gradient Boosting stacked), the following features consistently ranked highest:

Top 3 Influential Features (Across All Models)

1️⃣ total_receipts_amount – Most Important

* Strongest and most consistent predictor
* Matches intuition: higher receipts → higher reimbursement
* Aligns with interview hints that receipt totals drive the majority of payouts
* Linear and tree-based models both rely heavily on this feature.
* Also validated by Phase 1 correlation (≈ 0.85 with reimbursement)

---

2️⃣ miles_traveled

* Strong secondary influence
* Tree models detected nonlinear thresholds:
  * Mileage bands (e.g., <200, 200–600, >600 miles) affected payout differently.
* Suggests legacy logic includes a mileage reimbursement component

---

3️⃣ trip_duration_days

* Moderate but stable influence
* Long trips → slightly higher reimbursement
* Suggests the legacy system includes a per-diem-like rule
* Nonlinear effects observed in tree models: short trips and long trips behave differently

---

3. **Impact of Derived Features**

We engineered additional features in Phase 2:

* cost_per_day
* cost_per_mile
* miles_per_day
* cost_ratio

Findings:

* These features improved model performance.
* But they were not more important than the core original fields.
* Boosting models used them to capture subtle nonlinear patterns.

This indicates that ACME’s legacy logic was simple but had hidden nonlinearity.

---

 4. **How the Legacy Logic Seems to Work (Inferred)**

Based on model behavior and feature contribution:

The reimbursement logic appears to be a weighted combination of:

1. Direct receipts (primary driver)
2. Mileage-based reimbursement
3. A smaller adjustment based on duration

This matches multiple interview hints about how employees “felt” the system worked.

---

 5. **Model Behaviors That Confirm This**

✔ Linear Model

* Uses strong linear relationships between receipts, miles, and reimbursement
* Captures baseline business logic
* Good interpretability but slightly lower accuracy

✔ Tree-Based Models

* Detected nonlinear thresholds similar to business rules:
  * Mileage brackets
  * High-receipt bonus zones
  * Duration tiers
* Better match to legacy behavior

✔ Stacking Ensemble (Final Model)

* Balances linear insight + nonlinear precision
* Produces predictions closest to the undocumented legacy rules
* Provides the best accuracy, especially for high-value reimbursements

---

 6. **Key Takeaways for ACME Stakeholders**

* The legacy system heavily rewards receipt totals, with mileage as the second largest influence.
* Duration plays a smaller role but still affects payout in a structured way.
* The new ML model mirrors legacy behavior while revealing hidden rules.
* The ensemble approach provides both:
  * Accuracy (matches legacy results)
  * Interpretability (explains why the model predicts what it predicts)

---

**Summary Explanation**

Our interpretability analysis shows that ACME’s legacy system relies primarily on receipt amounts, with mileage and trip duration contributing smaller but consistent adjustments.
Tree-based models uncovered nonlinear patterns that resemble mileage brackets and per-diem tiers described by employees.
The final ensemble model captures both linear and nonlinear logic, providing accurate predictions while revealing the hidden business rules behind the original system.



