# Solution: The Backwards Coefficients

This is the answer key for Debug Drill 03.

---

## The Bug

The colleague created a feature `spend_per_order = total_spend / (orders_total + 1)`.

This feature is **derived from the target variable** (`total_spend`), creating a mathematical relationship that causes **multicollinearity** and **target leakage**.

### Why the coefficients flip:

When you include `spend_per_order` as a feature:
- `spend_per_order` is highly correlated with `total_spend` (by definition!)
- The model gives most of the "credit" to `spend_per_order`
- This leaves `orders_total` with a negative coefficient to "balance" the equation
- The regression is essentially solving: `total_spend ≈ spend_per_order × (orders_total + 1)` which requires negative weight on `orders_total` when `spend_per_order` has positive weight

---

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
import matplotlib.pyplot as plt

np.random.seed(42)

DATA_URL = 'https://raw.githubusercontent.com/189investmentai/ml-foundations-interactive/main/data/'

df = pd.read_csv(DATA_URL + 'streamcart_customers.csv')
print(f"Loaded {len(df):,} customers")

## Demonstrating the Bug

In [None]:
# The buggy feature engineering
df['spend_per_order'] = df['total_spend'] / (df['orders_total'] + 1)

# Buggy features (includes target-derived feature)
features_buggy = [
    'tenure_months',
    'logins_last_30d', 
    'orders_total',
    'avg_order_value',
    'spend_per_order'  # BUG: This is derived from total_spend!
]

X_buggy = df[features_buggy].fillna(0)
y = df['total_spend']

X_train, X_test, y_train, y_test = train_test_split(X_buggy, y, test_size=0.2, random_state=42)

model_buggy = LinearRegression()
model_buggy.fit(X_train, y_train)

print("BUGGY MODEL:")
print(f"  R²: {r2_score(y_test, model_buggy.predict(X_test)):.3f}")
print(f"\nCoefficients:")
for feature, coef in zip(features_buggy, model_buggy.coef_):
    sign = "✓" if (feature != 'orders_total' or coef > 0) else "✗ WRONG SIGN"
    print(f"  {feature:20s}: {coef:+.4f}  {sign}")

## The Fix

Remove `spend_per_order` — it's derived from the target and shouldn't be a feature.

In [None]:
# Fixed features - removed the target-derived feature
features_fixed = [
    'tenure_months',
    'logins_last_30d', 
    'orders_total',
    'avg_order_value'
    # spend_per_order REMOVED - it's derived from total_spend
]

X_fixed = df[features_fixed].fillna(0)

X_train_fixed, X_test_fixed, y_train_fixed, y_test_fixed = train_test_split(
    X_fixed, y, test_size=0.2, random_state=42
)

model_fixed = LinearRegression()
model_fixed.fit(X_train_fixed, y_train_fixed)

y_pred_fixed = model_fixed.predict(X_test_fixed)

print("FIXED MODEL:")
print(f"  MAE: ${mean_absolute_error(y_test_fixed, y_pred_fixed):.2f}")
print(f"  R²: {r2_score(y_test_fixed, y_pred_fixed):.3f}")
print(f"\nCoefficients:")
for feature, coef in zip(features_fixed, model_fixed.coef_):
    sign = "✓" if coef > 0 else "(check domain knowledge)"
    print(f"  {feature:20s}: {coef:+.4f}  {sign}")

In [None]:
# Self-check
orders_idx = features_fixed.index('orders_total')
assert model_fixed.coef_[orders_idx] > 0, "orders_total coefficient should be positive!"

tenure_idx = features_fixed.index('tenure_months')
assert model_fixed.coef_[tenure_idx] > 0, "tenure_months coefficient should be positive!"

print("✓ All self-checks passed!")

In [None]:
# Visualize the comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Buggy model
colors_buggy = ['green' if c > 0 else 'red' for c in model_buggy.coef_]
axes[0].barh(features_buggy, model_buggy.coef_, color=colors_buggy, alpha=0.7)
axes[0].axvline(x=0, color='black', linewidth=0.5)
axes[0].set_xlabel('Coefficient')
axes[0].set_title('BUGGY: Wrong Signs (red = negative)')

# Fixed model
colors_fixed = ['green' if c > 0 else 'red' for c in model_fixed.coef_]
axes[1].barh(features_fixed, model_fixed.coef_, color=colors_fixed, alpha=0.7)
axes[1].axvline(x=0, color='black', linewidth=0.5)
axes[1].set_xlabel('Coefficient')
axes[1].set_title('FIXED: Sensible Signs (all green)')

plt.tight_layout()
plt.show()

## Postmortem

### What happened:
- The model showed a negative coefficient for `orders_total`, implying "more orders = less spend" — which contradicts common sense.

### Root cause:
- The feature `spend_per_order` was derived from the target variable (`total_spend / orders_total`).
- This created **multicollinearity** — the features are mathematically related to each other and the target.
- When multicollinearity is severe, coefficients become unstable and can flip signs.

### How to prevent:
- **Never create features from the target variable.**
- Check correlation matrix for suspiciously high correlations (> 0.9).
- Verify coefficient signs make domain sense before deploying.
- Ask: "Would I have this feature at prediction time, without knowing the target?"

---

## Key Takeaway

**Multicollinearity breaks coefficient interpretation.** Even if the model predicts well (high R²), the individual coefficients become meaningless when features are highly correlated.

This is especially dangerous when features are mathematically derived from the target — it's a form of **target leakage** that can produce good metrics but nonsensical explanations.