# ANSWER KEY: Debug Drill 02 - Misleading Coefficients

**Bug:** Logistic regression coefficients are misleading because:
1. Features aren't scaled (different magnitudes)
2. Multicollinearity between features

**Key Lesson:** Always scale features before interpreting coefficients, and check for multicollinearity.

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('https://raw.githubusercontent.com/189investmentai/ml-foundations-interactive/main/data/streamcart_customers.csv')

## The Bug (Colleague's Code)

In [None]:
# ===== BUGGY CODE =====
features = ['tenure_months', 'logins_last_30d', 'orders_last_30d', 'support_tickets_last_30d']
X = df[features].fillna(0)
y = df['churn_30d']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# NO SCALING!
model_buggy = LogisticRegression(random_state=42, max_iter=1000)
model_buggy.fit(X_train, y_train)

print("Colleague's interpretation:")
print("="*50)
for feat, coef in zip(features, model_buggy.coef_[0]):
    print(f"{feat}: {coef:.4f}")
print("\n'tenure_months has the smallest coefficient,")
print("so it must be the least important feature!'")

## Why This Is Wrong

**Problem 1: Different scales**
- `tenure_months`: 0-60 (small numbers)
- `logins_last_30d`: 0-100+ (larger numbers)

Without scaling, coefficients reflect scale, not importance!

**Problem 2: Multicollinearity**
If `logins` and `orders` are highly correlated, their coefficients become unstable and hard to interpret.

In [None]:
# Check the scales
print("Feature scales (without standardization):")
print(X[features].describe().loc[['mean', 'std']])

print("\nCorrelation matrix (check for multicollinearity):")
print(X[features].corr().round(2))

## The Fix

In [None]:
# ===== FIXED CODE =====

# Step 1: Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 2: Train with scaled features
model_fixed = LogisticRegression(random_state=42, max_iter=1000)
model_fixed.fit(X_train_scaled, y_train)

print("Correct interpretation (after scaling):")
print("="*50)
coef_df = pd.DataFrame({
    'feature': features,
    'coefficient': model_fixed.coef_[0],
    'abs_coefficient': np.abs(model_fixed.coef_[0])
}).sort_values('abs_coefficient', ascending=False)

for _, row in coef_df.iterrows():
    direction = "increases" if row['coefficient'] > 0 else "decreases"
    print(f"{row['feature']}: {row['coefficient']:.4f} ({direction} churn risk)")

print(f"\nMost important feature: {coef_df.iloc[0]['feature']}")

In [None]:
# Step 3: If multicollinearity is high, consider removing correlated features
corr_matrix = X[features].corr().abs()

# Find pairs with correlation > 0.7
high_corr_pairs = []
for i in range(len(features)):
    for j in range(i+1, len(features)):
        if corr_matrix.iloc[i, j] > 0.7:
            high_corr_pairs.append((features[i], features[j], corr_matrix.iloc[i, j]))

if high_corr_pairs:
    print("WARNING: High correlation detected:")
    for f1, f2, corr in high_corr_pairs:
        print(f"  {f1} <-> {f2}: {corr:.2f}")
    print("Consider removing one feature from each pair.")
else:
    print("No severe multicollinearity detected.")

In [None]:
# Self-check
print("\nSelf-check: Coefficients are now comparable!")
print("PASS: Features scaled correctly!")

## Completed Postmortem

### What happened:
- Colleague interpreted raw coefficients as "importance" without scaling
- tenure_months appeared unimportant due to its smaller scale, not its actual predictive power

### Root cause:
- Logistic regression coefficients are scale-dependent
- A coefficient of 0.01 on a 0-100 scale has the same effect as 1.0 on a 0-1 scale

### How to prevent:
- Always use StandardScaler before training when you need to interpret coefficients
- Check correlation matrix for multicollinearity (>0.7 is concerning)
- Report "standardized coefficients" when comparing feature importance