# Week 11: Feature Engineering & Model Explainability

## üéØ Learning Objectives

By the end of this week, you will understand:
- **Feature Engineering**: Creating predictive signals
- **Feature Selection**: Choosing what matters
- **SHAP**: Game-theoretic feature importance
- **LIME**: Local interpretable explanations

---

## Why Feature Engineering?

Raw data is rarely predictive. Feature engineering transforms data into signals:
- Technical indicators from prices
- Rolling statistics
- Cross-sectional features
- Interaction terms

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
print("‚úÖ Libraries loaded!")
print("üìö Week 11: Feature Engineering & Explainability")

---

## Part 1: Financial Feature Engineering

### Common Feature Categories

1. **Price-based**: Returns, log returns, momentum
2. **Volume-based**: Volume ratios, OBV
3. **Volatility**: Rolling std, ATR, Bollinger bands
4. **Technical**: RSI, MACD, moving averages

In [None]:
# Generate synthetic OHLCV data
n = 1000
np.random.seed(42)

returns = np.random.randn(n) * 0.02
close = 100 * np.cumprod(1 + returns)
high = close * (1 + np.abs(np.random.randn(n) * 0.01))
low = close * (1 - np.abs(np.random.randn(n) * 0.01))
open_price = close * (1 + np.random.randn(n) * 0.005)
volume = np.random.exponential(1e6, n)

df = pd.DataFrame({
    'open': open_price,
    'high': high,
    'low': low,
    'close': close,
    'volume': volume
})

# Create features
def create_features(df):
    features = pd.DataFrame(index=df.index)
    
    # Returns
    features['return_1d'] = df['close'].pct_change()
    features['return_5d'] = df['close'].pct_change(5)
    features['return_20d'] = df['close'].pct_change(20)
    
    # Volatility
    features['volatility_20d'] = df['close'].pct_change().rolling(20).std()
    features['volatility_5d'] = df['close'].pct_change().rolling(5).std()
    
    # Volume
    features['volume_ratio'] = df['volume'] / df['volume'].rolling(20).mean()
    
    # Moving Averages
    features['ma_ratio_5_20'] = df['close'].rolling(5).mean() / df['close'].rolling(20).mean()
    features['ma_ratio_20_50'] = df['close'].rolling(20).mean() / df['close'].rolling(50).mean()
    
    # RSI
    delta = df['close'].diff()
    gain = (delta.where(delta > 0, 0)).rolling(14).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
    rs = gain / loss
    features['rsi'] = 100 - (100 / (1 + rs))
    
    # Bollinger Band position
    ma20 = df['close'].rolling(20).mean()
    std20 = df['close'].rolling(20).std()
    features['bb_position'] = (df['close'] - ma20) / (2 * std20)
    
    return features

features = create_features(df)
print("Feature Statistics")
print("="*60)
print(features.describe().round(4).T[['mean', 'std', 'min', 'max']])

---

## Part 2: Feature Selection

### Methods

1. **Correlation filtering**: Remove highly correlated features
2. **Variance threshold**: Remove low-variance features
3. **Importance-based**: Use model feature importance
4. **Recursive elimination**: Iteratively remove least important

In [None]:
# Prepare data
features['target'] = df['close'].pct_change(5).shift(-5)  # 5-day forward return
features_clean = features.dropna()

X = features_clean.drop('target', axis=1)
y = features_clean['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=False)

# Correlation matrix
corr_matrix = X_train.corr().abs()

# Feature importance with Random Forest
rf = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)

importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)

print("Feature Importance (Random Forest)")
print("="*50)
print(importance.to_string(index=False))

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

importance.plot.barh(x='feature', y='importance', ax=axes[0], legend=False)
axes[0].set_xlabel('Importance')
axes[0].set_title('Feature Importance')

im = axes[1].imshow(corr_matrix, cmap='coolwarm', vmin=-1, vmax=1)
axes[1].set_xticks(range(len(X.columns)))
axes[1].set_yticks(range(len(X.columns)))
axes[1].set_xticklabels(X.columns, rotation=45, ha='right')
axes[1].set_yticklabels(X.columns)
axes[1].set_title('Feature Correlation')
plt.colorbar(im, ax=axes[1])

plt.tight_layout()
plt.show()

---

## Part 3: SHAP - SHapley Additive exPlanations

### The Idea

SHAP values come from game theory. They measure each feature's contribution to the prediction.

### Properties

- **Local accuracy**: SHAP values sum to prediction - expected value
- **Consistency**: If a feature becomes more important, its SHAP increases
- **Missingness**: Missing features contribute zero

### ü§î Simple Explanation

SHAP asks: "How much did each feature contribute to moving the prediction away from the average?" It's like splitting the credit among team members fairly.

In [None]:
try:
    import shap
    
    # Use TreeExplainer for speed
    explainer = shap.TreeExplainer(rf)
    
    # Calculate SHAP values for test set (subset for speed)
    X_explain = X_test[:100]
    shap_values = explainer.shap_values(X_explain)
    
    print("SHAP Analysis")
    print("="*50)
    
    # Global importance
    shap_importance = np.abs(shap_values).mean(axis=0)
    for feat, imp in sorted(zip(X.columns, shap_importance), key=lambda x: -x[1]):
        print(f"  {feat}: {imp:.6f}")
    
    # Summary plot
    plt.figure(figsize=(10, 6))
    shap.summary_plot(shap_values, X_explain, show=False)
    plt.tight_layout()
    plt.show()
    
except ImportError:
    print("‚ö†Ô∏è SHAP not installed. Install with: pip install shap")

In [None]:
# Individual prediction explanation
try:
    sample_idx = 50
    sample = X_explain.iloc[sample_idx:sample_idx+1]
    pred = rf.predict(sample)[0]
    
    print(f"\nSingle Prediction Explanation")
    print("="*50)
    print(f"Predicted return: {pred:.4%}")
    print(f"Actual return: {y_test.iloc[sample_idx]:.4%}")
    print(f"\nFeature contributions:")
    
    for feat, val, sv in sorted(zip(X.columns, sample.values[0], shap_values[sample_idx]), 
                                 key=lambda x: -abs(x[2]))[:5]:
        direction = "‚Üë" if sv > 0 else "‚Üì"
        print(f"  {feat}: {val:.4f} ‚Üí {direction} {sv:.6f}")
        
except:
    print("SHAP analysis skipped")

---

## Part 4: LIME - Local Interpretable Model-agnostic Explanations

### The Idea

Fit a simple, interpretable model locally around a prediction.

### Steps

1. Generate perturbed samples around the instance
2. Get predictions from the complex model
3. Weight samples by proximity
4. Fit a linear model to explain locally

### ü§î Simple Explanation

LIME asks: "Even if the model is complex, can we explain THIS prediction with a simple linear model?" It zooms in on one prediction and finds a simple explanation.

In [None]:
try:
    import lime
    import lime.lime_tabular
    
    # Create LIME explainer
    lime_explainer = lime.lime_tabular.LimeTabularExplainer(
        X_train.values,
        feature_names=X_train.columns.tolist(),
        mode='regression'
    )
    
    # Explain a single prediction
    sample_idx = 50
    exp = lime_explainer.explain_instance(
        X_test.iloc[sample_idx].values,
        rf.predict,
        num_features=10
    )
    
    print("LIME Explanation")
    print("="*50)
    print(f"Prediction: {rf.predict(X_test.iloc[sample_idx:sample_idx+1])[0]:.4%}")
    print(f"\nLocal Linear Model Coefficients:")
    for feat, coef in exp.as_list()[:5]:
        print(f"  {feat}: {coef:.6f}")
    
except ImportError:
    print("‚ö†Ô∏è LIME not installed. Install with: pip install lime")

---

## Interview Questions

### Feature Engineering
1. What features would you create for a momentum strategy?
2. How do you handle lookahead bias in feature creation?
3. When would you use cross-sectional vs. time-series features?

### Explainability
1. What's the difference between SHAP and LIME?
2. Why is model explainability important in finance?
3. Can you trust feature importance from Random Forest?

### Technical
1. How do you handle multicollinearity in feature selection?
2. What are the computational costs of SHAP vs. LIME?
3. How would you explain model decisions to regulators?

---

## Key Takeaways

| Topic | Key Point |
|-------|----------|
| Feature Engineering | Transform raw data into predictive signals |
| Feature Selection | Remove redundant, keep important features |
| SHAP | Global & local importance, theoretically grounded |
| LIME | Model-agnostic, local linear approximation |