# Multi-Factor Alpha Engine - Exploratory Data Analysis

This notebook provides an end-to-end demonstration of the Alpha Engine pipeline with exploratory data analysis.

## Sections:
1. Data Loading and Processing
2. Feature Engineering Analysis
3. Factor Performance Analysis
4. Model Training and Evaluation
5. Portfolio Construction Demo
6. Performance Visualization

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Import alpha engine modules
import sys
sys.path.append('..')

from alpha_engine.data import load_and_process_data
from alpha_engine.features import FeatureEngine
from alpha_engine.models import ModelEnsemble
from alpha_engine.portfolio import PortfolioOptimizer
from alpha_engine.backtest import Backtester, PerformanceAnalyzer

print("✅ All imports successful!")

## 1. Data Loading and Processing

In [None]:
# Load and process data
print("Loading and processing equity data...")
data = load_and_process_data(config_path="../config.yaml", force_refresh=False)

print(f"\nDataset Overview:")
print(f"Shape: {data.shape}")
print(f"Date range: {data['Date'].min()} to {data['Date'].max()}")
print(f"Number of tickers: {data['ticker'].nunique()}")
print(f"Columns: {list(data.columns)}")

In [None]:
# Data quality analysis
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Number of tickers over time
ticker_counts = data.groupby('Date')['ticker'].nunique()
axes[0, 0].plot(ticker_counts.index, ticker_counts.values)
axes[0, 0].set_title('Number of Tickers Over Time')
axes[0, 0].set_ylabel('Count')

# Daily return distribution
axes[0, 1].hist(data['return_1d'].dropna(), bins=100, alpha=0.7, edgecolor='black')
axes[0, 1].set_title('Daily Return Distribution')
axes[0, 1].set_xlabel('Daily Return')
axes[0, 1].set_ylabel('Frequency')

# Volume distribution (log scale)
axes[1, 0].hist(np.log(data['Volume'] + 1), bins=50, alpha=0.7, edgecolor='black')
axes[1, 0].set_title('Log Volume Distribution')
axes[1, 0].set_xlabel('Log(Volume)')
axes[1, 0].set_ylabel('Frequency')

# Price distribution (log scale)
axes[1, 1].hist(np.log(data['Close']), bins=50, alpha=0.7, edgecolor='black')
axes[1, 1].set_title('Log Price Distribution')
axes[1, 1].set_xlabel('Log(Price)')
axes[1, 1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

## 2. Feature Engineering Analysis

In [None]:
# Engineer features
print("Engineering features...")
engine = FeatureEngine(config_path="../config.yaml")
featured_data = engine.engineer_features(data)

# Get feature columns
feature_cols = [col for col in featured_data.columns if col.endswith('_zscore')]
print(f"\nTotal features engineered: {len(feature_cols)}")
print(f"Feature categories:")

# Categorize features
categories = {
    'Value': ['book_to_market', 'earnings_yield', 'sales_to_price', 'dividend_yield'],
    'Momentum': ['momentum_12_1', 'momentum_6_1', 'short_term_reversal', 'price_trend'],
    'Quality': ['roe_proxy', 'profit_margin_proxy', 'accruals_proxy', 'earnings_quality'],
    'Size': ['market_cap_proxy', 'log_market_cap', 'relative_size'],
    'Liquidity': ['turnover', 'amihud_illiquidity', 'bid_ask_spread', 'volume_trend'],
    'Volatility': ['realized_vol_21d', 'realized_vol_63d', 'vol_of_vol', 'ewma_vol'],
    'Technical': ['rsi_14', 'macd_signal', 'bollinger_position', 'williams_r'],
    'Risk': ['beta', 'idiosyncratic_vol', 'return_skewness', 'return_kurtosis']
}

for category, features in categories.items():
    available_features = [f for f in features if f in featured_data.columns]
    print(f"  {category}: {len(available_features)} features")

In [None]:
# Feature correlation analysis
sample_features = [f for f in feature_cols[:20]]  # Sample of features for visualization
recent_data = featured_data[featured_data['Date'] >= '2020-01-01']
correlation_matrix = recent_data[sample_features].corr()

plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=False, cmap='coolwarm', center=0,
            square=True, fmt='.2f')
plt.title('Feature Correlation Matrix (Sample)', fontsize=16)
plt.tight_layout()
plt.show()

## 3. Factor Performance Analysis

In [None]:
# Analyze factor performance over time
def calculate_factor_ic(data, factor_col, return_col='return_21d', periods=21):
    """Calculate Information Coefficient for a factor."""
    # Forward returns
    data = data.sort_values(['ticker', 'Date'])
    data['forward_return'] = data.groupby('ticker')[return_col].shift(-periods)
    
    # Calculate IC by date
    ic_by_date = data.groupby('Date').apply(
        lambda x: x[factor_col].corr(x['forward_return'])
    ).dropna()
    
    return ic_by_date

# Calculate IC for sample factors
sample_factors = ['momentum_12_1_zscore', 'book_to_market_zscore', 
                 'realized_vol_21d_zscore', 'rsi_14_zscore']

ic_results = {}
for factor in sample_factors:
    if factor in featured_data.columns:
        ic_ts = calculate_factor_ic(featured_data, factor)
        ic_results[factor.replace('_zscore', '')] = ic_ts

# Plot IC time series
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.flatten()

for i, (factor_name, ic_ts) in enumerate(ic_results.items()):
    if i < 4:
        axes[i].plot(ic_ts.index, ic_ts.values, alpha=0.7)
        axes[i].axhline(y=0, color='black', linestyle='--', alpha=0.5)
        axes[i].set_title(f'{factor_name} - Information Coefficient')
        axes[i].set_ylabel('IC')
        
        # Add rolling mean
        rolling_ic = ic_ts.rolling(252).mean()
        axes[i].plot(rolling_ic.index, rolling_ic.values, 
                    color='red', linewidth=2, label='1Y Rolling Mean')
        axes[i].legend()

plt.tight_layout()
plt.show()

# IC statistics
print("\nInformation Coefficient Statistics:")
for factor_name, ic_ts in ic_results.items():
    mean_ic = ic_ts.mean()
    std_ic = ic_ts.std()
    ir = mean_ic / std_ic if std_ic > 0 else 0
    print(f"{factor_name:20} | Mean IC: {mean_ic:6.3f} | Std IC: {std_ic:6.3f} | IR: {ir:6.3f}")

## 4. Model Training and Evaluation

In [None]:
# Prepare training data
print("Preparing training data...")
X, y = engine.get_feature_matrix(
    featured_data, 
    use_zscore=True,
    start_date='2010-01-01',
    end_date='2020-01-01'
)

print(f"Training set shape: {X.shape}")
print(f"Target statistics: Mean={y.mean():.4f}, Std={y.std():.4f}")

# Train ensemble
print("\nTraining model ensemble...")
import yaml
with open('../config.yaml', 'r') as f:
    config = yaml.safe_load(f)

ensemble = ModelEnsemble(config)
ensemble.fit(X, y, blend_method='equal')

print(f"Ensemble weights: {ensemble.weights}")

In [None]:
# Model performance comparison
models_performance = {}

for model_name, model in ensemble.models.items():
    if model.is_fitted:
        predictions = model.predict(X)
        
        # Calculate metrics
        correlation = np.corrcoef(y, predictions)[0, 1]
        hit_rate = (np.sign(y) == np.sign(predictions)).mean()
        
        models_performance[model_name] = {
            'correlation': correlation,
            'hit_rate': hit_rate
        }

# Ensemble predictions
ensemble_pred = ensemble.predict(X)
ensemble_corr = np.corrcoef(y, ensemble_pred)[0, 1]
ensemble_hit_rate = (np.sign(y) == np.sign(ensemble_pred)).mean()
models_performance['ensemble'] = {
    'correlation': ensemble_corr,
    'hit_rate': ensemble_hit_rate
}

# Plot performance comparison
performance_df = pd.DataFrame(models_performance).T

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

performance_df['correlation'].plot(kind='bar', ax=axes[0], color='skyblue')
axes[0].set_title('Model Correlation with Future Returns')
axes[0].set_ylabel('Correlation')
axes[0].tick_params(axis='x', rotation=45)

performance_df['hit_rate'].plot(kind='bar', ax=axes[1], color='lightgreen')
axes[1].set_title('Model Hit Rate (Directional Accuracy)')
axes[1].set_ylabel('Hit Rate')
axes[1].tick_params(axis='x', rotation=45)
axes[1].axhline(y=0.5, color='red', linestyle='--', alpha=0.7, label='Random')
axes[1].legend()

plt.tight_layout()
plt.show()

print("\nModel Performance Summary:")
print(performance_df.round(4))

## 5. Portfolio Construction Demo

In [None]:
# Portfolio construction example
print("Demonstrating portfolio construction...")

# Generate sample signals
recent_data = featured_data[featured_data['Date'] >= '2023-01-01'].groupby('ticker').last()
sample_X = recent_data[feature_cols].fillna(0)
signals = pd.Series(ensemble.predict(sample_X), index=sample_X.index)

print(f"Generated signals for {len(signals)} stocks")
print(f"Signal statistics: Mean={signals.mean():.4f}, Std={signals.std():.4f}")

# Create return history for risk estimation
return_history = featured_data[featured_data['Date'] >= '2022-01-01'].pivot(
    index='Date', columns='ticker', values='return_1d'
).dropna(axis=1, thresh=200)  # Require at least 200 observations

# Portfolio optimization
optimizer = PortfolioOptimizer('../config.yaml')
positions, metrics = optimizer.create_dollar_neutral_portfolio(
    signals, return_history, capital=1000000
)

print(f"\nPortfolio Construction Results:")
print(f"Number of positions: {len(positions)}")
print(f"Long positions: {(positions > 0).sum()}")
print(f"Short positions: {(positions < 0).sum()}")
print(f"\nPortfolio Metrics:")
for metric, value in metrics.items():
    if isinstance(value, (int, float)):
        print(f"  {metric}: {value:.4f}")

In [None]:
# Portfolio composition analysis
if len(positions) > 0:
    # Top long and short positions
    long_positions = positions[positions > 0].sort_values(ascending=False)
    short_positions = positions[positions < 0].sort_values()
    
    fig, axes = plt.subplots(1, 2, figsize=(15, 6))
    
    # Top 10 long positions
    if len(long_positions) > 0:
        top_long = long_positions.head(10)
        axes[0].barh(range(len(top_long)), top_long.values, color='green', alpha=0.7)
        axes[0].set_yticks(range(len(top_long)))
        axes[0].set_yticklabels(top_long.index)
        axes[0].set_title('Top 10 Long Positions')
        axes[0].set_xlabel('Position Size ($)')
    
    # Top 10 short positions
    if len(short_positions) > 0:
        top_short = short_positions.head(10)
        axes[1].barh(range(len(top_short)), top_short.values, color='red', alpha=0.7)
        axes[1].set_yticks(range(len(top_short)))
        axes[1].set_yticklabels(top_short.index)
        axes[1].set_title('Top 10 Short Positions')
        axes[1].set_xlabel('Position Size ($)')
    
    plt.tight_layout()
    plt.show()
    
    # Position size distribution
    plt.figure(figsize=(10, 6))
    plt.hist(positions.values, bins=50, alpha=0.7, edgecolor='black')
    plt.axvline(x=0, color='red', linestyle='--', alpha=0.7)
    plt.title('Portfolio Position Size Distribution')
    plt.xlabel('Position Size ($)')
    plt.ylabel('Frequency')
    plt.show()

## 6. Performance Visualization

In [None]:
# Create sample backtest results for visualization
# This is a simplified example - full backtest would be run via run_pipeline.py

# Generate synthetic performance data for demonstration
np.random.seed(42)
dates = pd.date_range('2020-01-01', '2023-12-31', freq='M')
n_periods = len(dates)

# Simulate monthly returns
alpha_returns = np.random.normal(0.01, 0.04, n_periods)  # 1% monthly mean, 4% vol
benchmark_returns = np.random.normal(0.007, 0.035, n_periods)  # 0.7% monthly mean, 3.5% vol

# Create performance DataFrame
performance_data = pd.DataFrame({
    'date': dates,
    'alpha_return': alpha_returns,
    'benchmark_return': benchmark_returns
})
performance_data['alpha_cumulative'] = (1 + performance_data['alpha_return']).cumprod()
performance_data['benchmark_cumulative'] = (1 + performance_data['benchmark_return']).cumprod()
performance_data['alpha_value'] = 1000000 * performance_data['alpha_cumulative']
performance_data['benchmark_value'] = 1000000 * performance_data['benchmark_cumulative']

# Plot equity curves
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Equity curve comparison
axes[0, 0].plot(performance_data['date'], performance_data['alpha_value'], 
               label='Alpha Strategy', linewidth=2, color='blue')
axes[0, 0].plot(performance_data['date'], performance_data['benchmark_value'], 
               label='Benchmark', linewidth=2, color='orange')
axes[0, 0].set_title('Portfolio Value Comparison')
axes[0, 0].set_ylabel('Portfolio Value ($)')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Rolling Sharpe ratio
rolling_window = 12
alpha_rolling_sharpe = performance_data['alpha_return'].rolling(rolling_window).mean() / \
                      performance_data['alpha_return'].rolling(rolling_window).std() * np.sqrt(12)
benchmark_rolling_sharpe = performance_data['benchmark_return'].rolling(rolling_window).mean() / \
                          performance_data['benchmark_return'].rolling(rolling_window).std() * np.sqrt(12)

axes[0, 1].plot(performance_data['date'], alpha_rolling_sharpe, 
               label='Alpha Strategy', linewidth=2, color='blue')
axes[0, 1].plot(performance_data['date'], benchmark_rolling_sharpe, 
               label='Benchmark', linewidth=2, color='orange')
axes[0, 1].set_title('Rolling 12-Month Sharpe Ratio')
axes[0, 1].set_ylabel('Sharpe Ratio')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Monthly returns distribution
axes[1, 0].hist(performance_data['alpha_return'], bins=20, alpha=0.7, 
               label='Alpha Strategy', color='blue', edgecolor='black')
axes[1, 0].hist(performance_data['benchmark_return'], bins=20, alpha=0.7, 
               label='Benchmark', color='orange', edgecolor='black')
axes[1, 0].set_title('Monthly Returns Distribution')
axes[1, 0].set_xlabel('Monthly Return')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].legend()

# Excess returns
excess_returns = performance_data['alpha_return'] - performance_data['benchmark_return']
axes[1, 1].plot(performance_data['date'], excess_returns, 
               linewidth=1, alpha=0.7, color='green')
axes[1, 1].axhline(y=0, color='black', linestyle='--', alpha=0.5)
axes[1, 1].set_title('Monthly Excess Returns')
axes[1, 1].set_xlabel('Date')
axes[1, 1].set_ylabel('Excess Return')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Performance summary
final_alpha_value = performance_data['alpha_value'].iloc[-1]
final_benchmark_value = performance_data['benchmark_value'].iloc[-1]
alpha_total_return = (final_alpha_value / 1000000) - 1
benchmark_total_return = (final_benchmark_value / 1000000) - 1
alpha_sharpe = performance_data['alpha_return'].mean() / performance_data['alpha_return'].std() * np.sqrt(12)
benchmark_sharpe = performance_data['benchmark_return'].mean() / performance_data['benchmark_return'].std() * np.sqrt(12)

print("\n📊 PERFORMANCE SUMMARY (Sample Data)")
print("=" * 45)
print(f"Alpha Strategy:")
print(f"  Total Return: {alpha_total_return:.1%}")
print(f"  Annualized Sharpe: {alpha_sharpe:.2f}")
print(f"  Final Value: ${final_alpha_value:,.0f}")
print(f"\nBenchmark:")
print(f"  Total Return: {benchmark_total_return:.1%}")
print(f"  Annualized Sharpe: {benchmark_sharpe:.2f}")
print(f"  Final Value: ${final_benchmark_value:,.0f}")
print(f"\nExcess Performance:")
print(f"  Excess Return: {alpha_total_return - benchmark_total_return:.1%}")
print(f"  Information Ratio: {excess_returns.mean() / excess_returns.std() * np.sqrt(12):.2f}")

## Conclusion

This notebook demonstrated the key components of the Multi-Factor Equity Alpha Engine:

1. **Data Processing**: Successfully loaded and cleaned equity data
2. **Feature Engineering**: Generated 35+ factors across multiple categories
3. **Model Training**: Trained ensemble of Ridge, XGBoost, and Neural Network models
4. **Portfolio Construction**: Applied Kelly optimization with risk constraints
5. **Performance Analysis**: Visualized returns, risk metrics, and benchmark comparison

### Next Steps:
- Run the full pipeline: `python run_pipeline.py --start 2005-01-01 --end 2024-06-30`
- Explore factor performance in different market regimes
- Analyze sector and style exposures
- Implement additional risk management techniques

### Key Metrics Target:
- **Target**: 13% CAGR, 1.3 Sharpe ratio
- **Benchmark**: S&P 1500 with ~7% CAGR, 0.6 Sharpe ratio

The engine is designed to be production-ready with modular components, comprehensive testing, and professional reporting capabilities.