# Tutorial 1: Single-Target SPY Prediction

**Learning Objectives:**
- Understand the fundamentals of quantitative trading strategy development
- Learn walk-forward backtesting to avoid look-ahead bias
- Implement feature engineering with sector ETFs
- Use xarray for standardized results handling
- Calculate risk-adjusted performance metrics

**Blue Water Macro Corp Educational Framework © 2025**

## Part 1: Setup and Data Loading

First, let's import our libraries and understand what we're trying to accomplish.

In [None]:
import sys
import os
sys.path.append('../src')

import numpy as np
import pandas as pd
import xarray as xr
import yfinance as yf
import matplotlib.pyplot as plt
import seaborn as sns

# Import our custom utilities
from utils_simulate import (
    simplify_teos, log_returns, p_by_year, 
    create_results_xarray, plot_xarray_results,
    calculate_performance_metrics, get_educational_help
)

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("📚 Welcome to the Blue Water Macro Quantitative Trading Tutorial!")
print("🎯 Goal: Predict SPY returns using sector ETF data")

### Educational Moment: Why Log Returns?

Before we dive into data loading, let's understand a fundamental concept in quantitative finance:

In [None]:
# Get educational explanation
get_educational_help('log_returns')

### Load Market Data

We'll use SPDR sector ETFs as features to predict SPY (S&P 500) returns:

In [None]:
# Define our universe
TARGET_ETF = 'SPY'  # What we want to predict
FEATURE_ETFS = [
    'XLK',  # Technology
    'XLF',  # Financials
    'XLV',  # Healthcare
    'XLY',  # Consumer Discretionary
    'XLP',  # Consumer Staples
    'XLE',  # Energy
    'XLI',  # Industrials
    'XLB',  # Materials
    'XLU'   # Utilities
]

# Download data
print("📥 Downloading ETF price data...")
all_etfs = [TARGET_ETF] + FEATURE_ETFS
data = yf.download(all_etfs, start='2015-01-01', end='2024-12-31')

# Use adjusted closing prices
prices = data['Adj Close']
prices = simplify_teos(prices)  # Normalize timezone

print(f"✅ Downloaded {len(prices)} days of data for {len(all_etfs)} ETFs")
print(f"📊 Date range: {prices.index.min()} to {prices.index.max()}")

## Part 2: Feature Engineering and Exploration

Let's convert prices to log returns and explore the relationships between sector ETFs and SPY:

In [None]:
# Calculate log returns
returns = log_returns(prices).dropna()

# Separate features and target
X_features = returns[FEATURE_ETFS]
y_target = returns[TARGET_ETF]

print(f"📈 Features shape: {X_features.shape}")
print(f"🎯 Target shape: {y_target.shape}")

# Quick visualization
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))

# Plot cumulative returns
(1 + returns).cumprod().plot(ax=ax1, alpha=0.7)
ax1.set_title('Cumulative Returns: SPY vs Sector ETFs')
ax1.set_ylabel('Cumulative Return')
ax1.legend(bbox_to_anchor=(1.05, 1), loc='upper left')

# Plot rolling correlation with SPY
rolling_corr = X_features.rolling(252).corr(y_target).dropna()
rolling_corr.plot(ax=ax2, alpha=0.8)
ax2.set_title('Rolling 1-Year Correlation with SPY')
ax2.set_ylabel('Correlation')
ax2.legend(bbox_to_anchor=(1.05, 1), loc='upper left')

plt.tight_layout()
plt.show()

### Feature Analysis by Year

Let's analyze how the predictive power of each sector changes over time:

In [None]:
# Analyze feature importance by year
print("🔍 Analyzing feature importance by year...")
yearly_correlations = p_by_year(X_features, y_target)

# Create heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(yearly_correlations, annot=True, cmap='RdYlBu_r', center=0, 
           fmt='.3f', cbar_kws={'label': 'Pearson Correlation'})
plt.title('Annual Feature Correlations with SPY Returns')
plt.xlabel('Year')
plt.ylabel('Sector ETF')
plt.tight_layout()
plt.show()

# Find most stable predictors
mean_abs_corr = yearly_correlations.abs().mean(axis=1).sort_values(ascending=False)
print("\n🏆 Most consistent predictors (by average absolute correlation):")
for etf, corr in mean_abs_corr.head(5).items():
    print(f"  {etf}: {corr:.3f}")

## Part 3: Walk-Forward Simulation

Now we'll implement the core of quantitative backtesting: walk-forward analysis.

In [None]:
# Educational explanation
get_educational_help('walk_forward')

In [None]:
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from utils_simulate import generate_train_predict_calender

def simulate_single_target_strategy(X, y, window_size=252, window_type='expanding'):
    """
    Walk-forward simulation for single-target prediction.
    
    Returns:
        Dictionary with simulation results
    """
    # Create ML pipeline
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('regressor', Ridge(alpha=1.0))
    ])
    
    # Generate training/prediction calendar
    date_ranges = generate_train_predict_calender(
        pd.DataFrame(index=X.index), window_type, window_size
    )
    
    print(f"🚀 Starting simulation with {len(date_ranges)} predictions...")
    print(f"📅 Period: {date_ranges[0][0]} to {date_ranges[-1][2]}")
    
    results = {
        'dates': [],
        'predictions': [],
        'actuals': [],
        'positions': [],
        'returns': []
    }
    
    for i, (train_start, train_end, pred_date) in enumerate(date_ranges):
        # Training data
        X_train = X.loc[train_start:train_end]
        y_train = y.loc[train_start:train_end]
        
        # Prediction data
        X_pred = X.loc[[pred_date]]
        y_actual = y.loc[pred_date]
        
        # Fit model and predict
        pipeline.fit(X_train, y_train)
        prediction = pipeline.predict(X_pred)[0]
        
        # Simple position sizing: long if prediction > 0, short otherwise
        position = 1.0 if prediction > 0 else -1.0
        strategy_return = position * y_actual
        
        # Store results
        results['dates'].append(pred_date)
        results['predictions'].append(prediction)
        results['actuals'].append(y_actual)
        results['positions'].append(position)
        results['returns'].append(strategy_return)
        
        if (i + 1) % 100 == 0:
            print(f"  Progress: {i+1}/{len(date_ranges)} predictions completed")
    
    return results

# Run simulation
simulation_results = simulate_single_target_strategy(X_features, y_target)
print("✅ Simulation completed!")

## Part 4: Results Analysis with xarray

Let's use xarray to analyze our results in a standardized way:

In [None]:
# Convert results to xarray Dataset
results_df = pd.DataFrame(simulation_results)
results_df.set_index('dates', inplace=True)

# Create xarray dataset
results_xr = create_results_xarray({
    'strategy_returns': results_df['returns'],
    'spy_returns': results_df['actuals'],
    'predictions': results_df['predictions'],
    'positions': results_df['positions']
}, time_index=results_df.index)

print("📊 Results stored in xarray Dataset:")
print(results_xr)

# Calculate performance metrics
strategy_metrics = calculate_performance_metrics(results_xr.strategy_returns)
spy_metrics = calculate_performance_metrics(results_xr.spy_returns)

print("\n🏆 Performance Comparison:")
comparison_df = pd.DataFrame({
    'Strategy': strategy_metrics,
    'SPY Buy-Hold': spy_metrics
})
print(comparison_df.round(4))

### Visualization and Analysis

In [None]:
# Create comprehensive performance plots
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# 1. Cumulative returns
strategy_cumret = (1 + results_xr.strategy_returns).cumprod()
spy_cumret = (1 + results_xr.spy_returns).cumprod()

strategy_cumret.plot(ax=axes[0,0], label='Strategy', color='blue')
spy_cumret.plot(ax=axes[0,0], label='SPY Buy-Hold', color='red')
axes[0,0].set_title('Cumulative Returns')
axes[0,0].legend()
axes[0,0].grid(True, alpha=0.3)

# 2. Rolling Sharpe ratio (252-day)
rolling_sharpe = (results_xr.strategy_returns.rolling(time=252).mean() / 
                 results_xr.strategy_returns.rolling(time=252).std() * np.sqrt(252))
rolling_sharpe.plot(ax=axes[0,1], color='green')
axes[0,1].set_title('Rolling 1-Year Sharpe Ratio')
axes[0,1].axhline(y=1.0, color='black', linestyle='--', alpha=0.5)
axes[0,1].grid(True, alpha=0.3)

# 3. Drawdown analysis
running_max = strategy_cumret.expanding(dim='time').max()
drawdown = (strategy_cumret - running_max) / running_max
drawdown.plot(ax=axes[1,0], color='red')
axes[1,0].fill_between(drawdown.time, drawdown.values, 0, alpha=0.3, color='red')
axes[1,0].set_title('Strategy Drawdown')
axes[1,0].set_ylabel('Drawdown %')
axes[1,0].grid(True, alpha=0.3)

# 4. Prediction vs actual scatter
axes[1,1].scatter(results_xr.predictions, results_xr.spy_returns, alpha=0.5)
axes[1,1].axhline(y=0, color='black', linestyle='-', alpha=0.3)
axes[1,1].axvline(x=0, color='black', linestyle='-', alpha=0.3)
axes[1,1].set_xlabel('Predictions')
axes[1,1].set_ylabel('Actual SPY Returns')
axes[1,1].set_title('Prediction Accuracy')
axes[1,1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Calculate prediction accuracy metrics
predictions = results_xr.predictions.values
actuals = results_xr.spy_returns.values

# Direction accuracy
direction_accuracy = np.mean(np.sign(predictions) == np.sign(actuals))
correlation = np.corrcoef(predictions, actuals)[0,1]

print(f"\n🎯 Prediction Metrics:")
print(f"   Direction Accuracy: {direction_accuracy:.1%}")
print(f"   Prediction-Actual Correlation: {correlation:.4f}")

## Part 5: Student Exercises

Now it's your turn to experiment and learn! Try these exercises to deepen your understanding:

### Exercise 1: Position Sizing Improvements

Modify the position sizing function to use prediction confidence:

In [None]:
# TODO: Implement confidence-weighted position sizing
# Hint: Scale position size by absolute value of prediction

def confidence_weighted_positions(predictions, max_leverage=2.0):
    """
    Create position sizes based on prediction confidence.
    
    Your task:
    1. Calculate the absolute value of predictions (confidence)
    2. Normalize confidence to [0, max_leverage] range
    3. Apply the sign of the original prediction
    
    Returns:
        Array of position sizes
    """
    # YOUR CODE HERE
    pass

# Test your function
test_predictions = np.array([0.01, -0.02, 0.005, -0.03, 0.015])
test_positions = confidence_weighted_positions(test_predictions)
print(f"Predictions: {test_predictions}")
print(f"Positions: {test_positions}")

### Exercise 2: Feature Engineering

Add momentum indicators to improve predictions:

In [None]:
# TODO: Create momentum features
# Ideas:
# - 5-day, 20-day moving averages
# - RSI (Relative Strength Index)
# - Price momentum (current price vs N-day ago)

def create_momentum_features(prices, returns):
    """
    Create momentum-based features for prediction.
    
    Your task:
    1. Calculate short-term (5-day) and long-term (20-day) moving averages
    2. Create momentum indicators (e.g., current vs past prices)
    3. Add volatility measures (rolling standard deviation)
    
    Returns:
        DataFrame with momentum features
    """
    # YOUR CODE HERE
    pass

# Test with SPY data
# momentum_features = create_momentum_features(prices[TARGET_ETF], returns[TARGET_ETF])
# print(momentum_features.head())

### Exercise 3: Model Comparison

Compare different ML models using xarray:

In [None]:
# TODO: Compare Ridge, Random Forest, and Linear Regression
# Use xarray to store results from multiple models
# Create performance comparison table

from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression

models = {
    'Ridge': Ridge(alpha=1.0),
    'RandomForest': RandomForestRegressor(n_estimators=100, random_state=42),
    'LinearRegression': LinearRegression()
}

# YOUR CODE HERE
# 1. Run simulation for each model
# 2. Store results in xarray with 'model' dimension
# 3. Compare performance metrics
# 4. Create visualization showing all models

print("🎯 Model comparison exercise - implement your solution above!")

## Part 6: Key Takeaways

Congratulations! You've completed the single-target simulation tutorial. Here's what you learned:

### 🎓 Concepts Mastered:
1. **Log Returns**: Why they're essential for financial modeling
2. **Walk-Forward Analysis**: Preventing look-ahead bias in backtests
3. **Feature Analysis**: Understanding predictor stability over time
4. **xarray Integration**: Standardized handling of financial time series
5. **Performance Metrics**: Risk-adjusted return measurement

### 🚀 Next Steps:
- Complete the exercises above to deepen your understanding
- Move to Tutorial 2 for multi-target portfolio strategies
- Experiment with different time periods and ETF universes
- Try implementing transaction costs and slippage

### 📚 Additional Resources:
- [QuantNet Forums](https://quantnet.com): Connect with other quant students
- [Blue Water Macro Blog](https://bluewatermacro.com): Industry insights and research
- [xarray Documentation](https://xarray.pydata.org): Master multi-dimensional data analysis

**Ready for more advanced techniques? Proceed to Tutorial 2: Multi-Target Portfolio Strategies!**