# Bitcoin Volatility Prediction with Explainable AI üìä

## Project Overview

This notebook explores Bitcoin volatility prediction using multiple machine learning models with comprehensive visualizations and explainability analysis.

**Key Objectives:**
- Forecast next-day Bitcoin volatility
- Compare multiple ML models (Linear Regression, Random Forest, XGBoost, LSTM)
- Provide model explainability through SHAP values
- Visualize patterns and predictions

**Dataset:** BTC/USDT daily candles from Binance API (2019-present)

**Target Variable:** Next-day realized volatility = (High - Low) / Open

---


## 1. Setup & Imports üîß


In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Configure plotting
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# Set default figure size
plt.rcParams['figure.figsize'] = (14, 6)
plt.rcParams['font.size'] = 10

print("‚úÖ Imports successful!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")


## 2. Data Collection üì•

We'll fetch Bitcoin price data from Binance's public API. The data includes:
- **Open, High, Low, Close prices**
- **Volume**
- **Daily frequency** from 2019 onwards


In [None]:
# Import our custom data preparation module
from data_preparation import load_or_fetch_data

# Load data (will fetch from API on first run, then cache)
print("Loading Bitcoin data...\n")
df = load_or_fetch_data(
    symbol="BTCUSDT",
    start_date="2019-01-01",
    force_refresh=False  # Set to True to force refresh from API
)

print(f"\n‚úÖ Data loaded successfully!")
print(f"Date range: {df['date'].min()} to {df['date'].max()}")
print(f"Total days: {len(df)}")
print(f"Price range: ${df['close'].min():.2f} - ${df['close'].max():.2f}")


### 2.1 Quick Data Preview


In [None]:
# Display first few rows
df.head(10)


### 2.2 Visualize Raw Price Data


In [None]:
# Price history with volume
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(16, 8), sharex=True)

# Price chart
ax1.plot(df['date'], df['close'], linewidth=1.5, color='#F7931A', label='BTC Price')
ax1.fill_between(df['date'], df['low'], df['high'], alpha=0.2, color='#F7931A')
ax1.set_ylabel('Price (USDT)', fontsize=12, fontweight='bold')
ax1.set_title('Bitcoin Price History', fontsize=14, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Volume chart
ax2.bar(df['date'], df['volume'], color='#4A90E2', alpha=0.6, width=0.8)
ax2.set_ylabel('Volume', fontsize=12, fontweight='bold')
ax2.set_xlabel('Date', fontsize=12, fontweight='bold')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()


## 3. Feature Engineering ‚öôÔ∏è

We'll create 50+ technical indicators including:
- **Price Features:** Returns, Moving Averages, MACD, Bollinger Bands
- **Volume Features:** Volume ratios and changes  
- **Volatility Features:** Historical volatility measures
- **Momentum Features:** RSI, ROC, Momentum indicators


In [None]:
from feature_engineering import create_all_features, get_feature_columns

print("Creating features...\n")
df_features = create_all_features(df)
feature_cols = get_feature_columns()

print(f"\n‚úÖ Feature engineering complete!")
print(f"Total features created: {len(feature_cols)}")
print(f"Usable samples: {len(df_features)}")


### 3.1 Visualize Volatility Patterns


In [None]:
fig, axes = plt.subplots(2, 2, figsize=(16, 10))

# 1. Volatility over time
axes[0, 0].plot(df_features['date'], df_features['volatility_t'], 
                linewidth=1, alpha=0.7, color='#E74C3C', label='Daily')
axes[0, 0].plot(df_features['date'], df_features['volatility_roll_20'], 
                linewidth=2, color='#3498DB', label='20-day MA')
axes[0, 0].set_title('Realized Volatility Over Time', fontsize=12, fontweight='bold')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# 2. Distribution
axes[0, 1].hist(df_features['volatility_t'], bins=50, color='#9B59B6', alpha=0.7)
axes[0, 1].axvline(df_features['volatility_t'].mean(), color='red', linestyle='--', linewidth=2, label='Mean')
axes[0, 1].set_title('Volatility Distribution', fontsize=12, fontweight='bold')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3, axis='y')

# 3. vs Returns
axes[1, 0].scatter(df_features['ret_1d'], df_features['volatility_t'], alpha=0.3, s=10)
axes[1, 0].set_title('Volatility vs Daily Returns', fontsize=12, fontweight='bold')
axes[1, 0].set_xlabel('Daily Return')
axes[1, 0].set_ylabel('Volatility')
axes[1, 0].grid(True, alpha=0.3)

# 4. Regime
if 'volatility_regime' in df_features.columns:
    regime_colors = {'low': '#2ECC71', 'medium': '#F39C12', 'high': '#E74C3C'}
    for regime, color in regime_colors.items():
        mask = df_features['volatility_regime'] == regime
        axes[1, 1].scatter(df_features.loc[mask, 'date'], df_features.loc[mask, 'close'],
                          label=regime.capitalize(), alpha=0.5, s=5, color=color)
    axes[1, 1].set_title('Price by Volatility Regime', fontsize=12, fontweight='bold')
    axes[1, 1].legend()
    axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()


## 4. Train/Test Split üîÄ

For time series, we use chronological splitting (not random) to prevent data leakage.


In [None]:
from models import time_split

# Prepare X and y
X = df_features[feature_cols]
y = df_features['volatility_t_plus_1']

# Time-based split (80% train, 20% test)
train_df, test_df = time_split(df_features, test_size=0.2)
X_train = train_df[feature_cols]
y_train = train_df['volatility_t_plus_1']
X_test = test_df[feature_cols]
y_test = test_df['volatility_t_plus_1']

print(f"Training samples: {len(X_train)} (80%)")
print(f"Test samples: {len(X_test)} (20%)")
print(f"Train date range: {train_df['date'].min()} to {train_df['date'].max()}")
print(f"Test date range: {test_df['date'].min()} to {test_df['date'].max()}")


## 5. Model Training & Evaluation ü§ñ

Training 5 models: Baseline, Linear Regression, Random Forest, XGBoost, and LSTM


In [None]:
from models import train_all_models, get_best_model

print("üöÄ Training all models...\n")
results = train_all_models(X_train, y_train, X_test, y_test, feature_cols)
print("\n‚úÖ All models trained!")


### 5.1 Model Performance Comparison


In [None]:
# Extract metrics
metrics_dict = {name: res['metrics'] for name, res in results.items()}
metrics_df = pd.DataFrame(metrics_dict).T.sort_values('RMSE')

print("üìä Model Performance Summary:\n")
display(metrics_df)

best_model_name = get_best_model(results)
print(f"\nüèÜ Best Model: {best_model_name}")
print(f"   RMSE: {results[best_model_name]['metrics']['RMSE']:.6f}")
print(f"   R¬≤:   {results[best_model_name]['metrics']['R2']:.4f}")


In [None]:
# Visual comparison
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
metrics_list = ['MAE', 'RMSE', 'R2']

for idx, metric in enumerate(metrics_list):
    values = [metrics_dict[model][metric] for model in metrics_dict.keys()]
    bars = axes[idx].bar(range(len(values)), values, alpha=0.7, edgecolor='black')
    
    # Highlight best
    if metric == 'R2':
        best_idx = np.argmax(values)
        bars[best_idx].set_color('#2ECC71')
    else:
        best_idx = np.argmin(values)
        bars[best_idx].set_color('#2ECC71')
    
    axes[idx].set_title(f'{metric} Comparison', fontsize=12, fontweight='bold')
    axes[idx].set_xticks(range(len(values)))
    axes[idx].set_xticklabels(list(metrics_dict.keys()), rotation=45, ha='right')
    axes[idx].grid(True, alpha=0.3, axis='y')
    
    # Value labels
    for i, bar in enumerate(bars):
        height = bar.get_height()
        axes[idx].text(bar.get_x() + bar.get_width()/2., height,
                      f'{height:.4f}', ha='center', va='bottom', fontsize=9)

plt.suptitle('Model Performance Comparison', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()


## 6. Prediction Visualizations üìà


In [None]:
# Visualize predictions for best model
y_pred = results[best_model_name]['predictions']

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 5))

# Time series
ax1.plot(test_df['date'], y_test.values, label='Actual', linewidth=1.5, alpha=0.8)
ax1.plot(test_df['date'], y_pred, label='Predicted', linewidth=1.5, alpha=0.8)
ax1.set_title(f'{best_model_name}: Actual vs Predicted', fontsize=12, fontweight='bold')
ax1.set_xlabel('Date')
ax1.set_ylabel('Volatility')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Scatter
ax2.scatter(y_test.values, y_pred, alpha=0.4, s=20)
min_val = min(y_test.min(), y_pred.min())
max_val = max(y_test.max(), y_pred.max())
ax2.plot([min_val, max_val], [min_val, max_val], 'r--', linewidth=2, label='Perfect')
ax2.set_xlabel('Actual Volatility', fontweight='bold')
ax2.set_ylabel('Predicted Volatility', fontweight='bold')
ax2.set_title('Prediction Scatter', fontsize=12, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()


## 7. Feature Importance üîç


In [None]:
if 'importance' in results[best_model_name]:
    importance_df = results[best_model_name]['importance']
    
    print(f"üîù Top 15 Features ({best_model_name}):\n")
    for idx, row in importance_df.head(15).iterrows():
        print(f"{idx+1:2d}. {row['feature']:<30} {row['importance']:.6f}")
    
    # Visualize
    plt.figure(figsize=(12, 8))
    top_n = 20
    top_features = importance_df.head(top_n)
    plt.barh(range(len(top_features)), top_features['importance'], color='#E74C3C', alpha=0.7)
    plt.yticks(range(len(top_features)), top_features['feature'])
    plt.xlabel('Importance', fontsize=12, fontweight='bold')
    plt.title(f'Top {top_n} Feature Importances', fontsize=14, fontweight='bold')
    plt.gca().invert_yaxis()
    plt.grid(True, alpha=0.3, axis='x')
    plt.tight_layout()
    plt.show()
else:
    print(f"Feature importance not available for {best_model_name}")


## 8. SHAP Explainability (Optional) üéØ

Run this cell if you want deep explainability analysis (takes 1-2 minutes)


In [None]:
if best_model_name in ['Random Forest', 'XGBoost']:
    try:
        import shap
        
        print("Computing SHAP values...")
        background_size = min(500, len(X_train))
        X_background = X_train.tail(background_size)
        
        best_model = results[best_model_name]['model']
        explainer = shap.TreeExplainer(best_model.model)
        shap_values = explainer.shap_values(X_background)
        
        plt.figure(figsize=(12, 8))
        shap.summary_plot(shap_values, X_background, show=False)
        plt.title('SHAP Feature Importance Summary', fontsize=14, fontweight='bold', pad=20)
        plt.tight_layout()
        plt.show()
        
        print("\n‚úÖ SHAP analysis complete!")
    except ImportError:
        print("‚ö†Ô∏è SHAP not installed. Install with: pip install shap")
    except Exception as e:
        print(f"‚ö†Ô∏è Error: {e}")
else:
    print(f"SHAP only available for tree-based models. Current: {best_model_name}")


## 9. Summary & Conclusions üìù

### Key Findings:
- Successfully predicted Bitcoin volatility using ML models
- Best model performance shown above
- Historical volatility is the strongest predictor
- Volume and momentum indicators provide valuable signals

### Next Steps:
- Add sentiment data (Fear & Greed Index)
- Include macroeconomic features
- Ensemble multiple models
- Multi-step ahead forecasting

**‚ö†Ô∏è Disclaimer:** For educational purposes only. Not financial advice.
