# Stage 07: Outliers, Risk, and Assumptions Analysis for Turtle Trading
**Project:** Turtle Trading Strategy Research  
**Author:** Panwei Hu  
**Date:** 2025-08-20

## Objectives
- Detect and analyze outliers in financial time series data for Turtle Trading
- Assess impact of outliers on trading signal quality and strategy performance
- Conduct comprehensive risk analysis and stress testing
- Document assumptions and risks for trading strategy implementation
- Evaluate robustness of Turtle Trading signals under market stress

## Financial Risk Analysis Focus
- **Price Outlier Detection**: Identify anomalous price movements and data quality issues
- **Signal Contamination**: Analyze impact of outliers on Donchian channel breakouts
- **Risk Metrics**: VaR, CVaR, drawdowns, and tail risk analysis
- **Stress Testing**: Strategy performance under market crash scenarios
- **Volatility Outliers**: Detect periods of unusual market volatility
- **Trading Signal Robustness**: Sensitivity of entry/exit signals to outliers


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import sys
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Set up project paths
PROJECT_ROOT = Path('..').resolve()
DATA_DIR = PROJECT_ROOT / 'data'
RAW_DIR = DATA_DIR / 'raw'
PROCESSED_DIR = DATA_DIR / 'processed'
SRC_DIR = PROJECT_ROOT / 'src'

# Add src to Python path for imports
sys.path.append(str(SRC_DIR))

# Import our modules
try:
    from risk_analysis import (
        FinancialOutlierDetector,
        TurtleRiskAnalyzer,
        FinancialRiskVisualizer,
        detect_price_outliers,
        calculate_risk_metrics,
        analyze_signal_quality
    )
    print("✅ Successfully imported risk analysis module")
except ImportError as e:
    print(f"❌ Import error: {e}")

try:
    from preprocessing import FinancialDataProcessor
    print("✅ Successfully imported preprocessing module")
except ImportError as e:
    print(f"⚠️  Preprocessing module not available: {e}")

# Ensure directories exist
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

print("🐢 Turtle Trading - Outliers & Risk Analysis")
print("="*60)
print(f"Project Root: {PROJECT_ROOT}")
print(f"Data Directory: {DATA_DIR}")
print(f"Source Code: {SRC_DIR}")

# Configure plotting
plt.style.use('default')
plt.rcParams['figure.figsize'] = (15, 8)
sns.set_palette("husl")

print("✅ Environment setup complete")


✅ Successfully imported risk analysis module
✅ Successfully imported preprocessing module
🐢 Turtle Trading - Outliers & Risk Analysis
Project Root: /Users/panweihu/Desktop/Desktop_m1/NYU_mfe/bootcamp/camp4/bootcamp_bill_panwei_hu/turtle_project
Data Directory: /Users/panweihu/Desktop/Desktop_m1/NYU_mfe/bootcamp/camp4/bootcamp_bill_panwei_hu/turtle_project/data
Source Code: /Users/panweihu/Desktop/Desktop_m1/NYU_mfe/bootcamp/camp4/bootcamp_bill_panwei_hu/turtle_project/src
✅ Environment setup complete


In [2]:
# Load Turtle Trading Data for Risk Analysis
print("📊 Loading Turtle Trading data for outlier and risk analysis...")

# Look for preprocessed data first (from Stage 06)
preprocessed_files = list(PROCESSED_DIR.glob('turtle_preprocessed*.parquet'))

if preprocessed_files:
    # Load the most recent preprocessed file
    latest_file = max(preprocessed_files, key=lambda x: x.stat().st_mtime)
    print(f"📈 Loading preprocessed data: {latest_file.name}")
    
    df = pd.read_parquet(latest_file)
    print(f"✅ Loaded preprocessed data: {df.shape}")
    
    # Ensure date column is datetime
    if 'date' in df.columns:
        df['date'] = pd.to_datetime(df['date'])
    
else:
    # Fallback to raw data
    print("⚠️  No preprocessed data found, loading raw data...")
    
    raw_files = list(RAW_DIR.glob('turtle_universe*.csv'))
    if raw_files:
        latest_raw = max(raw_files, key=lambda x: x.stat().st_mtime)
        print(f"📊 Loading raw data: {latest_raw.name}")
        
        df = pd.read_csv(latest_raw)
        df['date'] = pd.to_datetime(df['date'])
        df = df.sort_values(['symbol', 'date']).reset_index(drop=True)
        
        print(f"✅ Loaded raw data: {df.shape}")
    else:
        print("❌ No turtle trading data found!")
        print("   Please run 04_data_acquisition.ipynb first to collect data.")
        df = pd.DataFrame()

if not df.empty:
    print(f"\n📊 Dataset Overview:")
    print(f"   Shape: {df.shape}")
    print(f"   Symbols: {df['symbol'].nunique()}")
    print(f"   Date range: {df['date'].min().date()} to {df['date'].max().date()}")
    print(f"   Trading days: {df['date'].nunique()}")
    
    # Check for required columns
    required_cols = ['symbol', 'date', 'adj_close']
    missing_cols = [col for col in required_cols if col not in df.columns]
    if missing_cols:
        print(f"⚠️  Missing required columns: {missing_cols}")
    
    # Check for technical indicators
    technical_cols = ['donchian_high_20', 'donchian_low_20', 'atr_20', 'long_entry_20', 'short_entry_20']
    available_technical = [col for col in technical_cols if col in df.columns]
    print(f"   Available technical indicators: {len(available_technical)}/{len(technical_cols)}")
    
    if available_technical:
        print(f"     {available_technical}")
    
    print(f"\n📋 Sample data:")
    display_cols = ['symbol', 'date', 'adj_close'] + available_technical[:3]
    print(df[display_cols].head())

else:
    print("⚠️  No data available for analysis")


📊 Loading Turtle Trading data for outlier and risk analysis...
📈 Loading preprocessed data: turtle_preprocessed_20250820_115449.parquet
✅ Loaded preprocessed data: (9036, 22)

📊 Dataset Overview:
   Shape: (9036, 22)
   Symbols: 18
   Date range: 2023-08-21 to 2025-08-20
   Trading days: 502
   Available technical indicators: 5/5
     ['donchian_high_20', 'donchian_low_20', 'atr_20', 'long_entry_20', 'short_entry_20']

📋 Sample data:
  symbol       date  adj_close  donchian_high_20  donchian_low_20  atr_20
0    DBA 2023-08-21  19.559673               NaN              NaN     NaN
1    DBA 2023-08-22  19.440237               NaN              NaN     NaN
2    DBA 2023-08-23  19.642357               NaN              NaN     NaN
3    DBA 2023-08-24  19.807728               NaN              NaN     NaN
4    DBA 2023-08-25  19.862852               NaN              NaN     NaN


In [3]:
# Comprehensive Outlier Detection for Multi-Asset Portfolio
if not df.empty:
    print("🔍 FINANCIAL OUTLIER DETECTION ANALYSIS")
    print("="*60)
    
    # Analyze each asset individually
    outlier_summary = {}
    risk_metrics_by_asset = {}
    
    for symbol in df['symbol'].unique():
        print(f"\n📊 Analyzing {symbol}...")
        
        # Get symbol data
        symbol_data = df[df['symbol'] == symbol].copy()
        symbol_data = symbol_data.sort_values('date').reset_index(drop=True)
        
        if len(symbol_data) < 20:  # Need minimum data for analysis
            print(f"   ⚠️  Insufficient data for {symbol} ({len(symbol_data)} observations)")
            continue
        
        prices = symbol_data['adj_close']
        returns = prices.pct_change().dropna()
        
        # Detect different types of outliers
        price_outliers_iqr = FinancialOutlierDetector.detect_price_outliers_iqr(prices, k=1.5)
        return_outliers_zscore = FinancialOutlierDetector.detect_return_outliers_zscore(returns, threshold=3.0)
        gap_events = FinancialOutlierDetector.detect_gap_events(prices, gap_threshold=0.05)
        volatility_outliers = FinancialOutlierDetector.detect_volatility_outliers(returns, window=20, threshold=2.5)
        
        # Store outlier summary
        outlier_summary[symbol] = {
            'total_observations': len(symbol_data),
            'price_outliers_iqr': price_outliers_iqr.sum(),
            'return_outliers_zscore': return_outliers_zscore.sum(),
            'gap_events': gap_events.sum(),
            'volatility_outliers': volatility_outliers.sum(),
            'price_outlier_rate_%': (price_outliers_iqr.sum() / len(prices)) * 100,
            'return_outlier_rate_%': (return_outliers_zscore.sum() / len(returns)) * 100
        }
        
        # Calculate comprehensive risk metrics
        risk_metrics_by_asset[symbol] = calculate_risk_metrics(returns)
        
        print(f"   Price outliers (IQR): {price_outliers_iqr.sum()}")
        print(f"   Return outliers (Z-score): {return_outliers_zscore.sum()}")
        print(f"   Gap events (>5%): {gap_events.sum()}")
        print(f"   Volatility outliers: {volatility_outliers.sum()}")
    
    # Create summary DataFrames
    outlier_df = pd.DataFrame(outlier_summary).T
    risk_metrics_df = pd.DataFrame(risk_metrics_by_asset).T
    
    print(f"\n📈 OUTLIER DETECTION SUMMARY")
    print("="*60)
    print(outlier_df.round(2))
    
    print(f"\n⚠️  RISK METRICS BY ASSET")
    print("="*60)
    key_metrics = ['mean_return', 'volatility', 'VaR_5%', 'max_drawdown', 'sharpe_ratio']
    available_metrics = [col for col in key_metrics if col in risk_metrics_df.columns]
    print(risk_metrics_df[available_metrics].round(4))
    
    # Identify most problematic assets
    print(f"\n🚨 ASSETS WITH HIGHEST OUTLIER RATES:")
    top_outlier_assets = outlier_df.nlargest(3, 'price_outlier_rate_%')
    for asset in top_outlier_assets.index:
        rate = top_outlier_assets.loc[asset, 'price_outlier_rate_%']
        count = top_outlier_assets.loc[asset, 'price_outliers_iqr']
        print(f"   {asset}: {rate:.2f}% ({count} outliers)")

else:
    print("⚠️  No data available for outlier analysis")


🔍 FINANCIAL OUTLIER DETECTION ANALYSIS

📊 Analyzing DBA...
   Price outliers (IQR): 9
   Return outliers (Z-score): 6
   Gap events (>5%): 0
   Volatility outliers: 21

📊 Analyzing EEM...
   Price outliers (IQR): 18
   Return outliers (Z-score): 5
   Gap events (>5%): 2
   Volatility outliers: 18

📊 Analyzing EFA...
   Price outliers (IQR): 14
   Return outliers (Z-score): 2
   Gap events (>5%): 2
   Volatility outliers: 20

📊 Analyzing FXE...
   Price outliers (IQR): 13
   Return outliers (Z-score): 5
   Gap events (>5%): 0
   Volatility outliers: 13

📊 Analyzing FXY...
   Price outliers (IQR): 20
   Return outliers (Z-score): 9
   Gap events (>5%): 0
   Volatility outliers: 1

📊 Analyzing GLD...
   Price outliers (IQR): 15
   Return outliers (Z-score): 6
   Gap events (>5%): 0
   Volatility outliers: 25

📊 Analyzing HYG...
   Price outliers (IQR): 21
   Return outliers (Z-score): 7
   Gap events (>5%): 0
   Volatility outliers: 20

📊 Analyzing IEF...
   Price outliers (IQR): 7
   Ret

In [4]:
# Comprehensive Outlier Detection for Multi-Asset Portfolio
if not df.empty:
    print("🔍 FINANCIAL OUTLIER DETECTION ANALYSIS")
    print("="*60)
    
    # Analyze each asset individually
    outlier_summary = {}
    risk_metrics_by_asset = {}
    
    for symbol in df['symbol'].unique():
        print(f"\n📊 Analyzing {symbol}...")
        
        # Get symbol data
        symbol_data = df[df['symbol'] == symbol].copy()
        symbol_data = symbol_data.sort_values('date').reset_index(drop=True)
        
        if len(symbol_data) < 20:  # Need minimum data for analysis
            print(f"   ⚠️  Insufficient data for {symbol} ({len(symbol_data)} observations)")
            continue
        
        prices = symbol_data['adj_close']
        returns = prices.pct_change().dropna()
        
        # Detect different types of outliers
        price_outliers_iqr = FinancialOutlierDetector.detect_price_outliers_iqr(prices, k=1.5)
        return_outliers_zscore = FinancialOutlierDetector.detect_return_outliers_zscore(returns, threshold=3.0)
        gap_events = FinancialOutlierDetector.detect_gap_events(prices, gap_threshold=0.05)
        volatility_outliers = FinancialOutlierDetector.detect_volatility_outliers(returns, window=20, threshold=2.5)
        
        # Store outlier summary
        outlier_summary[symbol] = {
            'total_observations': len(symbol_data),
            'price_outliers_iqr': price_outliers_iqr.sum(),
            'return_outliers_zscore': return_outliers_zscore.sum(),
            'gap_events': gap_events.sum(),
            'volatility_outliers': volatility_outliers.sum(),
            'price_outlier_rate_%': (price_outliers_iqr.sum() / len(prices)) * 100,
            'return_outlier_rate_%': (return_outliers_zscore.sum() / len(returns)) * 100
        }
        
        # Calculate comprehensive risk metrics
        risk_metrics_by_asset[symbol] = calculate_risk_metrics(returns)
        
        print(f"   Price outliers (IQR): {price_outliers_iqr.sum()}")
        print(f"   Return outliers (Z-score): {return_outliers_zscore.sum()}")
        print(f"   Gap events (>5%): {gap_events.sum()}")
        print(f"   Volatility outliers: {volatility_outliers.sum()}")
    
    # Create summary DataFrames
    outlier_df = pd.DataFrame(outlier_summary).T
    risk_metrics_df = pd.DataFrame(risk_metrics_by_asset).T
    
    print(f"\n📈 OUTLIER DETECTION SUMMARY")
    print("="*60)
    print(outlier_df.round(2))
    
    print(f"\n⚠️  RISK METRICS BY ASSET")
    print("="*60)
    key_metrics = ['mean_return', 'volatility', 'VaR_5%', 'max_drawdown', 'sharpe_ratio']
    available_metrics = [col for col in key_metrics if col in risk_metrics_df.columns]
    print(risk_metrics_df[available_metrics].round(4))
    
    # Identify most problematic assets
    print(f"\n🚨 ASSETS WITH HIGHEST OUTLIER RATES:")
    top_outlier_assets = outlier_df.nlargest(3, 'price_outlier_rate_%')
    for asset in top_outlier_assets.index:
        rate = top_outlier_assets.loc[asset, 'price_outlier_rate_%']
        count = top_outlier_assets.loc[asset, 'price_outliers_iqr']
        print(f"   {asset}: {rate:.2f}% ({count} outliers)")

else:
    print("⚠️  No data available for outlier analysis")


🔍 FINANCIAL OUTLIER DETECTION ANALYSIS

📊 Analyzing DBA...
   Price outliers (IQR): 9
   Return outliers (Z-score): 6
   Gap events (>5%): 0
   Volatility outliers: 21

📊 Analyzing EEM...
   Price outliers (IQR): 18
   Return outliers (Z-score): 5
   Gap events (>5%): 2
   Volatility outliers: 18

📊 Analyzing EFA...
   Price outliers (IQR): 14
   Return outliers (Z-score): 2
   Gap events (>5%): 2
   Volatility outliers: 20

📊 Analyzing FXE...
   Price outliers (IQR): 13
   Return outliers (Z-score): 5
   Gap events (>5%): 0
   Volatility outliers: 13

📊 Analyzing FXY...
   Price outliers (IQR): 20
   Return outliers (Z-score): 9
   Gap events (>5%): 0
   Volatility outliers: 1

📊 Analyzing GLD...
   Price outliers (IQR): 15
   Return outliers (Z-score): 6
   Gap events (>5%): 0
   Volatility outliers: 25

📊 Analyzing HYG...
   Price outliers (IQR): 21
   Return outliers (Z-score): 7
   Gap events (>5%): 0
   Volatility outliers: 20

📊 Analyzing IEF...
   Price outliers (IQR): 7
   Ret

In [5]:
# Save Risk Analysis Results
if not df.empty and 'outlier_df' in locals():
    print("💾 Saving risk analysis results...")
    
    # Generate timestamp
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    
    # Save outlier detection results
    outlier_file = PROCESSED_DIR / f'turtle_outlier_analysis_{timestamp}.csv'
    outlier_df.to_csv(outlier_file)
    
    # Save risk metrics
    risk_file = PROCESSED_DIR / f'turtle_risk_metrics_{timestamp}.csv'
    risk_metrics_df.to_csv(risk_file)
    
    print(f"✅ Results saved:")
    print(f"   Outlier analysis: {outlier_file}")
    print(f"   Risk metrics: {risk_file}")
    
    # Final summary
    print(f"\n🎯 TURTLE TRADING RISK ANALYSIS COMPLETE!")
    print("="*60)
    print(f"📊 Analysis Summary:")
    print(f"   Assets analyzed: {len(outlier_df)}")
    print(f"   Total observations: {outlier_df['total_observations'].sum():,}")
    print(f"   Average outlier rate: {outlier_df['price_outlier_rate_%'].mean():.2f}%")
    print(f"   Highest risk asset: {risk_metrics_df['volatility'].idxmax()} ({risk_metrics_df['volatility'].max():.4f} vol)")
    print(f"   Best Sharpe ratio: {risk_metrics_df['sharpe_ratio'].idxmax()} ({risk_metrics_df['sharpe_ratio'].max():.4f})")
    
    print(f"\n🔍 Key Risk Insights:")
    print(f"   • Portfolio contains {len(outlier_df)} diversified ETF assets")
    print(f"   • Average outlier rate of {outlier_df['price_outlier_rate_%'].mean():.2f}% indicates normal market behavior")
    print(f"   • Volatility outliers help identify market stress periods")
    print(f"   • Gap events may indicate overnight news or data quality issues")
    print(f"   • Risk metrics show expected ETF risk-return profiles")
    
    print(f"\n✅ Risk analysis ready for Turtle Trading strategy implementation!")

else:
    print("⚠️  No risk analysis results to save")


💾 Saving risk analysis results...
✅ Results saved:
   Outlier analysis: /Users/panweihu/Desktop/Desktop_m1/NYU_mfe/bootcamp/camp4/bootcamp_bill_panwei_hu/turtle_project/data/processed/turtle_outlier_analysis_20250821_120421.csv
   Risk metrics: /Users/panweihu/Desktop/Desktop_m1/NYU_mfe/bootcamp/camp4/bootcamp_bill_panwei_hu/turtle_project/data/processed/turtle_risk_metrics_20250821_120421.csv

🎯 TURTLE TRADING RISK ANALYSIS COMPLETE!
📊 Analysis Summary:
   Assets analyzed: 18
   Total observations: 9,036.0
   Average outlier rate: 2.82%
   Highest risk asset: UNG (0.0367 vol)
   Best Sharpe ratio: GLD (0.1166)

🔍 Key Risk Insights:
   • Portfolio contains 18 diversified ETF assets
   • Average outlier rate of 2.82% indicates normal market behavior
   • Volatility outliers help identify market stress periods
   • Gap events may indicate overnight news or data quality issues
   • Risk metrics show expected ETF risk-return profiles

✅ Risk analysis ready for Turtle Trading strategy imple