# AlphaNet Cryptocurrency Data Preparation

This notebook prepares cryptocurrency 15-minute OHLCV data for AlphaNet training, adapting the original stock methodology to crypto markets.

## Data Overview
- **Data Source**: 334 cryptocurrency parquet files with 15-minute intervals
- **Time Range**: ~2021-2024
- **AlphaNet Format**: 9×30 feature matrices with standardized return targets
- **Key Challenge**: Fill time gaps and adapt stock methodology to 24/7 crypto markets

## 1. Environment Setup and Dependencies

In [1]:
# Install required parquet reading dependencies
import subprocess
import sys

def install_package(package):
    try:
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])
        print(f"✅ Successfully installed {package}")
    except Exception as e:
        print(f"❌ Failed to install {package}: {e}")

# Install parquet reading engines
print("📦 Installing parquet reading dependencies...")
install_package("pyarrow")
install_package("fastparquet")

📦 Installing parquet reading dependencies...
✅ Successfully installed pyarrow
✅ Successfully installed fastparquet


In [2]:
import pandas as pd
import numpy as np
import os
import warnings
from datetime import datetime, timedelta
from typing import Dict, List, Tuple, Optional
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
import multiprocessing as mp
from concurrent.futures import ProcessPoolExecutor, as_completed
import psutil
import time
from joblib import Parallel, delayed

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

print("📚 All imports successful!")

📚 All imports successful!


## 2. Robust Data Loading Functions

In [3]:
def read_parquet_robust(file_path: str) -> pd.DataFrame:
    """
    Robust parquet file reader with multiple engine fallbacks
    
    Args:
        file_path (str): Path to the parquet file
        
    Returns:
        pd.DataFrame: Loaded dataframe or empty dataframe if failed
    """
    engines = ['pyarrow', 'fastparquet']
    
    for engine in engines:
        try:
            df = pd.read_parquet(file_path, engine=engine)
            if not df.empty:
                return df
        except Exception as e:
            continue
    
    print(f"⚠️ Failed to read {file_path} with all engines")
    return pd.DataFrame()

def standardize_columns(df: pd.DataFrame) -> pd.DataFrame:
    """
    Standardize column names and data types for crypto data
    
    Actual columns: timestamp, open_price, high_price, low_price, close_price, volume, amount, count, buy_volume, buy_amount
    Expected output: timestamp, open, high, low, close, volume
    """
    if df.empty:
        return df
    
    # Column mapping for actual parquet structure
    column_mapping = {
        'open_price': 'open',
        'high_price': 'high', 
        'low_price': 'low',
        'close_price': 'close',
        # timestamp and volume are already correctly named
    }
    
    # Apply column mapping
    df = df.rename(columns=column_mapping)
    
    # Required columns after mapping
    required_cols = ['timestamp', 'open', 'high', 'low', 'close', 'volume']
    
    # Check if we have all required columns
    missing_cols = [col for col in required_cols if col not in df.columns]
    if missing_cols:
        print(f"⚠️ Missing columns: {missing_cols}")
        print(f"   Available columns: {list(df.columns)}")
        return pd.DataFrame()
    
    # Convert timestamp to datetime
    if df['timestamp'].dtype == 'object':
        df['timestamp'] = pd.to_datetime(df['timestamp'])
    elif df['timestamp'].dtype in ['int64', 'int32']:
        # Handle millisecond timestamps
        df['timestamp'] = pd.to_datetime(df['timestamp'], unit='ms')
    
    # Convert price columns to float
    price_cols = ['open', 'high', 'low', 'close']
    for col in price_cols:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors='coerce')
    
    # Convert volume columns to numeric (handle object types)
    volume_cols = ['volume', 'amount', 'buy_volume', 'buy_amount']
    for col in volume_cols:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors='coerce')
    
    # Convert count to int
    if 'count' in df.columns:
        df['count'] = pd.to_numeric(df['count'], errors='coerce').fillna(0).astype('int64')
    
    # Sort by timestamp
    df = df.sort_values('timestamp').reset_index(drop=True)
    
    # Keep only required columns (but also preserve additional useful columns)
    # Keep buy_volume and buy_amount for potential enhanced feature engineering
    available_extra_cols = [col for col in ['amount', 'count', 'buy_volume', 'buy_amount'] if col in df.columns]
    final_cols = required_cols + available_extra_cols
    df = df[final_cols]
    
    return df

def load_crypto_data(data_dir: str = 'kline_data/train_data') -> Dict[str, pd.DataFrame]:
    """
    Load all cryptocurrency parquet files with robust error handling
    
    Returns:
        Dict[str, pd.DataFrame]: Dictionary mapping symbol to dataframe
    """
    print(f"📂 Loading cryptocurrency data from {data_dir}...")
    
    if not os.path.exists(data_dir):
        print(f"❌ Data directory not found: {data_dir}")
        return {}
    
    parquet_files = [f for f in os.listdir(data_dir) if f.endswith('.parquet')]
    print(f"📊 Found {len(parquet_files)} parquet files")
    
    # First, let's check the structure of one file
    if parquet_files:
        sample_file = parquet_files[0]
        sample_path = os.path.join(data_dir, sample_file)
        sample_df = read_parquet_robust(sample_path)
        if not sample_df.empty:
            print(f"📋 Sample file structure ({sample_file}):")
            print(f"   Columns: {list(sample_df.columns)}")
            print(f"   Shape: {sample_df.shape}")
            print(f"   Original data types: {sample_df.dtypes.to_dict()}")
    
    crypto_data = {}
    successful_loads = 0
    failed_loads = 0
    failed_symbols = []
    
    for filename in tqdm(parquet_files, desc="Loading files"):
        symbol = filename.replace('.parquet', '')
        file_path = os.path.join(data_dir, filename)
        
        try:
            # Load and standardize data
            df = read_parquet_robust(file_path)
            df = standardize_columns(df)
            
            if not df.empty and len(df) >= 100:  # Minimum data requirement
                crypto_data[symbol] = df
                successful_loads += 1
            else:
                failed_loads += 1
                failed_symbols.append(symbol)
                if len(df) < 100:
                    print(f"⚠️ {symbol}: Insufficient data ({len(df)} records)")
                    
        except Exception as e:
            failed_loads += 1
            failed_symbols.append(symbol)
            print(f"❌ Failed to load {symbol}: {str(e)[:50]}...")
    
    print(f"\\n✅ Successfully loaded: {successful_loads} files")
    print(f"❌ Failed to load: {failed_loads} files")
    
    if failed_symbols:
        print(f"\\n📋 Failed symbols: {', '.join(failed_symbols[:10])}{'...' if len(failed_symbols) > 10 else ''}")
        print(f"💡 Recommendation: These {failed_loads} coins will be excluded from AlphaNet training")
        print(f"   This is normal - represents {failed_loads/len(parquet_files)*100:.1f}% failure rate")
    
    return crypto_data

print("🔧 Data loading functions ready!")

🔧 Data loading functions ready!


## 3. Time Gap Detection and Filling

In [4]:
def detect_time_gaps(df: pd.DataFrame, symbol: str, expected_interval_minutes: int = 15) -> List[Dict]:
    """
    Detect time gaps in cryptocurrency data
    
    Args:
        df: DataFrame with timestamp column
        symbol: Symbol name for logging
        expected_interval_minutes: Expected interval between records
        
    Returns:
        List of gap information dictionaries
    """
    if len(df) < 2:
        return []
    
    df = df.sort_values('timestamp')
    time_diffs = df['timestamp'].diff().dropna()
    
    expected_interval = pd.Timedelta(minutes=expected_interval_minutes)
    tolerance = pd.Timedelta(seconds=30)  # 30-second tolerance
    
    gaps = []
    
    for i, diff in enumerate(time_diffs):
        if diff > expected_interval + tolerance:
            gap_minutes = diff.total_seconds() / 60
            gaps.append({
                'symbol': symbol,
                'start': df['timestamp'].iloc[i],
                'end': df['timestamp'].iloc[i+1],
                'gap_minutes': gap_minutes,
                'expected_bars': int(gap_minutes / expected_interval_minutes) - 1
            })
    
    return gaps

def fill_time_gaps(df: pd.DataFrame, symbol: str) -> pd.DataFrame:
    """
    Fill time gaps in cryptocurrency data with NA values
    
    Args:
        df: DataFrame with OHLCV data
        symbol: Symbol name for logging
        
    Returns:
        DataFrame with filled time gaps
    """
    if df.empty:
        return df
    
    original_length = len(df)
    df = df.sort_values('timestamp')
    
    # Create complete time index with 15-minute intervals
    start_time = df['timestamp'].min()
    end_time = df['timestamp'].max()
    
    # Generate complete 15-minute intervals
    complete_index = pd.date_range(
        start=start_time,
        end=end_time, 
        freq='15T'  # 15-minute intervals
    )
    
    # Create complete DataFrame
    complete_df = pd.DataFrame({'timestamp': complete_index})
    
    # Merge with original data
    filled_df = complete_df.merge(df, on='timestamp', how='left')
    
    filled_length = len(filled_df)
    na_count = filled_df.isna().sum().sum()
    
    if filled_length > original_length:
        print(f"🔧 {symbol}: Filled {filled_length - original_length} time gaps ({na_count} NA values)")
    
    return filled_df

def analyze_data_completeness(crypto_data: Dict[str, pd.DataFrame]) -> pd.DataFrame:
    """
    Analyze data completeness across all cryptocurrencies
    
    Returns:
        DataFrame with completeness statistics
    """
    print("📊 Analyzing data completeness...")
    
    completeness_stats = []
    
    for symbol, df in crypto_data.items():
        if df.empty:
            continue
            
        # Basic statistics
        total_records = len(df)
        date_range = (df['timestamp'].max() - df['timestamp'].min()).days
        
        # Expected records (15-min intervals for 24/7 trading)
        expected_records = int(date_range * 24 * 4)  # 4 intervals per hour
        completeness_rate = (total_records / expected_records * 100) if expected_records > 0 else 0
        
        # Detect gaps
        gaps = detect_time_gaps(df, symbol)
        
        completeness_stats.append({
            'symbol': symbol,
            'total_records': total_records,
            'date_range_days': date_range,
            'expected_records': expected_records,
            'completeness_rate': completeness_rate,
            'num_gaps': len(gaps),
            'start_date': df['timestamp'].min(),
            'end_date': df['timestamp'].max()
        })
    
    stats_df = pd.DataFrame(completeness_stats)
    
    # Summary statistics
    print(f"\n📈 Data Completeness Summary:")
    print(f"   Total symbols: {len(stats_df)}")
    print(f"   Average completeness: {stats_df['completeness_rate'].mean():.1f}%")
    print(f"   Symbols with gaps: {(stats_df['num_gaps'] > 0).sum()}")
    print(f"   Symbols with >90% completeness: {(stats_df['completeness_rate'] > 90).sum()}")
    
    return stats_df

print("⏰ Time gap detection and filling functions ready!")

⏰ Time gap detection and filling functions ready!


## 4. Load and Process Data

In [5]:
# Load all cryptocurrency data
crypto_data = load_crypto_data()

print(f"\n🎯 Loaded data for {len(crypto_data)} cryptocurrencies")

# Show sample data
if crypto_data:
    sample_symbol = list(crypto_data.keys())[0] 
    sample_df = crypto_data[sample_symbol]
    print(f"\n📋 Sample data from {sample_symbol}:")
    print(sample_df.head())
    print(f"\nData types:")
    print(sample_df.dtypes)

📂 Loading cryptocurrency data from kline_data/train_data...
📊 Found 355 parquet files
⚠️ Failed to read kline_data/train_data/SUSDT.parquet with all engines


Loading files:   0%|          | 0/355 [00:00<?, ?it/s]

⚠️ Failed to read kline_data/train_data/SUSDT.parquet with all engines
⚠️ SUSDT: Insufficient data (0 records)


Loading files:   3%|▎         | 12/355 [00:00<00:25, 13.26it/s]

⚠️ Failed to read kline_data/train_data/ANIMEUSDT.parquet with all engines
⚠️ ANIMEUSDT: Insufficient data (0 records)


Loading files:  10%|▉         | 34/355 [00:03<00:28, 11.23it/s]

⚠️ Failed to read kline_data/train_data/SOLVUSDT.parquet with all engines
⚠️ SOLVUSDT: Insufficient data (0 records)


Loading files:  11%|█         | 39/355 [00:03<00:22, 14.12it/s]

⚠️ Failed to read kline_data/train_data/AI16ZUSDT.parquet with all engines
⚠️ AI16ZUSDT: Insufficient data (0 records)


Loading files:  15%|█▌        | 55/355 [00:04<00:19, 15.12it/s]

⚠️ Failed to read kline_data/train_data/MELANIAUSDT.parquet with all engines
⚠️ MELANIAUSDT: Insufficient data (0 records)


Loading files:  23%|██▎       | 80/355 [00:06<00:20, 13.54it/s]

⚠️ Failed to read kline_data/train_data/BIOUSDT.parquet with all engines
⚠️ BIOUSDT: Insufficient data (0 records)


Loading files:  34%|███▍      | 121/355 [00:10<00:22, 10.51it/s]

⚠️ Failed to read kline_data/train_data/VINEUSDT.parquet with all engines
⚠️ VINEUSDT: Insufficient data (0 records)


Loading files:  35%|███▌      | 125/355 [00:10<00:18, 12.52it/s]

⚠️ Failed to read kline_data/train_data/COOKIEUSDT.parquet with all engines
⚠️ COOKIEUSDT: Insufficient data (0 records)


Loading files:  43%|████▎     | 152/355 [00:12<00:14, 13.75it/s]

⚠️ Failed to read kline_data/train_data/AVAAIUSDT.parquet with all engines
⚠️ AVAAIUSDT: Insufficient data (0 records)


Loading files:  52%|█████▏    | 186/355 [00:14<00:08, 19.16it/s]

⚠️ Failed to read kline_data/train_data/ARCUSDT.parquet with all engines
⚠️ ARCUSDT: Insufficient data (0 records)


Loading files:  57%|█████▋    | 204/355 [00:16<00:08, 17.53it/s]

⚠️ Failed to read kline_data/train_data/ZEREBROUSDT.parquet with all engines
⚠️ ZEREBROUSDT: Insufficient data (0 records)
⚠️ Failed to read kline_data/train_data/VTHOUSDT.parquet with all engines
⚠️ VTHOUSDT: Insufficient data (0 records)


Loading files:  59%|█████▉    | 210/355 [00:16<00:12, 11.28it/s]

⚠️ Failed to read kline_data/train_data/DUSDT.parquet with all engines
⚠️ DUSDT: Insufficient data (0 records)


Loading files:  62%|██████▏   | 219/355 [00:17<00:13, 10.42it/s]

⚠️ Failed to read kline_data/train_data/PROMUSDT.parquet with all engines
⚠️ PROMUSDT: Insufficient data (0 records)


Loading files:  71%|███████   | 251/355 [00:20<00:09, 11.47it/s]

⚠️ Failed to read kline_data/train_data/SONICUSDT.parquet with all engines
⚠️ SONICUSDT: Insufficient data (0 records)


Loading files:  75%|███████▍  | 265/355 [00:21<00:07, 12.62it/s]

⚠️ Failed to read kline_data/train_data/SWARMSUSDT.parquet with all engines
⚠️ SWARMSUSDT: Insufficient data (0 records)


Loading files:  79%|███████▉  | 281/355 [00:22<00:06, 12.10it/s]

⚠️ Failed to read kline_data/train_data/TRUMPUSDT.parquet with all engines
⚠️ TRUMPUSDT: Insufficient data (0 records)


Loading files:  83%|████████▎ | 294/355 [00:23<00:04, 12.87it/s]

⚠️ Failed to read kline_data/train_data/ALCHUSDT.parquet with all engines
⚠️ ALCHUSDT: Insufficient data (0 records)


Loading files:  85%|████████▍ | 301/355 [00:24<00:03, 17.04it/s]

⚠️ Failed to read kline_data/train_data/GRIFFAINUSDT.parquet with all engines
⚠️ GRIFFAINUSDT: Insufficient data (0 records)
⚠️ Failed to read kline_data/train_data/PIPPINUSDT.parquet with all engines
⚠️ PIPPINUSDT: Insufficient data (0 records)


Loading files:  87%|████████▋ | 310/355 [00:24<00:02, 15.16it/s]

⚠️ Failed to read kline_data/train_data/VVVUSDT.parquet with all engines
⚠️ VVVUSDT: Insufficient data (0 records)


Loading files: 100%|██████████| 355/355 [00:28<00:00, 12.37it/s]

\n✅ Successfully loaded: 334 files
❌ Failed to load: 21 files
\n📋 Failed symbols: SUSDT, ANIMEUSDT, SOLVUSDT, AI16ZUSDT, MELANIAUSDT, BIOUSDT, VINEUSDT, COOKIEUSDT, AVAAIUSDT, ARCUSDT...
💡 Recommendation: These 21 coins will be excluded from AlphaNet training
   This is normal - represents 5.9% failure rate

🎯 Loaded data for 334 cryptocurrencies

📋 Sample data from STXUSDT:
            timestamp    open    high     low   close   volume        amount  \
0 2023-02-21 14:30:00  0.6480  0.6480  0.6120  0.6210  1528423  9.520539e+05   
1 2023-02-21 14:45:00  0.6210  0.6290  0.6145  0.6191  1674991  1.040635e+06   
2 2023-02-21 15:00:00  0.6191  0.6214  0.6085  0.6110  1173522  7.220518e+05   
3 2023-02-21 15:15:00  0.6109  0.6133  0.5996  0.6078  1326027  8.045360e+05   
4 2023-02-21 15:30:00  0.6078  0.6180  0.5962  0.6142  1195182  7.283026e+05   

   count  buy_volume   buy_amount  
0   3910      647599  403575.5675  
1   4304      822325  511118.5320  
2   3231      622631  383221.5037




In [6]:
# Analyze data completeness before gap filling
completeness_stats = analyze_data_completeness(crypto_data)

# Show symbols with gaps
symbols_with_gaps = completeness_stats[completeness_stats['num_gaps'] > 0]
if not symbols_with_gaps.empty:
    print(f"\n⚠️ Symbols with time gaps:")
    print(symbols_with_gaps[['symbol', 'completeness_rate', 'num_gaps']].head(10))
else:
    print(f"\n✅ No time gaps detected!")

📊 Analyzing data completeness...

📈 Data Completeness Summary:
   Total symbols: 334
   Average completeness: 100.4%
   Symbols with gaps: 1
   Symbols with >90% completeness: 333

⚠️ Symbols with time gaps:
     symbol  completeness_rate  num_gaps
88  TLMUSDT          86.719574         1


In [7]:
# Fill time gaps for symbols that need it
print("🔧 Filling time gaps...")

filled_data = {}
symbols_with_gaps_list = completeness_stats[completeness_stats['num_gaps'] > 0]['symbol'].tolist()

for symbol, df in tqdm(crypto_data.items(), desc="Processing symbols"):
    if symbol in symbols_with_gaps_list:
        filled_df = fill_time_gaps(df, symbol)
        filled_data[symbol] = filled_df
    else:
        filled_data[symbol] = df

# Update crypto_data with filled data
crypto_data = filled_data

print(f"\n✅ Gap filling completed for {len(symbols_with_gaps_list)} symbols")

🔧 Filling time gaps...


Processing symbols: 100%|██████████| 334/334 [00:00<00:00, 54799.62it/s]

🔧 TLMUSDT: Filled 16165 time gaps (145485 NA values)

✅ Gap filling completed for 1 symbols





## 4.5. MAD Outlier Treatment

Apply Median Absolute Deviation (MAD) based outlier treatment to improve data quality:
- **Method**: Winsorization (replace extremes with threshold values)
- **Scope**: Price columns (open, high, low, close) only
- **Threshold**: 3.0 × MAD (conservative approach for crypto volatility)

In [8]:
def calculate_mad_thresholds(series: pd.Series, k: float = 3.0) -> Tuple[float, float]:
    """
    Calculate MAD-based outlier thresholds
    
    Args:
        series: Pandas Series of numeric data
        k: MAD multiplier (typically 2.5-3.5, equivalent to 2.5σ-3.5σ)
        
    Returns:
        Tuple of (lower_threshold, upper_threshold)
    """
    # Remove NaN values for calculation
    clean_series = series.dropna()
    
    if len(clean_series) < 10:  # Need sufficient data points
        return float('-inf'), float('inf')
    
    # Calculate median
    median_val = clean_series.median()
    
    # Calculate MAD (Median Absolute Deviation)
    mad = (clean_series - median_val).abs().median()
    
    # MAD to standard deviation conversion factor
    mad_to_std = 1.4826
    
    # Calculate thresholds
    threshold_range = k * mad * mad_to_std
    lower_threshold = median_val - threshold_range
    upper_threshold = median_val + threshold_range
    
    return lower_threshold, upper_threshold

def winsorize_outliers(series: pd.Series, lower_thresh: float, upper_thresh: float) -> Tuple[pd.Series, int]:
    """
    Apply winsorization to a series using given thresholds
    
    Args:
        series: Pandas Series to winsorize
        lower_thresh: Lower threshold value
        upper_thresh: Upper threshold value
        
    Returns:
        Tuple of (winsorized_series, num_outliers_treated)
    """
    # Count outliers before treatment
    lower_outliers = (series < lower_thresh).sum()
    upper_outliers = (series > upper_thresh).sum()
    total_outliers = lower_outliers + upper_outliers
    
    # Apply winsorization
    winsorized = series.copy()
    winsorized = winsorized.clip(lower=lower_thresh, upper=upper_thresh)
    
    return winsorized, total_outliers

def mad_outlier_treatment(df: pd.DataFrame, symbol: str, 
                         target_columns: List[str] = ['open', 'high', 'low', 'close'],
                         k: float = 3.0) -> pd.DataFrame:
    """
    Apply MAD-based outlier treatment to specified columns
    
    Args:
        df: DataFrame with price data
        symbol: Symbol name for logging
        target_columns: Columns to apply MAD treatment
        k: MAD multiplier threshold
        
    Returns:
        DataFrame with outliers treated
    """
    if df.empty:
        return df
    
    df_treated = df.copy()
    treatment_stats = {}
    
    for col in target_columns:
        if col not in df.columns:
            continue
            
        # Calculate MAD thresholds
        lower_thresh, upper_thresh = calculate_mad_thresholds(df[col], k=k)
        
        if lower_thresh == float('-inf'):  # Skip if insufficient data
            continue
            
        # Apply winsorization
        df_treated[col], outliers_count = winsorize_outliers(
            df[col], lower_thresh, upper_thresh
        )
        
        treatment_stats[col] = {
            'outliers_treated': outliers_count,
            'lower_threshold': lower_thresh,
            'upper_threshold': upper_thresh,
            'outlier_rate': outliers_count / len(df) * 100
        }
    
    # Log treatment results
    total_outliers = sum([stats['outliers_treated'] for stats in treatment_stats.values()])
    if total_outliers > 0:
        outlier_rate = total_outliers / (len(df) * len(target_columns)) * 100
        print(f"🔧 {symbol}: Treated {total_outliers} outliers ({outlier_rate:.2f}% of data points)")
        
        # Show detailed stats for high outlier rate
        if outlier_rate > 5.0:
            for col, stats in treatment_stats.items():
                if stats['outliers_treated'] > 0:
                    print(f"   {col}: {stats['outliers_treated']} outliers ({stats['outlier_rate']:.2f}%)")
    
    return df_treated

def process_all_symbols_mad(crypto_data: Dict[str, pd.DataFrame], 
                           k: float = 3.0) -> Dict[str, pd.DataFrame]:
    """
    Apply MAD outlier treatment to all cryptocurrency symbols
    
    Args:
        crypto_data: Dictionary mapping symbol to DataFrame
        k: MAD multiplier threshold
        
    Returns:
        Dictionary with MAD-treated data
    """
    print(f"🔧 Applying MAD outlier treatment (k={k}) to all symbols...")
    
    treated_data = {}
    total_outliers = 0
    processed_symbols = 0
    
    for symbol, df in tqdm(crypto_data.items(), desc="MAD treatment"):
        try:
            treated_df = mad_outlier_treatment(df, symbol, k=k)
            treated_data[symbol] = treated_df
            processed_symbols += 1
            
        except Exception as e:
            print(f"❌ Failed MAD treatment for {symbol}: {str(e)[:50]}...")
            # Keep original data if treatment fails
            treated_data[symbol] = df
    
    print(f"\\n✅ MAD treatment completed for {processed_symbols} symbols")
    print(f"💡 Treatment parameters: k={k} (≈{k:.1f}σ threshold)")
    print(f"📊 Method: Winsorization applied to price columns only")
    
    return treated_data

# Additional utility function for MAD analysis
def analyze_mad_impact(original_data: Dict[str, pd.DataFrame], 
                      treated_data: Dict[str, pd.DataFrame],
                      sample_symbols: int = 5) -> None:
    """
    Analyze the impact of MAD treatment on data distribution
    
    Args:
        original_data: Original cryptocurrency data
        treated_data: MAD-treated cryptocurrency data  
        sample_symbols: Number of symbols to analyze in detail
    """
    print(f"📊 Analyzing MAD treatment impact (sample of {sample_symbols} symbols)...")
    
    symbols_to_analyze = list(original_data.keys())[:sample_symbols]
    
    for symbol in symbols_to_analyze:
        if symbol not in treated_data:
            continue
            
        orig_df = original_data[symbol]
        treat_df = treated_data[symbol]
        
        print(f"\\n🔍 Analysis for {symbol}:")
        
        for col in ['open', 'high', 'low', 'close']:
            if col not in orig_df.columns:
                continue
                
            orig_std = orig_df[col].std()
            treat_std = treat_df[col].std()
            std_reduction = (1 - treat_std/orig_std) * 100
            
            changes = (orig_df[col] != treat_df[col]).sum()
            change_rate = changes / len(orig_df) * 100
            
            print(f"   {col}: {changes} values changed ({change_rate:.2f}%), std reduced by {std_reduction:.1f}%")

print("🔧 MAD outlier treatment functions ready!")

🔧 MAD outlier treatment functions ready!


In [9]:
# Apply MAD outlier treatment to all symbols
# Keep a backup of original data for analysis
crypto_data_original = crypto_data.copy()

# Apply MAD treatment with k=3.0 (conservative threshold)
crypto_data_mad_treated = process_all_symbols_mad(crypto_data, k=3.0)

# Analyze the impact of MAD treatment
analyze_mad_impact(crypto_data_original, crypto_data_mad_treated, sample_symbols=5)

# Update main data with MAD-treated version
crypto_data = crypto_data_mad_treated

print(f"\\n🎯 MAD outlier treatment completed!")
print(f"📊 Data now ready for quality filtering with improved outlier handling")

🔧 Applying MAD outlier treatment (k=3.0) to all symbols...


MAD treatment:   8%|▊         | 26/334 [00:00<00:01, 259.28it/s]

🔧 ARPAUSDT: Treated 29218 outliers (6.51% of data points)
   open: 7299 outliers (6.50%)
   high: 7266 outliers (6.47%)
   low: 7354 outliers (6.55%)
   close: 7299 outliers (6.50%)
🔧 MAVUSDT: Treated 11872 outliers (5.61% of data points)
   open: 2954 outliers (5.58%)
   high: 2976 outliers (5.62%)
   low: 2970 outliers (5.61%)
   close: 2972 outliers (5.62%)
🔧 POWRUSDT: Treated 1306 outliers (0.79% of data points)
🔧 SOLUSDT: Treated 56175 outliers (10.01% of data points)
   open: 14042 outliers (10.01%)
   high: 13907 outliers (9.92%)
   low: 14183 outliers (10.11%)
   close: 14043 outliers (10.01%)
🔧 MINAUSDT: Treated 4583 outliers (1.72% of data points)
🔧 STGUSDT: Treated 4495 outliers (1.36% of data points)
🔧 QNTUSDT: Treated 3232 outliers (1.05% of data points)
🔧 ASTRUSDT: Treated 39911 outliers (15.14% of data points)
   open: 9982 outliers (15.14%)
   high: 9970 outliers (15.13%)
   low: 9983 outliers (15.15%)
   close: 9976 outliers (15.14%)
🔧 ENSUSDT: Treated 24359 outliers (

MAD treatment:  16%|█▌        | 54/334 [00:00<00:01, 269.88it/s]

🔧 OCEANUSDT: Treated 23217 outliers (4.75% of data points)
🔧 FILUSDT: Treated 186355 outliers (33.22% of data points)
   open: 46589 outliers (33.22%)
   high: 46590 outliers (33.22%)
   low: 46588 outliers (33.22%)
   close: 46588 outliers (33.22%)
🔧 POLUSDT: Treated 3148 outliers (7.51% of data points)
   open: 793 outliers (7.57%)
   high: 778 outliers (7.42%)
   low: 787 outliers (7.51%)
   close: 790 outliers (7.54%)


MAD treatment:  25%|██▌       | 84/334 [00:00<00:00, 283.03it/s]

🔧 CRVUSDT: Treated 65671 outliers (11.71% of data points)
   open: 16453 outliers (11.73%)
   high: 16464 outliers (11.74%)
   low: 16299 outliers (11.62%)
   close: 16455 outliers (11.73%)
🔧 JUPUSDT: Treated 789 outliers (0.61% of data points)
🔧 YGGUSDT: Treated 14112 outliers (7.14% of data points)
   open: 3528 outliers (7.14%)
   high: 3525 outliers (7.14%)
   low: 3530 outliers (7.15%)
   close: 3529 outliers (7.14%)
🔧 KOMAUSDT: Treated 11 outliers (0.14% of data points)
🔧 ARBUSDT: Treated 2400 outliers (0.96% of data points)
🔧 FLUXUSDT: Treated 2756 outliers (6.01% of data points)
   open: 688 outliers (6.00%)
   high: 677 outliers (5.91%)
   low: 704 outliers (6.14%)
   close: 687 outliers (5.99%)
🔧 RAYUSDT: Treated 11866 outliers (6.84% of data points)
   open: 2971 outliers (6.85%)
   high: 3016 outliers (6.95%)
   low: 2918 outliers (6.72%)
   close: 2961 outliers (6.82%)
🔧 DOTUSDT: Treated 143967 outliers (25.66% of data points)
   open: 36000 outliers (25.67%)
   high: 3597

MAD treatment:  34%|███▍      | 113/334 [00:00<00:00, 271.38it/s]

🔧 VETUSDT: Treated 113744 outliers (20.27% of data points)
   open: 28434 outliers (20.27%)
   high: 28411 outliers (20.26%)
   low: 28462 outliers (20.29%)
   close: 28437 outliers (20.28%)
🔧 ATAUSDT: Treated 80547 outliers (17.21% of data points)
   open: 20147 outliers (17.22%)
   high: 20028 outliers (17.12%)
   low: 20225 outliers (17.29%)
   close: 20147 outliers (17.22%)
🔧 AGIXUSDT: Treated 38614 outliers (20.32% of data points)
   open: 9654 outliers (20.32%)
   high: 9666 outliers (20.35%)
   low: 9640 outliers (20.29%)
   close: 9654 outliers (20.32%)


MAD treatment:  42%|████▏     | 141/334 [00:00<00:00, 271.74it/s]

🔧 ZILUSDT: Treated 101538 outliers (18.10% of data points)
   open: 25387 outliers (18.10%)
   high: 25410 outliers (18.12%)
   low: 25350 outliers (18.07%)
   close: 25391 outliers (18.10%)
🔧 BANUSDT: Treated 298 outliers (1.80% of data points)
🔧 ACEUSDT: Treated 22736 outliers (15.61% of data points)
   open: 5706 outliers (15.67%)
   high: 5693 outliers (15.64%)
   low: 5634 outliers (15.47%)
   close: 5703 outliers (15.66%)
🔧 OMGUSDT: Treated 77614 outliers (13.83% of data points)
   open: 19389 outliers (13.82%)
   high: 19527 outliers (13.92%)
   low: 19303 outliers (13.76%)
   close: 19395 outliers (13.83%)
🔧 WLDUSDT: Treated 25871 outliers (12.80% of data points)
   open: 6467 outliers (12.80%)
   high: 6472 outliers (12.81%)
   low: 6465 outliers (12.80%)
   close: 6467 outliers (12.80%)
🔧 RUNEUSDT: Treated 7621 outliers (1.36% of data points)
🔧 AAVEUSDT: Treated 112451 outliers (20.04% of data points)
   open: 28106 outliers (20.04%)
   high: 28135 outliers (20.06%)
   low: 2

MAD treatment:  51%|█████     | 170/334 [00:00<00:00, 277.73it/s]

🔧 BANDUSDT: Treated 154790 outliers (27.59% of data points)
   open: 38703 outliers (27.59%)
   high: 38672 outliers (27.57%)
   low: 38712 outliers (27.60%)
   close: 38703 outliers (27.59%)
🔧 LPTUSDT: Treated 36562 outliers (8.30% of data points)
   open: 9142 outliers (8.31%)
   high: 9098 outliers (8.27%)
   low: 9177 outliers (8.34%)
   close: 9145 outliers (8.31%)
🔧 REZUSDT: Treated 18883 outliers (20.06% of data points)
   open: 4715 outliers (20.04%)
   high: 4699 outliers (19.97%)
   low: 4755 outliers (20.21%)
   close: 4714 outliers (20.04%)


MAD treatment:  60%|█████▉    | 199/334 [00:00<00:00, 280.43it/s]

🔧 CELOUSDT: Treated 87159 outliers (19.05% of data points)
   open: 21790 outliers (19.05%)
   high: 21810 outliers (19.07%)
   low: 21770 outliers (19.03%)
   close: 21789 outliers (19.05%)
🔧 VANRYUSDT: Treated 3666 outliers (3.26% of data points)
🔧 TIAUSDT: Treated 5201 outliers (3.17% of data points)
🔧 WUSDT: Treated 13584 outliers (13.00% of data points)
   open: 3403 outliers (13.03%)
   high: 3385 outliers (12.96%)
   low: 3392 outliers (12.99%)
   close: 3404 outliers (13.04%)
🔧 BEAMXUSDT: Treated 1496 outliers (0.95% of data points)
🔧 ACTUSDT: Treated 10 outliers (0.05% of data points)
🔧 RAYSOLUSDT: Treated 11 outliers (0.14% of data points)
🔧 BOMEUSDT: Treated 1217 outliers (1.09% of data points)
🔧 TRUUSDT: Treated 2887 outliers (1.13% of data points)
🔧 SUSHIUSDT: Treated 160269 outliers (28.57% of data points)
   open: 40068 outliers (28.57%)
   high: 40067 outliers (28.57%)
   low: 40070 outliers (28.57%)
   close: 40064 outliers (28.56%)
🔧 KDAUSDT: Treated 8498 outliers (21

MAD treatment:  77%|███████▋  | 256/334 [00:00<00:00, 271.83it/s]

🔧 LITUSDT: Treated 126054 outliers (23.24% of data points)
   open: 31513 outliers (23.24%)
   high: 31520 outliers (23.25%)
   low: 31508 outliers (23.24%)
   close: 31513 outliers (23.24%)
🔧 QTUMUSDT: Treated 109622 outliers (19.54% of data points)
   open: 27397 outliers (19.53%)
   high: 27413 outliers (19.54%)
   low: 27418 outliers (19.55%)
   close: 27394 outliers (19.53%)
🔧 DRIFTUSDT: Treated 760 outliers (3.73% of data points)
🔧 IMXUSDT: Treated 13084 outliers (3.23% of data points)
🔧 REEFUSDT: Treated 137139 outliers (25.36% of data points)
   open: 34284 outliers (25.36%)
   high: 34279 outliers (25.35%)
   low: 34290 outliers (25.36%)
   close: 34286 outliers (25.36%)
🔧 NEOUSDT: Treated 111810 outliers (19.93% of data points)
   open: 27954 outliers (19.93%)
   high: 27946 outliers (19.92%)
   low: 27956 outliers (19.93%)
   close: 27954 outliers (19.93%)
🔧 PONKEUSDT: Treated 5 outliers (0.02% of data points)
🔧 STMXUSDT: Treated 120150 outliers (22.67% of data points)
   op

MAD treatment:  93%|█████████▎| 311/334 [00:01<00:00, 267.22it/s]

🔧 BNBUSDT: Treated 6580 outliers (1.17% of data points)
🔧 ORBSUSDT: Treated 7 outliers (0.00% of data points)
🔧 CATIUSDT: Treated 907 outliers (2.31% of data points)
🔧 BNXUSDT: Treated 2532 outliers (0.97% of data points)
🔧 CHZUSDT: Treated 42949 outliers (7.77% of data points)
   open: 10739 outliers (7.77%)
   high: 10758 outliers (7.78%)
   low: 10712 outliers (7.75%)
   close: 10740 outliers (7.77%)
🔧 SSVUSDT: Treated 3790 outliers (1.46% of data points)
🔧 ATOMUSDT: Treated 89623 outliers (15.97% of data points)
   open: 22406 outliers (15.98%)
   high: 22497 outliers (16.04%)
   low: 22314 outliers (15.91%)
   close: 22406 outliers (15.98%)
🔧 ALGOUSDT: Treated 113042 outliers (20.15% of data points)
   open: 28278 outliers (20.16%)
   high: 28248 outliers (20.14%)
   low: 28256 outliers (20.15%)
   close: 28260 outliers (20.15%)
🔧 ZETAUSDT: Treated 19266 outliers (15.05% of data points)
   open: 4820 outliers (15.06%)
   high: 4784 outliers (14.95%)
   low: 4836 outliers (15.11%)


MAD treatment: 100%|██████████| 334/334 [00:01<00:00, 270.97it/s]

🔧 RPLUSDT: Treated 3488 outliers (8.03% of data points)
   open: 869 outliers (8.00%)
   high: 871 outliers (8.02%)
   low: 878 outliers (8.08%)
   close: 870 outliers (8.01%)
🔧 YFIUSDT: Treated 169777 outliers (30.26% of data points)
   open: 42450 outliers (30.27%)
   high: 42419 outliers (30.24%)
   low: 42462 outliers (30.27%)
   close: 42446 outliers (30.26%)
🔧 DUSKUSDT: Treated 32616 outliers (7.80% of data points)
   open: 8149 outliers (7.79%)
   high: 8170 outliers (7.81%)
   low: 8151 outliers (7.79%)
   close: 8146 outliers (7.79%)
🔧 LISTAUSDT: Treated 2618 outliers (3.51% of data points)
\n✅ MAD treatment completed for 334 symbols
💡 Treatment parameters: k=3.0 (≈3.0σ threshold)
📊 Method: Winsorization applied to price columns only
📊 Analyzing MAD treatment impact (sample of 5 symbols)...
\n🔍 Analysis for STXUSDT:
   open: 0 values changed (0.00%), std reduced by 0.0%
   high: 0 values changed (0.00%), std reduced by 0.0%
   low: 0 values changed (0.00%), std reduced by 0.0%




   close: 326 values changed (0.79%), std reduced by 9.3%
\n🔍 Analysis for SOLUSDT:
   open: 14042 values changed (10.01%), std reduced by 8.4%
   high: 13907 values changed (9.92%), std reduced by 8.3%
   low: 14183 values changed (10.11%), std reduced by 8.5%
   close: 14043 values changed (10.01%), std reduced by 8.4%
\n🎯 MAD outlier treatment completed!
📊 Data now ready for quality filtering with improved outlier handling


## 5. Data Filtering and Quality Control

Filter cryptocurrency data based on AlphaNet requirements:
- Remove coins with insufficient data
- Filter out extreme price movements (potential manipulation or errors)  
- Exclude periods with suspicious trading patterns

In [10]:
def filter_extreme_movements(df: pd.DataFrame, symbol: str, max_change_pct: float = 50.0) -> pd.DataFrame:
    """
    Filter out periods with extreme price movements that might indicate errors or manipulation
    
    Args:
        df: DataFrame with OHLCV data
        symbol: Symbol name for logging
        max_change_pct: Maximum allowed percentage change between bars
        
    Returns:
        Filtered DataFrame
    """
    if df.empty or len(df) < 2:
        return df
    
    original_length = len(df)
    
    # Calculate returns
    df['return'] = df['close'].pct_change() * 100
    
    # Identify extreme movements
    extreme_mask = (abs(df['return']) > max_change_pct) & df['return'].notna()
    extreme_count = extreme_mask.sum()
    
    if extreme_count > 0:
        print(f"⚠️ {symbol}: Found {extreme_count} extreme movements (>{max_change_pct}%), marking as suspicious")
        
        # Mark extreme movements for exclusion (could set to NaN or remove)
        # For now, we'll keep them but flag them
        df.loc[extreme_mask, 'extreme_movement'] = True
    
    df.drop('return', axis=1, inplace=True)  # Remove temporary return column
    
    return df

def calculate_data_quality_score(df: pd.DataFrame) -> float:
    """
    Calculate a data quality score based on various factors
    
    Returns:
        Quality score between 0 and 1 (1 = best quality)
    """
    if df.empty:
        return 0.0
    
    # Factors for quality scoring
    factors = []
    
    # 1. Data completeness (non-NA values)
    completeness = 1 - (df.isna().sum().sum() / (len(df) * len(df.columns)))
    factors.append(completeness)
    
    # 2. Price consistency (no negative prices)
    price_cols = ['open', 'high', 'low', 'close']
    price_validity = 1 - ((df[price_cols] <= 0).sum().sum() / (len(df) * len(price_cols)))
    factors.append(price_validity)
    
    # 3. OHLC logic consistency (high >= low, etc.)
    if all(col in df.columns for col in price_cols):
        valid_ohlc = (
            (df['high'] >= df['low']) & 
            (df['high'] >= df['open']) & 
            (df['high'] >= df['close']) &
            (df['low'] <= df['open']) &
            (df['low'] <= df['close'])
        ).sum()
        ohlc_consistency = valid_ohlc / len(df)
        factors.append(ohlc_consistency)
    
    # 4. Volume validity (non-negative)
    if 'volume' in df.columns:
        volume_validity = (df['volume'] >= 0).sum() / len(df)
        factors.append(volume_validity)
    
    # Calculate weighted average
    quality_score = np.mean(factors)
    
    return quality_score

def filter_crypto_data(crypto_data: Dict[str, pd.DataFrame], 
                      min_records: int = 1000,
                      min_quality_score: float = 0.8,
                      max_extreme_change: float = 50.0) -> Dict[str, pd.DataFrame]:
    """
    Apply comprehensive filtering to cryptocurrency data
    
    Args:
        crypto_data: Dictionary of DataFrames
        min_records: Minimum number of records required
        min_quality_score: Minimum quality score required
        max_extreme_change: Maximum allowed percentage change
        
    Returns:
        Filtered dictionary of DataFrames
    """
    print("🔍 Applying data quality filters...")
    
    filtered_data = {}
    filtered_out = {
        'insufficient_data': [],
        'low_quality': [], 
        'extreme_movements': []
    }
    
    for symbol, df in tqdm(crypto_data.items(), desc="Quality filtering"):
        # Filter 1: Minimum data requirement
        if len(df) < min_records:
            filtered_out['insufficient_data'].append((symbol, len(df)))
            continue
        
        # Filter 2: Data quality score
        quality_score = calculate_data_quality_score(df)
        if quality_score < min_quality_score:
            filtered_out['low_quality'].append((symbol, quality_score))
            continue
        
        # Filter 3: Extreme movements
        df_filtered = filter_extreme_movements(df, symbol, max_extreme_change)
        
        # Count extreme movements if column exists
        extreme_count = 0
        if 'extreme_movement' in df_filtered.columns:
            extreme_count = df_filtered['extreme_movement'].sum()
            df_filtered = df_filtered.drop('extreme_movement', axis=1)  # Remove flag column
        
        # Accept the data
        filtered_data[symbol] = df_filtered
        
    # Print filtering results
    print(f"\\n📊 Filtering Results:")
    print(f"   Original symbols: {len(crypto_data)}")
    print(f"   Passed filtering: {len(filtered_data)}")
    print(f"   Filtered out - Insufficient data: {len(filtered_out['insufficient_data'])}")
    print(f"   Filtered out - Low quality: {len(filtered_out['low_quality'])}")
    
    if filtered_out['insufficient_data']:
        print(f"\\n⚠️ Insufficient data symbols (need >{min_records}):")
        for symbol, count in filtered_out['insufficient_data'][:10]:
            print(f"   {symbol}: {count} records")
    
    if filtered_out['low_quality']:
        print(f"\\n⚠️ Low quality symbols (score <{min_quality_score}):")
        for symbol, score in filtered_out['low_quality'][:10]:
            print(f"   {symbol}: {score:.3f} quality score")
    
    return filtered_data

print("🔧 Data filtering functions ready!")

🔧 Data filtering functions ready!


In [11]:
# Apply data filtering
filtered_crypto_data = filter_crypto_data(
    crypto_data,
    min_records=1000,       # Require at least 1000 records (~10 days of 15-min data)
    min_quality_score=0.8,  # Require 80% data quality score
    max_extreme_change=50.0 # Flag movements >50% between bars
)

print(f"\\n🎯 Final dataset: {len(filtered_crypto_data)} high-quality cryptocurrencies")

🔍 Applying data quality filters...


Quality filtering:  11%|█         | 37/334 [00:00<00:00, 366.95it/s]

⚠️ ASTRUSDT: Found 1 extreme movements (>50.0%), marking as suspicious
⚠️ 1000SHIBUSDT: Found 1 extreme movements (>50.0%), marking as suspicious


Quality filtering:  54%|█████▍    | 181/334 [00:00<00:00, 439.40it/s]

⚠️ MOODENGUSDT: Found 1 extreme movements (>50.0%), marking as suspicious
⚠️ RUNEUSDT: Found 1 extreme movements (>50.0%), marking as suspicious
⚠️ AUCTIONUSDT: Found 1 extreme movements (>50.0%), marking as suspicious


Quality filtering:  81%|████████  | 269/334 [00:00<00:00, 421.85it/s]

⚠️ UNFIUSDT: Found 2 extreme movements (>50.0%), marking as suspicious
⚠️ CHZUSDT: Found 1 extreme movements (>50.0%), marking as suspicious


Quality filtering: 100%|██████████| 334/334 [00:00<00:00, 417.02it/s]

⚠️ AGLDUSDT: Found 1 extreme movements (>50.0%), marking as suspicious
\n📊 Filtering Results:
   Original symbols: 334
   Passed filtering: 330
   Filtered out - Insufficient data: 4
   Filtered out - Low quality: 0
\n⚠️ Insufficient data symbols (need >1000):
   DFUSDT: 113 records
   PHAUSDT: 114 records
   DEXEUSDT: 690 records
   HIVEUSDT: 786 records
\n🎯 Final dataset: 330 high-quality cryptocurrencies





## 6. Feature Engineering: 9×30 Data Pictures

Create the core AlphaNet feature format: 9×30 matrices where:
- **9 features**: Open, High, Low, Close, Volume, VWAP, Returns, Volume Ratio, Buy Pressure
- **30 time periods**: Historical 15-minute bars (7.5 hours of data)

In [12]:
def calculate_technical_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Calculate the 9 technical features for AlphaNet (Ultra Memory Optimized)
    
    Features:
    1. Open price (normalized)
    2. High price (normalized)
    3. Low price (normalized)
    4. Close price (normalized)
    5. Volume (normalized)
    6. VWAP (Volume Weighted Average Price ratio)
    7. Returns (price change percentage)
    8. Volume Ratio (current vs moving average)
    9. Buy Pressure (using buy_volume if available, otherwise OHLC approximation)
    
    Returns:
        DataFrame with 9 feature columns
    """
    if df.empty or len(df) < 30:
        return pd.DataFrame()
    
    # Work directly on the dataframe to save memory (avoid copy)
    features_df = df.copy()
    
    # CRITICAL: Ensure all numeric columns are properly converted
    numeric_cols = ['open', 'high', 'low', 'close', 'volume']
    for col in numeric_cols:
        if col in features_df.columns:
            features_df[col] = pd.to_numeric(features_df[col], errors='coerce')
    
    # Also convert additional columns if they exist
    additional_numeric_cols = ['amount', 'buy_volume', 'buy_amount']
    for col in additional_numeric_cols:
        if col in features_df.columns:
            features_df[col] = pd.to_numeric(features_df[col], errors='coerce')
    
    # Remove rows with NaN values in essential columns
    essential_cols = ['open', 'high', 'low', 'close', 'volume']
    features_df = features_df.dropna(subset=essential_cols)
    
    if len(features_df) < 30:
        return pd.DataFrame()
    
    # Feature 1-4: OHLC (normalize by previous close)
    prev_close = features_df['close'].shift(1)
    features_df['open_norm'] = features_df['open'] / prev_close
    features_df['high_norm'] = features_df['high'] / prev_close
    features_df['low_norm'] = features_df['low'] / prev_close
    features_df['close_norm'] = features_df['close'] / prev_close
    
    # Feature 5: Volume (normalize by recent moving average)
    volume_ma = features_df['volume'].rolling(window=20, min_periods=1).mean()
    features_df['volume_norm'] = features_df['volume'] / volume_ma
    
    # Feature 6: VWAP ratio
    typical_price = (features_df['high'] + features_df['low'] + features_df['close']) / 3
    vwap_numerator = (typical_price * features_df['volume']).rolling(window=20, min_periods=1).sum()
    vwap_denominator = features_df['volume'].rolling(window=20, min_periods=1).sum()
    vwap = vwap_numerator / vwap_denominator
    features_df['vwap_ratio'] = features_df['close'] / vwap
    
    # Feature 7: Returns
    features_df['returns'] = features_df['close'].pct_change()
    
    # Feature 8: Volume Ratio (current vs exponential moving average)
    volume_ema = features_df['volume'].ewm(span=20).mean()
    features_df['volume_ratio'] = features_df['volume'] / volume_ema
    
    # Feature 9: Buy Pressure
    if 'buy_volume' in features_df.columns and 'volume' in features_df.columns:
        try:
            buy_pressure = features_df['buy_volume'] / features_df['volume']
            features_df['buy_pressure'] = np.clip(buy_pressure.fillna(0.5), 0, 1)
        except:
            # Fallback to OHLC method
            hl_range = features_df['high'] - features_df['low']
            close_position = features_df['close'] - features_df['low']
            features_df['buy_pressure'] = np.where(hl_range > 0, close_position / hl_range, 0.5)
    else:
        # OHLC-based buy pressure
        hl_range = features_df['high'] - features_df['low']
        close_position = features_df['close'] - features_df['low']
        features_df['buy_pressure'] = np.where(hl_range > 0, close_position / hl_range, 0.5)
    
    # Select the 9 final features
    feature_columns = [
        'open_norm', 'high_norm', 'low_norm', 'close_norm',
        'volume_norm', 'vwap_ratio', 'returns', 'volume_ratio', 'buy_pressure'
    ]
    
    # Keep only required columns to save memory
    result_df = features_df[['timestamp'] + feature_columns].copy()
    
    # Handle any infinite or NaN values
    result_df = result_df.replace([np.inf, -np.inf], np.nan)
    
    # Clean up temporary variables to free memory
    del features_df, prev_close, volume_ma, typical_price, vwap_numerator, vwap_denominator, vwap, volume_ema
    
    return result_df

def create_data_pictures(df: pd.DataFrame, lookback_periods: int = 30) -> np.ndarray:
    """
    Create 9×30 data pictures for AlphaNet (Ultra Memory Optimized)
    """
    if len(df) < lookback_periods + 1:
        return np.array([])
    
    # Get feature columns (exclude timestamp)
    feature_cols = [col for col in df.columns if col != 'timestamp']
    
    if len(feature_cols) != 9:
        return np.array([])
    
    # Extract feature values
    feature_data = df[feature_cols].values
    
    # Create sliding windows more efficiently
    data_pictures = []
    
    for i in range(lookback_periods, len(feature_data)):
        window = feature_data[i-lookback_periods:i]
        picture = window.T
        
        if not np.isnan(picture).any():
            data_pictures.append(picture)
    
    if not data_pictures:
        return np.array([])
    
    return np.stack(data_pictures, axis=0)

def process_symbols_in_batches(crypto_data: Dict[str, pd.DataFrame], 
                              batch_size: int = 20) -> Dict[str, dict]:
    """
    Process cryptocurrency symbols in batches to manage memory usage (ULTRA CONSERVATIVE)
    
    Args:
        crypto_data: Dictionary mapping symbol to DataFrame
        batch_size: Number of symbols to process in each batch (reduced to 25 for stability)
        
    Returns:
        Dictionary mapping symbol to feature arrays
    """
    print(f"🔧 Processing {len(crypto_data)} symbols in ULTRA-CONSERVATIVE batches of {batch_size}...")
    
    symbols = list(crypto_data.keys())
    symbol_features = {}
    successful_processing = 0
    failed_processing = 0
    
    # Process in smaller batches with more aggressive memory management
    for batch_start in range(0, len(symbols), batch_size):
        batch_end = min(batch_start + batch_size, len(symbols))
        batch_symbols = symbols[batch_start:batch_end]
        
        print(f"\\n📦 Processing batch {batch_start//batch_size + 1}/{(len(symbols)-1)//batch_size + 1}: symbols {batch_start+1}-{batch_end}")
        print(f"💾 Memory before batch: {get_memory_usage():.1f} MB")
        
        # Process current batch
        batch_results = {}
        
        for symbol in tqdm(batch_symbols, desc=f"Batch {batch_start//batch_size + 1}"):
            try:
                df = crypto_data[symbol]
                
                # Calculate technical features
                features_df = calculate_technical_features(df)
                
                if features_df.empty:
                    failed_processing += 1
                    continue
                
                # Create data pictures
                data_pictures = create_data_pictures(features_df, lookback_periods=30)
                
                if data_pictures.size > 0:
                    batch_results[symbol] = {
                        'features': data_pictures,
                        'timestamps': features_df['timestamp'].iloc[30:].values[:len(data_pictures)]
                    }
                    successful_processing += 1
                else:
                    failed_processing += 1
                
                # Clean up to free memory IMMEDIATELY
                del features_df, data_pictures
                
            except Exception as e:
                failed_processing += 1
                print(f"❌ Failed to process {symbol}: {str(e)[:50]}...")
                continue
        
        # Add batch results to main dictionary
        symbol_features.update(batch_results)
        
        # AGGRESSIVE memory cleanup
        del batch_results
        import gc
        gc.collect()
        
        print(f"✅ Batch completed: {len(batch_symbols)} symbols processed")
        print(f"💾 Memory after batch: {get_memory_usage():.1f} MB")
        print(f"📊 Running totals: {successful_processing} successful, {failed_processing} failed")
    
    print(f"\\n🎯 All batches completed!")
    print(f"✅ Successfully processed: {successful_processing} symbols")
    print(f"❌ Failed to process: {failed_processing} symbols")
    print(f"📊 Success rate: {successful_processing/(successful_processing+failed_processing)*100:.1f}%")
    
    if symbol_features:
        sample_shape = list(symbol_features.values())[0]['features'].shape
        print(f"📊 Sample feature shape: {sample_shape}")
    
    return symbol_features

# Memory monitoring utility
def get_memory_usage():
    """Get current memory usage in MB"""
    import psutil
    import os
    process = psutil.Process(os.getpid())
    return process.memory_info().rss / 1024 / 1024

print("🎨 Ultra memory-optimized feature engineering functions ready!")

🎨 Ultra memory-optimized feature engineering functions ready!


In [13]:
# AGGRESSIVE MEMORY CLEANUP AND MONITORING
import gc
import psutil
import os

def aggressive_memory_cleanup():
    """Perform aggressive memory cleanup"""
    print("🧹 Starting aggressive memory cleanup...")
    
    # Get initial memory
    initial_memory = psutil.virtual_memory().used / (1024**3)
    print(f"   Initial memory usage: {initial_memory:.1f} GB")
    
    # Force garbage collection multiple times
    for i in range(3):
        collected = gc.collect()
        print(f"   GC round {i+1}: {collected} objects collected")
    
    # Clear any cached operations
    if 'pd' in globals():
        pd.reset_option('all')
    
    # Final memory check
    final_memory = psutil.virtual_memory().used / (1024**3)
    freed_memory = initial_memory - final_memory
    print(f"   Final memory usage: {final_memory:.1f} GB")
    print(f"   Memory freed: {freed_memory:.1f} GB")
    
    return final_memory

def check_memory_safety(required_gb=8):
    """Check if we have enough memory for processing"""
    available_gb = psutil.virtual_memory().available / (1024**3)
    print(f"🔍 Memory safety check:")
    print(f"   Available memory: {available_gb:.1f} GB")
    print(f"   Required memory: {required_gb} GB")
    
    if available_gb < required_gb:
        print(f"⚠️ WARNING: Insufficient memory! Consider kernel restart.")
        return False
    else:
        print(f"✅ Memory check passed!")
        return True

# Perform cleanup and safety checks
current_memory = aggressive_memory_cleanup()
memory_safe = check_memory_safety(required_gb=8)

if not memory_safe:
    print("🆘 RECOMMENDATION: Restart kernel and run sections 1-5 again")
    print("   This will ensure clean memory state for feature engineering")
else:
    print("🚀 Proceeding with ultra-conservative feature engineering...")
    
    # Process with MAXIMUM memory conservation
    print(f"💾 Pre-processing memory: {get_memory_usage():.1f} MB")
    
    try:
        # Start with extremely small batch for safety
        print("🧪 Testing with minimal batch size (3 symbols)...")
        test_data = dict(list(filtered_crypto_data.items())[:3])
        test_features = process_symbols_in_batches(test_data, batch_size=3)
        
        if test_features:
            print("✅ Minimal test successful!")
            
            # Cleanup test data immediately
            del test_data, test_features
            gc.collect()
            
            # Process with very conservative batches
            print("🔧 Processing all symbols with minimal batch size...")
            symbol_features = process_symbols_in_batches(
                filtered_crypto_data, 
                batch_size=20  # Very small batch size for maximum safety
            )
        else:
            print("❌ Even minimal test failed - kernel restart required")
            symbol_features = {}
            
    except Exception as e:
        print(f"💥 Processing failed: {str(e)[:100]}...")
        print("🆘 CRITICAL: Kernel restart required for stable processing")
        symbol_features = {}

print(f"💾 Final memory usage: {get_memory_usage():.1f} MB")

# Show results if successful
if symbol_features:
    sample_symbol = list(symbol_features.keys())[0]
    sample_features = symbol_features[sample_symbol]['features']
    print(f"\\n📊 SUCCESS! Sample features from {sample_symbol}:")
    print(f"   Shape: {sample_features.shape}")
    print(f"   Data type: {sample_features.dtype}")
    print(f"   Symbols processed: {len(symbol_features)}")
    
    # Show feature statistics for validation
    print(f"\\n📈 Feature statistics (first sample):")
    first_sample = sample_features[0]
    for i, feature_name in enumerate(['open_norm', 'high_norm', 'low_norm', 'close_norm', 
                                     'volume_norm', 'vwap_ratio', 'returns', 'volume_ratio', 'buy_pressure']):
        feature_data = first_sample[i]
        print(f"   {feature_name}: mean={feature_data.mean():.4f}, std={feature_data.std():.4f}")
else:
    print("❌ Feature creation failed - RESTART KERNEL and re-run sections 1-5")

🧹 Starting aggressive memory cleanup...
   Initial memory usage: 15.0 GB
   GC round 1: 0 objects collected
   GC round 2: 0 objects collected
   GC round 3: 0 objects collected
   Final memory usage: 15.0 GB
   Memory freed: 0.0 GB
🔍 Memory safety check:
   Available memory: 19.4 GB
   Required memory: 8 GB
✅ Memory check passed!
🚀 Proceeding with ultra-conservative feature engineering...
💾 Pre-processing memory: 5097.1 MB
🧪 Testing with minimal batch size (3 symbols)...
🔧 Processing 3 symbols in ULTRA-CONSERVATIVE batches of 3...
\n📦 Processing batch 1/1: symbols 1-3
💾 Memory before batch: 5097.1 MB


Batch 1: 100%|██████████| 3/3 [00:00<00:00,  6.56it/s]


✅ Batch completed: 3 symbols processed
💾 Memory after batch: 5600.4 MB
📊 Running totals: 3 successful, 0 failed
\n🎯 All batches completed!
✅ Successfully processed: 3 symbols
❌ Failed to process: 0 symbols
📊 Success rate: 100.0%
📊 Sample feature shape: (65159, 9, 30)
✅ Minimal test successful!
🔧 Processing all symbols with minimal batch size...
🔧 Processing 330 symbols in ULTRA-CONSERVATIVE batches of 20...
\n📦 Processing batch 1/17: symbols 1-20
💾 Memory before batch: 5600.4 MB


Batch 1: 100%|██████████| 20/20 [00:02<00:00,  7.07it/s]


✅ Batch completed: 20 symbols processed
💾 Memory after batch: 8272.3 MB
📊 Running totals: 20 successful, 0 failed
\n📦 Processing batch 2/17: symbols 21-40
💾 Memory before batch: 8272.3 MB


Batch 2: 100%|██████████| 20/20 [00:02<00:00,  7.11it/s]


✅ Batch completed: 20 symbols processed
💾 Memory after batch: 11350.8 MB
📊 Running totals: 40 successful, 0 failed
\n📦 Processing batch 3/17: symbols 41-60
💾 Memory before batch: 11350.8 MB


Batch 3: 100%|██████████| 20/20 [00:02<00:00,  8.93it/s]


✅ Batch completed: 20 symbols processed
💾 Memory after batch: 13751.1 MB
📊 Running totals: 60 successful, 0 failed
\n📦 Processing batch 4/17: symbols 61-80
💾 Memory before batch: 13751.1 MB


Batch 4: 100%|██████████| 20/20 [00:02<00:00,  6.87it/s]


✅ Batch completed: 20 symbols processed
💾 Memory after batch: 12570.9 MB
📊 Running totals: 80 successful, 0 failed
\n📦 Processing batch 5/17: symbols 81-100
💾 Memory before batch: 12570.9 MB


Batch 5: 100%|██████████| 20/20 [00:02<00:00,  9.47it/s]


✅ Batch completed: 20 symbols processed
💾 Memory after batch: 14807.7 MB
📊 Running totals: 100 successful, 0 failed
\n📦 Processing batch 6/17: symbols 101-120
💾 Memory before batch: 14807.7 MB


Batch 6: 100%|██████████| 20/20 [00:03<00:00,  5.44it/s]


✅ Batch completed: 20 symbols processed
💾 Memory after batch: 15624.0 MB
📊 Running totals: 120 successful, 0 failed
\n📦 Processing batch 7/17: symbols 121-140
💾 Memory before batch: 15623.1 MB


Batch 7: 100%|██████████| 20/20 [00:03<00:00,  6.39it/s]


✅ Batch completed: 20 symbols processed
💾 Memory after batch: 3306.8 MB
📊 Running totals: 140 successful, 0 failed
\n📦 Processing batch 8/17: symbols 141-160
💾 Memory before batch: 3306.8 MB


Batch 8: 100%|██████████| 20/20 [00:02<00:00,  8.25it/s]


✅ Batch completed: 20 symbols processed
💾 Memory after batch: 4849.2 MB
📊 Running totals: 160 successful, 0 failed
\n📦 Processing batch 9/17: symbols 161-180
💾 Memory before batch: 4849.2 MB


Batch 9: 100%|██████████| 20/20 [00:02<00:00,  8.39it/s]


✅ Batch completed: 20 symbols processed
💾 Memory after batch: 6211.6 MB
📊 Running totals: 180 successful, 0 failed
\n📦 Processing batch 10/17: symbols 181-200
💾 Memory before batch: 6211.6 MB


Batch 10: 100%|██████████| 20/20 [00:03<00:00,  5.80it/s]


✅ Batch completed: 20 symbols processed
💾 Memory after batch: 6269.6 MB
📊 Running totals: 200 successful, 0 failed
\n📦 Processing batch 11/17: symbols 201-220
💾 Memory before batch: 6269.6 MB


Batch 11: 100%|██████████| 20/20 [00:03<00:00,  6.18it/s]


✅ Batch completed: 20 symbols processed
💾 Memory after batch: 6050.2 MB
📊 Running totals: 220 successful, 0 failed
\n📦 Processing batch 12/17: symbols 221-240
💾 Memory before batch: 6050.2 MB


Batch 12: 100%|██████████| 20/20 [00:03<00:00,  5.81it/s]


✅ Batch completed: 20 symbols processed
💾 Memory after batch: 6359.1 MB
📊 Running totals: 240 successful, 0 failed
\n📦 Processing batch 13/17: symbols 241-260
💾 Memory before batch: 6359.1 MB


Batch 13: 100%|██████████| 20/20 [00:03<00:00,  6.01it/s]


✅ Batch completed: 20 symbols processed
💾 Memory after batch: 6153.7 MB
📊 Running totals: 260 successful, 0 failed
\n📦 Processing batch 14/17: symbols 261-280
💾 Memory before batch: 6153.7 MB


Batch 14: 100%|██████████| 20/20 [00:03<00:00,  6.37it/s]


✅ Batch completed: 20 symbols processed
💾 Memory after batch: 6209.9 MB
📊 Running totals: 280 successful, 0 failed
\n📦 Processing batch 15/17: symbols 281-300
💾 Memory before batch: 6209.9 MB


Batch 15: 100%|██████████| 20/20 [00:03<00:00,  5.04it/s]


✅ Batch completed: 20 symbols processed
💾 Memory after batch: 6177.0 MB
📊 Running totals: 300 successful, 0 failed
\n📦 Processing batch 16/17: symbols 301-320
💾 Memory before batch: 6177.0 MB


Batch 16: 100%|██████████| 20/20 [00:03<00:00,  5.50it/s]


✅ Batch completed: 20 symbols processed
💾 Memory after batch: 6135.9 MB
📊 Running totals: 320 successful, 0 failed
\n📦 Processing batch 17/17: symbols 321-330
💾 Memory before batch: 6135.9 MB


Batch 17: 100%|██████████| 10/10 [00:01<00:00,  5.15it/s]

✅ Batch completed: 10 symbols processed
💾 Memory after batch: 6326.9 MB
📊 Running totals: 330 successful, 0 failed
\n🎯 All batches completed!
✅ Successfully processed: 330 symbols
❌ Failed to process: 0 symbols
📊 Success rate: 100.0%
📊 Sample feature shape: (65159, 9, 30)
💾 Final memory usage: 6326.9 MB
\n📊 SUCCESS! Sample features from STXUSDT:
   Shape: (65159, 9, 30)
   Data type: float64
   Symbols processed: 330
\n📈 Feature statistics (first sample):
   open_norm: mean=1.0002, std=0.0006
   high_norm: mean=1.0146, std=0.0171
   low_norm: mean=0.9869, std=0.0105
   close_norm: mean=1.0012, std=0.0160
   volume_norm: mean=0.9705, std=0.8395
   vwap_ratio: mean=1.0031, std=0.0186
   returns: mean=0.0012, std=0.0160
   volume_ratio: mean=0.9600, std=0.7541
   buy_pressure: mean=0.4680, std=0.0471





## 7. Target Variable Creation

Create standardized return targets for AlphaNet training:
- **24-hour forward returns**: Standardized across all coins at each time period (competition requirement)
- **Cross-sectional standardization**: Z-score normalization within each time period
- **Ranking optimization**: Aligned with weighted Spearman rank correlation evaluation

In [14]:
def create_target_variables_ultra_fast(crypto_data, symbol_features):
    """Ultra-fast target variable creation with vectorized operations"""

    print("📈 Creating 24-hour forward return targets (ULTRA-FAST)...")

    # Step 1: Calculate forward returns (keep as is, it's already fast)
    print("Step 1: Calculating forward returns...")
    all_returns = {}

    for symbol, df in tqdm(crypto_data.items(), desc="Forward returns"):
        if symbol not in symbol_features:
            continue

        # Calculate 24-hour forward returns (96 periods)
        returns = df['close'].pct_change(periods=96, fill_method=None).shift(-96)

        # Align with feature timestamps
        feature_timestamps = pd.to_datetime(symbol_features[symbol]['timestamps'])
        df_returns = pd.Series(returns.values, index=pd.to_datetime(df['timestamp']))

        aligned_returns = df_returns.reindex(feature_timestamps)
        valid_returns = aligned_returns.dropna()

        if len(valid_returns) > 0:
            all_returns[symbol] = valid_returns

    print(f"✅ Calculated returns for {len(all_returns)} symbols")

    # Step 2: OPTIMIZED Cross-sectional standardization
    print("Step 2: Vectorized cross-sectional standardization...")

    # Convert to DataFrame for vectorized operations
    returns_df = pd.DataFrame(all_returns)

    # Standardize across rows (timestamps) in one operation
    means = returns_df.mean(axis=1, skipna=True)
    stds = returns_df.std(axis=1, skipna=True)

    # Mask for valid standardization (need at least 10 values)
    valid_mask = returns_df.count(axis=1) >= 10

    # Standardize in one vectorized operation
    standardized_df = returns_df.copy()
    standardized_df[valid_mask] = (returns_df[valid_mask].T - means[valid_mask]).T / stds[valid_mask].values[:, None]

    print(f"✅ Standardized {valid_mask.sum():,} timestamps")

    # Step 3: Create final target arrays
    print("Step 3: Creating target arrays...")
    target_variables = {}

    for symbol in symbol_features.keys():
        feature_length = len(symbol_features[symbol]['features'])
        feature_timestamps = pd.to_datetime(symbol_features[symbol]['timestamps'])

        # Initialize with NaN
        target_array = np.full(feature_length, np.nan, dtype=np.float32)

        # Fill with standardized values where available
        if symbol in standardized_df.columns:
            symbol_returns = standardized_df[symbol].dropna()

            # Fast lookup using index
            common_timestamps = feature_timestamps[feature_timestamps.isin(symbol_returns.index)]
            indices = feature_timestamps.isin(common_timestamps)
            target_array[indices] = symbol_returns.reindex(feature_timestamps[indices]).values

        target_variables[symbol] = {'target_1d': target_array}

    return target_variables

# Even faster version using pure numpy
def create_target_variables_numpy_fast(crypto_data, symbol_features):
    """Fastest possible implementation using pure numpy"""

    print("📈 Creating targets with pure numpy (FASTEST)...")

    # Step 1: Prepare data structures
    symbols = [s for s in symbol_features.keys() if s in crypto_data]
    timestamp_to_idx = {}
    symbol_to_idx = {s: i for i, s in enumerate(symbols)}

    # Collect all unique timestamps
    all_timestamps = set()
    for symbol in symbols:
        timestamps = symbol_features[symbol]['timestamps']
        all_timestamps.update(timestamps)

    all_timestamps = sorted(all_timestamps)
    timestamp_to_idx = {ts: i for i, ts in enumerate(all_timestamps)}

    # Pre-allocate returns matrix
    n_symbols = len(symbols)
    n_timestamps = len(all_timestamps)
    returns_matrix = np.full((n_symbols, n_timestamps), np.nan, dtype=np.float32)

    print(f"Matrix size: {n_symbols} symbols × {n_timestamps} timestamps")

    # Step 2: Fill returns matrix
    print("Calculating forward returns...")
    for i, symbol in enumerate(tqdm(symbols)):
        df = crypto_data[symbol]

        # Calculate forward returns
        closes = df['close'].values
        forward_returns = np.full(len(closes), np.nan)
        if len(closes) > 96:
            forward_returns[:-96] = (closes[96:] / closes[:-96]) - 1

        # Map to feature timestamps
        df_timestamps = df['timestamp'].values
        feature_timestamps = symbol_features[symbol]['timestamps']

        # Quick timestamp matching using searchsorted
        df_time_idx = np.searchsorted(df_timestamps, feature_timestamps)
        valid_mask = (df_time_idx < len(forward_returns))

        for j, feat_ts in enumerate(feature_timestamps[valid_mask]):
            if feat_ts in timestamp_to_idx:
                ts_idx = timestamp_to_idx[feat_ts]
                df_idx = df_time_idx[valid_mask][j]
                returns_matrix[i, ts_idx] = forward_returns[df_idx]

    # Step 3: Vectorized standardization
    print("Standardizing cross-sectionally...")

    # Count valid values per timestamp
    valid_counts = np.sum(~np.isnan(returns_matrix), axis=0)

    # Standardize only timestamps with >= 10 values
    for j in range(n_timestamps):
        if valid_counts[j] >= 10:
            col = returns_matrix[:, j]
            valid_mask = ~np.isnan(col)
            if valid_mask.sum() > 1:
                mean_val = col[valid_mask].mean()
                std_val = col[valid_mask].std()
                if std_val > 1e-8:
                    returns_matrix[valid_mask, j] = (col[valid_mask] - mean_val) / std_val

    # Step 4: Create output
    print("Creating final target arrays...")
    target_variables = {}

    for symbol in symbol_features.keys():
        if symbol not in symbol_to_idx:
            target_variables[symbol] = {'target_1d': np.full(len(symbol_features[symbol]['features']), np.nan)}
            continue

        symbol_idx = symbol_to_idx[symbol]
        feature_timestamps = symbol_features[symbol]['timestamps']
        target_array = np.full(len(feature_timestamps), np.nan, dtype=np.float32)

        for i, ts in enumerate(feature_timestamps):
            if ts in timestamp_to_idx:
                target_array[i] = returns_matrix[symbol_idx, timestamp_to_idx[ts]]

        target_variables[symbol] = {'target_1d': target_array}

    return target_variables

In [15]:
# Execute Section 7: Create 24-hour forward return targets
print("🚀 Section 7: Creating 24-hour forward return targets...")
print(f"💾 Available memory: {psutil.virtual_memory().available / (1024**3):.1f} GB")

# Check prerequisites
if 'filtered_crypto_data' not in globals():
    print("❌ ERROR: filtered_crypto_data not found. Please run sections 1-5 first!")
elif 'symbol_features' not in globals():
    print("❌ ERROR: symbol_features not found. Please run section 6 first!")
else:
    import time
    start_time = time.time()

    # Use the ultra-fast pandas version first (generally more reliable)
    try:
        print("\n📊 Using Ultra-Fast vectorized method...")
        target_variables = create_target_variables_ultra_fast(filtered_crypto_data, symbol_features)

    except Exception as e:
        print(f"⚠️ Ultra-fast method failed: {str(e)}")
        print("📊 Falling back to numpy method...")

        # Fallback to numpy version if pandas version fails
        target_variables = create_target_variables_numpy_fast(filtered_crypto_data, symbol_features)

    # Calculate processing time
    end_time = time.time()
    processing_time = end_time - start_time

    print(f"\n⏱️ Section 7 completed in {processing_time:.1f} seconds ({processing_time/60:.1f} minutes)")

    # Validate results
    total_targets = 0
    valid_symbols = 0
    for symbol, targets in target_variables.items():
        valid_targets = (~np.isnan(targets['target_1d'])).sum()
        total_targets += valid_targets
        if valid_targets > 0:
            valid_symbols += 1

    print(f"\n📈 Results summary:")
    print(f"   Total valid targets created: {total_targets:,}")
    print(f"   Symbols with valid targets: {valid_symbols}")
    print(f"   Average targets per symbol: {total_targets/valid_symbols:.0f}")

    if total_targets == 0:
        print("\n❌ ERROR: No targets created! Check your data.")
    else:
        # Show target statistics
        all_targets = []
        for targets in target_variables.values():
            valid_vals = targets['target_1d'][~np.isnan(targets['target_1d'])]
            all_targets.extend(valid_vals)

        if all_targets:
            all_targets = np.array(all_targets)
            print(f"\n📊 Target statistics:")
            print(f"   Mean: {np.mean(all_targets):.6f} (should be ~0)")
            print(f"   Std: {np.std(all_targets):.6f} (should be ~1)")
            print(f"   Min: {np.min(all_targets):.3f}")
            print(f"   Max: {np.max(all_targets):.3f}")
            print(f"   Percentiles [5%, 25%, 50%, 75%, 95%]: {np.percentile(all_targets, [5, 25, 50, 75, 95]).round(3).tolist()}")

        print("\n✅ Section 7 completed successfully!")
        print("📊 Target variables ready for AlphaNet training")

🚀 Section 7: Creating 24-hour forward return targets...
💾 Available memory: 6.5 GB

📊 Using Ultra-Fast vectorized method...
📈 Creating 24-hour forward return targets (ULTRA-FAST)...
Step 1: Calculating forward returns...


Forward returns: 100%|██████████| 330/330 [00:02<00:00, 132.56it/s]


✅ Calculated returns for 330 symbols
Step 2: Vectorized cross-sectional standardization...
✅ Standardized 140,130 timestamps
Step 3: Creating target arrays...

⏱️ Section 7 completed in 4.3 seconds (0.1 minutes)

📈 Results summary:
   Total valid targets created: 22,184,056
   Symbols with valid targets: 330
   Average targets per symbol: 67224

📊 Target statistics:
   Mean: 0.000000 (should be ~0)
   Std: 0.996836 (should be ~1)
   Min: -12.519
   Max: 16.057
   Percentiles [5%, 25%, 50%, 75%, 95%]: [-1.249, -0.47, -0.1, 0.343, 1.527]

✅ Section 7 completed successfully!
📊 Target variables ready for AlphaNet training


## 8. Training Structure & Sampling

Create the final AlphaNet training format:
- **Sampling Strategy**: Every 2 periods (30-minute intervals) from past 1500 periods  
- **Train/Validation Split**: 70/30 chronological split
- **Data Format**: Combined features and targets ready for neural network training

In [16]:
# Section 8: FINAL WORKING VERSION
import os
import gc
import numpy as np
import pandas as pd
from tqdm import tqdm
import psutil
import tempfile
import time

def create_alphanet_dataset_final(symbol_features, target_variables, 
                                lookback_periods=1500, 
                                sample_every=2,
                                train_ratio=0.7):
    """
    Final working version with extensive error handling and progress tracking
    """

    print("🚀 Starting AlphaNet dataset creation (FINAL VERSION)...")
    print(f"Config: lookback={lookback_periods}, sample_every={sample_every}, train_ratio={train_ratio}")

    # Create output directory
    os.makedirs('data/processed/', exist_ok=True)

    # Step 1: Count and validate samples
    print("\n📊 Step 1: Counting and validating samples...")
    symbol_info = []
    skipped_symbols = []

    for symbol in tqdm(symbol_features.keys(), desc="Validating symbols"):
        # Check if symbol exists in both dictionaries
        if symbol not in target_variables:
            skipped_symbols.append((symbol, "No targets"))
            continue

        # Get lengths
        n_features = len(symbol_features[symbol]['features'])
        n_targets = len(target_variables[symbol]['target_1d'])

        # Check minimum length
        if n_features < lookback_periods + 100:
            skipped_symbols.append((symbol, f"Too few features: {n_features}"))
            continue

        # Check features and targets match
        if n_features != n_targets:
            skipped_symbols.append((symbol, f"Length mismatch: {n_features} vs {n_targets}"))
            continue

        # Calculate samples
        sample_indices = list(range(lookback_periods, n_features, sample_every))
        n_samples = len(sample_indices)

        if n_samples < 20:
            skipped_symbols.append((symbol, f"Too few samples: {n_samples}"))
            continue

        # Calculate split
        train_samples = int(n_samples * train_ratio)
        val_samples = n_samples - train_samples

        symbol_info.append({
            'symbol': symbol,
            'n_samples': n_samples,
            'train_samples': train_samples,
            'val_samples': val_samples,
            'sample_indices': sample_indices
        })

    # Report skipped symbols
    if skipped_symbols:
        print(f"\n⚠️ Skipped {len(skipped_symbols)} symbols:")
        for symbol, reason in skipped_symbols[:5]:
            print(f"   {symbol}: {reason}")
        if len(skipped_symbols) > 5:
            print(f"   ... and {len(skipped_symbols)-5} more")

    # Calculate totals
    total_train = sum(s['train_samples'] for s in symbol_info)
    total_val = sum(s['val_samples'] for s in symbol_info)
    total_samples = total_train + total_val

    print(f"\n✅ Valid symbols: {len(symbol_info)}")
    print(f"   Total samples: {total_samples:,}")
    print(f"   Train: {total_train:,} ({total_train/total_samples*100:.1f}%)")
    print(f"   Val: {total_val:,} ({total_val/total_samples*100:.1f}%)")

    if len(symbol_info) == 0:
        print("❌ No valid symbols found!")
        return 0, 0

    # Step 2: Create memory-mapped arrays
    print("\n📊 Step 2: Creating memory-mapped arrays...")

    # Use temp directory
    temp_dir = tempfile.gettempdir()

    # Create file paths
    train_features_file = os.path.join(temp_dir, f'alphanet_train_features_{int(time.time())}.dat')
    train_targets_file = os.path.join(temp_dir, f'alphanet_train_targets_{int(time.time())}.dat')
    val_features_file = os.path.join(temp_dir, f'alphanet_val_features_{int(time.time())}.dat')
    val_targets_file = os.path.join(temp_dir, f'alphanet_val_targets_{int(time.time())}.dat')

    print(f"Creating arrays of size: train=({total_train}, 9, 30), val=({total_val}, 9, 30)")

    # Create memory maps
    train_features = np.memmap(train_features_file, dtype='float32', mode='w+', shape=(total_train, 9, 30))
    train_targets = np.memmap(train_targets_file, dtype='float32', mode='w+', shape=(total_train,))
    val_features = np.memmap(val_features_file, dtype='float32', mode='w+', shape=(total_val, 9, 30))
    val_targets = np.memmap(val_targets_file, dtype='float32', mode='w+', shape=(total_val,))

    # Metadata
    train_symbols = []
    train_timestamps = []
    val_symbols = []
    val_timestamps = []

    # Step 3: Process symbols
    print("\n📊 Step 3: Processing symbols one at a time...")
    train_idx = 0
    val_idx = 0
    processed = 0

    for sym_info in tqdm(symbol_info, desc="Processing symbols"):
        try:
            symbol = sym_info['symbol']
            sample_indices = sym_info['sample_indices']
            n_train = sym_info['train_samples']
            n_val = sym_info['val_samples']

            # Get data
            features = symbol_features[symbol]['features'][sample_indices]
            timestamps = symbol_features[symbol]['timestamps'][sample_indices]
            targets = target_variables[symbol]['target_1d'][sample_indices]

            # Convert to float32
            features = features.astype(np.float32)
            targets = targets.astype(np.float32)

            # Write training data
            if n_train > 0:
                train_features[train_idx:train_idx+n_train] = features[:n_train]
                train_targets[train_idx:train_idx+n_train] = targets[:n_train]
                train_symbols.extend([symbol] * n_train)
                train_timestamps.extend(timestamps[:n_train].tolist())
                train_idx += n_train

            # Write validation data
            if n_val > 0:
                val_features[val_idx:val_idx+n_val] = features[n_train:]
                val_targets[val_idx:val_idx+n_val] = targets[n_train:]
                val_symbols.extend([symbol] * n_val)
                val_timestamps.extend(timestamps[n_train:].tolist())
                val_idx += n_val

            processed += 1

            # Periodic flush
            if processed % 50 == 0:
                train_features.flush()
                train_targets.flush()
                val_features.flush()
                val_targets.flush()

        except Exception as e:
            print(f"\n⚠️ Error processing {symbol}: {str(e)}")
            continue

    # Final flush
    train_features.flush()
    train_targets.flush()
    val_features.flush()
    val_targets.flush()

    print(f"\n✅ Processed {processed} symbols successfully")
    print(f"   Actual train samples: {train_idx:,}")
    print(f"   Actual val samples: {val_idx:,}")

    # Step 4: Save to compressed files
    print("\n📊 Step 4: Saving compressed datasets...")

    # Save training data
    print("   Saving training data...")
    try:
        np.savez_compressed(
            'data/processed/alphanet_train.npz',
            features=train_features[:train_idx],
            target_1d=train_targets[:train_idx],
            symbols=np.array(train_symbols),
            timestamps=np.array(train_timestamps)
        )
        print("   ✅ Training data saved")
    except Exception as e:
        print(f"   ❌ Error saving training data: {e}")

    # Clear training memory
    del train_features, train_targets
    gc.collect()

    # Clean up temp files
    try:
        os.remove(train_features_file)
        os.remove(train_targets_file)
    except:
        pass

    # Save validation data
    print("   Saving validation data...")
    try:
        np.savez_compressed(
            'data/processed/alphanet_val.npz',
            features=val_features[:val_idx],
            target_1d=val_targets[:val_idx],
            symbols=np.array(val_symbols),
            timestamps=np.array(val_timestamps)
        )
        print("   ✅ Validation data saved")
    except Exception as e:
        print(f"   ❌ Error saving validation data: {e}")

    # Clear validation memory
    del val_features, val_targets
    gc.collect()

    # Clean up temp files
    try:
        os.remove(val_features_file)
        os.remove(val_targets_file)
    except:
        pass

    # Report final stats
    if os.path.exists('data/processed/alphanet_train.npz') and os.path.exists('data/processed/alphanet_val.npz'):
        train_size = os.path.getsize('data/processed/alphanet_train.npz') / (1024**2)
        val_size = os.path.getsize('data/processed/alphanet_val.npz') / (1024**2)

        print(f"\n🎉 Dataset creation completed successfully!")
        print(f"📦 File sizes:")
        print(f"   Training: {train_size:.1f} MB")
        print(f"   Validation: {val_size:.1f} MB")
        print(f"📊 Final dataset:")
        print(f"   Training samples: {train_idx:,}")
        print(f"   Validation samples: {val_idx:,}")
        print(f"   Feature shape: (N, 9, 30)")
        print(f"   Split: {train_idx/(train_idx+val_idx)*100:.1f}% / {val_idx/(train_idx+val_idx)*100:.1f}%")

    return train_idx, val_idx

# EXECUTION CODE FOR SECTION 8
print("="*60)
print("🚀 SECTION 8: Creating AlphaNet Training Dataset")
print("="*60)

# Check memory
mem = psutil.virtual_memory()
print(f"\n💾 Current memory: {mem.percent:.1f}% used, {mem.available/(1024**3):.1f}GB available")

# Check prerequisites
missing = []
if 'symbol_features' not in globals():
    missing.append('symbol_features')
if 'target_variables' not in globals():
    missing.append('target_variables')

if missing:
    print(f"\n❌ ERROR: Missing prerequisites: {', '.join(missing)}")
    print("   Please run the required sections first!")
else:
    print(f"\n✅ Prerequisites found:")
    print(f"   symbol_features: {len(symbol_features)} symbols")
    print(f"   target_variables: {len(target_variables)} symbols")

    # Clean memory before starting
    gc.collect()

    # Run dataset creation
    print("\n" + "="*60)
    start_time = time.time()

    try:
        train_count, val_count = create_alphanet_dataset_final(
            symbol_features=symbol_features,
            target_variables=target_variables,
            lookback_periods=1500,
            sample_every=2,
            train_ratio=0.7
        )

        elapsed = time.time() - start_time
        print(f"\n⏱️ Total time: {elapsed:.1f} seconds ({elapsed/60:.1f} minutes)")

        if train_count > 0 and val_count > 0:
            print("\n✅ SECTION 8 COMPLETED SUCCESSFULLY!")
            print("🚀 AlphaNet dataset is ready for training!")
        else:
            print("\n⚠️ No data was created. Check the error messages above.")

    except Exception as e:
        print(f"\n❌ FATAL ERROR: {str(e)}")
        import traceback
        traceback.print_exc()

    finally:
        # Final cleanup
        gc.collect()
        mem = psutil.virtual_memory()
        print(f"\n💾 Final memory: {mem.percent:.1f}% used, {mem.available/(1024**3):.1f}GB available")

🚀 SECTION 8: Creating AlphaNet Training Dataset

💾 Current memory: 81.2% used, 6.8GB available

✅ Prerequisites found:
   symbol_features: 330 symbols
   target_variables: 330 symbols

🚀 Starting AlphaNet dataset creation (FINAL VERSION)...
Config: lookback=1500, sample_every=2, train_ratio=0.7

📊 Step 1: Counting and validating samples...


Validating symbols: 100%|██████████| 330/330 [00:00<00:00, 3349.00it/s]



⚠️ Skipped 9 symbols:
   CGPTUSDT: Too few features: 1012
   VANAUSDT: Too few features: 1419
   LUMIAUSDT: Too few features: 1235
   AIXBTUSDT: Too few features: 1015
   MOCAUSDT: Too few features: 1427
   ... and 4 more

✅ Valid symbols: 321
   Total samples: 10,861,708
   Train: 7,603,077 (70.0%)
   Val: 3,258,631 (30.0%)

📊 Step 2: Creating memory-mapped arrays...
Creating arrays of size: train=(7603077, 9, 30), val=(3258631, 9, 30)

📊 Step 3: Processing symbols one at a time...


Processing symbols: 100%|██████████| 321/321 [02:42<00:00,  1.97it/s]



✅ Processed 321 symbols successfully
   Actual train samples: 7,603,077
   Actual val samples: 3,258,631

📊 Step 4: Saving compressed datasets...
   Saving training data...
   ✅ Training data saved
   Saving validation data...
   ✅ Validation data saved

🎉 Dataset creation completed successfully!
📦 File sizes:
   Training: 574.3 MB
   Validation: 266.7 MB
📊 Final dataset:
   Training samples: 7,603,077
   Validation samples: 3,258,631
   Feature shape: (N, 9, 30)
   Split: 70.0% / 30.0%

⏱️ Total time: 271.5 seconds (4.5 minutes)

✅ SECTION 8 COMPLETED SUCCESSFULLY!
🚀 AlphaNet dataset is ready for training!

💾 Final memory: 75.9% used, 8.7GB available
