# Short-term Crypto Price Prediction based on Order Book Dynamics

This notebook aims to implement and explore methods for short-term cryptocurrency price prediction, drawing inspiration from the research paper "Mind the Gaps: Short-term Crypto Price Prediction".

## Objective
Conduct quantitative research on order book dynamics and build predictive models to forecast future prices.

## Scope of Work
1. Download/Fetch order book data.
2. Engineer features from the order book.
3. Create ML/statistical models to predict future price changes.

## Table of Contents
1. Setup and Configuration
2. Data Acquisition
   - Method 1: Fetching from Bybit API
   - Method 2: Loading from a local dataset
3. Data Preprocessing
   - LOB Reconstruction (if necessary)
   - Resampling to Second-Level Data
   - Cleaning
4. Feature Engineering
   - Mid-Price and Spread
   - Volume-Adjusted Mid-Price (VAMP)
   - Trade Imbalance (TI)
   - Quote Imbalance (QI)
5. Target Variable Creation
6. Stationarity Checks and Transformations
7. Model Building and Evaluation
   - Data Splitting
   - Linear Regression
   - Logistic Regression (for classification)
   - Decision Tree
   - Random Forest
   - XGBoost
   - Support Vector Machine (SVM)
   - Neural Network (MLP)
8. Conclusion and Future Work

---
## 1. Setup and Configuration
Import necessary libraries and configure settings.

In [None]:
import pandas as pd
import numpy as np
import ccxt
import time
import datetime
from sklearn.model_selection import train_test_split, TimeSeriesSplit
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.svm import SVR, SVC
import xgboost as xgb
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, precision_score, recall_score, f1_score
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.stattools import adfuller

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
sns.set_style('whitegrid')

### Configuration Parameters
Set your API keys for Bybit here if you choose Method 1.
Set the path to your local dataset if you choose Method 2.

In [None]:
# --- Configuration ---
DATA_SOURCE_METHOD = 'local_dataset' # 'bybit_api' or 'local_dataset'

# For Bybit API (Method 1)
BYBIT_API_KEY = 'YOUR_API_KEY'
BYBIT_SECRET_KEY = 'YOUR_SECRET_KEY'
BYBIT_SYMBOL = 'BTC/USDT' # Example symbol
BYBIT_LOB_DEPTH = 25 # Number of levels for order book depth
BYBIT_TRADE_LIMIT = 1000 # Number of recent trades to fetch

# For Local Dataset (Method 2)
# Assume CSV with columns like: timestamp, best_bid_price, best_bid_qty, best_ask_price, best_ask_qty, ...
# Or trade data: timestamp, price, volume, side
LOCAL_ORDERBOOK_FILEPATH = 'path/to/your/orderbook_data.csv'
LOCAL_TRADES_FILEPATH = 'path/to/your/trades_data.csv'

# Feature Engineering & Modeling Parameters
VAMP_LIQUIDITY_CUTOFF = 60000 # In dollars, as per PDF findings
TRADE_IMBALANCE_WINDOW = '60S' # 60 seconds for TI calculation
QUOTE_IMBALANCE_LEVELS = 5 # Number of LOB levels for QI
PREDICTION_HORIZON_SECONDS = 60 # Predict price change 60 seconds ahead
TARGET_TYPE = 'regression' # 'regression' or 'classification' (for direction: up/down/neutral)

---
## 2. Data Acquisition

### Method 1: Fetching from Bybit API (using `ccxt`)
This section contains functions to fetch Limit Order Book (LOB) data and trade data from Bybit.

**Note**: Live data fetching can be complex due to rate limits, data consistency, and the need for continuous streaming for a proper LOB reconstruction. The following provides a snapshot approach. For high-frequency analysis, a dedicated WebSocket connection and data storage solution would be necessary.

In [None]:
def fetch_bybit_order_book(symbol, depth):
    """Fetches the current order book snapshot from Bybit."""
    exchange = ccxt.bybit({
        'apiKey': BYBIT_API_KEY,
        'secret': BYBIT_SECRET_KEY,
        # 'enableRateLimit': True, # Optional: built-in rate limiting
    })
    try:
        order_book = exchange.fetch_l2_order_book(symbol, limit=depth)
        timestamp = order_book['timestamp']
        bids = pd.DataFrame(order_book['bids'], columns=['price', 'qty'])
        bids['side'] = 'bid'
        asks = pd.DataFrame(order_book['asks'], columns=['price', 'qty'])
        asks['side'] = 'ask'
        
        # Add cumulative volume for VAMP calculation later
        bids = bids.sort_values('price', ascending=False).reset_index(drop=True)
        asks = asks.sort_values('price', ascending=True).reset_index(drop=True)
        bids['cumulative_qty'] = bids['qty'].cumsum()
        bids['cumulative_value_usd'] = (bids['price'] * bids['qty']).cumsum() # Assuming price is in USDT for BTC/USDT
        asks['cumulative_qty'] = asks['qty'].cumsum()
        asks['cumulative_value_usd'] = (asks['price'] * asks['qty']).cumsum()

        return bids, asks, timestamp
    except Exception as e:
        print(f"Error fetching Bybit order book for {symbol}: {e}")
        return None, None, None

def fetch_bybit_trades(symbol, limit):
    """Fetches recent trades from Bybit."""
    exchange = ccxt.bybit({
        'apiKey': BYBIT_API_KEY,
        'secret': BYBIT_SECRET_KEY,
    })
    try:
        trades = exchange.fetch_trades(symbol, limit=limit)
        df_trades = pd.DataFrame(trades)
        if not df_trades.empty:
            df_trades['timestamp'] = pd.to_datetime(df_trades['timestamp'], unit='ms')
            df_trades = df_trades[['timestamp', 'price', 'amount', 'side', 'cost']] # 'cost' is amount * price
            df_trades.rename(columns={'amount': 'qty', 'cost': 'value_usd'}, inplace=True)
            return df_trades.sort_values('timestamp').reset_index(drop=True)
        return pd.DataFrame()
    except Exception as e:
        print(f"Error fetching Bybit trades for {symbol}: {e}")
        return pd.DataFrame()

# Example usage for Bybit API (if DATA_SOURCE_METHOD is 'bybit_api')
if DATA_SOURCE_METHOD == 'bybit_api':
    print("Attempting to fetch data from Bybit API...")
    # For a real analysis, you'd call these functions repeatedly and store the data.
    # This is a single snapshot example.
    bids_df, asks_df, lob_timestamp = fetch_bybit_order_book(BYBIT_SYMBOL, BYBIT_LOB_DEPTH)
    trades_df_api = fetch_bybit_trades(BYBIT_SYMBOL, BYBIT_TRADE_LIMIT)

    if bids_df is not None and asks_df is not None:
        print(f"\nOrder Book Snapshot at {datetime.datetime.fromtimestamp(lob_timestamp/1000)}:")
        print("Top 5 Bids:\n", bids_df.head())
        print("Top 5 Asks:\n", asks_df.head())
    if not trades_df_api.empty:
        print("\nRecent Trades:\n", trades_df_api.head())
    # For a full analysis, you would need to aggregate these snapshots over time
    # or use websockets to stream data. The PDF uses historical tick data.
    # Here, we'll create a placeholder for subsequent processing steps.
    # In a real scenario, you'd build up 'all_order_book_data' and 'all_trades_data' over time.
    all_order_book_data_raw = [] # List to store (timestamp, bids_df, asks_df) tuples
    all_trades_data_raw = trades_df_api # Placeholder

    # For demonstration, let's assume we collected a few snapshots
    # This part needs a proper data collection loop for a real application
    if bids_df is not None: # Add the single snapshot
        all_order_book_data_raw.append({'timestamp': lob_timestamp, 'bids': bids_df, 'asks': asks_df})
    # Create a combined DataFrame from these snapshots (simplified for demo)
    # This is where you would structure your collected LOB snapshots into a usable format.
    # For now, it's hard to proceed with features without continuous LOB data.
    # The PDF used "end of day snapshots" and "quote by quote updates"
    # which are then processed.

    # To proceed with a single snapshot for feature calculation demonstration (limited use):
    # best_bid_price = bids_df['price'].iloc[0] if not bids_df.empty else np.nan
    # best_ask_price = asks_df['price'].iloc[0] if not asks_df.empty else np.nan
    # This is insufficient for time-series modeling.
    print("\nNote: Live API fetching requires a continuous data collection strategy for robust analysis.")
    print("The current API example provides a single snapshot.")

### Method 2: Loading from a Local Dataset
Load data from a CSV file. The PDF uses three months of tick-level data from Bitstamp.
You'll need to adapt the loading based on your dataset's format.

For this example, we'll assume two files:
1.  `orderbook_data.csv`: Contains time-series of LOB snapshots (e.g., best bid/ask, and deeper levels if available).
    Columns might be: `timestamp`, `best_bid_price`, `best_bid_qty`, `best_ask_price`, `best_ask_qty`, `bid_price_L2`, `bid_qty_L2`, ... `ask_price_L5`, `ask_qty_L5`.
2.  `trades_data.csv`: Contains historical trades.
    Columns might be: `timestamp`, `price`, `qty`, `side` (e.g., 'buy' or 'sell').

In [None]:
if DATA_SOURCE_METHOD == 'local_dataset':
    print(f"Loading data from local files: {LOCAL_ORDERBOOK_FILEPATH} and {LOCAL_TRADES_FILEPATH}")
    try:
        # --- Load Order Book Data ---
        # This is a simplified loading. Your actual data might be structured differently.
        # For example, it might be tick-by-tick updates that you need to reconstruct into LOB snapshots.
        # The PDF mentions starting with end-of-day snapshots and then using quote updates.
        # For simplicity, let's assume 'orderbook_data.csv' is already somewhat processed
        # to have best bid/ask at each timestamp, and potentially deeper levels for QI.
        # If you have raw LOB data (price levels and quantities), you'll need more complex parsing.
        
        # Placeholder: Create sample data if files don't exist, to allow the notebook to run.
        # In a real scenario, replace this with actual data loading.
        sample_timestamps = pd.to_datetime(pd.date_range(start='2023-01-01', periods=10000, freq='100ms').values)
        
        # Sample Order Book Data (replace with your actual data loading)
        data_lob = {
            'timestamp': sample_timestamps,
            'best_bid_price': np.random.uniform(20000, 20100, 10000),
            'best_bid_qty': np.random.uniform(0.1, 5, 10000),
            'best_ask_price': np.random.uniform(20101, 20200, 10000),
            'best_ask_qty': np.random.uniform(0.1, 5, 10000),
        }
        # Add deeper levels for Quote Imbalance
        for i in range(1, QUOTE_IMBALANCE_LEVELS + 1): # L1 to L5
            data_lob[f'bid_price_L{i}'] = data_lob['best_bid_price'] - i * np.random.uniform(0.5, 2, 10000)
            data_lob[f'bid_qty_L{i}'] = np.random.uniform(0.1, 3, 10000)
            data_lob[f'ask_price_L{i}'] = data_lob['best_ask_price'] + i * np.random.uniform(0.5, 2, 10000)
            data_lob[f'ask_qty_L{i}'] = np.random.uniform(0.1, 3, 10000)

        # Ensure asks are higher than bids
        data_lob['best_ask_price'] = np.maximum(data_lob['best_ask_price'], data_lob['best_bid_price'] + 0.01)
        for i in range(1, QUOTE_IMBALANCE_LEVELS + 1):
            data_lob[f'ask_price_L{i}'] = np.maximum(data_lob[f'ask_price_L{i}'], data_lob[f'bid_price_L{i}'] + (i*0.1))


        df_lob_raw = pd.DataFrame(data_lob)
        df_lob_raw['timestamp'] = pd.to_datetime(df_lob_raw['timestamp'])
        df_lob_raw.set_index('timestamp', inplace=True)
        
        # --- Load Trades Data ---
        # Sample Trades Data (replace with your actual data loading)
        data_trades = {
            'timestamp': sample_timestamps, # Align with LOB for simplicity here
            'price': (df_lob_raw['best_bid_price'] + df_lob_raw['best_ask_price']) / 2,
            'qty': np.random.uniform(0.01, 1, 10000),
            'side': np.random.choice(['buy', 'sell'], 10000),
        }
        df_trades_raw = pd.DataFrame(data_trades)
        df_trades_raw['timestamp'] = pd.to_datetime(df_trades_raw['timestamp'])
        df_trades_raw['value_usd'] = df_trades_raw['price'] * df_trades_raw['qty']
        df_trades_raw.set_index('timestamp', inplace=True)
        
        print("Sample LOB data (raw head):\n", df_lob_raw.head())
        print("Sample Trades data (raw head):\n", df_trades_raw.head())

    except FileNotFoundError:
        print(f"Error: One or both data files not found. Please check paths: {LOCAL_ORDERBOOK_FILEPATH}, {LOCAL_TRADES_FILEPATH}")
        df_lob_raw = pd.DataFrame()
        df_trades_raw = pd.DataFrame()
    except Exception as e:
        print(f"Error loading local data: {e}")
        df_lob_raw = pd.DataFrame()
        df_trades_raw = pd.DataFrame()
else:
    print("Invalid DATA_SOURCE_METHOD selected.")
    df_lob_raw = pd.DataFrame()
    df_trades_raw = pd.DataFrame()

---
## 3. Data Preprocessing
The PDF mentions condensing the full dataset to consider full seconds rather than every tick update to reduce data size.
We will resample the data to 1-second intervals.

**Note on LOB Reconstruction for VAMP/QI from Tick Data:**
If `df_lob_raw` is from tick-by-tick updates (not snapshots), you'd need a more sophisticated LOB reconstruction process before this step. You would maintain the state of the order book at each tick and then sample it every second. The PDF mentions: "Starting with the end of day snapshot, we then used each quote update to first update the order book, and then calculated and stored the important features".
For simplicity, if using `local_dataset`, we assume `df_lob_raw` provides snapshots or already has necessary levels available at each timestamp.

In [None]:
if not df_lob_raw.empty:
    # Resample LOB data to 1-second frequency. Use last observation in interval.
    # Adjust aggregation methods as needed (e.g., 'ohlc' for prices, 'sum' for quantities if meaningful)
    # For features like best_bid/ask, 'last' is often appropriate.
    # The PDF uses "full seconds throughout the day".
    
    agg_dict_lob = {}
    # Best bid/ask
    if 'best_bid_price' in df_lob_raw.columns: agg_dict_lob['best_bid_price'] = 'last'
    if 'best_bid_qty' in df_lob_raw.columns: agg_dict_lob['best_bid_qty'] = 'last' # Or 'sum' if ticks are updates within the second
    if 'best_ask_price' in df_lob_raw.columns: agg_dict_lob['best_ask_price'] = 'last'
    if 'best_ask_qty' in df_lob_raw.columns: agg_dict_lob['best_ask_qty'] = 'last' # Or 'sum'
    
    # Deeper levels for QI
    for i in range(1, QUOTE_IMBALANCE_LEVELS + 1):
        if f'bid_price_L{i}' in df_lob_raw.columns: agg_dict_lob[f'bid_price_L{i}'] = 'last'
        if f'bid_qty_L{i}' in df_lob_raw.columns: agg_dict_lob[f'bid_qty_L{i}'] = 'last' # Or 'sum'
        if f'ask_price_L{i}' in df_lob_raw.columns: agg_dict_lob[f'ask_price_L{i}'] = 'last'
        if f'ask_qty_L{i}' in df_lob_raw.columns: agg_dict_lob[f'ask_qty_L{i}'] = 'last' # Or 'sum'

    df_lob_sec = df_lob_raw.resample('1S').agg(agg_dict_lob) if agg_dict_lob else pd.DataFrame()
    df_lob_sec.dropna(subset=['best_bid_price', 'best_ask_price'], inplace=True) # Ensure core data is present
    print("Resampled LOB data (1-second frequency, head):\n", df_lob_sec.head())
else:
    df_lob_sec = pd.DataFrame()

if not df_trades_raw.empty:
    # Resample trades data to 1-second frequency
    # Sum quantities and values, VWAP for price
    agg_dict_trades = {
        'price': 'last', # Could also be VWAP: lambda x: (x * df_trades_raw.loc[x.index, 'qty']).sum() / df_trades_raw.loc[x.index, 'qty'].sum() if not x.empty else np.nan
        'qty': 'sum',
        'value_usd': 'sum'
    }
    df_trades_sec = df_trades_raw.resample('1S').agg(agg_dict_trades)
    
    # For trade imbalance, we need buy/sell volumes separately
    buy_volume_sec = df_trades_raw[df_trades_raw['side'] == 'buy']['qty'].resample('1S').sum().rename('buy_qty_sum')
    sell_volume_sec = df_trades_raw[df_trades_raw['side'] == 'sell']['qty'].resample('1S').sum().rename('sell_qty_sum')
    buy_value_sec = df_trades_raw[df_trades_raw['side'] == 'buy']['value_usd'].resample('1S').sum().rename('buy_value_sum')
    sell_value_sec = df_trades_raw[df_trades_raw['side'] == 'sell']['value_usd'].resample('1S').sum().rename('sell_value_sum')

    df_trades_info_sec = pd.concat([df_trades_sec, buy_volume_sec, sell_volume_sec, buy_value_sec, sell_value_sec], axis=1)
    df_trades_info_sec.fillna({'buy_qty_sum': 0, 'sell_qty_sum': 0, 'buy_value_sum':0, 'sell_value_sum':0}, inplace=True)

    print("Resampled Trades data (1-second frequency, head):\n", df_trades_info_sec.head())
else:
    df_trades_info_sec = pd.DataFrame()

# Combine LOB and Trades data based on timestamp (outer join to keep all data points)
df_combined = pd.DataFrame()
if not df_lob_sec.empty and not df_trades_info_sec.empty:
    df_combined = pd.merge(df_lob_sec, df_trades_info_sec, left_index=True, right_index=True, how='outer')
elif not df_lob_sec.empty:
    df_combined = df_lob_sec
elif not df_trades_info_sec.empty:
    df_combined = df_trades_info_sec

# Forward fill missing LOB data (common for non-trading seconds), then backfill
if 'best_bid_price' in df_combined.columns: # Check if LOB columns exist
    lob_cols_to_ffill = [col for col in df_lob_sec.columns if col in df_combined.columns]
    df_combined[lob_cols_to_ffill] = df_combined[lob_cols_to_ffill].ffill().bfill()

# Fill NaNs for trade volumes/values with 0 for calculation purposes
trade_qty_val_cols = ['qty', 'value_usd', 'buy_qty_sum', 'sell_qty_sum', 'buy_value_sum', 'sell_value_sum']
for col in trade_qty_val_cols:
    if col in df_combined.columns:
        df_combined[col].fillna(0, inplace=True)

df_combined.dropna(subset=['best_bid_price', 'best_ask_price'], inplace=True) # Critical for mid-price
print("Combined and preprocessed data (head):\n", df_combined.head())

---
## 4. Feature Engineering
Based on the paper "Mind the Gaps".

In [None]:
# Ensure we have the necessary base columns
if df_combined.empty or not all(col in df_combined.columns for col in ['best_bid_price', 'best_ask_price']):
    print("Skipping feature engineering due to missing base LOB data (best_bid_price, best_ask_price).")
else:
    # --- Mid-Price and Spread ---
    df_combined['mid_price'] = (df_combined['best_bid_price'] + df_combined['best_ask_price']) / 2
    df_combined['spread'] = df_combined['best_ask_price'] - df_combined['best_bid_price']

    # --- Volume-Adjusted Mid-Price (VAMP) ---
    # VAMP_v = (Pb + Pa) / 2
    # Pb = sum(Pbi * Vbi) / v  (volume-weighted bid price up to cumulative volume v)
    # Pa = sum(Pai * Vai) / v  (volume-weighted ask price up to cumulative volume v)
    # 'v' is volume in dollars (VAMP_LIQUIDITY_CUTOFF)
    # This requires having multiple levels of LOB data or reconstructing it.
    # For this example, we'll assume 'df_lob_raw' (if from API) or the loaded CSV
    # contains sufficient levels to calculate this.
    # If using the simplified 'local_dataset' with only best_bid/ask, VAMP calculation will be trivial or impossible.
    # The sample data created for 'local_dataset' earlier includes L1-L5, which is not enough depth typically for VAMP_LIQUIDITY_CUTOFF of $60k.
    # A proper VAMP calculation requires summing up volume at different price levels until 'v' (e.g., $60k) is reached on both bid and ask sides.
    # The formula from the paper: $P_{b}=\frac{\sum_{i}^{n}P_{b_i}\times V_{b_i}}{v}$ (where $V_{b_i}$ is in base currency qty, $v$ is target value in quote)
    # $P_{b} = \frac{\sum (PriceLevel_{bid,i} \times VolumeAtPriceLevel_{bid,i})}{\sum VolumeAtPriceLevel_{bid,i}}$ where $\sum (PriceLevel_{bid,i} \times VolumeAtPriceLevel_{bid,i})$ approx equals $v$.
    # More accurately: $P_b$ is the weighted average price of bids one would need to transact to fill an order of value $v$.
    # This is complex to implement without full LOB snapshot data structure per row of df_combined.
    # Placeholder for VAMP - a proper implementation needs full LOB access per row of df_combined.
    # For now, let's assume we have columns like 'bid_price_L{i}', 'bid_qty_L{i}', etc.
    
    def calculate_weighted_price_for_vamp(levels_prices, levels_qty, target_value_usd):
        """ Helper for VAMP. Calculates weighted price for one side. """
        cumulative_value = 0
        weighted_price_sum = 0
        total_qty_for_value = 0
        
        for price, qty in zip(levels_prices, levels_qty):
            if pd.isna(price) or pd.isna(qty) or qty == 0:
                continue
            
            value_at_level = price * qty
            if cumulative_value + value_at_level >= target_value_usd:
                remaining_value_needed = target_value_usd - cumulative_value
                qty_to_take = remaining_value_needed / price
                weighted_price_sum += price * qty_to_take
                total_qty_for_value += qty_to_take
                cumulative_value += remaining_value_needed
                break
            else:
                weighted_price_sum += price * qty
                total_qty_for_value += qty
                cumulative_value += value_at_level
        
        return (weighted_price_sum / total_qty_for_value) if total_qty_for_value > 0 else np.nan

    # This VAMP is illustrative and depends heavily on having granular LOB data available in df_combined
    # The sample data does not have enough depth for a $60k VAMP typically.
    # You'd need to parse your actual LOB data to provide these lists of prices/quantities for each timestamp.
    # For demonstration, using available L1-L5 from sample data:
    vamp_bids_p, vamp_bids_q, vamp_asks_p, vamp_asks_q = [], [], [], []
    for i in range(1, QUOTE_IMBALANCE_LEVELS + 1): # Using up to L5 from sample
        if f'bid_price_L{i}' in df_combined.columns: vamp_bids_p.append(df_combined[f'bid_price_L{i}'])
        if f'bid_qty_L{i}' in df_combined.columns: vamp_bids_q.append(df_combined[f'bid_qty_L{i}'])
        if f'ask_price_L{i}' in df_combined.columns: vamp_asks_p.append(df_combined[f'ask_price_L{i}'])
        if f'ask_qty_L{i}' in df_combined.columns: vamp_asks_q.append(df_combined[f'ask_qty_L{i}'])

    if vamp_bids_p and vamp_bids_q and vamp_asks_p and vamp_asks_q:
        # Transpose to iterate row-wise for apply
        vamp_bids_p_T = pd.concat(vamp_bids_p, axis=1).transpose()
        vamp_bids_q_T = pd.concat(vamp_bids_q, axis=1).transpose()
        vamp_asks_p_T = pd.concat(vamp_asks_p, axis=1).transpose()
        vamp_asks_q_T = pd.concat(vamp_asks_q, axis=1).transpose()

        Pb_vamp_list = []
        Pa_vamp_list = []
        for idx in df_combined.index:
            if idx not in vamp_bids_p_T.columns: continue # If timestamp missing after resample
            bid_prices_at_ts = vamp_bids_p_T[idx].values
            bid_qtys_at_ts = vamp_bids_q_T[idx].values
            ask_prices_at_ts = vamp_asks_p_T[idx].values
            ask_qtys_at_ts = vamp_asks_q_T[idx].values
            
            Pb_vamp_list.append(calculate_weighted_price_for_vamp(bid_prices_at_ts, bid_qtys_at_ts, VAMP_LIQUIDITY_CUTOFF))
            Pa_vamp_list.append(calculate_weighted_price_for_vamp(ask_prices_at_ts, ask_qtys_at_ts, VAMP_LIQUIDITY_CUTOFF))
        
        if Pb_vamp_list and Pa_vamp_list:
             # Align indices if there were skipped timestamps
            valid_indices = [idx for idx in df_combined.index if idx in vamp_bids_p_T.columns][:len(Pb_vamp_list)]
            df_combined.loc[valid_indices, 'Pb_vamp'] = Pb_vamp_list
            df_combined.loc[valid_indices, 'Pa_vamp'] = Pa_vamp_list
            df_combined['vamp'] = (df_combined['Pb_vamp'] + df_combined['Pa_vamp']) / 2
            df_combined['vamp_mid_diff'] = df_combined['mid_price'] - df_combined['vamp'] # As used in paper's plots
        else:
            df_combined['vamp'] = df_combined['mid_price'] # Fallback
            df_combined['vamp_mid_diff'] = 0

    else: # Fallback if not enough LOB levels
        df_combined['vamp'] = df_combined['mid_price']
        df_combined['vamp_mid_diff'] = 0
        print("Warning: VAMP calculation could not be performed due to missing deep LOB data. Using mid_price as fallback for VAMP.")


    # --- Trade Imbalance (TI) ---
    # TI = sum(Y * (Vbuy - Vsell)) / sum(Y * (Vbuy + Vsell)) over past 1 min (60s)
    # Y is linear weight, closer time has larger weight.
    # For simplicity, let's use equal weights for now (rolling sum).
    # The paper uses $Y_i$ as a linear weight. For a 60s window, weights could be 1/60, 2/60, ..., 60/60.
    # Or, more simply, recent trades get higher weight. Or just sum over window.
    # Let's use value (USD) for Vbuy and Vsell, assuming 'buy_value_sum' and 'sell_value_sum' are from df_trades_info_sec
    
    if 'buy_value_sum' in df_combined.columns and 'sell_value_sum' in df_combined.columns:
        rolling_window_size_ti = int(pd.Timedelta(TRADE_IMBALANCE_WINDOW).total_seconds()) # Ensure integer for rolling
        
        # Numerator: Sum of (buy_value - sell_value)
        diff_value = df_combined['buy_value_sum'] - df_combined['sell_value_sum']
        # Denominator: Sum of (buy_value + sell_value)
        total_value = df_combined['buy_value_sum'] + df_combined['sell_value_sum']

        # Applying linear weights ( exemplu: 1 to N for N window size)
        weights = np.arange(1, rolling_window_size_ti + 1)

        # Weighted rolling sum
        # Note: Pandas rolling().apply() can be slow. For performance, custom Cython or Numba might be needed for large datasets.
        # For simplicity, if not applying complex weights, a simple .sum() is faster.
        # Let's use simple sum for now as weighted rolling sum is more complex with standard pandas.
        # To implement linear weights as in the paper, you'd need a custom rolling function or a library that supports weighted rolling sums efficiently.
        # E.g., (diff_value.rolling(window=rolling_window_size_ti).apply(lambda x: np.sum(weights[:len(x)] * x) / np.sum(weights[:len(x)]), raw=True))

        numerator_ti = diff_value.rolling(window=rolling_window_size_ti, min_periods=1).sum()
        denominator_ti = total_value.rolling(window=rolling_window_size_ti, min_periods=1).sum()
        
        df_combined['trade_imbalance'] = numerator_ti / denominator_ti
        df_combined['trade_imbalance'].fillna(0, inplace=True) # Fill cases where denominator is 0
        df_combined['trade_imbalance'] = np.clip(df_combined['trade_imbalance'], -1, 1) # TI is -1 to 1
    else:
        df_combined['trade_imbalance'] = 0
        print("Warning: Trade Imbalance calculation skipped due to missing trade value columns.")


    # --- Quote Imbalance (QI) ---
    # QI_L=k = (sum(Vb_i) - sum(Va_i)) / (sum(Vb_i) + sum(Va_i)) for i=1 to k levels
    # Using quantities at different LOB levels.
    sum_bid_qty_L = pd.Series(0.0, index=df_combined.index)
    sum_ask_qty_L = pd.Series(0.0, index=df_combined.index)

    for i in range(1, QUOTE_IMBALANCE_LEVELS + 1):
        if f'bid_qty_L{i}' in df_combined.columns:
            sum_bid_qty_L += df_combined[f'bid_qty_L{i}'].fillna(0)
        if f'ask_qty_L{i}' in df_combined.columns:
            sum_ask_qty_L += df_combined[f'ask_qty_L{i}'].fillna(0)
            
    if not sum_bid_qty_L.empty and not sum_ask_qty_L.empty:
        numerator_qi = sum_bid_qty_L - sum_ask_qty_L
        denominator_qi = sum_bid_qty_L + sum_ask_qty_L
        df_combined['quote_imbalance'] = numerator_qi / denominator_qi
        df_combined['quote_imbalance'].fillna(0, inplace=True) # Fill cases where denominator is 0
        df_combined['quote_imbalance'] = np.clip(df_combined['quote_imbalance'], -1, 1) # QI is -1 to 1
    else:
        df_combined['quote_imbalance'] = 0
        print("Warning: Quote Imbalance calculation skipped due to missing LOB quantity columns.")

    print("Data with engineered features (head):\n", df_combined[['mid_price', 'spread', 'vamp', 'vamp_mid_diff', 'trade_imbalance', 'quote_imbalance']].head())
    
    # Drop intermediate LOB level columns if they are no longer needed for modeling to save memory
    cols_to_drop_intermediate = []
    for i in range(1, QUOTE_IMBALANCE_LEVELS + 1):
        cols_to_drop_intermediate.extend([f'bid_price_L{i}', f'bid_qty_L{i}', f'ask_price_L{i}', f'ask_qty_L{i}'])
    cols_to_drop_intermediate.extend(['Pb_vamp', 'Pa_vamp']) # VAMP intermediate calcs
    # df_combined.drop(columns=[col for col in cols_to_drop_intermediate if col in df_combined.columns], inplace=True)

---
## 5. Target Variable Creation
We want to predict future price changes. The PDF uses look-ahead windows from 1s to 60s.
Let's use `PREDICTION_HORIZON_SECONDS`.
Target: $MidPrice_{t+\Delta t} - MidPrice_t$ (absolute change) or % change.
Or, for classification: sign of change.

In [None]:
if 'mid_price' not in df_combined.columns or df_combined.empty:
    print("Skipping target variable creation as mid_price is missing.")
else:
    df_combined['future_mid_price'] = df_combined['mid_price'].shift(-PREDICTION_HORIZON_SECONDS)
    
    if TARGET_TYPE == 'regression':
        # Predicting the actual price change
        df_combined['target_price_change'] = df_combined['future_mid_price'] - df_combined['mid_price']
        # Or percentage change:
        # df_combined['target_price_change_pct'] = (df_combined['future_mid_price'] - df_combined['mid_price']) / df_combined['mid_price'] * 100
        TARGET_COLUMN = 'target_price_change'
    
    elif TARGET_TYPE == 'classification':
        # Predicting price movement direction (-1 for down, 0 for neutral, 1 for up)
        # The PDF uses strict inequalities for binary classification
        # And thresholds (sigma) for multiclass
        price_diff = df_combined['future_mid_price'] - df_combined['mid_price']
        
        # Simple binary: up (1) or down (-1), excluding no change
        # df_combined['target_direction'] = np.sign(price_diff)
        # df_combined = df_combined[df_combined['target_direction'] != 0] # Optional: remove neutral cases

        # Tri-class: up (2), neutral (1), down (0) (common for classification)
        # Define a threshold for "neutral" e.g., based on spread or a small fixed value
        neutral_threshold = df_combined['spread'].mean() * 0.1 # Example threshold
        df_combined['target_direction'] = 1 # Neutral
        df_combined.loc[price_diff > neutral_threshold, 'target_direction'] = 2 # Up
        df_combined.loc[price_diff < -neutral_threshold, 'target_direction'] = 0 # Down
        TARGET_COLUMN = 'target_direction'

    df_final = df_combined.dropna(subset=[TARGET_COLUMN]) # Remove rows where future price is NaN
    
    if not df_final.empty:
        print(f"Final data with target variable '{TARGET_COLUMN}' (head):\n", df_final[['mid_price', 'future_mid_price', TARGET_COLUMN] + [col for col in ['vamp_mid_diff', 'trade_imbalance', 'quote_imbalance'] if col in df_final.columns]].head())
        print(f"\nTarget variable ({TARGET_COLUMN}) distribution:")
        if TARGET_TYPE == 'regression':
            df_final[TARGET_COLUMN].plot(kind='hist', bins=50, title='Target Price Change Distribution')
            plt.show()
            print(df_final[TARGET_COLUMN].describe())
        elif TARGET_TYPE == 'classification':
            print(df_final[TARGET_COLUMN].value_counts(normalize=True))
    else:
        print("DataFrame is empty after target creation and NaN removal. Check data or prediction horizon.")

---
## 6. Stationarity Checks and Transformations
Time series data for financial modeling often requires features to be stationary.
The target variable (price change or returns) is usually stationary.
We should check engineered features like `vamp_mid_diff`, `trade_imbalance`, `quote_imbalance`.

In [None]:
def check_stationarity(series, series_name=''):
    """Performs ADF test and prints results."""
    if series.empty or series.isnull().all():
        print(f"Series {series_name} is empty or all NaN. Skipping stationarity check.")
        return
    print(f'\nStationarity Test for {series_name}:')
    result = adfuller(series.dropna()) # Drop NaNs for test
    print('ADF Statistic: %f' % result[0])
    print('p-value: %f' % result[1])
    print('Critical Values:')
    for key, value in result[4].items():
        print('\t%s: %.3f' % (key, value))
    if result[1] <= 0.05:
        print(f"Result: Likely Stationary (p-value <= 0.05) for {series_name}")
    else:
        print(f"Result: Likely Non-Stationary (p-value > 0.05) for {series_name}")

feature_columns_for_model = []
if 'df_final' in locals() and not df_final.empty:
    # Features to check and potentially use
    potential_features = ['spread', 'vamp_mid_diff', 'trade_imbalance', 'quote_imbalance']
    # Add other features like lagged mid_price changes, volatility, etc. if desired

    for col in potential_features:
        if col in df_final.columns:
            check_stationarity(df_final[col], col)
            # If a feature is found non-stationary, consider differencing:
            # df_final[f'{col}_diff'] = df_final[col].diff()
            # Then check stationarity of df_final[f'{col}_diff'] and use it in the model.
            # For simplicity, we'll use original features here but this step is crucial.
            feature_columns_for_model.append(col)
    
    if TARGET_TYPE == 'regression': # Target itself should be stationary if it's a difference
        check_stationarity(df_final[TARGET_COLUMN], TARGET_COLUMN)
    
    # Ensure no NaN in features or target chosen for modeling after any transformations
    df_model_ready = df_final[feature_columns_for_model + [TARGET_COLUMN]].copy()
    df_model_ready.dropna(inplace=True)

    if df_model_ready.empty:
        print("DataFrame became empty after dropping NaNs for model features. Check feature engineering and stationarity steps.")
    else:
        print(f"\nFeatures selected for modeling: {feature_columns_for_model}")
else:
    print("Skipping stationarity checks as df_final is not available or empty.")
    df_model_ready = pd.DataFrame()

---
## 7. Model Building and Evaluation

In [None]:
if 'df_model_ready' not in locals() or df_model_ready.empty:
    print("Skipping model building as data is not ready.")
else:
    X = df_model_ready[feature_columns_for_model]
    y = df_model_ready[TARGET_COLUMN]

    # --- Data Splitting (Time Series Aware) ---
    # Using TimeSeriesSplit for cross-validation would be more robust.
    # For a simple train/test split:
    train_size_pct = 0.8
    split_idx = int(len(X) * train_size_pct)
    
    X_train, X_test = X.iloc[:split_idx], X.iloc[split_idx:]
    y_train, y_test = y.iloc[:split_idx], y.iloc[split_idx:]

    print(f"Training set size: {X_train.shape[0]}, Test set size: {X_test.shape[0]}")

    # --- Feature Scaling ---
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    # Store results
    model_results = {}

    def evaluate_model(name, model, X_test_data, y_true, y_pred):
        if TARGET_TYPE == 'regression':
            mse = mean_squared_error(y_true, y_pred)
            r2 = r2_score(y_true, y_pred)
            print(f"{name} - MSE: {mse:.4f}, R2: {r2:.4f}")
            model_results[name] = {'MSE': mse, 'R2': r2}
            
            # Plot predictions vs actuals for regression
            plt.figure(figsize=(10, 6))
            plt.scatter(y_true, y_pred, alpha=0.5, label='Predicted vs Actual')
            plt.plot([y_true.min(), y_true.max()], [y_true.min(), y_true.max()], 'k--', lw=2, label='Perfect Prediction')
            plt.xlabel("Actual Values")
            plt.ylabel("Predicted Values")
            plt.title(f"{name} - Predictions vs Actuals")
            plt.legend()
            plt.show()

        elif TARGET_TYPE == 'classification':
            accuracy = accuracy_score(y_true, y_pred)
            precision = precision_score(y_true, y_pred, average='weighted', zero_division=0)
            recall = recall_score(y_true, y_pred, average='weighted', zero_division=0)
            f1 = f1_score(y_true, y_pred, average='weighted', zero_division=0)
            print(f"{name} - Accuracy: {accuracy:.4f}, Precision: {precision:.4f}, Recall: {recall:.4f}, F1: {f1:.4f}")
            model_results[name] = {'Accuracy': accuracy, 'Precision': precision, 'Recall': recall, 'F1': f1}

    # --- Model Implementations ---

    # 1. Linear Regression (for regression target) / Logistic Regression (for classification)
    if TARGET_TYPE == 'regression':
        print("\n--- Linear Regression ---")
        model_lr = LinearRegression()
        model_lr.fit(X_train_scaled, y_train)
        y_pred_lr = model_lr.predict(X_test_scaled)
        evaluate_model("Linear Regression", model_lr, X_test_scaled, y_test, y_pred_lr)
    elif TARGET_TYPE == 'classification':
        print("\n--- Logistic Regression ---")
        model_logr = LogisticRegression(solver='liblinear', multi_class='auto', random_state=42, max_iter=1000)
        model_logr.fit(X_train_scaled, y_train)
        y_pred_logr = model_logr.predict(X_test_scaled)
        evaluate_model("Logistic Regression", model_logr, X_test_scaled, y_test, y_pred_logr)

    # 2. Decision Tree
    print("\n--- Decision Tree ---")
    if TARGET_TYPE == 'regression':
        model_dt = DecisionTreeRegressor(random_state=42, max_depth=10, min_samples_split=10)
        model_dt.fit(X_train_scaled, y_train) # Can use unscaled X_train for trees too
        y_pred_dt = model_dt.predict(X_test_scaled)
        evaluate_model("Decision Tree Regressor", model_dt, X_test_scaled, y_test, y_pred_dt)
    elif TARGET_TYPE == 'classification':
        model_dtc = DecisionTreeClassifier(random_state=42, max_depth=10, min_samples_split=10)
        model_dtc.fit(X_train_scaled, y_train)
        y_pred_dtc = model_dtc.predict(X_test_scaled)
        evaluate_model("Decision Tree Classifier", model_dtc, X_test_scaled, y_test, y_pred_dtc)

    # 3. Random Forest
    print("\n--- Random Forest ---")
    if TARGET_TYPE == 'regression':
        model_rf = RandomForestRegressor(n_estimators=100, random_state=42, max_depth=10, min_samples_split=10, n_jobs=-1)
        model_rf.fit(X_train_scaled, y_train)
        y_pred_rf = model_rf.predict(X_test_scaled)
        evaluate_model("Random Forest Regressor", model_rf, X_test_scaled, y_test, y_pred_rf)
    elif TARGET_TYPE == 'classification':
        model_rfc = RandomForestClassifier(n_estimators=100, random_state=42, max_depth=10, min_samples_split=10, n_jobs=-1)
        model_rfc.fit(X_train_scaled, y_train)
        y_pred_rfc = model_rfc.predict(X_test_scaled)
        evaluate_model("Random Forest Classifier", model_rfc, X_test_scaled, y_test, y_pred_rfc)

    # 4. XGBoost
    print("\n--- XGBoost ---")
    if TARGET_TYPE == 'regression':
        model_xgb = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=100, random_state=42, max_depth=7, learning_rate=0.1, n_jobs=-1)
        model_xgb.fit(X_train_scaled, y_train)
        y_pred_xgb = model_xgb.predict(X_test_scaled)
        evaluate_model("XGBoost Regressor", model_xgb, X_test_scaled, y_test, y_pred_xgb)
    elif TARGET_TYPE == 'classification':
        # Determine num_class for XGBClassifier if target is multi-class
        num_class = len(np.unique(y_train)) if len(np.unique(y_train)) > 2 else 1 # Binary if 1 or 2 unique values (0,1)
        objective_xgb_clf = 'multi:softmax' if num_class > 1 else 'binary:logistic' # num_class for multi:softmax
        
        model_xgbc_params = {
            'n_estimators': 100, 'random_state': 42, 'max_depth': 7, 'learning_rate': 0.1, 'n_jobs': -1
        }
        if objective_xgb_clf == 'multi:softmax':
             model_xgbc_params['objective'] = 'multi:softmax'
             model_xgbc_params['num_class'] = num_class
        else:
            model_xgbc_params['objective'] = 'binary:logistic'
            
        model_xgbc = xgb.XGBClassifier(**model_xgbc_params)
        model_xgbc.fit(X_train_scaled, y_train)
        y_pred_xgbc = model_xgbc.predict(X_test_scaled)
        evaluate_model("XGBoost Classifier", model_xgbc, X_test_scaled, y_test, y_pred_xgbc)
        
    # 5. Support Vector Machine (SVM)
    # SVM can be slow on large datasets. Consider using a subset of data if training takes too long.
    print("\n--- Support Vector Machine (SVM) ---")
    if TARGET_TYPE == 'regression':
        model_svr = SVR(kernel='rbf', C=1.0, epsilon=0.1) # Parameters can be tuned
        model_svr.fit(X_train_scaled, y_train)
        y_pred_svr = model_svr.predict(X_test_scaled)
        evaluate_model("SVR", model_svr, X_test_scaled, y_test, y_pred_svr)
    elif TARGET_TYPE == 'classification':
        model_svc = SVC(kernel='rbf', C=1.0, random_state=42, probability=True) # probability=True for predict_proba if needed
        model_svc.fit(X_train_scaled, y_train)
        y_pred_svc = model_svc.predict(X_test_scaled)
        evaluate_model("SVC", model_svc, X_test_scaled, y_test, y_pred_svc)

    # 6. Neural Network (MLP)
    print("\n--- Neural Network (MLP) ---")
    def create_mlp(input_dim, num_classes=1, classification=False):
        model = Sequential()
        model.add(Dense(64, input_dim=input_dim, activation='relu'))
        model.add(Dropout(0.2))
        model.add(Dense(32, activation='relu'))
        model.add(Dropout(0.2))
        if classification:
            if num_classes == 1 or num_classes == 2: # Binary classification
                 model.add(Dense(1, activation='sigmoid'))
                 model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
            else: # Multiclass classification
                 model.add(Dense(num_classes, activation='softmax'))
                 model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) # use sparse if y is integer encoded
        else: # Regression
            model.add(Dense(1, activation='linear')) # Linear activation for regression
            model.compile(optimizer='adam', loss='mean_squared_error', metrics=['mae'])
        return model

    if TARGET_TYPE == 'regression':
        model_mlp = create_mlp(X_train_scaled.shape[1], classification=False)
        early_stop = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
        model_mlp.fit(X_train_scaled, y_train, epochs=50, batch_size=32, validation_split=0.1, callbacks=[early_stop], verbose=0)
        y_pred_mlp = model_mlp.predict(X_test_scaled).flatten()
        evaluate_model("MLP Regressor", model_mlp, X_test_scaled, y_test, y_pred_mlp)
    elif TARGET_TYPE == 'classification':
        num_classes_nn = len(np.unique(y_train))
        model_mlpc = create_mlp(X_train_scaled.shape[1], num_classes=num_classes_nn, classification=True)
        early_stop = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
        # Ensure y_train is appropriate for the loss function (e.g. 0,1 for binary_crossentropy; 0,1,2 for sparse_categorical)
        model_mlpc.fit(X_train_scaled, y_train, epochs=50, batch_size=32, validation_split=0.1, callbacks=[early_stop], verbose=0)
        
        if num_classes_nn <= 2: # Binary
            y_pred_mlpc_proba = model_mlpc.predict(X_test_scaled)
            y_pred_mlpc = (y_pred_mlpc_proba > 0.5).astype(int).flatten()
        else: # Multiclass
            y_pred_mlpc_proba = model_mlpc.predict(X_test_scaled)
            y_pred_mlpc = np.argmax(y_pred_mlpc_proba, axis=1)
            
        evaluate_model("MLP Classifier", model_mlpc, X_test_scaled, y_test, y_pred_mlpc)

    # --- Display Final Results ---
    print("\n--- Model Performance Summary ---")
    results_df = pd.DataFrame(model_results).T
    print(results_df)

---
## 8. Conclusion and Future Work

This notebook provided a framework for acquiring, preprocessing, and modeling cryptocurrency order book data for short-term price prediction. Key features inspired by "Mind the Gaps" such as Mid-Price, Spread, VAMP, Trade Imbalance, and Quote Imbalance were implemented.

**Observations from this run (based on sample data/placeholder logic):**
* (Actual observations will depend on the real data and model performance)
* The VAMP feature was noted in the paper as a strong predictor. Its effectiveness would depend on the quality and depth of LOB data used.
* Trade Imbalance and Quote Imbalance aim to capture market pressure.

**Future Work:**
* **Robust Data Pipeline**: Implement a more robust data acquisition pipeline, especially for live API data (e.g., using WebSockets for continuous LOB updates and reconstruction).
* **Advanced Feature Engineering**:
    * Explore more sophisticated weighting for Trade Imbalance.
    * Test different VAMP liquidity cutoffs and QI levels systematically.
    * Incorporate features like realized volatility, order flow toxicity, or market impact models.
* **Hyperparameter Tuning**: Systematically tune hyperparameters for each ML model (e.g., using GridSearchCV or RandomizedSearchCV with TimeSeriesSplit).
* **Stationarity**: Rigorously ensure all features used in models are stationary. Apply transformations like differencing if needed and re-evaluate.
* **Model Ensembling/Stacking**: Combine predictions from multiple models to potentially improve performance.
* **Deeper Neural Networks**: Explore more complex architectures like LSTMs or GRUs, which are well-suited for time series data.
* **Alternative Prediction Targets**: Expand on the binary and multiclass classification approaches from the paper, especially predicting one-standard-deviation price movements.
* **Backtesting Framework**: Develop a rigorous backtesting framework that accounts for transaction costs, slippage, and realistic trading conditions. The P&L metric in the paper is a good starting point.
* **Expand Dataset**: Analyze data across different crypto assets and exchanges, and longer time periods, including diverse market conditions (e.g., high volatility periods).

---
### References from "Mind the Gaps" used in this notebook:
- [1] Martin, P., Line Jr., W., Feng, Y., Yang, Y., Zheng, S., Qi, S., & Zhu, B. (2022). *Mind the Gaps: Short-term Crypto Price Prediction*. Cornell University. Available at SSRN: https://ssrn.com/abstract=4351947
- [16] Prediction at time scales from one second to 60 seconds.
- [18] Volume-Adjusted Mid-Price as the ultimate short-term predictor.
- [19, 20] Data sourcing: Bitstamp, three full months, tick level.
- [21] Initial feature calculation: spread, mid-price, best bid/ask, volume-adjusted versions.
- [25, 26] Condensing dataset to full seconds.
- [48, 49] Volume-Adjusted Mid-Price (VAMP) definition and formula.
- [52] Plotting (mid-price - VAMP) against returns.
- [54, 55, 159] VAMP volume cutoffs, settling on $50k-$60k range, specifically $60k.
- [56, 57, 58] Trade Imbalance (TI) definition, formula with linear weight, range -1 to 1.
- [72] Using 1-minute window for Trade Imbalance.
- [88, 89, 94] Quote Imbalance (QI) definition, formula, range -1 to 1, using up to level 5.
- [93] QI relationship becoming more linear with deeper levels.
- [128] Trading P&L metric introduction.
- [152] Binary classification setup: strict inequalities for price change prediction.
- [171] Multiclass classification setup: one standard deviation thresholds.
- [195] Expanding data to include diverse BTC data and volatile conditions.

Note: Citation numbers in the markdown cells (e.g., `[cite: X]`) refer to page numbers or specific findings in the provided PDF "Mind-the-Gaps-Short-term-Crypto-Price-Prediction-2022.pdf".