# Pearson Correlation Coefficient (PCC) for Model Evaluation

The Pearson correlation coefficient (ρ) measures the linear correlation between two variables, in our case between the actual labels (y) and predicted values (ŷ) on the private test set. It ranges from -1 to +1, where:

- +1 indicates perfect positive linear correlation
- 0 indicates no linear correlation  
- -1 indicates perfect negative linear correlation

# The formula for PCC is:
# 
# $\rho = \frac{Cov(y,\hat{y})}{\sigma_y \cdot \sigma_{\hat{y}}}$

Where:
- Cov(y,ŷ) is the covariance between actual and predicted values
- σy is the standard deviation of actual values
- σŷ is the standard deviation of predicted values

For this competition, the evaluation metric is the Pearson correlation between our predictions and the true labels on the private test set. A higher positive correlation indicates better predictive performance, as our predictions more closely track the actual market movements.

Key implications:
- We want our predictions to move in the same direction as actual values
- The magnitude of movements matters less than getting the direction right
- A correlation of 0.7+ would indicate strong predictive performance


In [19]:
# DRW - Crypto Market Prediction Analysis
# Import necessary libraries

import pandas as pd
import numpy as np
import polars as pl
import pyarrow as pa
import pyarrow.parquet as pq
from matplotlib import pyplot as plt
from matplotlib.ticker import MaxNLocator, FormatStrFormatter, PercentFormatter
import seaborn as sns
import xgboost as xgb
from sklearn.model_selection import KFold
import os
import zipfile
import warnings
warnings.filterwarnings('ignore')

# Set display options for better output
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Libraries imported successfully!")
print("Ready to analyze DRW Crypto Market Prediction data...")


Libraries imported successfully!
Ready to analyze DRW Crypto Market Prediction data...


In [8]:
# Load and examine the crypto market data
# First, let's identify the main data files (both CSV and Parquet)
data_dir = "./data/"
all_files = os.listdir(data_dir)
csv_files = [f for f in all_files if f.endswith('.csv')]
parquet_files = [f for f in all_files if f.endswith('.parquet')]

# Load the main dataset(s)
train_data = None
test_data = None
sample_submission = None

# Function to load data based on file extension
def load_data_file(filepath):
    if filepath.endswith('.csv'):
        return pd.read_csv(filepath)
    elif filepath.endswith('.parquet'):
        return pd.read_parquet(filepath)
    else:
        raise ValueError(f"Unsupported file format: {filepath}")

# Try to identify train, test, and submission files
all_data_files = csv_files + parquet_files
for file in all_data_files:
    filepath = os.path.join(data_dir, file)
    
    if 'train' in file.lower():
        train_data = load_data_file(filepath)
        
    elif 'test' in file.lower() and 'submission' not in file.lower():
        test_data = load_data_file(filepath)
        
    elif 'submission' in file.lower():
        sample_submission = load_data_file(filepath)


# Display basic information about the datasets
if train_data is not None:
    print("\nTraining Data:")
    print(f"Shape: {train_data.shape}")
    print("\nFirst few rows (showing first 7 columns):")
    print(train_data.iloc[:5, :7])

if test_data is not None:
    print("\nTest Data:")
    print(f"Shape: {test_data.shape}")
    print("\nFirst few rows (showing first 7 columns):")
    print(test_data.iloc[:5, :7])

if sample_submission is not None:
    print("\nSample Submission:")
    print(f"Shape: {sample_submission.shape}")
    print("\nFirst few rows:")
    print(sample_submission.head())

if train_data is None and test_data is None:
    print("No train or test data loaded. Please check if the data files exist.")



Training Data:
Shape: (525887, 896)

First few rows (showing first 7 columns):
                     bid_qty  ask_qty  buy_qty  sell_qty   volume        X1  \
timestamp                                                                     
2023-03-01 00:00:00   15.283    8.425  176.405    44.984  221.389  0.121263   
2023-03-01 00:01:00   38.590    2.336  525.846   321.950  847.796  0.302841   
2023-03-01 00:02:00    0.442   60.250  159.227   136.369  295.596  0.167462   
2023-03-01 00:03:00    4.865   21.016  335.742   124.963  460.705  0.072944   
2023-03-01 00:04:00   27.158    3.451   98.411    44.407  142.818  0.173820   

                           X2  
timestamp                      
2023-03-01 00:00:00 -0.417690  
2023-03-01 00:01:00 -0.049576  
2023-03-01 00:02:00 -0.291212  
2023-03-01 00:03:00 -0.436590  
2023-03-01 00:04:00 -0.213489  

Test Data:
Shape: (538150, 896)

First few rows (showing first 7 columns):
    bid_qty  ask_qty  buy_qty  sell_qty   volume        X1        

In [None]:
target_col = 'label'

if target_col in train_data.columns:
    X = train_data.drop(target_col, axis=1)
    y = train_data[target_col]

    if target_col in test_data.columns:
        X_test = test_data.drop(target_col, axis=1)
    else:
        X_test = test_data.copy() # Use .copy() to avoid SettingWithCopyWarning later

    print("Data successfully split into training and testing sets.")
    print(f"X (training features) shape: {X.shape}")
    print(f"y (training target) shape: {y.shape}")
    print(f"X_test (test features) shape: {X_test.shape}")

else:
    print(f"Error: Target column '{target_col}' not found in the training data.")
    print("Please update the 'target_col' variable with the correct column name.")


Data successfully split into training and testing sets.
X (training features) shape: (525887, 895)
y (training target) shape: (525887,)
X_test (test features) shape: (538150, 895)


In [None]:
# Define the list of features identified as most important through prior analysis.
# This combines domain-specific features and data-driven anonymous features.

features = [
    # Top features identified from the referenced Kaggle notebook
    "X863", "X856", "X344", "X598", "X862", "X385", "X852", "X603", "X860", "X674",
    "X415", "X345", "X137", "X855", "X174", "X302", "X178", "X532", "X168", "X612",
    
    # Core market microstructure features
    "bid_qty", "ask_qty", "buy_qty", "sell_qty", "volume"
]

X = X[features]
X_test = X_test[features]

print("Feature selection applied.")
print(f"New shape of X (training features): {X.shape}")
print(f"New shape of X_test (test features): {X_test.shape}")

Feature selection applied.
New shape of X (training features): (525887, 25)
New shape of X_test (test features): (538150, 25)


In [None]:
import xgboost as xgb
from sklearn.model_selection import KFold
import numpy as np

# --- XGBoost Model Training with Cross-Validation ---

# 1. Define Model Parameters
# These are starting parameters. The best values are typically found through hyperparameter tuning.
params = {
    'n_estimators': 500,          # Number of boosting rounds.
    'learning_rate': 0.05,        # Step size shrinkage to prevent overfitting.
    'max_depth': 4,               # Maximum depth of a tree.
    'subsample': 0.8,             # Fraction of samples to be used for fitting the individual base learners.
    'colsample_bytree': 0.8,      # Fraction of columns to be used when constructing each tree.
    'objective': 'reg:squarederror', # Specifies the learning task and objective.
    'random_state': 42,           # Seed for reproducibility.
    'n_jobs': -1,                 # Use all available CPU cores for training.
    'eval_metric': 'rmse'         # **FIX**: Moved the metric here for early stopping evaluation.
}

# 2. Set up Cross-Validation
N_SPLITS = 5
kf = KFold(n_splits=N_SPLITS, shuffle=True, random_state=42)

# 3. Initialize storage for scores and predictions
oof_scores = []
test_predictions = np.zeros(len(X_test))

print("Starting XGBoost training with 5-fold cross-validation...")


Starting XGBoost training with 5-fold cross-validation...
--- Fold 1/5 ---
Pearson Correlation for Fold 1: 0.67993
--- Fold 2/5 ---
Pearson Correlation for Fold 2: 0.68663
--- Fold 3/5 ---
Pearson Correlation for Fold 3: 0.66705
--- Fold 4/5 ---
Pearson Correlation for Fold 4: 0.67591
--- Fold 5/5 ---
Pearson Correlation for Fold 5: 0.66471

--- Cross-Validation Summary ---
Average Pearson Correlation: 0.67485
Standard Deviation of Scores: 0.00812

Submission file 'submission.csv' created successfully!
First 5 rows of the submission file:
   ID  prediction
0   1    0.273182
1   2    0.272615
2   3    0.193447
3   4   -0.107047
4   5    0.337480
