# Crypto Market Analysis

## Pearson Correlation Coefficient (PCC) for Model Evaluation

The Pearson correlation coefficient (ρ) measures the linear correlation between two variables, in our case between the actual labels (y) and predicted values (ŷ) on the private test set. It ranges from -1 to +1, where:

- +1 indicates perfect positive linear correlation
- 0 indicates no linear correlation  
- -1 indicates perfect negative linear correlation

# The formula for PCC is:
# 
# $\rho = \frac{Cov(y,\hat{y})}{\sigma_y \cdot \sigma_{\hat{y}}}$

Where:
- Cov(y,ŷ) is the covariance between actual and predicted values
- σy is the standard deviation of actual values
- σŷ is the standard deviation of predicted values

For this competition, the evaluation metric is the Pearson correlation between our predictions and the true labels on the private test set. A higher positive correlation indicates better predictive performance, as our predictions more closely track the actual market movements.

Key implications:
- We want our predictions to move in the same direction as actual values
- The magnitude of movements matters less than getting the direction right
- A correlation of 0.7+ would indicate strong predictive performance


In [3]:
# DRW - Crypto Market Prediction Analysis
# Import necessary libraries

import pandas as pd
import numpy as np
import polars as pl
import pyarrow as pa
import pyarrow.parquet as pq
from matplotlib import pyplot as plt
from matplotlib.ticker import MaxNLocator, FormatStrFormatter, PercentFormatter
import seaborn as sns
import xgboost as xgb
from sklearn.model_selection import KFold
from sklearn.linear_model import Ridge
import joblib
import os
import zipfile
import warnings
from sklearn.base import clone
from xgboost import XGBRegressor
from scipy.stats import pearsonr


warnings.filterwarnings('ignore')

# Set display options for better output
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Libraries imported successfully!")
print("Ready to analyze DRW Crypto Market Prediction data...")


Libraries imported successfully!
Ready to analyze DRW Crypto Market Prediction data...


## Load the data

In [4]:
# Load and examine the crypto market data
# First, let's identify the main data files (both CSV and Parquet)
data_dir = "./data/"
all_files = os.listdir(data_dir)
csv_files = [f for f in all_files if f.endswith('.csv')]
parquet_files = [f for f in all_files if f.endswith('.parquet')]

# Load the main dataset(s)
train_data = None
test_data = None
sample_submission = None

# Function to load data based on file extension
def load_data_file(filepath):
    if filepath.endswith('.csv'):
        return pd.read_csv(filepath)
    elif filepath.endswith('.parquet'):
        return pd.read_parquet(filepath)
    else:
        raise ValueError(f"Unsupported file format: {filepath}")

# Try to identify train, test, and submission files
all_data_files = csv_files + parquet_files
for file in all_data_files:
    filepath = os.path.join(data_dir, file)
    
    if 'train' in file.lower():
        train_data = load_data_file(filepath)
        
    elif 'test' in file.lower() and 'submission' not in file.lower():
        test_data = load_data_file(filepath)
        
    elif 'submission' in file.lower():
        sample_submission = load_data_file(filepath)


# Display basic information about the datasets
if train_data is not None:
    print("\nTraining Data:")
    print(f"Shape: {train_data.shape}")
    print("\nFirst few rows (showing first 7 columns):")
    print(train_data.iloc[:5, :7])

if test_data is not None:
    print("\nTest Data:")
    print(f"Shape: {test_data.shape}")
    print("\nFirst few rows (showing first 7 columns):")
    print(test_data.iloc[:5, :7])

if sample_submission is not None:
    print("\nSample Submission:")
    print(f"Shape: {sample_submission.shape}")
    print("\nFirst few rows:")
    print(sample_submission.head())

if train_data is None and test_data is None:
    print("No train or test data loaded. Please check if the data files exist.")



Training Data:
Shape: (525887, 896)

First few rows (showing first 7 columns):
                     bid_qty  ask_qty  buy_qty  sell_qty   volume        X1  \
timestamp                                                                     
2023-03-01 00:00:00   15.283    8.425  176.405    44.984  221.389  0.121263   
2023-03-01 00:01:00   38.590    2.336  525.846   321.950  847.796  0.302841   
2023-03-01 00:02:00    0.442   60.250  159.227   136.369  295.596  0.167462   
2023-03-01 00:03:00    4.865   21.016  335.742   124.963  460.705  0.072944   
2023-03-01 00:04:00   27.158    3.451   98.411    44.407  142.818  0.173820   

                           X2  
timestamp                      
2023-03-01 00:00:00 -0.417690  
2023-03-01 00:01:00 -0.049576  
2023-03-01 00:02:00 -0.291212  
2023-03-01 00:03:00 -0.436590  
2023-03-01 00:04:00 -0.213489  

Test Data:
Shape: (538150, 896)

First few rows (showing first 7 columns):
    bid_qty  ask_qty  buy_qty  sell_qty   volume        X1        

### Reduce Memory Usage

Next we need to reduce the amounr of RAM that the pandas dataframe is using. When loading data, panda often assignes default values line int64 or float64. We can reduce the memory usage by converting these to smaller datatypes. We simply need to iterate over the columns and check the min and max values of each column. If the min and max values are within the range of the smaller datatypes, we can convert the column to that datatype.

In [1]:
def reduce_mem_usage(dataframe, dataset):
    print('Reducing memory usage for:', dataset)
    initial_mem_usage = dataframe.memory_usage().sum() / 1024**2
    
    # Select only numeric columns
    numeric_cols = dataframe.select_dtypes(include=[np.number]).columns
    
    for col in numeric_cols:
        c_min = dataframe[col].min()
        c_max = dataframe[col].max()
        
        if np.issubdtype(dataframe[col].dtype, np.integer):
            types = [np.int8, np.int16, np.int32, np.int64]
            for t in types:
                if c_min >= np.iinfo(t).min and c_max <= np.iinfo(t).max:
                    dataframe[col] = dataframe[col].astype(t)
                    break
        elif np.issubdtype(dataframe[col].dtype, np.floating):
            types = [np.float16, np.float32, np.float64]
            for t in types:
                if c_min >= np.finfo(t).min and c_max <= np.finfo(t).max:
                    dataframe[col] = dataframe[col].astype(t)
                    break
    
    final_mem_usage = dataframe.memory_usage().sum() / 1024**2
    print('--- Memory usage before: {:.2f} MB'.format(initial_mem_usage))
    print('--- Memory usage after: {:.2f} MB'.format(final_mem_usage))
    print('--- Decreased memory usage by {:.1f}%\n'.format(100 * (initial_mem_usage - final_mem_usage) / initial_mem_usage))
    
    return dataframe

### Seperating the data into training and testing sets

The foundation for building our models is to split the data into training and testing sets. We will use the training set to train our models and the testing set to evaluate their performance.


In [5]:
target_col = 'label'

#reduce mem usage 
train_data = reduce_mem_usage(train_data, 'train_data')
test_data = reduce_mem_usage(test_data, 'test_data')

#split data into training and testing sets
if target_col in train_data.columns:
    X = train_data.drop(target_col, axis=1)
    y = train_data[target_col]

    if target_col in test_data.columns:
        X_test = test_data.drop(target_col, axis=1)
    else:
        X_test = test_data.copy() # Use .copy() to avoid SettingWithCopyWarning later

    print("Data successfully split into training and testing sets.")
    print(f"X (training features) shape: {X.shape}")
    print(f"y (training target) shape: {y.shape}")
    print(f"X_test (test features) shape: {X_test.shape}")

else:
    print(f"Error: Target column '{target_col}' not found in the training data.")
    print("Please update the 'target_col' variable with the correct column name.")


Reducing memory usage for: train_data
--- Memory usage before: 3598.94 MB
--- Memory usage after: 965.94 MB
--- Decreased memory usage by 73.2%

Reducing memory usage for: test_data
--- Memory usage before: 3678.76 MB
--- Memory usage after: 984.36 MB
--- Decreased memory usage by 73.2%

Data successfully split into training and testing sets.
X (training features) shape: (525887, 895)
y (training target) shape: (525887,)
X_test (test features) shape: (538150, 895)


### Black Box Feature Selection

In [13]:
# Define the list of features identified as most important through prior analysis.
# This combines domain-specific features and data-driven anonymous features.

features = [
    # Top features identified from the referenced Kaggle notebook
    "X863", "X856", "X344", "X598", "X862", "X385", "X852", "X603", "X860", "X674",
    "X415", "X345", "X137", "X855", "X174", "X302", "X178", "X532", "X168", "X612",
    
    # Core market microstructure features
    "bid_qty", "ask_qty", "buy_qty", "sell_qty", "volume"
]

X = X[features]
X_test = X_test[features]

print("Feature selection applied.")
print(f"New shape of X (training features): {X.shape}")
print(f"New shape of X_test (test features): {X_test.shape}")

Feature selection applied.
New shape of X (training features): (5259, 25)
New shape of X_test (test features): (5382, 25)


In [14]:

class Trainer:
    """
    A wrapper class to simplify cross-validated model training, prediction,
    and evaluation. It handles out-of-fold predictions and aggregates scores.
    """
    def __init__(self, model, cv, metric, task="regression", metric_precision=6):
        self.model = model
        self.cv = cv
        self.metric = metric
        self.task = task
        self.metric_precision = metric_precision
        self.models_ = []
        self.fold_scores = []
        self.oof_preds = None

    def fit(self, X, y):
        """
        Fits the model using the provided cross-validation strategy.
        It stores the trained model for each fold and computes out-of-fold predictions.
        """
        self.models_ = []
        self.oof_preds = np.zeros(len(X))
        
        # Check if input data is a pandas DataFrame or Series
        is_pandas = hasattr(X, 'iloc')

        for fold, (train_idx, val_idx) in enumerate(self.cv.split(X, y)):
            print(f"--- Fold {fold+1} ---")

            if is_pandas:
                X_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
                X_val, y_val = X.iloc[val_idx], y.iloc[val_idx]
            else:  # Assume numpy array
                X_train, y_train = X[train_idx], y[train_idx]
                X_val, y_val = X[val_idx], y[val_idx]

            fold_model = clone(self.model)
            fold_model.fit(X_train, y_train)
            self.models_.append(fold_model)

            val_preds = fold_model.predict(X_val)
            self.oof_preds[val_idx] = val_preds

            score, _ = self.metric(y_val, val_preds)
            self.fold_scores.append(round(score, self.metric_precision))
            print(f"Fold {fold+1} Score: {self.fold_scores[-1]}")

    def predict(self, X_test):
        """
        Generates test set predictions by averaging the predictions 
        from all models trained during cross-validation.
        """
        if not self.models_:
            raise RuntimeError("The trainer has not been fitted yet. Call .fit() before .predict().")

        test_predictions = np.zeros(len(X_test))
        for model in self.models_:
            test_predictions += model.predict(X_test)
        
        return test_predictions / len(self.models_)

# Define XGBoost model parameters from your notebook
xgb_params = {
    "tree_method": "hist",
    "device": "gpu",
    "colsample_bylevel": 0.4778,
    "colsample_bynode": 0.3628,
    "colsample_bytree": 0.7107,
    "gamma": 1.7095,
    "learning_rate": 0.02213,
    "max_depth": 20,
    "max_leaves": 12,
    "min_child_weight": 16,
    "n_estimators": 1667,
    "subsample": 0.06567,
    "reg_alpha": 39.3524,
    "reg_lambda": 75.4484,
    "verbosity": 0,
    "random_state": 42,
    "n_jobs": -1
}

# Initialize dictionaries to store results for different models
fold_scores = {}
overall_scores = {}
oof_preds = {}
test_preds = {}

# Instantiate the trainer with the XGBoost regressor and parameters
xgb_trainer = Trainer(
    model=XGBRegressor(**xgb_params),
    cv=KFold(n_splits=5, shuffle=False),
    metric=pearsonr,
    task="regression",
    metric_precision=6
)

# Assuming X, y, and X_test are defined and available in your environment
# Fit the model and generate out-of-fold predictions
xgb_trainer.fit(X, y)

# Store the results
fold_scores["XGBoost"] = xgb_trainer.fold_scores
overall_scores["XGBoost"] = [pearsonr(xgb_trainer.oof_preds, y)[0]]
oof_preds["XGBoost"] = xgb_trainer.oof_preds
test_preds["XGBoost"] = xgb_trainer.predict(X_test)

# Print a summary of the training results
print("\n--- XGBoost Training Summary ---")
print(f"Fold Scores: {fold_scores['XGBoost']}")
print(f"Overall OOF Score (Pearson): {overall_scores['XGBoost'][0]:.6f}")
print(f"Test predictions shape: {test_preds['XGBoost'].shape}")

--- Fold 1 ---
Fold 1 Score: nan
--- Fold 2 ---
Fold 2 Score: nan
--- Fold 3 ---
Fold 3 Score: nan
--- Fold 4 ---
Fold 4 Score: nan
--- Fold 5 ---
Fold 5 Score: nan

--- XGBoost Training Summary ---
Fold Scores: [np.float64(nan), np.float64(nan), np.float64(nan), np.float64(nan), np.float64(nan)]
Overall OOF Score (Pearson): -0.034295
Test predictions shape: (5382,)


In [None]:
# add an ensemble model with ridge regression
X = pd.DataFrame(oof_preds)
X_test = pd.DataFrame(test_preds)

joblib.dump(X, "oof_preds.pkl")
joblib.dump(X_test, "test_preds.pkl")

ridge_params = {
    'alpha': 0.001,
    'random_state': 42,
    'tol': 1e-3,
    'fit_intercept': True,
    'positive': True
}

ridge_trainer = Trainer(
        Ridge(**ridge_params),
        cv=KFold(n_splits=5, shuffle=False),
        metric=pearsonr,
        task="regression",
        verbose=False
    )

ridge_trainer.fit(X, y)

fold_scores["Ridge (ensemble)"] = ridge_trainer.fold_scores
overall_scores["Ridge (ensemble)"] = [pearsonr(ridge_trainer.oof_preds, y)]
ridge_test_preds = ridge_trainer.predict(X_test)



NameError: name 'Ridge' is not defined