In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

#### Life cycle of Machine learning Project

- Understanding the Problem Statement
- Data Collection
- Data Checks to perform
- Exploratory data analysis
- Data Pre-Processing
- Model Training
- Choose best model

### 1) The Problem Statement
The cryptocurrency market represents one of the most dynamic and rapidly evolving financial landscapes, offering a wealth of opportunities for those who can extract meaningful insights from its vast streams of data. However, market information in crypto has an inherently low signal-to-noise ratio making it exceptionally difficult to identify predictive patterns. Price movements are shaped by a complex interplay of liquidity, order flow dynamics, sentiment shifts, and structural inefficiencies, requiring sophisticated quantitative techniques to decode.

At DRW, we have been at the forefront of financial innovation for over three decades, embracing cutting-edge technology and rigorous quantitative research to optimize trading strategies. Through Cumberland, our dedicated crypto trading arm, we were among the earliest institutional participants in the digital asset space, helping to shape market structure and improve efficiency. As one of the largest liquidity providers in crypto, we thrive on developing proprietary trading strategies that adapt to the ever-changing market environment.

In this competition, we invite you to build a model capable of predicting short-term crypto future price movements using our production feature data alongside publicly available market volume statistics. The proprietary production features we provide are integral to our trading strategies, capturing subtle market signals that help us navigate and seize opportunities in real time. Moreover, these production features, combined with public data describing the broader market state, create a rich and challenging dataset for data mining and modeling. Your task is to integrate these diverse sources of information into a single directional signal that effectively predicts crypto future price movements.

Through this challenge, we aim to replicate the real-world problems we tackle at DRW every day—leveraging advanced machine learning techniques to extract structure from noisy, high-dimensional market data. The most successful solutions will provide a learning model that efficiently incorporates both explicit patterns and implicit interactions between all data features to refine price movement predictions.

We look forward to seeing how the Kaggle community approaches this problem and how different modeling techniques can enhance our understanding of market dynamics. If you're excited by complex, high-impact challenges beyond predictive modeling, DRW offers a diverse range of opportunities at the intersection of quantitative research, technology, and trading strategy development.

### 2) Know about your data
train.parquet
The training dataset containing all historical market data along with the corresponding labels.

timestamp: The timestamp index representing the minute associated with each row.

bid_qty: The total quantity buyers are willing to purchase at the best (highest) bid price at the given timestamp.

ask_qty: The total quantity sellers are offering to sell at the best (lowest) ask price at the given timestamp.

buy_qty: The total trading quantity executed at the best ask price during the given minute.

sell_qty: The total trading quantity executed at the best bid price during the given minute.

volume: The total traded volume during the minute.

X_{1,...,890}: A set of anonymized market features derived from proprietary data sources.

label: The target variable representing the anonymized market price movement to be predicted.

test.parquet


The test dataset has the same feature structure as train.parquet, with the following differences:

timestamp: To prevent future peeking, all timestamps are masked, shuffled, and replaced with a unique ID.

label: All labels in the test set are set to 0.

In [2]:
%pip install pyarrow
%pip install fastparquet

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [3]:
df=pd.read_parquet('train.parquet')
df.head()

Unnamed: 0_level_0,bid_qty,ask_qty,buy_qty,sell_qty,volume,X1,X2,X3,X4,X5,...,X882,X883,X884,X885,X886,X887,X888,X889,X890,label
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2023-03-01 00:00:00,15.283,8.425,176.405,44.984,221.389,0.121263,-0.41769,0.005399,0.125948,0.058359,...,1.925423,1.847943,0.005676,0.190791,0.369691,0.37763,0.210153,0.159183,0.530636,0.562539
2023-03-01 00:01:00,38.59,2.336,525.846,321.95,847.796,0.302841,-0.049576,0.356667,0.481087,0.237954,...,1.928569,1.849468,0.005227,0.18466,0.363642,0.374515,0.209573,0.158963,0.530269,0.533686
2023-03-01 00:02:00,0.442,60.25,159.227,136.369,295.596,0.167462,-0.291212,0.083138,0.206881,0.101727,...,1.928047,1.849282,0.004796,0.178719,0.357689,0.371424,0.208993,0.158744,0.529901,0.546505
2023-03-01 00:03:00,4.865,21.016,335.742,124.963,460.705,0.072944,-0.43659,-0.102483,0.017551,0.007149,...,1.928621,1.849608,0.004398,0.172967,0.351832,0.368358,0.208416,0.158524,0.529534,0.357703
2023-03-01 00:04:00,27.158,3.451,98.411,44.407,142.818,0.17382,-0.213489,0.096067,0.215709,0.107133,...,1.927084,1.84895,0.004008,0.167391,0.346066,0.365314,0.207839,0.158304,0.529167,0.362452


In [5]:
df.shape

(525887, 896)