<h3 style='background-color: crimson; color:cream'>
    Overview: For Pedro and Josh
</h3>

From the Kaggle competition description: 

In this competition, hosted by Jane Street, you'll build a model using real-world data derived from production systems, which offers a glimpse into the daily challenges of successful trading. This challenge highlights the difficulties in modeling financial markets, including fat-tailed distributions, non-stationary time series, and sudden shifts in market behavior.

When approaching modeling problems in modern financial markets, there are many reasons to believe that the problems you are trying to solve are impossible. Even if you put aside the beliefs that the prices of financial instruments rationally reflect all available information, you’ll have to grapple with time series and distributions that have properties you don’t encounter in other sorts of modeling problems. Distributions can be famously fat-tailed, time series can be non-stationary, and data can generally fail to satisfy a lot of the underlying assumptions on which very successful statistical approaches rely. Layer on all of this the fact that the financial markets are ultimately a human endeavor involving a large number of individuals and institutions that are constantly changing with advances in technology and shifts in society, and responding to economic and geopolitical issues as they arise - and you can start to get a sense of the difficulties involved!

In this challenge, we ask you to build a model using real-world data derived from some of our production systems. This data gives a very close picture of some of the things we have to do every day to be successful at trading in modern financial markets. We’ve assembled a collection of features and responders related to markets where we run automated trading strategies and are concerned about having good underlying models. To balance crafting a challenging, relevant problem that ties into our business while respecting the proprietary and highly competitive nature of our trading, you will notice that we have anonymized and lightly obfuscated some of the features and responders we present in the data. These modifications don’t change the essence of the problem at hand, but they do allow us to give you a difficult task that meaningfully illustrates the work we do at Jane Street.

Jane Street has spent decades relentlessly innovating on all aspects of our trading, and building machine learning models to aid our decision-making. These models help us actively trade thousands of financial products each day across 200+ trading venues around the world. While this challenge only presents a tiny fraction of the quantitative problems Jane Streeters work on daily, we are very interested in seeing how the Kaggle community will approach this challenge, and in engaging with you about your solutions to the problem!

<h3 style='background-color: crimson; color:cream'>
    Josh's current understanding of how the competition works
</h3>

We will develop predictive models for the market. Our models will need to be submitted on January 13, formatted in a certain way outlined on the Kaggle website. That is so Jane Street can actually test how our models predict the market from January to July of 2025. We need to format our submission in a specific way, using like a Jane Street API so that it is easy for them to run our code to get predictions and see how closely they align with the actual market.

In [23]:
import html
import matplotlib as plt

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error


<h3 style='background-color: crimson; color:cream'>
    Notes on imports
</h3>
Jane Street uses polars instead of pandas, could be worth checking out. It's supposed to be faster.

<h3 style='background-color: crimson; color:cream'>
    Let's just have a few preliminary goals.
</h3>

<p style='background-color: orange; color:steelblue'>
    i. Come up with questions to ask in the Kaggle forum.
</p>

<p style='background-color: orange; color:steelblue'>
    ii. Try and get into familiar territory, of training models and then testing their performance. This leads to a few questions that we should come up with reasonable solutions with. 
</p>


In [15]:
# Let's take a look at the shape of the different parquet files.
file_path = '../train.parquet/partition_id='
for i in range(10):
    df = pd.read_parquet(file_path + str(i), engine='fastparquet')
    print(df.shape)

(1944210, 92)
(2804247, 92)
(3036873, 92)
(4016784, 92)
(5022952, 92)
(5348200, 92)
(6203912, 92)
(6335560, 92)
(6140024, 92)
(6274576, 92)


 Naively right now just drop columns with at least one nan value, we should come up with a better way
to deal with missingness, but right now I just want something to run.

In [19]:
df = pd.read_parquet(file_path + '0', engine='fastparquet')
df = df.dropna(axis='columns')
df.head()

Unnamed: 0,date_id,time_id,symbol_id,weight,feature_05,feature_06,feature_07,feature_09,feature_10,feature_11,...,feature_78,responder_0,responder_1,responder_2,responder_3,responder_4,responder_5,responder_6,responder_7,responder_8
0,0,0,1,3.889038,0.851033,0.242971,0.2634,11,7,76,...,-0.281498,0.738489,-0.069556,1.380875,2.005353,0.186018,1.218368,0.775981,0.346999,0.095504
1,0,0,7,1.370613,0.676961,0.151984,0.192465,11,7,76,...,-0.302441,2.965889,1.190077,-0.523998,3.849921,2.626981,5.0,0.703665,0.216683,0.778639
2,0,0,9,2.285698,1.056285,0.187227,0.249901,11,7,76,...,-0.096792,-0.864488,-0.280303,-0.326697,0.375781,1.271291,0.099793,2.109352,0.670881,0.772828
3,0,0,10,0.690606,1.139366,0.273328,0.306549,42,5,150,...,-0.296244,0.408499,0.223992,2.294888,1.097444,1.225872,1.225376,1.114137,0.775199,-1.379516
4,0,0,14,0.44057,0.9552,0.262404,0.344457,44,3,16,...,3.418133,-0.373387,-0.502764,-0.348021,-3.928148,-1.591366,-5.0,-3.57282,-1.089123,-5.0


In [27]:
predictors = ['date_id', 'time_id', 'symbol_id', 'weight']
for col in df.columns:
    if 'feature' in col:
        predictors.append(col)
response = 'responder_6'

X = df[predictors]
y = df[response]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

linreg = LinearRegression()

linreg0 = linreg.fit(X_train, y_train)
y_pred = linreg0.predict(X_test)
mse = mean_squared_error(y_pred, y_test)
mse

np.float32(0.7474239)