# Hull - Leak Safe Baseline

- The training data also contain the public test set. It is the last 180 days, see [data description](https://www.kaggle.com/competitions/hull-tactical-market-prediction/data). Let's remove this part overall to get a meaningful score on the current public leaderboard.
- We are supposed to predict the strategy by day, which is something that depends on the forward_return and risk_free_rate (and some overall effect). This makes me wonder if we need to estimate both of them at the same time, and then deriving a certain stratgey. However, in [this discussion answer](https://www.kaggle.com/competitions/hull-tactical-market-prediction/discussion/608349#3299060), this optimization is done analytically (withouth considering penalty effects). Thus let's take these targets as the true targets to be predicted.
- Otherwise we are modelling as simple as in the [Hull Starter Notebook](https://www.kaggle.com/code/laurentlanteigne/hull-starter-notebook): No time effect, same features and same model.
- Note, the training dataset will be updated throughout the competition.

**All comments welcome!**

## Import & Settings

In [None]:
import os
import pathlib
import numpy as np
import pandas as pd
import polars as pl 
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode, iplot
import plotly as py
init_notebook_mode(connected=True) 
from sklearn.linear_model import ElasticNet, ElasticNetCV
from sklearn.preprocessing import StandardScaler
import kaggle_evaluation.default_inference_server
from metric import score as hull_score

In [None]:
BASE_DIR = pathlib.Path("/kaggle/input/hull-tactical-market-prediction")
SEED = 888
TEST_SKIP = 180
# same features as in hull starter nb
FEATURES = [
    "S2",
    "E2", "E3",
    "P8", "P9", "P10", "P12", "P13",
    "S1", "S5", 
    "I2",
    "U1",
    "U2",
]
INFO_COLS = ["date_id", "forward_returns", "risk_free_rate"]
# model as in hull starter
CV = 10
L1_RATIO = 0.5
ALPHAS = np.logspace(-4, 2, 100)
MAX_ITER = 1000000

## Load data

In [None]:
data = pd.read_csv(BASE_DIR / "train.csv")
data["U1"] = data["I2"] - data["I1"]
data["U2"] = data["M11"] / ((data["I2"] + data["I9"] + data["I7"]) / 3)
data = data[FEATURES + INFO_COLS].dropna()
data

In [None]:
max_train_date = data["date_id"].max() - TEST_SKIP
print("max train date_id:", max_train_date)

In [None]:
train = data.loc[data["date_id"] <= max_train_date].copy()
test = data.loc[data["date_id"] > max_train_date].copy()
print("train shape:", train.shape)
print("test shape:", test.shape)

## Build target
Set target as best strategy on training dataset.

In [None]:
solution = train.copy()
market_excess_returns = solution['forward_returns'] - solution['risk_free_rate']
market_excess_cumulative = (1 + market_excess_returns).prod()
market_mean_excess_return = (market_excess_cumulative) ** (1 / len(solution)) - 1
c = (1 + market_mean_excess_return) ** (1 / (market_excess_returns > 0).mean()) - 1
submission = pd.DataFrame({'prediction': (c / market_excess_returns).clip(0, 2)})
print("best score train:", hull_score(solution, submission, ''))

In [None]:
train["target"] = submission

In [None]:
fig = go.Figure(data=[go.Scatter3d(
    x=train['forward_returns'],
    y=train['risk_free_rate'],
    z=train['target'],
    mode='markers',
    marker=dict(size=3)
)])
fig.update_layout(
    scene=dict(
        xaxis_title='forward_returns',
        yaxis_title='risk_free_rate',
        zaxis_title='target'
    )
)
iplot(fig)

In [None]:
market_excess_returns = np.linspace(-0.01, 0.04, 402)
fig = go.Figure(data=[go.Scatter(
    x=market_excess_returns,
    y=(c / market_excess_returns).clip(0, 2),
    mode='markers',
    marker=dict(size=3)
)])
fig.update_layout(
    xaxis_title="market_excess_returns",
    yaxis_title="target",
)
iplot(fig)

## Model

In [None]:
X_train = train[FEATURES].values
y_train = train["target"].values

In [None]:
sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train)
model_cv = ElasticNetCV(l1_ratio=L1_RATIO, cv=CV, alphas=ALPHAS, max_iter=MAX_ITER)
model_cv.fit(X_train_scaled, y_train)
model = ElasticNet(alpha=model_cv.alpha_, l1_ratio=L1_RATIO)
model.fit(X_train_scaled, y_train)

In [None]:
model.score(X_train_scaled, y_train)

## Invoke on test set

In [None]:
X_test = test[FEATURES].values
X_test_scaled = sc.transform(X_test)
y_test_pred = model.predict(X_test_scaled)
y_test_pred = np.clip(y_test_pred, 0.0, 2.0)
pd.Series(y_test_pred).describe()

In [None]:
solution = test.copy()
submission = pd.DataFrame({'prediction': y_test_pred}, index=solution.index)
print("score public test:", hull_score(solution, submission, ''))

# Submit

In [None]:
def predict(test: pl.DataFrame) -> float:
    data = test.to_pandas()
    data["U1"] = data["I2"] - data["I1"]
    data["U2"] = data["M11"] / ((data["I2"] + data["I9"] + data["I7"]) / 3)    
    X = data[FEATURES].values
    X_scaled = sc.transform(X)
    y = model.predict(X_scaled)
    pred = np.clip(y, 0.0, 2.0)[0]
    print(f"date_id: {data['date_id'][0]} -> prediction: {pred:>.4f}")
    return pred

In [None]:
inference_server = kaggle_evaluation.default_inference_server.DefaultInferenceServer(predict)

if os.getenv('KAGGLE_IS_COMPETITION_RERUN'):
    inference_server.serve()
else:
    inference_server.run_local_gateway((BASE_DIR.as_posix(),))