# Hull - Cross Validation Baseline

## Motivation
- We need a good validation scheme, otherwise the competition is pure gambling.
- Note, ultimately the validation scheme needs to show us good hyperparameters, not neccessarily that the CV score is exact. 
- As a model reference I use my leak-free baseline that directly targets the strategy as output, see [Hull - Leak Safe Baseline](https://www.kaggle.com/code/morodertobias/hull-leak-safe-baseline).
- A single train-valid split is not sufficient; so a larger number of splits is required.
- These splits should respect the time ordering that training comes before validation.
- After having determined the best hyperparameters train the best model for the leak-free public test set.
- The first results were disappointing:
    - Lots of fluctuations in the scores.
    - Unclear cross validation settings: number of splits, train/valid sizes or aggregate function
    - Also, the simple model from the baseline seems to not learn anything. In order to enhace model capacity I switched to the elastic net using all features.

## Current Cross Validation
- Make several temporal time series splits over the training set.
- Keep validation set sizes fixed. Initially I set it the size of test set (so 180), but the scores are strongly fluctuating. I think the score is also very unstable, maybe even too susceptible to outlier due to the mean the Sharpe ratio. Hence I was using a larger validation set sizes.
- In order to remove effects on the initial splitting subsequent validations sets partially overlap.
- Finally I look at median score to determine best hyperparameters.

**What do you think about this cross validation strategy? How do you validate? All comments welcome!**

## Import & Settings

In [None]:
import os
import pathlib
import warnings
import numpy as np
import pandas as pd
import polars as pl 
import matplotlib.pyplot as plt
from tqdm.auto import tqdm
from sklearn.linear_model import ElasticNet
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
import kaggle_evaluation.default_inference_server
from metric import score as hull_score
warnings.simplefilter("ignore")

In [None]:
BASE_DIR = pathlib.Path("/kaggle/input/hull-tactical-market-prediction")
TEST_SKIP = 180
SKIP_ROWS = 2000
INFO_COLS = ["date_id", "forward_returns", "risk_free_rate", "market_forward_excess_returns"]
# model as in hull starter but on all features
L1_RATIO = 0.5
ALPHAS = np.logspace(-5, 0, 50)
MAX_ITER = 1000000
# cv 
TRAIN_SIZE = 4000
VALID_SIZE = 2 * 180
VALID_STRIDE = 90
N_SPLITS = 28

## Load data

In [None]:
data = pd.read_csv(BASE_DIR / "train.csv")
FEATURES = data.filter(regex=r"(M|E|I|P|V|S)").columns.to_list()
# skip first rows
data = data.iloc[SKIP_ROWS:]
data = data[FEATURES + INFO_COLS]
data

In [None]:
max_train_date = data["date_id"].max() - TEST_SKIP
print("max_train_date:", max_train_date)

In [None]:
train = data.loc[data["date_id"] <= max_train_date].copy()
test = data.loc[data["date_id"] > max_train_date].copy()
print("train shape:", train.shape)
print("test shape:", test.shape)

## Build target
Set target as best strategy on training dataset.

In [None]:
def compute_ideal_targets(df):
    """Computes ideal targets"""
    solution = df.copy()
    market_excess_returns = solution['forward_returns'] - solution['risk_free_rate']
    market_excess_cumulative = (1 + market_excess_returns).prod()
    market_mean_excess_return = (market_excess_cumulative) ** (1 / len(solution)) - 1
    c = (1 + market_mean_excess_return) ** (1 / (market_excess_returns > 0).mean()) - 1
    submission = pd.DataFrame({'prediction': (c / market_excess_returns).clip(0, 2)})
    score = hull_score(solution, submission, '')
    return submission, score

In [None]:
__, score = compute_ideal_targets(train)
print(f"Hull score: {score:.4f}")

## Validation

In [None]:
def generate_splits():
    num_samples = len(train)
    tv_size = TRAIN_SIZE + VALID_SIZE
    start = num_samples - tv_size - (N_SPLITS - 1) * VALID_STRIDE
    if start < 0:
        raise ValueError("Too few samples!")
    for i in range(N_SPLITS):
        i_start = start + i * VALID_STRIDE
        i_loc_train = np.arange(i_start, i_start + TRAIN_SIZE)
        i_loc_valid = np.arange(i_start + TRAIN_SIZE, i_start + tv_size)
        yield (i_loc_train, i_loc_valid)

In [None]:
split_summary = []
for i, (train_iloc, valid_iloc) in enumerate(generate_splits()):
    split_summary.append(
        {
            "split": i,
            "train_start": train_iloc[0],
            "train_end": train_iloc[-1],
            "train_size": 1 + train_iloc[-1] - train_iloc[0],
            "valid_start": valid_iloc[0],
            "valid_end": valid_iloc[-1],
            "valid_size": 1 + valid_iloc[-1] - valid_iloc[0],
        }
    )
split_summary = pd.DataFrame(split_summary)
split_summary["train_stride"] = split_summary["train_start"] - split_summary[
    "train_start"
].shift(1)
split_summary

In [None]:
def train_model(model, train_df, test_df=None):
    y_train, __ = compute_ideal_targets(train_df)
    X_train = train_df[FEATURES]
    model.fit(X_train, y_train)
    score_train = model.score(X_train, y_train)
    y_pred_train = model.predict(X_train)
    sub_train = pd.DataFrame(
        {"prediction": y_pred_train.clip(0.0, 2.0)}, index=train_df.index
    )
    hull_train = hull_score(train_df, sub_train, "")
    scores = {
        "train_score": score_train,
        "train_hull": hull_train,
    }
    if test_df is not None:
        X_test = test_df[FEATURES]
        y_pred_test = model.predict(X_test)
        sub_train = pd.DataFrame(
            {"prediction": y_pred_test.clip(0.0, 2.0)}, index=test_df.index
        )
        hull_test = hull_score(test_df, sub_train, "")
        scores["test_hull"] = hull_test
    return model, scores

In [None]:
cv_scores = {}

for fold, (train_iloc, valid_iloc) in \
    tqdm(enumerate(generate_splits()), total=N_SPLITS):

    df_train = train.iloc[train_iloc].copy()
    df_test = train.iloc[valid_iloc].copy()

    fold_scores = {}
    for alpha in ALPHAS:
        model = Pipeline([
            ("im", SimpleImputer(strategy='median')),
            ("sc", StandardScaler()),
            ("reg", ElasticNet(alpha=alpha, l1_ratio=L1_RATIO, max_iter=MAX_ITER)),
        ])
        model, scores = train_model(model, df_train, df_test)
        fold_scores[alpha] = scores
    fold_scores = pd.DataFrame(fold_scores).T
    
    cv_scores[fold] = fold_scores

In [None]:
cv_scores = pd.concat(cv_scores).unstack(level=1)
cv_scores

## Look at scores

In [None]:
fig, ax = plt.subplots(figsize=(8, 5))
cv_scores["test_hull"].T.plot(
    ax=ax,
    logx=True,
    legend=False,
    xlabel="alpha",
    ylabel="test hull",
    title="hull score by split",
)
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(14, 5))
cv_scores["test_hull"].plot(
    kind="box", ax=ax, xlabel="alpha position", ylabel="test hull"
)
ax.set_xticks(range(len(ALPHAS)))
ax.set_xticklabels(range(len(ALPHAS)), rotation=90, ha="right")
ax.set_xlabel("ALPHA position")
ax.set_ylabel("Hull  score")
ax.set_ylim(-0.5, 1.8)
ax.grid(True)
plt.show()

In [None]:
fig, axs = plt.subplots(nrows=2, sharex="all", sharey="all", gridspec_kw=dict(hspace=0), figsize=(8, 5))
fig.suptitle("mean/median hull scores")
ax = axs[0]
cv_scores["train_hull"].mean().plot(
    ax=ax, xlabel="alpha", marker="x", label="train", logx=True, ylabel="mean hull"
)
cv_scores["test_hull"].mean().plot(ax=ax, marker="x", label="test")
ax.legend(loc="upper right")
ax = axs[1]
cv_scores["train_hull"].median().plot(
    ax=ax,
    marker="x",
    label="train",
    logx=True,
    ylabel="mean hull",
    xlabel="alpha",
)
cv_scores["test_hull"].median().plot(ax=ax, marker="x", label="test")
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(12, 5))
cv_scores["test_hull"][ALPHAS[::10]].plot(
    kind="bar",
    ax=ax,
    xlabel="split",
    ylabel="test hull",
    title="hull score by split for some variants",
)
ax.legend(title="alpha")
handles, labels = ax.get_legend_handles_labels()
fmt_labels = [f"{float(l):.4e}" for l in labels]
ax.legend(handles, fmt_labels, title="alpha")
ax.set_ylim([-0.2, 1.7])
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(8, 5))
cv_scores["train_score"].median().plot(
    ax=ax, marker="x", label="train", logx=True, xlabel="alpha", ylabel="r2"
)
plt.legend()
plt.show()

# Determine best parameter and retrain

In [None]:
test_scores = cv_scores["test_hull"].median()
alpha0 = test_scores.idxmax()
print(f"best median score: {test_scores[alpha0]:.4f} @ alpha={alpha0:.4E}")

In [None]:
df_train = train.iloc[-TRAIN_SIZE:].copy()
df_test = test
model = Pipeline([
    ("im", SimpleImputer(strategy='median')),
    ("sc", StandardScaler()),
    ("reg", ElasticNet(alpha=alpha0, l1_ratio=L1_RATIO, max_iter=MAX_ITER)),
])
model, scores = train_model(model, df_train, df_test)
pd.Series(scores)

In [None]:
X_test = test[FEATURES]
y_test_pred = model.predict(X_test)
y_test_pred = np.clip(y_test_pred, 0.0, 2.0)
pd.Series(y_test_pred).describe()

In [None]:
(
    model.named_steps["reg"].intercept_,
    model.named_steps["reg"].coef_ 
)

In [None]:
solution = test.copy()
submission = pd.DataFrame({'prediction': y_test_pred}, index=solution.index)
hull_score(solution, submission, '')

## Make submission

In [None]:
def predict(test: pl.DataFrame) -> float:
    data = test.to_pandas()
    X = data[FEATURES]
    y = model.predict(X)
    pred = np.clip(y, 0.0, 2.0)[0]
    print(f"date_id: {data['date_id'][0]} -> prediction: {pred:>.4f}")
    return pred

In [None]:
inference_server = kaggle_evaluation.default_inference_server.DefaultInferenceServer(predict)

if os.getenv('KAGGLE_IS_COMPETITION_RERUN'):
    inference_server.serve()
else:
    inference_server.run_local_gateway((BASE_DIR.as_posix(),))