For a given trading day $t$ and an allocation $S$, let:

- $M$ : the number of assets in the universe  
- $w,{S,t} = (w_{S,t,1},w_{S,t,2}, \dots, w_{S,t,N})$ :  be the weights of allocation $S$ at time $t$
- $r_{i,t+1}$ : the performance (or return) of asset $i$ from day $t$ to day $t+1$  

Then the realized return of allocation $S$ at $t+1$ is given by:

$$
R_{S,t+1} = \sum_{i=1}^M w_{S,t,i} \times r_{i,t+1}
$$

The prediction task is to estimate the sign of $R_{S,t+1}$.


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## `X_train.csv`

At every day $t$, each allocation $S$ follows this property:

$$
\forall S,  \forall t : \ \sum_{i=1}^M |w_{S,t,i} |
$$

The **SIGNED\_VOLUME** of an allocation $S$ at $t$ is given by:

$$
V_{S,t} = \sum_{i=1}^M w_{S,t,i} \times v_{i,t}
$$

where $v_{i,t}$ is the traded volume of stock $i$ during the trading session at timestamp $t$.


For homogeneity, these $V_{S,t}$ were rescaled in a rolling fashion to ensure comparability across different styles of allocations.

The **AVG\_DAILY\_TURNOVER** of an allocation $S$ at $t$ is given by:

$$
TURNOVER_{S,t} = \sum_{i=1}^M | w_{S,t,i} - w_{S,t-1,i} |
$$

$ADT_{S,t} = median(TURNOVER_{S,t} , \dots, TURNOVER_{S,t-20} )$

In [None]:
X_train = pd.read_csv("data/X_train.csv")
X_train.head()

In [None]:
X_train.columns

In [None]:
X_train.columns

In [None]:
X_train["TS"].unique()

In [None]:
X_train["ALLOCATION"].unique()

Be careful no continuity in the dates

## `y_train.csv`

In [None]:
y_train = pd.read_csv("data/y_train.csv")
y_train.head()

In [None]:
y_train.shape

In [None]:
X_test = pd.read_csv("data/X_test.csv")
X_test.head()

## `train.csv`

In [None]:
train = pd.read_csv("data/train.csv")

### Visualise some correlation on the data

In [None]:
corr = train[
    [col for col in train.columns if col not in ["ROW_ID", "TS", "ALLOCATION"]]
].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=False, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Matrice de corrélation")
plt.show()

In [None]:
train.groupby("ALLOCATION")[
    [f"SIGNED_VOLUME_{i}" for i in range(1, 21)]
].mean().T.corr()

In [None]:
X_train.groupby("ALLOCATION")["AVG_DAILY_TURNOVER"].describe().drop(
    columns=["count"]
).sort_values(by=["std"]).describe()

In [None]:
from data_engineering import feature_engineering as fe


RET_features = [f"RET_{i}" for i in range(1, 20)]
SIGNED_VOLUME_features = [f"SIGNED_VOLUME_{i}" for i in range(1, 20)]
TURNOVER_features = ["AVG_DAILY_TURNOVER"]

window_sizes = [1, 3, 5, 10, 15, 20]


def feature_engineering(
    X: pd.DataFrame,
) -> pd.DataFrame:
    X = (
        X.pipe(
            fe.add_return_to_volume_ratio,
            RET_features=RET_features,
            SIGNED_VOLUME_features=SIGNED_VOLUME_features,
        )
        .pipe(
            fe.add_average_perf_features,
            RET_features=RET_features,
            window_sizes=window_sizes,
            group_col="TS",
        )
        .pipe(
            fe.add_statistical_features,
            RET_features=RET_features,
            SIGNED_VOLUME_features=SIGNED_VOLUME_features,
        )
        .pipe(
            fe.add_average_volume_features,
            SIGNED_VOLUME_features=SIGNED_VOLUME_features,
        )
        # .pipe(fe.add_cross_sectional_features, base_cols=["RET_1", "RET_3"])
    )

    return X


X_feat = feature_engineering(train)

In [None]:
corr = X_feat[
    [col for col in X_feat.columns if col not in ["ROW_ID", "TS", "ALLOCATION"]]
].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=False, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Matrice de corrélation")
plt.show()

In [None]:
corr = X_feat[X_feat["ALLOCATION"] == "ALLOCATION_02"][
    [col for col in X_feat.columns if col not in ["ROW_ID", "TS", "ALLOCATION"]]
].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=False, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Matrice de corrélation")
plt.show()

In [None]:
allocation = "ALLOCATION_02"

corr = X_feat[X_feat["ALLOCATION"] == allocation][
    [col for col in X_feat.columns if col not in ["ROW_ID", "TS", "ALLOCATION"]]
].corr()

(corr["target"].drop("target") * 100).abs().sort_values(ascending=False).head(10).index

In [None]:
from statsmodels.tsa.stattools import ccf

alloc1 = train[train["ALLOCATION"] == "ALLOCATION_01"].reset_index(drop=True)
alloc2 = train[train["ALLOCATION"] == "ALLOCATION_02"].reset_index(drop=True)
corr_x_y = []
for i in range(len(alloc1)):
    x = alloc1.iloc[i]  # variable 1
    y = alloc2.iloc[i]  # variable 2
    cross_corr = ccf(x[RET_features], y[RET_features])  # jusqu'à 20 lags

cross_corr