# Ubiquant market prediction with `etna` libarary 🌋

<a href="https://github.com/tinkoff-ai/etna">
    <img src="https://img.shields.io/badge/GitHub-100000?style=for-the-badge&logo=github&logoColor=white"  align='left'>
</a>

In this notebook we will try to get some insights about given time series and make base prediction with [etna time series library](https://github.com/tinkoff-ai/etna/).

In [None]:
!mkdir etna-deps && cp -r ../input/etnadeps170/etna-deps-1-7-0.zip . && unzip -P 1234 etna-deps-1-7-0  1> /dev/null 2> /dev/null

In [None]:
!pip install --no-index --ignore-installed  --find-links . etna==1.7.0  1> /dev/null 2> /dev/null

In [None]:
!python --version

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
import gc
from collections import defaultdict
from typing import Dict, Tuple, List

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

from etna.transforms import LagTransform, TimeSeriesImputerTransform, StandardScalerTransform
from etna.models import SklearnMultiSegmentModel
from etna.metrics import MAE, SMAPE, MSE, Metric
from etna.pipeline import Pipeline
from etna.datasets import TSDataset
from etna.analysis import *

In [None]:
# https://www.kaggle.com/code/edwardcrookenden/eda-and-lgbm-baseline-feature-imp
features_to_use = [
    "f_153", "f_231", "f_225", "f_142", "f_62", "f_118", "f_74", "f_179",
    "f_78", "f_240", "f_22", "f_174", "f_165", "f_241", "f_65", "f_232",
    "f_200", "f_21", "f_221", "f_41"
]

FILE_PATH = "../input/ubiquant-market-prediction/train.csv"

df = pd.read_csv(
    FILE_PATH,
    usecols=["row_id", "time_id", "investment_id", "target"] + features_to_use,
)

# EDA

In [None]:
df.head().iloc[:, :10]

#### Let's instatinate container with time series we work with.  
To use `etna` package we should use dummy timestamps cause of api limitations.

In [None]:
ts_df = df.copy()
timestamp_min = pd.to_datetime("2021-01-01")
ts_df["timestamp"] = timestamp_min + pd.to_timedelta('1 days') * ts_df["time_id"]
ts_df["segment"] = ts_df["investment_id"].apply(str)
ts_df = ts_df[["timestamp", "segment", "target"] + features_to_use]

In [None]:
ts = TSDataset.to_dataset(ts_df)
ts = TSDataset(ts, freq="D")

To make quick look at data we can call `describe` method

In [None]:
data_decription = ts.describe()

In [None]:
data_decription.head()

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(20, 10))
_ = data_decription.num_missing.hist(ax=ax[0])
_ = ax[0].set_title("Number of missing values")
_ = data_decription["length"].hist(ax=ax[1])
_ = ax[1].set_title("Time series length")

There are a lot of missed values in time series, but they are quite wide distributed along history.   
As was mentioned in discussions `time_id` is weekly coupled with real time steps.  

##### Let's plot some sampled assets:

In [None]:
assets_to_analyse = list(np.random.default_rng(5).choice(ts.segments, size=10))

In [None]:
ts.plot(segments=ts.plot(segments=assets_to_analyse))

We can see that missed timestamps are quite random and don't have seasonal pattern for example.  
Time series start at different timestamps

In [None]:
ts = TSDataset.to_dataset(ts_df[ts_df.segment.isin(assets_to_analyse)])
ts = ts.asfreq("D")
ts = ts.apply(lambda x: x.fillna(x.mean()).fillna(0), axis=0)
ts = TSDataset(ts, freq="D")

In [None]:
plot_correlation_matrix(
    TSDataset(ts.df.loc[:, pd.IndexSlice[:, "target"]], "D"),
    segments=assets_to_analyse,
    method="pearson"
)

We've got correlation matrix between chosen assets.  
For example correlation analysis is used for pair trading strategies.  
In our case all assets are weekly correlated, except `1074`-`275` and `1180`-`275` pairs.

In [None]:
sample_acf_plot(ts, segments=assets_to_analyse)

Autocorrelation plot is used for finding correlated lags.  
This information may be important for hyperparameters tuning or feature engineering - we can use lags as features for example.  
In perspective of Efficient-market hypothesiss autocorrelation coefficients should be near zero.  
As you see it's not quite true - it seems we can get some usefull information from historical values.  
We will see that lagged target improves perfomance.  
There is another [evidence](https://www.kaggle.com/code/hhgami/predict-and-use-target-shift-1-as-a-feature) of this fact.

In [None]:
gc.collect()

# Backtesting

Let's define helper functions:

In [None]:
def df_test_to_tsdataset(df: pd.DataFrame, features_to_use: list):
    from unittest.mock import Mock

    global timestamp_min
    ts = df.copy()
    ts["time_id"] = ts["row_id"].apply(lambda x: int(x.split("_")[0]))
    ts["timestamp"] = timestamp_min + pd.to_timedelta('1 days') * ts["time_id"]
    ts["segment"] = ts["investment_id"].apply(str)
    ts["target"] = 0
    ts = ts[["timestamp", "segment", "target"] + features_to_use]
    ts = TSDataset.to_dataset(ts)
    ts_target = ts.loc[:, pd.IndexSlice[:, "target"]]
    ts_reg = ts.loc[:, pd.IndexSlice[:, features_to_use]]
    TSDataset._check_regressors = Mock
    ts = TSDataset(ts_target, freq="D", df_exog=ts_reg, known_future="all")
    return ts


In [None]:
def to_submit_transform(ts: TSDataset):
    to_submit = ts.df.loc[:, pd.IndexSlice[:, "target"]]
    to_submit = to_submit.unstack()
    to_submit = to_submit.rename("target")
    to_submit = to_submit.reset_index()
    to_submit["time_id"] = (to_submit["timestamp"] - timestamp_min) / pd.to_timedelta('1 days')
    to_submit["time_id"] = to_submit["time_id"].apply(int)
    to_submit["row_id"] = to_submit.apply(lambda x: f'{x["time_id"]}_{x["segment"]}', axis=1)
    to_submit = to_submit.set_index("row_id")
    return to_submit


In [None]:
from lightgbm.sklearn import LGBMRegressor

class LGBMMultiSegmentModel(SklearnMultiSegmentModel):

    def __init__(self, **kwargs):
        self.kwargs = kwargs
        super().__init__(
            regressor=LGBMRegressor(**kwargs)
        )       


For backtesting we will make on the last 365 time series points.  
There are a lot of missing values in the time series.  
The most simple method to eliminate etna tsdataset limitations is to fill that gaps.  
Of cause it's not an optimal way but as example it's accepted.

In [None]:
SHIFT = 365

ts = TSDataset.to_dataset(ts_df)

# There are a lot of missing values in the time series.
# The most simple method to eliminate etna tsdataset limitations is to fill that gaps.
# Of cause it's not optimal way but as example it's accepted.
ts = ts.asfreq("D")
ts = ts.apply(lambda x: x.fillna(x.mean()).fillna(0), axis=0)

ts_target = ts.loc[:, pd.IndexSlice[:, "target"]].iloc[:-SHIFT]
ts_reg = ts.loc[:, pd.IndexSlice[:, features_to_use]]
ts = TSDataset(
    ts_target, freq="D", df_exog=ts_reg.fillna(0), known_future="all"
)

In [None]:
ts.head()

In [None]:
def corr_pearson(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    return np.mean((y_true - np.mean(y_true)) * (y_pred - np.mean(y_pred))) / np.std(y_pred) / np.std(y_true)

class Corr(Metric):
    def __init__(self, mode: str = "per-segment", **kwargs):
        super().__init__(mode=mode, metric_fn=corr_pearson, **kwargs)

def corr_pearson_helper(ts_true: pd.DataFrame, ts_pred: pd.DataFrame) -> Tuple[List[float], Dict[str, float]]:
    corr_list = list()
    corr_dict = defaultdict(list)
    for idx in ts_pred.index:
        y_true = ts_true.loc[idx, pd.IndexSlice[:, "target"]].sort_index().values
        y_pred = ts_pred.loc[idx, pd.IndexSlice[:, "target"]].sort_index().values
        _corr = corr_pearson(y_true, y_pred)
        corr_dict[forecast_df.loc[idx, pd.IndexSlice[:, "fold_number"]].iat[0]].append(_corr)
        corr_list.append(_corr)
    return corr_list, corr_dict

In [None]:
HORIZON = 24

pipe = Pipeline(
    model=LGBMMultiSegmentModel(),
    transforms=[],
    horizon=HORIZON
)

In [None]:
metrics_df, forecast_df, fold_info_df = pipe.backtest(ts, metrics=[MAE(), SMAPE(), MSE()], n_folds=5)

In [None]:
(
    metrics_df
    .groupby("segment")
    .mean()
    .reset_index()
    .drop(["segment", "fold_number"], axis=1)
    .apply(["median", "mean", "std"])
)

In [None]:
corr_per_raw, corr_dict = corr_pearson_helper(ts.df, forecast_df)

In [None]:
print(f"Total pearson mean: {np.mean(corr_per_raw)}")

Correlation distribution over folds:

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(20, 10))
for i in corr_dict:
    sns.distplot(corr_dict[i], bins=10)

### Let's add some target lags with `LagTransform`

In [None]:
HORIZON = 24

pipe = Pipeline(
    model=LGBMMultiSegmentModel(),
    transforms=[
        LagTransform("target", lags=[HORIZON + i for i in range(20)], out_column="lag")
    ],
    horizon=HORIZON
)

In [None]:
metrics_df, forecast_df, fold_info_df = pipe.backtest(ts, metrics=[MAE(), SMAPE(), MSE()], n_folds=5)

In [None]:
(
    metrics_df
    .groupby("segment")
    .mean()
    .reset_index()
    .drop(["segment", "fold_number"], axis=1)
    .apply(["median", "mean", "std"])
)

In [None]:
corr_per_raw, corr_dict = corr_pearson_helper(ts.df, forecast_df)

Correlation distribution over folds:

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(20, 10))
for i in corr_dict:
    sns.distplot(corr_dict[i], bins=10)

In [None]:
print(f"Total pearson mean: {np.mean(corr_per_raw)}")

In [None]:
from unittest.mock import Mock

ts = TSDataset.to_dataset(ts_df)
ts = ts.apply(lambda x: x.fillna(x.mean()).fillna(0), axis=0)
ts_target = ts.loc[:, pd.IndexSlice[:, "target"]]
ts_reg = ts.loc[:, pd.IndexSlice[:, features_to_use]]
TSDataset._check_regressors = Mock
ts = TSDataset(
    ts_target, freq="D",
    df_exog=ts_reg, known_future="all"
)

In [None]:
HORIZON = 24

pipe = Pipeline(
    model=LGBMMultiSegmentModel(),
    transforms=[],
    horizon=HORIZON
)

In [None]:
pipe.fit(ts)

##### We have some improvements both in pearson correaltion and typical regression metrics.


##### Let's make final submission

# Final submission

In [None]:
import ubiquant
env = ubiquant.make_env()   # initialize the environment
iter_test = env.iter_test()    # an iterator which loops over the test set and sample submission
for (test_df, sample_prediction_df) in iter_test:
    
    ts_test = df_test_to_tsdataset(test_df, features_to_use)
    ts_forecast = pipe.model.forecast(ts_test)
    to_submit = to_submit_transform(ts_forecast)
    
    sample_prediction_df = sample_prediction_df.set_index("row_id")
    sample_prediction_df = sample_prediction_df.merge(to_submit, on="row_id", how="left").reset_index()
    sample_prediction_df["target"] = sample_prediction_df["target_y"]
    sample_prediction_df = sample_prediction_df[["row_id", "target"]]
    sample_prediction_df["target"] = sample_prediction_df["target"].fillna(sample_prediction_df.target.mean()).fillna(0)
    
    env.predict(sample_prediction_df)   # register your predictions