# Tabular Playground Series - Mar 2022 with ETNA 🌋

In [None]:
# Notebook uses some unrelesed features, to install the latest release just write "pip install -U etna"
!pip install git+https://github.com/tinkoff-ai/etna.git@3e432df98e1a8ec6d5e0a79e8d26d4220f82042a --ignore-installed -q 2> /dev/null
!pip install -I jinja2==3.0.3 -q 2> /dev/null   

<a href="https://github.com/tinkoff-ai/etna">
    <img src="https://img.shields.io/badge/GitHub-100000?style=for-the-badge&logo=github&logoColor=white"  align='left'>
</a>

In this notebook we will make predictions for [Tabular Playground Series - Mar 2022](https://www.kaggle.com/competitions/tabular-playground-series-mar-2022/overview) with [etna time series library](https://github.com/tinkoff-ai/etna/).

In [None]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from copy import deepcopy

In [None]:
TRAIN_PATH = "../input/tabular-playground-series-mar-2022/train.csv"
TEST_PATH = "../input/tabular-playground-series-mar-2022/test.csv"
HORIZON = 36

# Dataset

In the previous notebook for [TPS Jan 2022](https://www.kaggle.com/code/chikovalexander/tps-jan-2022-etna/edit/run/86568046) we showed how to work with timeseries data using TSDataset. Here, the process of creating the dataset is literally the same, except the new parameter `known_future` in TSDataset constructor. It should contain the columns from `df_exog` which are "regressors" - exogenous data known in the future(i.e. "direction" of the road).

In [None]:
from etna.datasets import TSDataset

In [None]:
def load_dataset(segments=None):
    train = pd.read_csv(TRAIN_PATH, parse_dates=["time"])
    test = pd.read_csv(TEST_PATH, parse_dates=["time"])
    data = pd.concat([train, test])
    
    # Rename columns to fit the ETNA format
    data = data.drop(columns=["row_id"]).rename(columns={"time": "timestamp", "congestion": "target"})
    data["segment"] = data["x"].astype(str) + "_" + data["y"].astype(str) + "_" + data["direction"]
    data = TSDataset.to_flatten(TSDataset(df=TSDataset.to_dataset(data), freq="20T").df)
    if segments is not None:
        data = data[data["segment"].isin(segments)]

    # Some FE + mark the categorical columns to be automatically handled with catboost
    data['moment'] = (data['timestamp'].dt.hour * 3 + data['timestamp'].dt.minute // 20).astype("category") 
    data = data.drop(columns=["x", "y", "direction"])
    
    # Dataframe with targets
    df = TSDataset.to_dataset(data[["timestamp", "segment", "target"]]).iloc[:-HORIZON]
    #Dataframe with exogenous data
    df_exog = TSDataset.to_dataset(data.drop(columns=["target"]))
    ts = TSDataset(df=df, freq="20T", df_exog=df_exog, known_future=["moment"])
    return ts

Now we don't need to add prefix "regressor_" to all the features we suggest being regressors, dataset will **automatically update the regressors** list after each transformation.Now we do't need to add prefix "regressor_" to all the features we sugest to be regressors, dataset will automaticaly updete the regressors list after each transformation.

In [None]:
ts = load_dataset()
ts.head()

# EDA


## Special Values

There are some "special values" in the dataset, which were possibly used to fill the missing values in some segments.(see more in [TPSMAR22 EDA which makes sense](https://www.kaggle.com/code/ambrosm/tpsmar22-eda-which-makes-sense/notebook))

In [None]:
plt.figure(figsize=(10, 6))
plt.bar(range(101), TSDataset.to_flatten(ts[:,:,"target"]).target.value_counts().sort_index(), width=1,
        color=['r' if con in [15, 20, 21, 29, 34] else '#ffd700' for con in range(101)])
plt.ylabel('Count')
plt.xlabel('Congestion');

## Missing values

First of all, some basic information about the series in the dataset. As we can see, there are 81 missing values in each segment, we should definitely look at them.

In [None]:
ts.describe()

We can select the imputation strategy by visualizing the imputed values. As we can see, the missing point are mostly not consequent, so we can use the last known value to impute them as the dataset frequency is high.

In [None]:
from etna.analysis import plot_imputation
from etna.transforms import TimeSeriesImputerTransform

In [None]:
imputer = TimeSeriesImputerTransform(in_column="target", strategy="forward_fill")
plot_imputation(ts=ts, imputer=imputer, segments=ts.segments[:4])

In [None]:
ts.fit_transform([imputer])

## Seasonality

Now, let's take a look at the time series in the dataset. 

In [None]:
ts.plot(segments=ts.segments[:10])

Here we can see the **daily** seasonality.

In [None]:
ts.plot(segments=ts.segments[:10], start="1991-08-01", end="1991-08-07")

We can also look at autocorrelation plot. In fact, there is also **weekly** seasonality, **12-hours** seasonality(i.e 2_2_NB), **2-days** seasonality(i.e. 2_2_NE). The absolute values of autocorrelation vary significantly from segment to segment. This might be hard for the model to catch all types of the seasonality as it varies from segment to segment.

In [None]:
from etna.analysis import sample_acf_plot, sample_pacf_plot

In [None]:
sample_acf_plot(ts, segments=["0_2_EB", "2_2_NE", "0_0_SB", "2_2_NB"], lags=3*14*24)

## Correlations

There might be correlations between the directions inside one road. To spot them, we can plot the correlation matrix. As we can see, there is correlation between SB and NB directions inside the road.

**Note**: I tried to use the lags and the mark of the most correlated segment as features in the model, however it didn't help. May be, someone will find out how to use it smartly.

In [None]:
from etna.analysis import plot_correlation_matrix

In [None]:
road = "0_1"
road_segments= [segment for segment in ts.segments if segment.startswith(road)]
plot_correlation_matrix(
    TSDataset(ts.df.loc[:, pd.IndexSlice[:, "target"]], "20T"), # Bag, normaly you can just write "ts" here
    segments=road_segments,
    method="pearson"
)

Also, there is strong correlation between the neighbor roads in one direction(i.e. 1_2_NB and 1_3_NB). It might be helpful to show the model somehow the neighborhood of the segments.

In [None]:
direction = "NB"
direction_segments= [segment for segment in ts.segments if segment.endswith(direction)]
plot_correlation_matrix(
    TSDataset(ts.df.loc[:, pd.IndexSlice[:, "target"]], "20T"),
    segments=direction_segments,
    method="pearson"
)

## Clustering

There might be the roads with the same patterns, let's check it out.

In [None]:
from etna.analysis import plot_clusters
from etna.clustering import EuclideanClustering

In [None]:
model = EuclideanClustering()
model.build_distance_matrix(ts=ts)
model.build_clustering_algo(n_clusters=8, linkage="average") # number of clusters = number of directions
segment2cluster = model.fit_predict()
centroids = model.get_centroids()

In [None]:
plot_clusters(ts=ts, segment2cluster=segment2cluster, centroids_df=centroids)

We see here two big clusters and several smaller ones.

**Note**: I tried to use the cluster marks as feature in the model, however it also did't help.

# Feature engineering

In [None]:
from etna.transforms import TimeSeriesImputerTransform 
from etna.transforms import StandardScalerTransform, YeoJohnsonTransform 
from etna.transforms import LagTransform, FourierTransform 
from etna.transforms import (MeanTransform, StdTransform, MinTransform,
                             MaxTransform, MedianTransform, MADTransform)
from etna.transforms import SegmentEncoderTransform

In [None]:
# Imputation
imputer = TimeSeriesImputerTransform(in_column="target", strategy="forward_fill")

# Preprocessing
power = YeoJohnsonTransform(in_column="target")
scaler = StandardScalerTransform(in_column="target")

# Lags and seasonalities
seasonlal_lags = [3 * 7 * 24 * i for i in range(1, 9)] + [3 * 24 * i for i in range(1, 9)] + [3 * 12 * i for i in range(1, 9)]
lags = LagTransform(in_column="target", lags=seasonlal_lags, out_column="lag")

# Rolling statistics
statistics_transforms = [MeanTransform, StdTransform, MinTransform,
                         MaxTransform, MedianTransform, MADTransform]
names = ["mean", "std", "min", "max", "median", "mad"]
seasonal_statistics = [
    transform(in_column="lag_504", window=-1, seasonality=3 * 7 * 24, out_column=name+"_w")
    for transform, name in zip(statistics_transforms, names)
]
seasonal_statistics += [
    transform(in_column="lag_72", window=-1, seasonality=3 * 7 * 24, out_column=name+"_d")
    for transform, name in zip(statistics_transforms, names)
]
seasonal_statistics += [
    transform(in_column="lag_504", window=4, seasonality=3 * 7 * 24, out_column=name+"_short_w")
    for transform, name in zip(statistics_transforms, names)
]
seasonal_statistics += [
    transform(in_column="lag_72", window=7, seasonality=3 * 7 * 24, out_column=name+"_short_d")
    for transform, name in zip(statistics_transforms, names)
]


# Segment mark
segment_encoder = SegmentEncoderTransform()

transforms = [imputer, power, scaler, lags, *seasonal_statistics, segment_encoder]

In [None]:
ts = load_dataset()

In [None]:
ts.fit_transform(deepcopy(transforms))

# Feature Importance

Roads in the dataset have different behavior, this implies that the feature importance might vary between the road. Let's generate the set of feature, which will be used later in the model, and look at the top-20 important ones for each segment.

In [None]:
from etna.analysis import StatisticsRelevanceTable, ModelRelevanceTable, plot_feature_relevance
from sklearn.ensemble import RandomForestRegressor

In [None]:
# Bag, we will fix it later)
ts.df = ts.df.astype(float)
ts.df = ts.df.dropna() 
ts = TSDataset(df=ts[:,["0_2_EB", "2_2_NE", "0_0_SB", "2_2_NB"],:], freq="20T")

In our library, we use 2 approaches to evaluate feature relevance.

The first one is the feature relevance from the tree-base models:

In [None]:
plot_feature_relevance(
    ts=ts,
    relevance_table=ModelRelevanceTable(),
    normalized=True,
    relevance_aggregation_mode="per-segment",
    top_k=20,
    segments=["0_2_EB", "2_2_NE", "0_0_SB", "2_2_NB"],
    relevance_params=dict(model=RandomForestRegressor(n_estimators=10))
)

And the second one is based on statistical tests(feature relevance here is q-value => the less, the better):

In [None]:
plot_feature_relevance(
    ts=ts,
    normalized=True,
    relevance_table=StatisticsRelevanceTable(),
    relevance_aggregation_mode="per-segment",
    top_k=20,
    segments=["0_2_EB", "2_2_NE", "0_0_SB", "2_2_NB"],
)

As we can see, the order of the features for each segment is different, so the best way might be to use the separate model for each segment. However, it works significantly slow, so we won't do it in this notebook, but you might try it yourself.

As we are going to use multi-segment model, we actually need aggregated feature relevance!

In [None]:
plot_feature_relevance(
    ts=ts,
    relevance_table=ModelRelevanceTable(),
    normalized=True,
    relevance_aggregation_mode="mean",
    top_k=20,
    segments=["0_2_EB", "2_2_NE", "0_0_SB", "2_2_NB"],
    relevance_params=dict(model=RandomForestRegressor(n_estimators=10))
)

In [None]:
plot_feature_relevance(
    ts=ts,
    normalized=True,
    relevance_table=StatisticsRelevanceTable(),
    relevance_aggregation_mode="mean",
    top_k=20,
    segments=["0_2_EB", "2_2_NE", "0_0_SB", "2_2_NB"],
)

**Note**: Both methods suggest **rolling statistics** the most important features

# Model

Now, let's create a baseline

In [None]:
from etna.pipeline import Pipeline
from etna.models import CatBoostModelMultiSegment

# Backtest

The important part in any project is correct validation. 

In [None]:
pipeline = Pipeline(model=CatBoostModelMultiSegment(), 
                   transforms=transforms, 
                   horizon=HORIZON)

In [None]:
from etna.metrics import MAE

In [None]:
ts = load_dataset(segments=sorted(ts.segments)[:10])

Firstly, let's try to use three last folds in classical time series cross validation strategy

In [None]:
metrics, forecast, _ = pipeline.backtest(ts=ts, metrics=[MAE()], n_folds=3, n_jobs=3)

In [None]:
metrics.mean()["MAE"]

Don't be upset. In fact, we need to forecast the afternoon of the Monday, so why we are validating on the last 36 hours? We need to validate on last tree Monday afternoons!

In [None]:
from etna.pipeline import FoldMask

In [None]:
fold_mask_1 = FoldMask(first_train_timestamp=None, 
                       last_train_timestamp="1991-09-23 11:40:00", 
                      target_timestamps=pd.date_range(start="1991-09-23 12:00:00", end="1991-09-23 23:40:00", freq="20T"))
fold_mask_2 = FoldMask(first_train_timestamp=None, 
                       last_train_timestamp="1991-09-16 11:40:00", 
                      target_timestamps=pd.date_range(start="1991-09-16 12:00:00", end="1991-09-16 23:40:00", freq="20T"))
fold_mask_3 = FoldMask(first_train_timestamp=None, 
                       last_train_timestamp="1991-09-09 11:40:00", 
                      target_timestamps=pd.date_range(start="1991-09-09 12:00:00", end="1991-09-09 23:40:00", freq="20T"))
folds = [fold_mask_3, fold_mask_2, fold_mask_1]

In [None]:
metrics_mondays, _, _ = pipeline.backtest(ts=ts, metrics=[MAE()], n_folds=folds, n_jobs=3)

In [None]:
metrics_mondays.mean()["MAE"]

Looks much more correlated with the LB!

# Ensemble

Now, let's build the final solution. We will use the ensemble of Catboost models with different random seeds, to make the forecast more robust.

In [None]:
from etna.ensembles import VotingEnsemble

In [None]:
seeds = [None, 13, 121, 11041999, 3141, 235813, 1501]
pipelines = [Pipeline(model=CatBoostModelMultiSegment(random_seed=seeds[i]),
                      transforms=transforms,
                      horizon=HORIZON) 
             for i in range(len(seeds))]
ensemble = VotingEnsemble(pipelines=pipelines, n_jobs=5)

# Forecast

In [None]:
from etna.analysis import plot_forecast

In [None]:
ts = load_dataset()
ensemble.fit(ts)
forecast = ensemble.forecast()

In [None]:
plot_forecast(forecast_ts=forecast, train_ts=ts, n_train_samples=3*7*24, segments=sorted(ts.segments)[:10])

# Submission

In [None]:
from sklearn.metrics import mean_absolute_error

In [None]:
def make_submission(forecast):
    forecast = TSDataset.to_flatten(forecast[:,:,"target"])
    
    test = pd.read_csv(TEST_PATH, parse_dates=["time"])
    test = test.rename(columns={"time": "timestamp"})
    test["segment"] = test["x"].apply(str) + "_" + test["y"].apply(str) + "_" + test["direction"]
    test = pd.merge(test, forecast, on=["timestamp", "segment"])
    test = test.rename(columns={"target": "congestion"})
    submission = test[["row_id", "congestion"]]
    
    # Postprocessing (see https://www.kaggle.com/code/ambrosm/tpsmar22-generalizing-the-special-values for an explanation)
    
    # Read and prepare the training data
    train = pd.read_csv(TRAIN_PATH, parse_dates=['time'])
    train['hour'] = train['time'].dt.hour
    train['minute'] = train['time'].dt.minute
    
    # Compute the quantiles of workday afternoons in September except Labor Day
    sep = train[(train.time.dt.hour >= 12) & (train.time.dt.weekday < 5) &
                (train.time.dt.dayofyear >= 246)]
    lower = sep.groupby(['hour', 'minute', 'x', 'y', 'direction']).congestion.quantile(0.2).values
    upper = sep.groupby(['hour', 'minute', 'x', 'y', 'direction']).congestion.quantile(0.8).values

    # Clip the submission data to the quantiles
    submission_out = submission.copy()
    submission_out['congestion'] = submission.congestion.clip(lower, upper)

    # Display some statistics
    mae = mean_absolute_error(submission.congestion, submission_out.congestion)
    print(f'Mean absolute modification: {mae:.4f}')
    print(f"Submission was below lower bound: {(submission.congestion <= lower - 0.5).sum()}")
    print(f"Submission was above upper bound: {(submission.congestion > upper + 0.5).sum()}")

    #Round targets
    submission_out['congestion'] = submission_out["congestion"].round().astype(int)
    submission_out.to_csv('submission.csv', index=False)

In [None]:
make_submission(forecast)

Phew, baseline is ready. But work is still in progress, wait for updates soon!