Bazirano na radu https://proceedings.mlr.press/v139/xu21h/xu21h.pdf

The study introduces a **technique** for creating PIs **not bound by** any specific **distribution** for **dynamic
time series dat**a. The EnbPI method encompasses a** bootstrap ensemble estimator to formulate sequential
P**Is. Unlike classical conformal prediction methods that require data exchangeability, EnbP**I does not
require data exchangeabil**ity and has been custom-built for time series.
The data exchangeability assumption suggests that the sequence in which observations appear in the
dataset doesn’t matter. However, this assumption does not apply to time series, where the sequence
of data points is crucial. EnbPI doesn’t rely on data exchangeability, making it aptly suited for time
series analysis.
PIs generated by EnbPI attain a finite-sample, approximately valid marginal coverage for broad
regression functions and time series under the mild assumption of strongly mixing stochastic errors.
Additionally, EnbPI is computationally efficient and avoids overfitting by not requiring data splitting or
training multiple ensemble estimators. It is also scalable to producing arbitrarily many PIs sequentially
and is well suited to a wide range of regression functions.
Time series data is dynamic and often non-stationary, meaning the statistical properties can change
over time. While various regression functions exist for predicting time series, such as those using
boosted trees or neural network structures, these existing methods often need help constructing
accurate PIs. Typically, they can only create reliable intervals by placing restrictive assumptions on
the underlying distribution of the time series, which may only sometimes be appropriate or feasible

In [9]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import HistGradientBoostingRegressor

from fortuna.conformal import EnbPI
from fortuna.metric.regression import prediction_interval_coverage_probability

# Učitavanje podataka 

In [10]:
DATA_WO_CANCELLATIONS_PATH = '../data/interim/data_wo_cancel.parquet'
RAW_DATA = '../data/raw/export_df.parquet'
DATA_HOTEL0_PATH = '../data/interim/data_wo_cancel_hotel0.parquet'
DATA_HOTEL1_PATH = '../data/interim/data_wo_cancel_hotel1.parquet'


In [11]:
data = pd.read_parquet(DATA_WO_CANCELLATIONS_PATH)
dataResort = pd.read_parquet(DATA_HOTEL0_PATH)
dataCity = pd.read_parquet(DATA_HOTEL1_PATH)
data.head()

Unnamed: 0,hotel_id,datum_dolaska,zemlja_gosta,kanal_prodaje_id,tip_sobe_id,cijena_nocenja,gost_id,duljina_boravka,ukupno_gostiju,raspon_dolazak_rezervacija
0,0,2015-07-01,PRT,0,0,100.0,1077152,0,2.0,161
1,0,2015-07-01,PRT,0,0,100.0,1017906,0,2.0,21
2,0,2015-07-01,GBR,0,1,64.991345,1039896,1,1.0,49
3,0,2015-07-01,GBR,1,1,74.368897,1008245,1,1.0,397
4,0,2015-07-01,GBR,2,1,130.973278,1093703,2,2.0,360


In [15]:
len(dataCity)

46047

In [14]:
len(NumberOfGuestsDailyCity)

1096

In [13]:
NumberOfGuestsDailyCity = dataCity['ukupno_gostiju'].groupby(dataCity['datum_dolaska']).sum()
NumberOfGuestsDailyCity = NumberOfGuestsDailyCity.resample('d').sum().to_frame()
NumberOfGuestsDailyCity.head(5)

Unnamed: 0_level_0,ukupno_gostiju
datum_dolaska,Unnamed: 1_level_1
2015-01-01,10.0
2015-01-02,6.0
2015-01-03,11.0
2015-01-04,8.0
2015-01-05,4.0


In [None]:
NumberOfGuestsDailyResort = dataResort['ukupno_gostiju'].groupby(dataResort['datum_dolaska']).sum()
NumberOfGuestsDailyResort = NumberOfGuestsDailyResort.resample('d').sum().to_frame()
NumberOfGuestsDailyResort.head()

In [None]:
y = NumberOfGuestsDailyCity['ukupno_gostiju']
X = NumberOfGuestsDailyCity.drop(columns=["ukupno_gostiju"],axis=1)
X_train, X_test = train_test_split(X, test_size=0.2, shuffle=False)
y_train, y_test = train_test_split(y, test_size=0.2, shuffle=False)

In [None]:
y.shape

### Data bootstrapping

In [None]:
class DataFrameBootstrapper:
    def __init__(self, n_samples: int):
        self.n_samples = n_samples

    def __call__(
        self, X: np.ndarray, y: np.ndarray
    ) -> tuple[np.ndarray, list[tuple[np.ndarray, np.ndarray]]]:
        indices = np.random.choice(y.shape[0], size=(self.n_samples, y.shape[0]))
        return indices, [(X.iloc[idx], y.iloc[idx]) for idx in indices]

In [None]:
n_bs_samples = 10
bs_indices, bs_train_data = DataFrameBootstrapper(n_samples=n_bs_samples)(
    X_train, y_train
)

In [None]:
bs_indices.shape

In [None]:
X_train.shape

In [None]:
bs_train_data[0][0]

In [None]:
# Get a boolean series where True indicates the index is duplicated
duplicate_indices = bs_train_data[0][0].index.duplicated(keep=False)

# Print the duplicated indices
print(bs_train_data[0][0].index[duplicate_indices])

In [None]:
bs_train_data[0][0].loc[bs_train_data[0][0].index[duplicate_indices][0]]