In [1]:
import pandas as pd

In [7]:
train_df = pd.read_parquet('./data/train.parquet')
test_df = pd.read_parquet('./data/test.parquet')
val_df = pd.read_parquet('./data/val.parquet')


print(train_df.shape, val_df.shape, test_df.shape)

(24210, 9) (3459, 9) (6918, 9)


## General approach for working with a time series is to:

1. Plotting the series; notice trends and seasonality (we have done this in `data_preparation.ipynb`).
2. Detrend the time series by removing the seasonality and drift.
3. Fit a baseline model and calculate the residual.
4. Diagnose the residual.

## SARIMAX

SARIMAX (Seasonal AutoRegressive Integrated Moving Average with eXogenous variables) is an advanced statistical model used for forecasting time series data that incorporates both seasonal and non-seasonal components, as well as external or exogenous variables that could influence the time series. It extends the ARIMA model by adding seasonality (SARIMA) and the ability to model the impact of independent external variables, making it highly versatile and powerful for handling complex forecasting scenarios. SARIMAX can be decomposed into:

* S: Seasonal

* AR: Autoregressive (p)

* I: Integrated (d)

* MA: Moving Average (q)

* X: Exogeneous variables

### Stationarity test

Stationarity means that the statistical properties of a time series i.e. mean, variance and covariance do not change over time. Many statistical models require the series to be stationary to make effective and precise predictions. There are two tests commonly used in practice:

1. Augmented Dickey Fuller (“ADF”) test.
2. Kwiatkowski-Phillips-Schmidt-Shin (“KPSS”) test.

### ADF Test

ADF test is used to determine the presence of unit root in the series, and hence helps in understanding if the series is stationary or not. The null and alternate hypothesis of this test are:

$H_0$: the series has a unit root

$H_1$: the series has no unit root

If we fail to reject the null hypothesis, this test may provide evidence that the series is non-stationary. 


### KPSS Test

KPSS is another test for checking the stationarity of a time series. The null and alternate hypothesis for the KPSS test are opposite that of the ADF test.

$H_0$: the process is trend stationary

$H_1$: the series has a unit root (series is not stationary)


### Verdict

It is always better to apply both the tests, so that it can be ensured that the series is truly stationary. If the two tests contradict (as we will see now). KPSS indicates non-stationarity and ADF indicates stationarity - The series is difference stationary. Differencing is to be used to make series stationary. The differenced series is checked for stationarity.

* ADF test

In [None]:
target=for_sarimax[['outcome']]

ad_fuller_result = adfuller(target)

print(f'ADF Statistic: {ad_fuller_result[0]}')
print(f'p-value: {ad_fuller_result[1]}')

In [None]:
full_train_df=pd.concat([train_df, val_df])

In [10]:
val_df.index = pd.DatetimeIndex(val_df.index).to_period('H')

full_train_df=pd.concat([train_df, val_df])

full_train_df.drop(columns=['day_sin', 'day_cos'], inplace=True)

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 24210 entries, 2006-12-16 18:00:00 to 2009-09-20 11:00:00
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Global_active_power    24210 non-null  float64
 1   Global_reactive_power  24210 non-null  float64
 2   Voltage                24210 non-null  float64
 3   Global_intensity       24210 non-null  float64
 4   Sub_metering_1         24210 non-null  float64
 5   Sub_metering_2         24210 non-null  float64
 6   Sub_metering_3         24210 non-null  float64
 7   day_sin                24210 non-null  float64
 8   day_cos                24210 non-null  float64
dtypes: float64(9)
memory usage: 1.8 MB
