# On Analyzing Real World Time Series for Forecasting: Bitcoin Price Dataset (2017-2023)

In [1]:
import os
import sys
import torch

import torch.nn as nn
import torch.optim as optim

# Get the current working directory of the notebook
notebook_dir = os.getcwd()
sys.path.append(os.path.join(notebook_dir, '../tslearn/'))
from ml_models import MLP
from data_loader import build_bitcoin_uts
from ts_models import RandomWalk, ARIMA_model, EvaluationMetric

KeyboardInterrupt: 

## Data Pipeline

1. Load Raw Data
2. Plot Raw Data
3. Get Descriptive Statistics (ie: mean, median, etc) of Raw Data
4. Check Stationarity of Raw Data
5. Plot Autocorrelation and Partial Autocorrelation of Raw Data
6. Initialize & Predict Random Walk Model on Raw Data
    1. Split Raw Data
7. Difference
8. Get Descriptive Statistics (ie: mean, median, etc)
9. Check Stationarity of Differenced Data
10. Plot Autocorrelation and Partial Autocorrelation of Differenced Data
11. Initialize & Predict Random Walk `RW` Model on Raw Data
    1. Split Differenced Data
12. Initialize Autoregressive `AR(p)` Model
13. Predict Forecasts for Returns Data
14. Plot Actual Forecasts vs Predicted Forecasts for Returns Data
15. Follow-up

### Load Raw Data

In [None]:
reversed_bitcoin_ts = build_bitcoin_uts()

In [None]:
reversed_bitcoin_df = reversed_bitcoin_ts.get_as_df()
reversed_bitcoin_df

In [None]:
bitcoin_ts = reversed_bitcoin_ts.data_augment_reverse()
bitcoin_ts

In [None]:
bitcoin_df = bitcoin_ts.get_as_df()
bitcoin_df

### Plot Raw Data

- August 2017 to July 2023. The data has been meticulously collected from the Binance API, with price data captured at **one-minute intervals** [About Dataset](https://www.kaggle.com/datasets/jkraak/bitcoin-price-dataset)

In [None]:
bitcoin_ts.plot(tick_skip=120)

### Get Descriptive Statistics of Raw Data

In [None]:
bitcoin_ts.get_statistics()

In [None]:
bitcoin_ts.range_skewness_kurtosis()

### Check Stationarity of Raw Data

In [None]:
# both a taking a while
# bitcoin_ts.stationarity_test(bitcoin_df)

# from statsmodels.tsa.stattools import adfuller, bds

# adfuller(bitcoin_df)

- Stationarity test is taking a while. From looking at the graph, we can see that the time series is NOT stationary, so we can difference.

In [None]:
# bitcoin_series = bitcoin_ts.get_series()
# bitcoin_ts.independence_test(bitcoin_series)

- Independence test is taking a while as well. Assume that the data is dependent as there is correlation due to the time series being non-stationary.

### Initialize & Predict (RW) Model of Raw Data
- RW uses the raw data because the current observation depends on the previous, thus dependency is need and dependency is in raw data. Differencing removes this dependence.

In [None]:
bitcoin_series = bitcoin_ts.get_series()
bitcoin_series

In [None]:
day_forecast = 60 * 24
year_forecast = day_forecast * 365
forecasting_step = year_forecast
N = len(bitcoin_ts.get_series())
train_length = N - forecasting_step
train_length

In [None]:
train_uts, test_uts = bitcoin_ts.get_slice(1, train_length, both_train_test=True)
train_uts, test_uts

In [None]:
train_df = train_uts.get_as_df()
train_df

In [None]:
test_df = test_uts.get_as_df()
test_df

In [None]:
# rw_model_class = RandomWalk()

# rw_predictions = rw_model_class.predict(train_df, test_df)

In [None]:
# type(rw_predictions[0]), len(rw_predictions), rw_predictions

In [None]:
# rw_mse_gsts = EvaluationMetric.eval_mse(test_df, rw_predictions, per_element=False)
# rw_rmse_gsts = EvaluationMetric.eval_rmse(test_df, rw_predictions, per_element=False)

- Both `MSE` and `RMSE` for the raw TS are high. Why?

In [None]:
# EvaluationMetric.plot_forecast(train_df, test_df, rw_predictions, per_element=False)
# # EvaluationMetric.plot_forecast_only(test_df, rw_predictions, per_element=True)
# # EvaluationMetric.plot_forecast_only(test_df, rw_predictions)

# EvaluationMetric.plot_predictions(test_df, rw_predictions)

### Initialize & Predict ARIMA Model of Raw Data
- ARIMA assumes non-stationarity

In [None]:
# lag_p = 1
# integrated_d = 1
# error_q = 1
# arima_model_class = ARIMA_model()
# arima_model = arima_model_class.train_arima_model(train_df, lag_p, integrated_d, error_q)

In [None]:
# # retrain false
# arima_predictions_no_retrain = arima_model_class.predict(arima_model, train_df, test_df, False, lag_p)

# # retrain true
# arima_predictions_retrain = arima_model_class.predict(arima_model, train_df, test_df, True, lag_p)

In [None]:
# arima_predictions_no_retrain

In [None]:
# arima_predictions_retrain

In [None]:
# len(test_df), len(arima_predictions_no_retrain), len(arima_predictions_retrain)

In [None]:
# arima_mse_no_retrain = EvaluationMetric.eval_mse(test_df, arima_predictions_no_retrain, per_element=False)
# arima_rmse_no_retrain = EvaluationMetric.eval_rmse(test_df, arima_predictions_no_retrain, per_element=False)

# arima_mse_retrain = EvaluationMetric.eval_mse(test_df, arima_predictions_retrain, per_element=False)
# arima_rmse_retrain = EvaluationMetric.eval_rmse(test_df, arima_predictions_retrain, per_element=False)

In [None]:
# EvaluationMetric.plot_forecast(train_df, test_df, arima_predictions_no_retrain, False)
# EvaluationMetric.plot_forecast(train_df, test_df, arima_predictions_retrain, False)
# EvaluationMetric.plot_predictions(test_df, arima_predictions_no_retrain)
# EvaluationMetric.plot_predictions(test_df, arima_predictions_retrain)

### MLP Model

- Reminder: Our data is in **one-minute intervals**.
- Predict every hour of the next day. 

In [None]:
reversed_bitcoin_df = reversed_bitcoin_ts.get_as_df()
reversed_bitcoin_df

#### Data Manipulation

In [None]:
# day_forecast = 60 * 24
# year_forecast = day_forecast * 365
# forecasting_step = year_forecast
previous_steps = 2
forecast_ahead = 60

X_train_mvts, y_train_mvts = reversed_bitcoin_ts.data_augment_to_mvts(previous_steps, forecast_ahead)
# type(X_train_mvts)

In [None]:
X_train_df = X_train_mvts.get_as_df()
X_train_df

In [None]:
y_train_df = y_train_mvts.get_as_df()
y_train_df

In [None]:
X_test_mvts, y_test_mvts = X_train_mvts.data_augment_to_test(y_train_mvts, previous_steps, forecast_ahead)

In [None]:
X_test_df = X_test_mvts.get_as_df()
X_test_df

In [None]:
y_test_df = y_test_mvts.get_as_df()
y_test_df

In [None]:
hidden_size = 100
mlp_forecast_model = MLP(previous_steps, hidden_size, forecast_ahead)
mlp_forecast_model

In [None]:
criterion = nn.MSELoss()
optimizer = optim.Adam(mlp_forecast_model.parameters())
N_epochs = 200
configs = [criterion, optimizer, N_epochs]
train_mlp_forecast_model = mlp_forecast_model.train(X_train_df, y_train_df, configs)

In [None]:
mlp_model_predictions = mlp_forecast_model.predict(X_test_df, forecasting_step)
mlp_model_predictions

In [None]:
y_test_df

In [None]:
per_element = False
EvaluationMetric.eval_mse(y_test_df, mlp_model_predictions, per_element)

### Follow-up
- What can we determine from this?
    - Raw TS
        - `RW`
        - `ARIMA-no-retrain`
        - `ARIMA-retrain`
        - `MLP`: Loss not decreasing rapidly after 250.