# Compare - Shanghai Composite Index

1. PAPER: [Financial Time Series Forecasting with the Deep Learning Ensemble Model](https://www.mdpi.com/2227-7390/11/4/1054) by He K., et al. 2023
2. NOTE (of stock data):
    1. **Raw Data** is mostly non-stationary.
    2. **Returns Data** is stationary and also called differenced.

In [1]:
import os
import sys
# Get the current working directory of the notebook
notebook_dir = os.getcwd()

# Add the parent directory to the system path
sys.path.append(os.path.join(notebook_dir, '../tslearn/'))

from collections import namedtuple
from data_loader import build_stock_uts, build_downloaded_stock_uts
from time_series import TimeSeriesFactory
from data_plotter import InterpolatePlotter
from ts_models import AR, ARMA, ARIMA_model, EvaluationMetric

## Stock Data: Analysis

1. Load Raw Data
2. Plot Raw Data
3. Get Descriptive Statistics (ie: mean, median, range, etc) of Raw Data
4. Check Stationarity of Raw Data
5. Plot Autocorrelation and Partial Autocorrelation of Raw Data
6. Get Returns Data
7. Plot Returns Data
8. Get Descriptive Statistics (ie: mean, median, range, etc) of Returns Data
9. Check Stationarity of Returns Data
10. Plot Autocorrelation and Partial Autocorrelation of Returns Data

## Stock Data: Models

11. Split Returns Data
12. Initialize Models: `AR(p)` and `ARMA(p, q)`
13. Split Raw Data
14. Initialize Model: `ARMA(p, d, q)`

## Stock Data: Evaluation Metrics + Plots

15. Evaluation Metrics `MSE`, `RMSE`
16. Plot Actual Predictions vs Model Predictions

## Follow-up

- Notes on above

## Stock Data: Analysis

### Load Raw Data

In [None]:
# Only grab stocks whose data is available for the entire time period
start_date, end_date = "2010-01-04", "2020-02-07"
Stock = namedtuple("Stock", ["symbol", "name"])
stocks = [
    ("000001.SS", "Shanghai Composite Index")
]
independent_variable = "Close"
# # frequency = 
stocks = [Stock(*s) for s in stocks]
stocks = {s.symbol: build_stock_uts(s.symbol, s.name, independent_variable, start_date=start_date, end_date=end_date, frequency='1d') for s in stocks}

In [None]:
values_cols = list(stocks.keys())
stock_mvts = TimeSeriesFactory.create_time_series(
    time_col="date",
    time_values=stocks[values_cols[0]].data.index,
    values_cols=values_cols,
    values=[stock.get_series() for stock in stocks.values()]
)

In [None]:
stock_symbol = '000001.SS'
stock_of_interest = stocks[stock_symbol]
type(stock_of_interest), stock_of_interest

In [None]:
stock_df = stock_of_interest.get_as_df()
stock_df

### Plot Raw Data

In [None]:
stock_of_interest.plot(tick_skip=75)

### Get Descriptive Statistics of Raw Data

In [None]:
stock_of_interest.get_statistics()

In [None]:
stock_of_interest.range_skewness_kurtosis()

### Check Stationarity of Raw Data

- With financial data, we expect it to be non-stationary (as in there's a change in either or both the mean of the variance between two distant points).

In [None]:
stock_of_interest.stationarity_test(stock_df)

### Plot Autocorrelation and Partial Autocorrelation of Raw Data

- Not required for `AR` or `ARMA` models as both models assumes stationary and the TS is non-stationary.

In [None]:
stock_of_interest.plot_autocorrelation(50)

- What is the above telling us?
    - Both plots are the same, just showing differently. 
    - Both plots confirm that the TS is non-stationary as in the current value depends on the previous value. We don't want this with traditional TS models like `AR`, `ARMA`.

In [None]:
stock_of_interest.plot_partial_autocorrelation(35)

- What is the above telling us?
    - TS is non-stationary. Although data isn't as dependent (like the ACorr plot), the 1st is dependent upon the 0th. 

### Get Returns Data

- This should provide us with stationary data that we can pass to both `AR` and `ARMA` models.

In [None]:
len(stock_of_interest.get_series()), stock_of_interest.get_series()

In [None]:
stock_returns = stock_of_interest.data_augment_for_returns()
stock_returns

### Plot Returns

In [None]:
stock_returns.plot(tick_skip=150)

- Returns seem to have constant mean and constant variance although there are a few wide spread mean values between ~2014-10-07 to 2016-05-29 and ~2017-08-22 to 2019-04-14.

In [None]:
stock_returns_df = stock_returns.get_as_df()
stock_returns_df

### Get Descriptive Statistics of Returns Data

In [None]:
stock_returns.get_statistics()

In [None]:
stock_returns.range_skewness_kurtosis()

### Check Stationarity of Returns Data

- Data is now stationary. Confirm with independence test which is only conducted on returns and has a null-hypothesis of data being independent (or not dependent).

In [None]:
stock_returns.stationarity_test(stock_returns_df)

In [None]:
stock_returns.independence_test(stock_returns_df)

### Plot Autocorrelation and Partial Autocorrelation of Returns Data

In [None]:
stock_returns.plot_autocorrelation(50)

- What is the above telling us?
    - Both plots are the same, just showing differently. 
    - Both plots confirm that the TS is stationary as in the current value doesn't depend on the previous value. This is what we want for `MA(q)`.
    - Lag exponentially decays at 1, thus use lag 1 for `MA(q)` as they have in the PAPER.


In [None]:
stock_returns.plot_partial_autocorrelation(50)

- What is the above telling us?
    - TS is stationary. The 1st is not dependent upon the 0th. This is what we want for `AR(p)`.
    - Lag exponentially decays at 1, thus use lag 1 for `AR(p)` as they have in the PAPER.

## Stock Data: Models

### Split Differenced Data for AR(p) and ARMA(p, q) Models

- Make 5-day forecasts

In [None]:
interpolation_step = 5
N = len(stock_returns.get_series())
diff_train_length = N - interpolation_step

In [None]:
diff_train_uts, diff_test_uts = stock_returns.get_slice(1, diff_train_length, both_train_test=True)
diff_train_uts, diff_test_uts

In [None]:
diff_train_df = diff_train_uts.get_as_df()
diff_train_df

- Make 5-day forecasts, hence why test data is only 5 values.

In [None]:
diff_test_df = diff_test_uts.get_as_df()
diff_test_df

### Initialize Models: AR(p) and ARMA(p, q)

In [None]:
lag_p = 1 # AR
error_q = 1 # MA

In [None]:
ar_model_class = AR()
trained_ar_model = ar_model_class.train_ar_model(diff_train_df, lag_p)
trained_ar_model

In [None]:
trained_ar_model.summary()

In [None]:
arma_model_class = ARMA()
trained_arma_model = arma_model_class.train_arma_model(diff_train_df, lag_p, error_q)
trained_arma_model

In [None]:
trained_arma_model.summary()

In [None]:
print('Coefficients: %s' % trained_arma_model.params)

NOTE: Should Dep. Variable be t as t depends on t - 1?

In [None]:
# retrain false
ar_predictions_no_retrain = ar_model_class.predict(trained_ar_model, diff_train_df, diff_test_df, False, lag_p)

# retrain true
ar_predictions_retrain = ar_model_class.predict(trained_ar_model, diff_train_df, diff_test_df, True, lag_p)

# retrain false
arma_predictions_no_retrain = arma_model_class.predict(trained_arma_model, diff_train_df, diff_test_df, False, lag_p)

# retrain true
arma_predictions_retrain = arma_model_class.predict(trained_arma_model, diff_train_df, diff_test_df, True, lag_p)

In [None]:
len(diff_test_df), len(ar_predictions_no_retrain), len(ar_predictions_retrain), len(arma_predictions_no_retrain), len(arma_predictions_retrain)

### Split Raw Data for ARIMA(p, d, q) Model

- Make 5-day forecasts

In [None]:
interpolation_step = 5
N = len(stock_of_interest.get_series())
train_length = N - interpolation_step
train_length

In [None]:
train_uts, test_uts = stock_of_interest.get_slice(1, train_length, both_train_test=True)
train_uts, test_uts

In [None]:
train_df = train_uts.get_as_df()
train_df

In [None]:
test_df = test_uts.get_as_df()
test_df

### Initialize Models: ARIMA(p, d, q)

In [None]:
diff_d = 1

arima_model_class = ARIMA_model()
arima_model_class

In [None]:
trained_arima_model = arima_model_class.train_arima_model(train_df, lag_p, diff_d, error_q)
trained_arima_model

In [None]:
trained_arima_model.summary()

In [None]:
print('Coefficients: %s' % trained_arima_model.params)

In [None]:
# retrain false
arima_predictions_no_retrain = arima_model_class.predict(trained_arima_model, train_df, test_df, False, lag_p)

# retrain true
arima_predictions_retrain = arima_model_class.predict(trained_arima_model, train_df, test_df, True, lag_p)

len(test_df), len(arima_predictions_no_retrain), len(arima_predictions_retrain)

## Stock Data: Evaluation Metrics + Plots

### Evaluation Metrics: MSE, RMSE

In [None]:
# AR
EvaluationMetric.eval_mse(diff_test_df, ar_predictions_no_retrain, per_element=False)
EvaluationMetric.eval_rmse(diff_test_df, ar_predictions_no_retrain, per_element=False)
EvaluationMetric.eval_mse(diff_test_df, ar_predictions_retrain, per_element=False)
EvaluationMetric.eval_rmse(diff_test_df, ar_predictions_retrain, per_element=False)

# ARMA
EvaluationMetric.eval_mse(diff_test_df, arma_predictions_no_retrain, per_element=False)
EvaluationMetric.eval_rmse(diff_test_df, arma_predictions_no_retrain, per_element=False)
EvaluationMetric.eval_mse(diff_test_df, arma_predictions_retrain, per_element=False)
EvaluationMetric.eval_rmse(diff_test_df, arma_predictions_retrain, per_element=False)

# ARIMA
EvaluationMetric.eval_mse(diff_test_df, arima_predictions_no_retrain, per_element=False)
EvaluationMetric.eval_rmse(diff_test_df, arima_predictions_no_retrain, per_element=False)
EvaluationMetric.eval_mse(diff_test_df, arima_predictions_retrain, per_element=False)
EvaluationMetric.eval_rmse(diff_test_df, arima_predictions_retrain, per_element=False)

### Plots Actual Predictions vs Model Predictions
- Need to finish plots

In [None]:
prediction_plots = InterpolatePlotter(diff_test_df, ar_predictions_no_retrain)
prediction_plots.plot_in_sample_predictions()

## Follow-up