# On Analyzing Real World Time Series for Forecasting Stock Data - Tencent
- NOT splitting the data

In [None]:
import os
import sys
# Get the current working directory of the notebook
notebook_dir = os.getcwd()

# Add the parent directory to the system path
sys.path.append(os.path.join(notebook_dir, '../tslearn/'))

from collections import namedtuple
from data_loader import build_stock_uts
from ts_models import Model, RandomWalk, PersistenceWalkForward, AR, MA, ARMA, ARIMA_model, EvaluationMetric
from time_series import TimeSeriesFactory, UnivariateTimeSeries

## Stock Data Analysis

1. Load Raw TS
2. Plot Raw TS
3. Get Descriptive Statistics (ie: mean, median, range, etc) of Raw TS
4. Check Stationarity of Raw TS
5. Plot Autocorrelation and Partial Autocorrelation of Raw TS
6. Get Differenced TS
7. Plot Differenced TS
8. Get Descriptive Statistics of Differenced TS
9. Check Stationarity of Raw TS
10. Plot Autocorrelation and Partial Autocorrelation of Raw TS
11. Initialize and Predict Random Walk `RW` Model for Raw TS
12. Plot Actual Forecasts vs Predicted Forecasts for Raw TS
13. Initialize Autoregressive Integrated Moving Average `ARIMA(p, d, q)` for Raw TS
14. Predict Forecasts for Raw TS
15. Plot Actual Forecasts vs Predicted Forecasts for Raw TS
16. Follow-up

### Load Raw TS

In [None]:
# Only grab stocks whose data is available for the entire time period
start_date, end_date = "2010-01-05", "2023-10-23"
Stock = namedtuple("Stock", ["symbol", "name"])
stocks = [
    ("TCEHY", "Tencent"),
    ("INTC", "Intel")
]
independent_variable = "Close"
stocks = [Stock(*s) for s in stocks]
stocks = {s.symbol: build_stock_uts(s.symbol, s.name, independent_variable, start_date=start_date, end_date=end_date, frequency='1d') for s in stocks}

In [None]:
values_cols = list(stocks.keys())
stock_mvts = TimeSeriesFactory.create_time_series(
    time_col="date",
    time_values=stocks[values_cols[0]].data.index,
    values_cols=values_cols,
    values=[stock.get_series() for stock in stocks.values()]
)

In [None]:
stock_symbol = 'TCEHY'
type(stocks[stock_symbol]), stocks[stock_symbol]

In [None]:
stock_series = stocks[stock_symbol].get_series()
stock_series

In [None]:
stock_df = stocks[stock_symbol].get_as_df()
stock_df

### Plot Raw TS

In [None]:
stocks[stock_symbol].plot(tick_skip=100)

- Tencent went public Jan 4, 2010 [yfinance](https://finance.yahoo.com/quote/TCEHY/history?period1=1262649600&period2=1698537600&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true).
- What happened in 2018 for the Tencent stock close observations to fall?
    - See milestones [Tencent](https://www.tencent.com/en-us/about.html#about-con-2). Which milestones seem contradictory to this fall?
- What happened in 2020 for the Tencent stock close observations to fall?
    - Covid-19, so what happened in China during Covid? How did customers and end users react to company during the pandemic?
    - See milestones [Tencent](https://www.tencent.com/en-us/about.html#about-con-2). Which milestones seem contradictory to this fall?
- What's the future of the China market? Tencent?
- How long will it take for Tencent to see another peak stock close observation?

In [None]:
stock_df.loc['2020-01-01':'2022-01-01'].plot()

### Get Descriptive Statistics of Raw TS

In [None]:
stocks[stock_symbol].get_statistics()

In [None]:
stocks[stock_symbol].max_min_range()

### Check Stationarity of Raw TS

- With financial data, we expect it to be non-stationary.
    - Can we verify this non-stationary with plotting the Autocorrelation?

In [None]:
stocks[stock_symbol].stationarity_test(stock_df)

### Plot Autocorrelation and Partial Autocorrelation of Raw TS

In [None]:
stocks[stock_symbol].plot_autocorrelation(50)

- Above, the data is highly correlated which means that the k-th lag observation has some impact on the most recent observation.

In [None]:
stocks[stock_symbol].plot_partial_autocorrelation(35)

- Above, the data shows a rapid decay at lag 2. 

### Get Differenced TS
- To remove the trend

In [None]:
stock_diff = stocks[stock_symbol].data_augment_with_differencing(1)
stock_diff

In [None]:
stock_stock_diff_df = stock_diff.get_as_df()
stock_stock_diff_df

### Plot Differenced TS

In [None]:
stock_diff.plot(tick_skip=150)

- Differenced TS seems to have constant mean and constant variance although the variance seems to funnel in and out a bit.

### Get Descriptive Statistics of Differenced TS

In [None]:
stock_diff.get_statistics()

In [None]:
stock_diff.max_min_range()

### Check Stationarity of Differenced TS

In [None]:
stock_diff.stationarity_test(stock_stock_diff_df)

### Plot Autocorrelation and Partial Autocorrelation of Differenced TS

In [None]:
stock_diff.plot_autocorrelation(50)

- ACor exponentially decays at 1. Can test a MA(q), where q = 4, 12, 19, 31 as these are where the values are close or outside of the significance line.

In [None]:
stock_diff.plot_partial_autocorrelation(50)

- PACor exponentially decays at 1. Can test a AR(p), where p = 4, 19, 31, 39, 47 as these are where the values are close or outside of the significance line.

### Initialize and Predict RW Model for Raw TS
- Need to fix as it's looking for the train, test split data.

In [None]:
rw_model_class = RandomWalk()

rw_predictions = rw_model_class.predict(stocks[stock_symbol].get_series(), stocks[stock_symbol].get_series())

In [None]:
rw_mse_gsts = EvaluationMetric.eval_mse(stocks[stock_symbol].get_series(), rw_predictions, per_element=False)
rw_mse_gsts

In [None]:
rw_rmse_gsts = EvaluationMetric.eval_rmse(stocks[stock_symbol].get_series(), rw_predictions, per_element=False)
rw_rmse_gsts

### Plot Actual Forecasts vs Predicted Forecasts for Raw TS

In [None]:
lags_to_test = []
EvaluationMetric.plot_forecast(stocks[stock_symbol].get_series(), rw_predictions, lags_to_test, with_lags=False)

### Initialize ARIMA(p, d, q) Model for Raw TS

- How to choose d? Compare differenced values 1, 2, etc by looking at the ACor plot. Is there a large change in any difference orders? Should NOT be overdifferenced.

In [None]:
true_labels = stocks[stock_symbol].get_series()
true_labels

In [None]:
end = len(stock_df)

subset_of_true_labels = true_labels[:end]
len(subset_of_true_labels), subset_of_true_labels

- Maximum Likelihood optimization failed to converge with the below. What does this mean? Should we be concerned with it NOT converging?
    1. 4, 1, 4
    2. 4, 1, 19
    3. 4, 1, 31

In [None]:
# create an object from the ARIMA_model() class
arima_model_class = ARIMA_model()

# call the function to train our ARIMA model
trained_arima_models = arima_model_class.train_arima_model(subset_of_true_labels, 4, 1, 4)

### Predict Forecasts for Raw TS

In [None]:
arima_predictions = arima_model_class.predict(trained_arima_models, 1, end)
arima_predictions

In [None]:
mse_gsts = EvaluationMetric.eval_mse(subset_of_true_labels, arima_predictions)
mse_gsts

In [None]:
mse_gsts = EvaluationMetric.eval_rmse(subset_of_true_labels, arima_predictions)
mse_gsts

### Plot Actual Forecasts vs Predicted Forecasts for Raw TS

In [None]:
EvaluationMetric.plot_forecast(subset_of_true_labels, arima_predictions, [1], with_lags=True)

### Follow-up
- What can we determine from this?
    - We are overfitting. I think the reason being is that we are telling our ARIMA model to train and predict on the same data.
    
- What to consider?
    - [ ] Splitting the data into a training set and a testing set.
    - [ ] Log Likelihood
    - [ ] AIC
    - [ ] BIC
    - [ ] HQIC
    - [ ] Ljung-Box (L1) (Q)
    - [ ] Jarque-Bera (JB)
    - [ ] Prob(Q):
    - [ ] Prob(JB):
    - [ ] Heteroskedasticity (H):
    - [ ] Skew
    - [ ] Prob(H) (two-sided)
    - [ ] Kurtosis