In [None]:
# crypto_module importation #

from crypto_module import *

# **Third Model : ARIMA model**

### **I-Introduction**

**ARIMA** stands for **Auto-Regressive Integrated Moving Average** and is a product of two models : 

- **Auto-Regressive model (AR)** parametized by a value p, which allows to predict the future value of a time series through a regression on its last p values.

- **Moving Average model (MA)** parametized by a value q, which instead of using past forecast values uses past forecast errors in a regression-like model and aim to improve the model's forecasts by knowing how far off our prediction from yesterday was compared to the actual value. The forcasts are made based on the last q periods before the current period.


**ARIMA** model have a last parameter, the order of differenciation d, which is the number of times we must differenciate the time series to make it stationnary. **Stationnary is a must-have feature** in our time series that allows us to apply the two components (AR, MA) of the ARIMA model.


To sum up, we aim to predict the close cotation by using an **ARIMA(p,d,q)** model. From now on, we focus on the determination of the parameters **p,d,q**.

### **II-Loading and visualizing the data**

We use the API to load the data. We focus on the bitcoin daily cotations in USD between 2021 and 2022

In [None]:
symbol = 'BTC/USD'
interval = '1day'
start_date = '2021-01-01 00:00:00'
end_date = '2022-12-31 00:00:00'

In [None]:
data = load_data(symbol, start_date, end_date, interval)

data.head(n = 10)

visualize_data(data, symbol, interval)

The previous plot showed a great volatility in the price. Let's plot the price along with its moving statistic. A stationnary time series is expected to have these features constant in time.

In [None]:
data = add_arima_indicators(data, "close", period = 14) 

arima_viz_with_indicator(data,symbol, interval)

As we can see, there is no real stationnarity in this time series.

Let's apply logarithmic transformation to the data in order to reduce the variation strength.

In [None]:
log_data = add_arima_indicators(data, "close", period = 14, ln = True) 

arima_viz_with_indicator(log_data, symbol, interval)

The amplitude of the variations are well-reduced as we can see oh the axis 

### **III-Stationnarity test**

We observed a downward trend and a non-constant variance in the previous plots. Our series is far from stationnary

Two tests can help us confirm wether the series is stationnary or not :

- The **Augmented Dicky Fuller Test (ADF Test)** : H0 = The series is non-stationary. The more negative the ADF number is, the more prominent the rejection of the null hypothesis that the time series is non-stationary.

- The **Kwiatkowski–Phillips–Schmidt–Shin (KPSS)**: H0 = The series is stationary. The higher the test statistic, the more prominent the rejection of the null hypothesis that the series is stationary.


Let's perform an example on the original series.

In [None]:
adf_test(data, "close")
kpss_test(data, 'close')

In [None]:
adf_test(log_data, "close")
kpss_test(log_data, "close")

The ADF test failed to reject the null hypothesis while the KPSS rejects its null hypothesis with a very small p-value in both cases. It firmly confirms the non-stationnarity of the time series or its log.

Since the ADF gave better results in the log case, we will keep the log series from now on in our analysis.

### **IV-Differenciation order**

To calibrate our ARIMA model and achieve stationnarity we will need to differenciate our data at least once.

To do so, we differenciate the data d times and run the ADF and KPSS test until we obtain p-values respectively lower and higher than 5% so that we can assume stationnarity.

In [None]:
d, diff_data = find_diff_order(log_data, "close", to_print= False)


As the result showed a **first-order** differenciation results in success for both ADF and KPSS test. 

We got our first parameter, **d = 1** with the function we implemented.

We can look how the data behave after the first order differenciation

In [None]:
visualize_data(diff_data, symbol, interval)
print(f"The order of differenciation is {d}")

It looks far more stationnary than the original series.

### **V-Auto-Regression and Moving Average parameters**

In this part we aim at compute the p and q parameters of our ARIMA model. 

For this sake, we might plot the **Auto Correlation Function (ACF)** and **Partial Auto Correlation Function (PACF)** plot functions. 

- **ACF** plot displays the correlation coefficients between a time series and its lag values. It explains how the present value of a given time series is related to previous values. 
- **PACF** correlates the impact on n(k) of n(k-3) that are not predicted by n(k-1) and n(k-2). 

We look for significant points outside the shaded area and a geometric decay if we are dealing with a times series where ARIMA may be appropriate.



##### **01- ACF and PACF plots**

In [None]:
acf_pacf(diff_data, 'close')

The ACF and PACF plots, shows some significant points (ninth lag for example), but no geometric decay.

What we are seeing is likely an ARIMA(0, 1, 0) model, meaning our differenced data will be what is known as **“white noise”** and  our original data a **“random walk”** and therefore the best prediction we can do of the current is by using the precedent value.

Before moving forward, we need to check those hypothesis.


##### **02- Is our data a white noise ?**

If the variables are independent and identically distributed with a mean of zero, the time series is white noise. Let's check this possibility.

We'll proceed in two steps :

- Splitting the dataset in two parts and observe the distributions
- Compute the Ljung-box test for which H0 = Data is independently distributed.

###### **02a- SPLITTING THE DATA**

In [None]:
fast_dist_check(diff_data, 'close')

BAD NEWS : As we can see the distribution varies is quite similar in the samples. We can not rule out the hypothesis of our differenced data to be white noise.

Now we are performing the Ljung-box test

###### **02b- COMPUTING LJUNG-BOX TEST**

In [None]:
LB_test(diff_data, 'close')

The Ljung-box test fails to reject the null hypothesis

##### **03- Using Auto-ARIMA**

The function called “Auto-Arima” from the pmdarima package fit and test a selection of models and return the model with the lowest **Akaike Information Criterion (AIC)** value. 

In [None]:
auto_arima(data, "close")

The Auto-arima process states the best fit for the data is an ARIMA(). It's the same results for every information criterion we can use in the process (hqic, bic, oob..). It reinforces the idea of our differenced data being white noise.

### **VI- FORECASTING FUTURE VALUES**

Since we obtained significant correlation for the ninth lag value, we will still try to predict future values with an auto-regression order and moving average of 10. Let's the results ! 

In [None]:
model(data, "close",30, end_date, symbol,interval, 10,1,10)

### **VII-CONCLUSIONS**

We clearly see at sight the inefficiency of the forecasting renforcing the idea that the price is more likely to be a random walk. 

The models fails to reproduce the sudden jumps of the price that we can observe in reality