# Regression Model with Seasonal ARIMA

- (Seasonal) ARIMA errors i.e. the (S)ARIMAX 
- Since we are using (S)ARIMAX, we may also implicitly use past values of the dependent variable PT08_S4_NO2 and past errors of the model as additional regression variables
- Regression Goal: Build a regression model to predict the hourly value of the PT08_S4_NO2 variable
- Regression Strategy: The variables are
    - X, a matrix regression (independent) variable (IV) -- Temperature (T) and Absolute Humidity (AH)
    - y, a dependent variable -- (PT08_S4_NO2)
    - columns in data wrt variables -- T, AH, PT08_S4_NO2
    - SIDE NOTE: (IV) vs (DV): causes the effect on the (DV) vs it depends on the (IV) [4]

# Imports + Load Data

In [2]:
# !pip install statsmodels

In [3]:
import pandas as pd
from statsmodels.regression import linear_model
from patsy import dmatrices
import statsmodels.graphics.tsaplots as tsa
from matplotlib import pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima.model import ARIMA as ARIMA
import numpy as np

In [12]:
BASE = '/Users/brinkley97/Documents/development/'
path_to_dataset = 'book-forecasting_and_control_by_4_Gs/datasets/'
name_of_dataset = 'air_quality_uci_mod.csv'
dataset = BASE + path_to_dataset + name_of_dataset
df = pd.read_csv(dataset, header=0)
df

Unnamed: 0,DateTime,CO_GT,PT08_S1_CO,NMHC_GT,C6H6_GT,PT08_S2_NMHC,NOx_GT,PT08_S3_NOx,NO2_GT,PT08_S4_NO2,PT08_S5_O3,T,RH,AH
0,03-10-04 18:00,2.6,1360,150,11.9,1046,166,1056,113,1692,1268,13.6,48.9,0.7578
1,03-10-04 19:00,2.0,1292,112,9.4,955,103,1174,92,1559,972,13.3,47.7,0.7255
2,03-10-04 20:00,2.2,1402,88,9.0,939,131,1140,114,1555,1074,11.9,54.0,0.7502
3,03-10-04 21:00,2.2,1376,80,9.2,948,172,1092,122,1584,1203,11.0,60.0,0.7867
4,03-10-04 22:00,1.6,1272,51,6.5,836,131,1205,116,1490,1110,11.2,59.6,0.7888
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7339,04-04-05 10:00,3.1,1314,-200,13.5,1101,472,539,190,1374,1729,21.9,29.3,0.7568
7340,04-04-05 11:00,2.4,1163,-200,11.4,1027,353,604,179,1264,1269,24.3,23.7,0.7119
7341,04-04-05 12:00,2.4,1142,-200,12.4,1063,293,603,175,1241,1092,26.9,18.3,0.6406
7342,04-04-05 13:00,2.1,1003,-200,9.5,961,235,702,156,1041,770,28.3,13.5,0.5139


# 1. Prepare the data
- [ ] Convert the dateTime column in the data frame into a pandas DateTime column and set it as the index of the DataFrame.
- [ ] Set the frequency attribute of the index to Hourly.
- [ ] Verify that there are no empty cells in any column. The output should be all zeroes.
- [ ] Create the training and the test data sets

# 2. Create a Linear Regression model

- [ ] Create modex expression
- [ ] Carve out the y and X matrices
- [ ] Fit an Ordinary Least Squares Linear Regression (OLSR) model on the training dataset
- [ ] Show results

# 3. Estimate (S)ARIMA parameters (p, d, q) and m

- [ ] Plot the Auto-correlation (ACF/ACor) of the residual errors
    - [ ] Plot with my ACor functions [5]
- [ ] Summarize ACF plots
- [ ] Difference the time series once (i.e: d = 1)
- [ ] Replot the ACF of the 1 differenced time series of residual erros
- [ ] Difference the time series again (i.e.: d = 2)
- [ ] Replot the ACF of the 2 differenced time series of residual erros
- [ ] Verify that the seasonal period m is 24 hours
    - [ ] Decompose the residual errors of regression into trend, seasonality and noise by using the `seasonal_decompose()` function provided by statsmodels
    - [ ] Plot output from using the `seasonal_decompose()` function
- [ ] Apply a single seasonal difference to our already differenced time series of residual errors

# 4. Build and fit the Regression Model with Seasonal ARIMA errors
- [ ] Fit the SARIMAX model on the training data set (y_train, X_train) 
- [ ] Build and test the SARIMAX model
- [ ] Simplify our model by setting Q to 0 (i.e: we’ll try a SARIMAX(1,1,0)(0,1,0)24 model)

# 5. Prediction
- [ ] Predict the value of the y (PT08_S4_NO2) for the next 24 hours beyond the end of the training data set
- [ ] Call the `get_forecast` method to get the out of sample forecasts
- [ ] Plot the actual value y_test from the test data set

# References

1. BOOK: [Time Series Analysis: Forecasting and Control, 5th Edition](https://www.wiley.com/en-us/Time+Series+Analysis:+Forecasting+and+Control,+5th+Edition-p-9781118675021)
2. BOOK: [Time Series Analysis, Regression and Forecasting with tutorials in Python](https://timeseriesreasoning.com/)
3. WEBSITE: [Introduction to Regression With ARIMA Errors Model](https://timeseriesreasoning.com/contents/regression-with-arima-errors-model/)
4. PAPER: [A Survey of Multimodal Probabilistic Learning by Detravious](https://detraviousjbrinkley.notion.site/A-Survey-of-Multimodal-Probabilistic-Learning-for-Human-Communication-and-Emotion-Recognition-d40cab0081024276b876ae0de4204dc7) (in construction)
5. SOFTWARE: [Calculate ACov and ACor](https://github.com/Brinkley97/book-forecasting_and_control/blob/main/part_1/2-autocorrelation_func_and_spectrum_of_stationary_process/exercises/2.4.3.ipynb) by Detravious