# SARIMA with Exogenous Variables (X)

## Introduction
- The $SARIMAX$ model further extends the $SARIMA(p,d,q)(P,D,Q)_m$ model by adding the effect of exogenous variables $X$.
- The SARIMAX model is the **most general model** for forecasting time series. 
    - If no seasonal patterns, it becomes an ARIMAX model. 
    - With no exogenous variables, it is a SARIMA model. 
    - With no seasonality or exogenous variables, it becomes an ARIMA model.
- SARIMA**X** express the present value $y_t$ simply as a SARIMA model to which we add any number $n$ of exogenous variables $X_t$ as below equation:
    - In other words, the SARIMA**X** model simply adds a linear combination of exogenous variables to the SARIMA model.
    
$$y_t = SARIMA(p,d,q)(P,D,Q)_m + \sum_{i=1}^{n}{\beta_i X_t^i}$$

In [1]:
import statsmodels.api as sm


## Exploring the exogenous variables of the US macroeconomics dataset
- There are two ways to work with exogenous variables for time series forecasting. 
    - Method 1: we could train multiple models with various combinations of exogenous variables, and see which model generates the best forecasts. 
    - Method 2: we can simply include all exogenous variables and stick to model selection using the AIC, as we know this yields a good-fitting model that does not overfit.
- All variables in the US macroeconomics dataset

| Variable | Description |
|:--------:|:------------:|
| `realgdp`| Real gross domestic product (the target variable or endogenous variable)|
| `realcons`| Real personal consumption expenditure  |
| `realinv` | Real gross private domestic investment |
| `realgovt` | Real federal consumption expenditure and investment |
| `realdpi` | Real private disposable income |
| `cpi` | Consumer price index for the end of the quarter|
| `m1` | M1 nominal money stock |
| `tbilrate` | Quarterly monthly average of the monthly 3-month treasury bill |
| `unemp` | Unemployment rate |
| `pop` | Total population at the end of the quarter |
| `infl` | Inflation rate |
| `realint` | Real interest rate |

In [3]:
macro_econ_data = sm.datasets.macrodata.load_pandas().data  
macro_econ_data.tail()

Unnamed: 0,year,quarter,realgdp,realcons,realinv,realgovt,realdpi,cpi,m1,tbilrate,unemp,pop,infl,realint
198,2008.0,3.0,13324.6,9267.7,1990.693,991.551,9838.3,216.889,1474.7,1.17,6.0,305.27,-3.16,4.33
199,2008.0,4.0,13141.92,9195.3,1857.661,1007.273,9920.4,212.174,1576.5,0.12,6.9,305.952,-8.79,8.91
200,2009.0,1.0,12925.41,9209.2,1558.494,996.287,9926.4,212.671,1592.8,0.22,8.1,306.547,0.94,-0.71
201,2009.0,2.0,12901.504,9189.0,1456.678,1023.528,10077.5,214.469,1653.6,0.18,9.2,307.226,3.37,-3.19
202,2009.0,3.0,12990.341,9256.0,1486.398,1044.088,10040.6,216.385,1673.9,0.12,9.6,308.013,3.56,-3.44


## Problem with Exogenous variables
- What if you wish to predict two timesteps into the future? 
    - While this is possible with a SARIMA model, the SARIMAX model requires us to forecast the exogenous variables too.
- The only way to avoid that situation is to predict only one timestep into the future and **wait to observe** the exogenous variable before predicting the target for another timestep into the future.
- Summary: There is no clear recommendation to predict only one timestep. 
    - If you determine that your exogenous variable can be accurately predicted, you can recommend forecasting many timesteps into the future. 
    - Otherwise, your recommendation must be to predict one timestep at a time and justify your decision by explaining that errors will accumulate as more predictions are made, meaning that the forecasts will lose accuracy.

## Forecasting with SARIMAX
- In this example, we will use SARIMAX to forecast the real GDP with the exploration of exogenous variables: 'realcons', 'realinv', 'realgovt', 'realdpi', 'cpi'

In [4]:
target = macro_econ_data['realgdp']
exog = macro_econ_data[['realcons', 'realinv', 'realgovt', 'realdpi','cpi']]                                   