##Principles of Forecasting

###Forecasts Based on Conditional Expectation

Suppose we are interested in forcasting the dependent variable $y_{t+1}$, based on the independent variables $X_t$, i.e., we want to predict a dependent variable in the next period ($t+1$) based on information about independent variables in the current period ($t$). One common example is forecasting $y_{t+1}$ using the $m$ most recent values, so $X_t$ consists of $y_t$, $y_{t-1}$,...,$y_{t-m+1}$. Let $\widehat{y}_{t+1|t}$ denote the **forecast** of $y_{t+1}$ based on $X_t$.

We will choose a **loss function** to summarize how concerned we are if our forecast misses by a certain amount. Quadratic loss functions provide particularly convenient results, so a very common choice is the **Mean Squared Error (MSE)**:
\begin{equation}
MSE(\widehat{y}_{t+1|t}) = E(y_{t+1} - \widehat{y}_{t+1|t})^2.
\end{equation}
It can be proven analytically that the forecast that leads to the smallest MSE is the conditional expectation of $y_{t+1}$ given $X_t$ or $E(y_{t+1}|X_t)$.

###Forecasts Based on Linear Projection

Now, we restrict the class of forecasts considered to linear functions:
\begin{equation}
\widehat{y}_{t+1|t} = \mathbf{\alpha}'X_t
\end{equation}
where $\mathbf{\alpha}$ is a vector of parameters. Suppose we find a $\mathbf{\alpha}$ such that
\begin{equation}
E[(y_{t+1} - \mathbf{\alpha}'X_t)X_t'] = \mathbf{0'}.
\end{equation}
Then, we call the forecast $\mathbf{\alpha}'X_t$ the **linear projection** of $y_{t+1}$ on $X_t$. Note that this choice of $\alpha$ makes the error terms, $(y_{t+1} - \mathbf{\alpha}'X_t)$, and the independent variables, $X_t$, **orthogonal**. This linear projection turns out to produce the smallest MSE among all choices of linear models.

###Linear Projection and Ordinary Least Squares (OLS)

The discussion above involves (probably unknown) population parameters. Recall from our previous discussions of OLS that the least squares estimator, applied to *sample* data, can be written:
\begin{equation}
\mathbf{b} = \left[\sum_{t=1}^T\mathbf{x}_t\mathbf{x}_t'\right]^{-1}\left[\sum_{t=1}^T\mathbf{x}_ty_{t+1}\right]
\end{equation}
where $\mathbf{x}_t$ and $y_{t+1}$ come from a sample of data.

If we assume that the data generating process that defines the relationship between $X_t$ and $y_{t+1}$ is **covariance stationary** and **ergodic** then OLS estimates, $\mathbf{b}$, converge to optimal linear projections, $\alpha$, as the sample size increases to infinity.  

**Covariance stationarity** (or weak stationarity) means that the statistical properties of the time series—specifically, the mean, variance, and autocovariances—do not change over time. In other words, the process generating the data is stable, so if we observe it at different points in time, we expect the same behavior in terms of these statistics.

**Ergodicity** means that, given enough data, the time series behavior we observe over time will reflect the “true” or average behavior across all possible realizations of the process. In other words, a single long time series is representative of the behavior of the entire process, as if we had many different versions (or “replications”) of it.

###Forecasting with Lagged Values

One of the most common and fundamental ways to predict $y_{t+1}$ is using an **auto-regressive model** or **autoregression** in which the independent variables are **lags** of the depedent variable. An **AR(1)** autoregression can be written
\begin{equation}
y_{t+1} = \beta_0 +\phi y_t + \epsilon_{t+1},
\end{equation}
while an **AR(m)** autoregression can be written
\begin{equation}
y_{t+1} = \beta_0 + \phi_0 y_t + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \dots + \phi_m y_{t-m} + \epsilon_{t+1}
\end{equation}
where $\beta_0$ is a constant term. If $|\phi|\lt 1$ then the long-run expectaton of the dependent variable is
\begin{equation}
E(y_t) = \frac{\beta_0}{1-\phi}
\end{equation}

When we add independent variables, and lags of those variables, we get an **autoregressive distributed lag model (ARDL)** which can be written
\begin{equation}
y_{t+1} = \beta_0 + \phi y_t + \beta_1X_t + \beta_2 X_{t-1} + \epsilon_{t+1}.
\end{equation}

Let's look at these models using inflation data.

In [None]:
import pandas as pd
import numpy as np
from statsmodels.tsa.ar_model import AutoReg
from datetime import datetime

##AR(1) model

In [None]:
# Load the data from the CSV file
data = pd.read_csv('CPI and Oil Prompt Price.csv')

# Convert the "date" column to a datetime format
data['date'] = pd.to_datetime(data['date'], format='%m/%d/%y')

# Sort data by date to ensure chronological order
data = data.sort_values(by='date').reset_index(drop=True)

# Calculate the percent changes for CPI and oil price
data['cpi_pct_change'] = data['cpi_index'].pct_change(fill_method=None) * 100
data['oil_pct_change'] = data['oil_price'].pct_change(fill_method=None) * 100

# Drop rows with NaN values generated by the percent change calculation
data = data.dropna().reset_index(drop=True)

# Knock off the most recent inflation value (last row in cpi_pct_change)
# Save this for later comparison with the forecast
actual_inflation = data['cpi_pct_change'].iloc[-1]
data = data.iloc[:-1]  # Remove the last row

# Prepare the data for AR(1) forecasting
# We only use lagged values of 'cpi_pct_change' for this model (no oil price)
train_data = data['cpi_pct_change']

# Fit the AR(1) model
model = AutoReg(train_data, lags=1)
model_fit = model.fit()

# Extract the constant (intercept) and slope (AR(1) coefficient)
constant = model_fit.params['const']
slope = model_fit.params['cpi_pct_change.L1']

# Calculate the long-run average inflation
long_run_average_inflation = constant / (1 - slope)

# Generate the forecast for the next period
forecast = model_fit.predict(start=len(train_data), end=len(train_data))

# Compare forecast with actual inflation in levels and squared (MSE)
predicted_inflation = forecast.iloc[0]
mse = (predicted_inflation - actual_inflation) ** 2

# Calculate R-squared manually
fitted_values = model_fit.fittedvalues
y_mean = train_data.mean()
ss_total = np.sum((train_data - y_mean) ** 2)
ss_residual = np.sum((train_data - fitted_values) ** 2)
r_squared = 1 - (ss_residual / ss_total)

# Display the results
print("Predicted Inflation:", predicted_inflation)
print("Actual Inflation:", actual_inflation)
print("MSE:", mse)
print("Estimated Long-Run Average Inflation:", long_run_average_inflation)
print("R-squared:", r_squared)

Predicted Inflation: 0.21086048404452756
Actual Inflation: 0.17986699392908978
MSE: 0.0009605964295357395
Estimated Long-Run Average Inflation: 0.23219799669365376
R-squared: 0.22584374793933581


#ARDL Model

In [None]:
# Add a lagged value for oil percent change and inflation percent change
data['oil_pct_change_lag1'] = data['oil_pct_change'].shift(1)
data['cpi_pct_change_lag1'] = data['cpi_pct_change'].shift(1)

# Drop any remaining NaN values due to lagging
data = data.dropna().reset_index(drop=True)

# Define the exogenous variables, including the lagged CPI and oil price changes
X = data[['oil_pct_change', 'oil_pct_change_lag1', 'cpi_pct_change_lag1']]

# Fit the model with added lagged CPI and oil prices and ensure an intercept is included
model = AutoReg(data['cpi_pct_change'], lags=1, exog=X)  # 'trend="c"' ensures an intercept
model_fit = model.fit()

# Set out-of-sample values for the exogenous variables (most recent observed values)
exog_oos = X.iloc[-1:].values

# Generate the forecast for the next period
forecast = model_fit.predict(start=len(data), end=len(data), exog_oos=exog_oos)

# Compare forecast with the actual last inflation value in both levels and squared (MSE)
predicted_inflation = forecast.iloc[0]
mse = (predicted_inflation - actual_inflation) ** 2

# Calculate R-squared manually
fitted_values = model_fit.fittedvalues
y_mean = data['cpi_pct_change'].mean()
ss_total = np.sum((data['cpi_pct_change'] - y_mean) ** 2)
ss_residual = np.sum((data['cpi_pct_change'] - fitted_values) ** 2)
r_squared = 1 - (ss_residual / ss_total)

# Display the results
print("Predicted Inflation with Oil Price:", predicted_inflation)
print("Actual Inflation:", actual_inflation)
print("MSE:", mse)
print("R-squared:", r_squared)


Predicted Inflation with Oil Price: 0.3666233864796053
Actual Inflation: 0.3129103545463252
MSE: 0.002885089799465566
R-squared: 0.452734965576719
