# Time Series

The basic object of forecasting is the time series, which is a set of observations recorded over time. In forecasting applications, the observations are typically recorded with a regular frequency, like daily or monthly. In this example sales of a company. The data created represents monthly sales over a period of 5 years (60 months). The time series I'm working with has a slight upward trend, meaning sales are generally increasing over time.

### Sales Data: 
Simulated sales for a business over time.
### Trend: 
A general direction in which the data points (sales) move over time.

## Why Apply the ARIMA Model?
The ARIMA model is widely used for forecasting time series data where patterns, such as trends and seasonality, may exist but are not obvious or deterministic. 
ARIMA helps:
- Capture the trends in the data by using past values (auto-regression, AR).
- Make the data stationary using differencing (the "I" in ARIMA), which removes the trend and makes the data more stable for modeling.
- Predict future values by accounting for both the historical data and moving averages (the MA component).

## ARIMA Breakdown:
### AR (Auto-Regressive) Terms (p): 
This part of the model captures the influence of previous time periods on the current value. For example, if you set p=5, the model will consider the sales values from the last 5 months to predict the next value.
### I (Integrated) Term (d): 
Differencing is used to make the time series stationary, meaning it removes trends so that the data fluctuates around a stable mean. This is important because many statistical models, like ARIMA, work better with stationary data.
### MA (Moving Average) Term (q): 
This term accounts for any relationship between the forecast errors and past observations.

Here ARIMA(5, 1, 0) means that the model is using the past 5 months' sales data, applying 1 level of differencing to make the data stationary, and no moving average component is applied (q=0).


In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
from statsmodels.tsa.arima.model import ARIMA

# Simulate monthly sales data over 5 years (60 months)
np.random.seed(42)
dates = pd.date_range(start='2018-01-01', periods=60, freq='M')
sales = np.random.normal(2000, 100, size=len(dates)) + np.arange(len(dates)) * 10  # Sales with a slight upward trend

# Create a DataFrame
df = pd.DataFrame({'Date': dates, 'Sales': sales})
df.set_index('Date', inplace=True)
print(df.head())

# Plot the sales using Plotly Express with pastel colors
fig = px.line(df, x=df.index, y='Sales', title="Monthly Sales", 
              labels={'Sales': 'Sales Amount'}, line_shape='linear',
              color_discrete_sequence=px.colors.qualitative.Pastel)

fig.show()

# Create ARIMA model
# ARIMA (Auto-Regressive Integrated Moving Average)
model = ARIMA(df['Sales'], order=(5, 1, 0))  

""" 
(p,d,q): p=autoregressive terms, d=differencing, q=moving average
p=5: Autoregressive terms – the sales depend on the values from the previous 5 months.
d=1: Differencing – this ensures stationarity by taking the difference between consecutive values.
q=0: Moving Average – in this case, we are not using a moving average model. 
"""

model_fit = model.fit()

# Forecast for the next 12 months (1 year)
forecast = model_fit.forecast(steps=12)

# Plot historical sales and forecasted sales using Plotly Express
fig_forecast = px.line(df.reset_index(), x='Date', y='Sales', title="Sales Forecast for the Next 12 Months", 
                       labels={'Sales': 'Sales Amount'}, color_discrete_sequence=px.colors.qualitative.Pastel)

# Add the forecasted data to the plot
fig_forecast.add_scatter(x=forecast_index, y=forecast, mode='lines', name='Forecast', line=dict(color='red'))

# Show the plot with historical and forecasted sales
fig_forecast.show()

# Display model summary
print(model_fit.summary())


                  Sales
Date                   
2018-01-31  2049.671415
2018-02-28  1996.173570
2018-03-31  2084.768854
2018-04-30  2182.302986
2018-05-31  2016.584663



No frequency information was provided, so inferred frequency M will be used.


No frequency information was provided, so inferred frequency M will be used.


No frequency information was provided, so inferred frequency M will be used.



                               SARIMAX Results                                
Dep. Variable:                  Sales   No. Observations:                   60
Model:                 ARIMA(5, 1, 0)   Log Likelihood                -354.976
Date:                Thu, 05 Sep 2024   AIC                            721.953
Time:                        17:52:59   BIC                            734.418
Sample:                    01-31-2018   HQIC                           726.819
                         - 12-31-2022                                         
Covariance Type:                  opg                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1         -0.6844      0.205     -3.333      0.001      -1.087      -0.282
ar.L2         -0.5792      0.219     -2.643      0.008      -1.009      -0.150
ar.L3         -0.2974      0.221     -1.345      0.1

# Output Interpretation

## AR Coefficients (ar.L1, ar.L2, etc.): T
These are the coefficients for the auto-regressive terms. They represent the contribution of the past values (lags) to the current sales prediction.

In my output, ar.L1, ar.L2, etc., represent the first, second, and so on lagged sales values, with their respective coefficients.
For example, ar.L1 = -0.6844 means that the sales value from 1 month ago has a negative impact on the current month’s sales value (a decrease of about 0.684 for every unit change in the sales one month ago).

## Z-scores and P-values: 
These help determine the statistical significance of each coefficient.

A p-value < 0.05 suggests that the coefficient is statistically significant. In this case, ar.L1 and ar.L2 are significant, meaning the sales values from 1 and 2 months ago have a meaningful impact on the current sales.

## Information Criteria (AIC, BIC, HQIC): 
These are measures of model performance. Lower values indicate better model performance. AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) are used to compare different models (e.g., with different lag lengths) and choose the best one.

## Sigma²: 
This represents the variance of the residuals (the error term). A smaller value means the model's predictions are closer to the actual values.

