# **📌 Advanced Model Selection Pipeline for Time Series Forecasting**
This methodology focuses on **feature characteristics**, **statistical properties**, and **machine learning metrics** to determine the most suitable model for a given time series dataset.

---

## **1️⃣ Step 1: Dataset Understanding & Preprocessing**
### **1.1. Identify Univariate vs. Multivariate Time Series**
- If **only one feature** (e.g., sales, temperature) → **Univariate**  
- If **multiple features** (e.g., multiple sensors, economic indicators) → **Multivariate**

#### **Key Decision:**
- **Univariate** → SARIMA, Prophet  
- **Multivariate** → XGBoost, LSTM  

> ✅ **Example:**  
> - **Univariate:** Monthly airline passengers  
> - **Multivariate:** Predicting sales based on previous sales + promotions + holidays

---

## **2️⃣ Step 2: Stationarity Analysis**
### **2.1. Perform Statistical Stationarity Tests**
A time series is **stationary** if its statistical properties (mean, variance) do not change over time.

#### **Key Tests:**
- **Augmented Dickey-Fuller (ADF) Test**
- **KPSS Test (Kwiatkowski-Phillips-Schmidt-Shin)**
- **Rolling Mean & Variance Check**

#### **Python Code:**
```python
from statsmodels.tsa.stattools import adfuller, kpss
import numpy as np

def check_stationarity(ts):
    adf_result = adfuller(ts)
    kpss_result, _, _, _ = kpss(ts, regression='c')

    print(f'ADF Test p-value: {adf_result[1]} (Stationary if < 0.05)')
    print(f'KPSS Test p-value: {kpss_result} (Non-Stationary if > 0.05)')
    
    if adf_result[1] < 0.05 and kpss_result < 0.05:
        return "Stationary - Consider SARIMA"
    else:
        return "Non-Stationary - Consider LSTM, XGBoost, Prophet"
```

#### **Key Decision:**
- **Stationary (ADF p < 0.05, KPSS p < 0.05)** → SARIMA  
- **Non-Stationary** → LSTM, XGBoost, Prophet (trend-based models)

---

## **3️⃣ Step 3: Trend & Seasonality Analysis**
### **3.1. Detect Trend & Seasonality**
- **Trend:** Is there a consistent increase or decrease?  
- **Seasonality:** Does the pattern repeat at regular intervals?

#### **Methods:**
- **STL Decomposition (Seasonal-Trend Decomposition)**
- **Fourier Transform for seasonality detection**
- **Autocorrelation (ACF/PACF) Analysis**

#### **Python Code:**
```python
import statsmodels.api as sm
import matplotlib.pyplot as plt

def decompose_time_series(ts, period=12):
    decomposition = sm.tsa.seasonal_decompose(ts, model='additive', period=period)
    decomposition.plot()
    plt.show()
    
    trend_strength = np.std(decomposition.trend.dropna())
    seasonality_strength = np.std(decomposition.seasonal.dropna())
    
    if trend_strength > seasonality_strength:
        return "Strong Trend - Consider Prophet or LSTM"
    elif seasonality_strength > trend_strength:
        return "Strong Seasonality - Consider SARIMA"
    else:
        return "No Strong Trend/Seasonality - Consider XGBoost"
```

#### **Key Decision:**
- **Strong Trend → Prophet, LSTM**  
- **Strong Seasonality → SARIMA**  
- **No clear pattern → XGBoost**

---

## **4️⃣ Step 4: Feature & Correlation Analysis**
For **multivariate time series**, assess how **independent variables influence the target**.

### **4.1. Compute Feature Correlations**
- Pearson Correlation for **linear** relationships  
- Mutual Information Score for **nonlinear** relationships  

#### **Python Code:**
```python
import seaborn as sns
import pandas as pd

def correlation_analysis(df, target_column):
    correlation_matrix = df.corr()
    plt.figure(figsize=(8, 6))
    sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm")
    plt.show()

    target_corr = correlation_matrix[target_column].sort_values(ascending=False)
    print("Feature Correlation with Target:\n", target_corr)
```

#### **Key Decision:**
- **Low correlation (< 0.3) → XGBoost (feature importance matters)**
- **High correlation (> 0.5) → LSTM (sequence matters)**

---

## **5️⃣ Step 5: Autocorrelation Analysis**
Check how past values influence future values.

#### **Methods:**
- **ACF (Autocorrelation Function)**
- **PACF (Partial Autocorrelation Function)**

#### **Python Code:**
```python
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

def check_autocorrelation(ts, lags=40):
    fig, ax = plt.subplots(1, 2, figsize=(12, 4))
    plot_acf(ts, lags=lags, ax=ax[0])
    plot_pacf(ts, lags=lags, ax=ax[1])
    plt.show()
```

#### **Key Decision:**
- **High autocorrelation → SARIMA**  
- **Low autocorrelation → LSTM, XGBoost**

---

## **📌 Final Decision Table**
| **Characteristic** | **SARIMA** | **Prophet** | **LSTM** | **XGBoost** |
|-------------------|-----------|-------------|----------|-------------|
| **Univariate?** | ✅ Yes | ✅ Yes | ❌ No | ❌ No |
| **Multivariate?** | ❌ No | ❌ No | ✅ Yes | ✅ Yes |
| **Stationary?** | ✅ Yes | ❌ No | ❌ No | ❌ No |
| **Strong Trend?** | ❌ No | ✅ Yes | ✅ Yes | ✅ Yes |
| **Strong Seasonality?** | ✅ Yes | ❌ No | ✅ Yes | ❌ No |
| **Autocorrelation High?** | ✅ Yes | ❌ No | ❌ No | ❌ No |
| **Handles Nonlinearity?** | ❌ No | ❌ No | ✅ Yes | ✅ Yes |


In [18]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller


In [19]:
df=pd.read_csv("../preprocessing/output/preprocessed_timeseries0.csv", index_col=0, parse_dates=True)

In [20]:
df

Unnamed: 0_level_0,Revenue,Sales_quantity,Average_cost,The_average_annual_payroll_of_the_region
Period,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2015-01-01,0.810920,0.812461,0.101596,0.75
2015-01-02,0.000000,0.000000,0.171134,0.00
2015-01-03,0.865071,0.875412,0.189211,0.75
2015-01-04,1.000000,0.988853,0.086302,1.00
2015-01-05,0.810920,0.812461,0.356200,0.75
...,...,...,...,...
2022-01-08,0.810920,0.812461,0.590341,0.75
2022-01-09,0.810920,0.812461,0.590341,0.75
2022-01-10,0.810920,0.812461,0.590341,0.75
2022-01-11,0.810920,0.812461,0.590341,0.75


In [21]:
def select_model(data, target_col, exog_features=False):

    target_series = data[target_col]
    
    # Stationarity check (ADF Test on target variable)
    adf_result = adfuller(target_series)
    is_stationary = adf_result[1] < 0.05  

    # Seasonality check (using autocorrelation on target variable)
    seasonality_strength = target_series.autocorr(lag=12)  # Adjust lag based on frequency
    is_seasonal = seasonality_strength > 0.6  

    # Trend check (rolling mean difference)
    trend_strength = np.abs(target_series.rolling(window=12).mean().diff().mean())
    has_trend = trend_strength > 0.01  

    # Check if data is multivariate
    is_multivariate = len(data.columns) > 1  

    # Model selection logic
    if not is_multivariate:
        if is_stationary and not is_seasonal:
            return "SARIMA"
        elif is_seasonal and has_trend:
            return "Prophet"
        else:
            return "LSTM"
    
    if exog_features:
        return "XGBoost"
    elif is_seasonal and has_trend:
        return "Prophet"
    else:
        return "LSTM"

# Example usage
selected_model = select_model(df, target_col=df.columns[-1], exog_features=False)
print(f"Selected Model: {selected_model}")


Selected Model: LSTM


In [22]:
def select_model(data, target_col, exog_features=False):
    # Stationarity check using Augmented Dickey-Fuller (ADF) test
    adf_result = adfuller(data[target_col])
    if adf_result[1] > 0.05:
        print("Data is not stationary. Consider differencing or using a model that handles non-stationarity.")
        # If data is not stationary, consider using SARIMA or Prophet

    # Seasonality check using autocorrelation
    from statsmodels.tsa.stattools import acf
    acf_result = acf(data[target_col], nlags=20)
    if any(acf_result[1:] > 0.5):
        print("Data exhibits seasonality. Consider using SARIMA or Prophet.")

    # Trend check using rolling mean difference
    trend_result  = data[target_col].rolling(window=12).mean().diff().mean()
    if trend_result.mean() > 0.1:
        print("Data exhibits trend. Consider using ARIMA or SARIMA.")

    # Multivariate check using number of columns in DataFrame
    if exog_features:
        print("Data includes exogenous features. Consider using LSTM or Prophet.")

    # Model selection logic
    if adf_result[1] > 0.05 and any(acf_result[1:] > 0.5):
        # Data is not stationary and exhibits seasonality
        return "SARIMA"
    elif trend_result.mean() > 0.1 and exog_features:
        # Data exhibits trend and includes exogenous features
        return "LSTM"
    elif any(acf_result[1:] > 0.5) and not exog_features:
        # Data exhibits seasonality but no exogenous features
        return "Prophet"
    else:
        # Data is stationary and no seasonality or trend
        return "ARIMA"
    
selected_model = select_model(df, target_col=df.columns[-1], exog_features=False)
print(f"Selected Model: {selected_model}")

Selected Model: ARIMA


In [None]:
import numpy as np
import pandas as pd
from statsmodels.tsa.stattools import adfuller, kpss
from scipy.stats import entropy
from sklearn.ensemble import RandomForestClassifier

# Function to extract features from time series
def extract_features(df, target_col):
    series = df[target_col]
    
    # Stationarity Tests
    adf_p = adfuller(series)[1]  # ADF Test p-value
    kpss_p = kpss(series, regression="c")[1]  # KPSS Test p-value
    
    # Entropy (Complexity Measure)
    approx_entropy = entropy(np.histogram(series, bins=10)[0])
    
    # Auto-correlation & Seasonality
    autocorr_lag1 = series.autocorr(lag=1)
    autocorr_lag12 = series.autocorr(lag=12)
    
    # Variance stability
    rolling_var = series.rolling(window=12).var().mean()
    
    return [adf_p, kpss_p, approx_entropy, autocorr_lag1, autocorr_lag12, rolling_var]

# Example: Train a meta-model on past datasets
X_train = []  # Feature matrix
y_train = []  # Model labels

for dataset, best_model in past_datasets:  # Assume past_datasets is pre-collected
    X_train.append(extract_features(dataset, target_col="value"))
    y_train.append(best_model)

# Train the meta-learning classifier
clf = RandomForestClassifier()
clf.fit(X_train, y_train)

# Predict the best model for a new dataset
new_features = extract_features(df, target_col="value")
selected_model = clf.predict([new_features])[0]

print(f"Selected Model: {selected_model}")
