# Cross Validation

[Link to the Video](https://www.youtube.com/watch?v=1rZpbvSI26c&list=PLKmQjl_R9bYd32uHImJxQSFZU5LPuXfQe&index=14)

## Intro

Cross-validation is a staple process when building any statistical or machine learning model and is ubiquitous in data science. However, for the more niche area of time series analysis and forecasting, it is very easy to incorrectly carry out cross-validation.

## What is Cross-Validation?

Cross-validation is a method to determine the best performing model and parameters through training and testing the model on different portions of the data. The most common and basic approach is the classic train-test split. This is where we split our data into a training set that is used to fit our model and then evaluated it on the test set.

This idea can be taken one step further by carrying out the train-test split numerous times by varying the data we train and test on. This process is cross-validation as we are using every row of data for both training and evaluation to ensure we choose the most robust model over all the possible available data.

In [2]:
# Import packages
import plotly.graph_objects as go
import pandas as pd
from sklearn.model_selection import KFold

In [3]:
def plot_cross_val(n_splits: int,
                   splitter_func,
                   df: pd.DataFrame,
                   title_text: str) -> None:
  
    """Function to plot the cross validation of various
    sklearn splitter objects."""

    split = 1
    plot_data = []

    for train_index, valid_index in splitter_func(n_splits=n_splits).split(df):
        plot_data.append([train_index, 'Train', f'{split}'])
        plot_data.append([valid_index, 'Test', f'{split}'])
        split += 1

    plot_df = pd.DataFrame(plot_data,
                           columns=['Index', 'Dataset', 'Split'])\
                           .explode('Index')

    fig = go.Figure()
    for _, group in plot_df.groupby('Split'):
        fig.add_trace(go.Scatter(x=group['Index'].loc[group['Dataset'] == 'Train'],
                                 y=group['Split'].loc[group['Dataset'] == 'Train'],
                                 name='Train',
                                 line=dict(color="blue", width=10)
                                 ))
        fig.add_trace(go.Scatter(x=group['Index'].loc[group['Dataset'] == 'Test'],
                                 y=group['Split'].loc[group['Dataset'] == 'Test'],
                                 name='Test',
                                 line=dict(color="goldenrod", width=10)
                                 ))

    fig.update_layout(template="simple_white", font=dict(size=20),
                      title_text=title_text, title_x=0.5, width=850,
                      height=450, xaxis_title='Index', yaxis_title='Split')

    legend_names = set()
    fig.for_each_trace(
        lambda trace:
        trace.update(showlegend=False)
        if (trace.name in legend_names) else legend_names.add(trace.name))

    return fig.show()

In [4]:
# Read in the data
data = pd.read_csv('../data/airline.csv')
data['Month'] = pd.to_datetime(data['Month'])

In [5]:
# Plot the cross validation
plot_cross_val(n_splits=5,
              splitter_func=KFold,
              df=data,
              title_text='Cross-Validation')

## Time Series Cross Validation

The above cross-validation is not an effective or valid strategy on forecasting models due to their temporal dependency. For time series, we always predict into the future. However, in the above approach we will be training on data that is further in time than the evaluation test data. This is data leakage and should be avoided at all costs.

To overcome this problem, we need to ensure the test set always has a higher index (the index is usually time for time series data) than the training set. This means our test is always in the future compared to the data our model is fitted on.

In [6]:
# Import packages
from sklearn.model_selection import TimeSeriesSplit

# Plot the time series cross validation splits
plot_cross_val(n_splits=5,
               splitter_func=TimeSeriesSplit,
               df=data,
               title_text='Time Series Cross-Validation')

## Hyperparameter Tuning

Cross-validation is frequently used in collaboration with hyperparameter tuning to determine the optimal hyperparameter values for a model. Let’s quickly go over an example of this process, for a forecasting model, in Python.

In [7]:
# Import packages
import plotly.express as px


def plot_time_series(df: pd.DataFrame) -> None:
    """General function to plot the passenger data."""
    
    fig = px.line(df, x='Month', y='#Passengers',
                  labels={'Month': 'Date', '#Passengers': 'Passengers'})
                  
    fig.update_layout(template="simple_white", font=dict(size=18),
                      title_text='Airline Passengers', width=650,
                      title_x=0.5, height=400)

    return fig.show()
    

In [8]:
# Read in the data
data = pd.read_csv('../data/airline.csv')
data['Month'] = pd.to_datetime(data['Month'])

In [9]:
# Plot the time series
plot_time_series(df=data)

The data has a clear trend and high seasonality. A suitable model for this time series would be the Holt Winters exponential smoothing model that incorporates both trend and seasonality components

In [10]:
# Import packages
import numpy as np
from sklearn.metrics import mean_absolute_percentage_error
from statsmodels.tsa.holtwinters import ExponentialSmoothing

In [11]:
def hyperparameter_tuning_season_cv(n_splits: int,
                                    gammas: list[float],
                                    df: pd.DataFrame) -> pd.DataFrame:                                   
    """Function to carry out cross-validation hyperparameter tuning
    for the seasonal parameter in a Holt Winters' model. """

    tscv = TimeSeriesSplit(n_splits=n_splits)
    error_list = []

    for gamma in gammas:
    
        errors = []
        
        for train_index, valid_index in tscv.split(df):
            train, valid = df.iloc[train_index], df.iloc[valid_index]
            
            model = ExponentialSmoothing(train['#Passengers'], trend='mul',
                                         seasonal='mul', seasonal_periods=12) \
                .fit(smoothing_seasonal=gamma)
                
            forecasts = model.forecast(len(valid))
            errors.append(mean_absolute_percentage_error(valid['#Passengers'], forecasts))

        error_list.append([gamma, sum(errors) / len(errors)])

    return pd.DataFrame(error_list, columns=['Gamma', 'MAPE'])

In [12]:
def plot_error_cv(df: pd.DataFrame,
                  title: str) -> None:                  
    """Bar chart to plot the errors from the different
    hyperparameters."""

    fig = px.bar(df, x='Gamma', y='MAPE')
    fig.update_layout(template="simple_white", font=dict(size=18), title_text=title,
                      width=800, title_x=0.5, height=400)

    return fig.show()

In [13]:
# Carry out cv for hyperparameter tuning for the seasonal parameter
error_df = hyperparameter_tuning_season_cv(df=data,
                                         n_splits=4,
                                         gammas=list(np.arange(0, 1.1, 0.1)))


overflow encountered in matmul


overflow encountered in matmul


overflow encountered in matmul


overflow encountered in matmul


overflow encountered in matmul


overflow encountered in matmul


overflow encountered in matmul


overflow encountered in matmul


overflow encountered in matmul


overflow encountered in matmul


overflow encountered in matmul


overflow encountered in matmul


overflow encountered in matmul


overflow encountered in matmul


overflow encountered in matmul


overflow encountered in matmul


overflow encountered in matmul


overflow encountered in matmul


overflow encountered in matmul


overflow encountered in matmul


overflow encountered in matmul


overflow encountered in matmul


overflow encountered in matmul


overflow encountered in matmul


overflow encountered in matmul


overflow encountered in matmul


overflow encountered in matmul


overflow encountered in matmul


overflow encountered in matmul



In [14]:
# Plot the tuning results
plot_error_cv(df=error_df, title='Hyperparameter Results')

As we can see, it appears the optimal value of the smoothing_seasonal hyperparameter is 0.7.