## Cross Validation Split

Next important import statements for trying cross validation

In [0]:
from pyspark.sql.functions import percent_rank
from pyspark.sql import Window
from datetime import datetime, date
from pyspark.sql import Row

One key part of cross validation is splitting the data into the different folds for model evaluation. Below if a function to take a full training dataframe and return three lists that contain each fold of the training data, validation data, and test data. The split occurs by ordering the data by a specific time column, adding a `rank` column which specifies what percentile of the ordered data a specific data point is in, and splitting based on the different percentages.

In [0]:
def time_series_cross_validation_split(num_folds, df, sort_order_column_name, dependent_column_name):
  
  """
  Parameters
  ----------
    num_folds : int
      The number of folds for which to split the data
    df : dataframe
      The base dataframe for which to split
    sort_order_column_name : str
      The column for which the dataframe should be sorted by for ranking purposes
    dependent_column_name : str
      The column name for the dependent variable in the model
  """
  
  # Sanity check on datafram
  if sort_order_column_name not in df.columns or dependent_column_name not in df.columns:
    return
  # Add rank column
  df = df.withColumn("rank", percent_rank().over(Window.partitionBy().orderBy(sort_order_column_name)))
  # Define percentages on which to split data based on rank and number of folds
  fold_percentage = 1 / (num_folds + 2)
  df_train_list = []
  df_validate_list = []
  df_test_list = []
  
  # For each fold
  for i in range(1, num_folds + 1):
    # Define train df
    train_df = df.where("rank <= " + str(fold_percentage * i)).drop("rank")
    # Define validation df
    validate_df = df.where("rank <= " + str(fold_percentage * (i + 1)) + " and rank > " + str(fold_percentage * i)).drop("rank")
    # Define test df
    test_df = df.where("rank <= " + str(fold_percentage * (i + 2)) + " and rank > " + str(fold_percentage * (i + 1))).drop("rank")
    
    # Append to lists to return 
    df_train_list.append(train_df)
    df_validate_list.append(validate_df)
    df_test_list.append(test_df)
  return df_train_list, df_validate_list, df_test_list

For a three fold validation, the function would divide the data into five equal slices. In the first fold, the training data would be the first fifth of the data, the validation the second fifth, and the training the third fifth. In the second fold, the training data would be the first two fifths of the data, the validation set the third fifth, and the test set the fourth fifth.

This function allows any data set to be easily split into proper training, validation, and test folds that any model can utilize for error estimation and hyperparameter tuning. It will be used in modeling sections later in the report.