## Modelling time series data in a tabular format using featurization and improving accuracy using AutoML

This notebook delves into enhancing the process of forecasting average daily energy consumption levels by transforming a time series dataset into a tabular format using open-source libraries. We explore the application of a multiclass classification model and leverage AutoML with cleanlab to significantly boost our out-of-sample accuracy.

At a high level we will:

- Establish a baseline accuracy by fitting a Prophet forecasting model on our time series data
- Convert our time series data into a tabular format by using open-source featurization libraries and then will show that can outperform prophet with a multiclass classification approach.
- Use cleanlab’s AutoML platform for multiclass classification to **improve our out-of-sample accuracy for our predictions by ~8%** compared to both approaches tried so fa**r.**

## Initialize time series data for Prophet

In [None]:
import pandas as pd
pd.options.mode.chained_assignment = None 

data = pd.read_csv('PJME_hourly.csv', parse_dates=['Datetime'], index_col='Datetime')

# Assuming pjme_data is loaded as before
daily_data = data.resample('D').mean() 

# Prepare data for Prophet
daily_data.reset_index(inplace=True)
daily_data.columns = ['ds', 'y']

## Initialize time series data for featurization into a tabular format

In [None]:
from sklearn.model_selection import train_test_split

# Reset the datetime
data["Datetime"] = data.index
data = data.reset_index(drop=True)

# Create copy for multiclass data 
df = data.copy()

# Convert the datetime column
df['Datetime'] = pd.to_datetime(df['Datetime'])  # Adjust the 'datetime' column name as necessary
df = df.sort_values('Datetime').reset_index(drop=True)


# Obtain day and hour
df['Date'] = pd.to_datetime(df['Datetime']).dt.floor('D')  
df['Hour'] = pd.to_datetime(df['Datetime']).dt.hour

# Create multi-index feature df to compute time series features on
features = df.set_index(['Date', 'Hour'])  
features.drop("Datetime", inplace=True, axis=1)

# Split the data into training and testing sets, respecting the temporal order
X_train, X_test, y_train, y_test = train_test_split(features, features["PJME_MW"], test_size=0.2, shuffle=False)

# Get group lengths
train_lengths = X_train.groupby(level=0).size()
test_lengths = X_test.groupby(level=0).size()

# Obtain common length value for train/test data
train_common_length = train_lengths.mode().iloc[0]
test_common_length = test_lengths.mode().iloc[0]

# Filter train/test data to groups with same common length for featurizer
X_train = X_train.groupby(level=0).filter(lambda x: len(x) == train_common_length)
X_test = X_test.groupby(level=0).filter(lambda x: len(x) == test_common_length)

# Create quartiles based on training data to avoid leakage
quartiles = [X_train['PJME_MW'].quantile(q) for q in [0.25, 0.50, 0.75]]

## Train and Evaluate Prophet Forecasting Model

In [None]:
# Cutoff date at 2015-04-09
cutoff_index = int(len(daily_data) * 0.8)

# Use 80% of data for training set and 20% for test set
train_df = daily_data.iloc[:cutoff_index]
test_df = daily_data.iloc[cutoff_index:]

print("Training Set Shape:", train_df.shape)
print("Testing Set Shape:", test_df.shape)

In [None]:
train_df.tail()

In [None]:
test_df.head()

In [None]:
import numpy as np
from prophet import Prophet
from sklearn.metrics import accuracy_score

# Initialize model and train it on training data
model = Prophet()
model.fit(train_df)

# Create a dataframe for future predictions covering the test period
future = model.make_future_dataframe(periods=len(test_df), freq='D')
forecast = model.predict(future)

# Categorize forecasted daily values into quartiles based on the thresholds
forecast['quartile'] = pd.cut(forecast['yhat'], bins = [-np.inf] + list(quartiles) + [np.inf], labels=[1, 2, 3, 4])

# Extract the forecasted quartiles for the test period
forecasted_quartiles = forecast.iloc[-len(test_df):]['quartile'].astype(int)


# Categorize actual daily values in the test set into quartiles
test_df['quartile'] = pd.cut(test_df['y'], bins=[-np.inf] + list(quartiles) + [np.inf], labels=[1, 2, 3, 4])
actual_test_quartiles = test_df['quartile'].astype(int)


# Calculate the evaluation metrics
accuracy = accuracy_score(actual_test_quartiles, forecasted_quartiles)

# Print the evaluation metrics
print(f'Accuracy: {accuracy:.4f}')

## Convert time series data to tabular format through featurization

In [None]:
import tsfel
from sktime.transformations.panel.tsfresh import TSFreshFeatureExtractor

# Define tsfresh feature extractor
tsfresh_trafo = TSFreshFeatureExtractor(default_fc_parameters="minimal")

# Transform the training data using the feature extractor
X_train_transformed = tsfresh_trafo.fit_transform(X_train)

# Transform the test data using the same feature extractor
X_test_transformed = tsfresh_trafo.transform(X_test)

# Retrieves a pre-defined feature configuration file to extract all available features
cfg = tsfel.get_features_by_domain()

# Function to compute tsfel features per day
def compute_features(group):
    # TSFEL expects a DataFrame with the data in columns, so we transpose the input group
    features = tsfel.time_series_features_extractor(cfg, group, fs=1, verbose=0)
    return features


# Group by the 'day' level of the index and apply the feature computation
train_features_per_day = X_train.groupby(level='Date').apply(compute_features).reset_index(drop=True)
test_features_per_day = X_test.groupby(level='Date').apply(compute_features).reset_index(drop=True)

# Combine each featurization into a set of combined features for our train/test data
train_combined_df = pd.concat([X_train_transformed, train_features_per_day], axis=1)
test_combined_df = pd.concat([X_test_transformed, test_features_per_day], axis=1)

# Filter out features that are highly correlated with our target variable
column_of_interest = "PJME_MW__mean"
train_corr_matrix = train_combined_df.corr()
train_corr_with_interest = train_corr_matrix[column_of_interest]
null_corrs = pd.Series(train_corr_with_interest.isnull())
false_features = null_corrs[null_corrs].index.tolist()

columns_to_exclude = list(set(train_corr_with_interest[abs(train_corr_with_interest) > 0.8].index.tolist() + false_features))
columns_to_exclude.remove(column_of_interest)

# Filtered DataFrame excluding columns with high correlation to the column of interest
train_combined_df = train_combined_df.drop(columns=columns_to_exclude)
test_combined_df = test_combined_df.drop(columns=columns_to_exclude)

In [None]:
# Define a function to classify each value into a quartile
def classify_into_quartile(value):
    if value < quartiles[0]:
        return 1  
    elif value < quartiles[1]:
        return 2  
    elif value < quartiles[2]:
        return 3  
    else:
        return 4  

X_train_transformed = train_combined_df.copy()
X_test_transformed = test_combined_df.copy()

y_train = X_train_transformed["PJME_MW__mean"]
X_train_transformed.drop("PJME_MW__mean", inplace=True, axis=1)

y_test = X_test_transformed["PJME_MW__mean"]
X_test_transformed.drop("PJME_MW__mean", inplace=True, axis=1)

y_train_labels = y_train.apply(classify_into_quartile)
y_test_labels = y_test.apply(classify_into_quartile)

## Train and Evaluate GradientBoostingClassifier Model on multiclass tabular data

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier(
    n_estimators=150,
    learning_rate=0.1,
    max_depth=4,
    min_samples_leaf=20,
    max_features='sqrt',
    subsample=0.8,
    random_state=42
)

gbc.fit(X_train_transformed, y_train_labels)


y_pred_gbc = gbc.predict(X_test_transformed)
print(f'Accuracy: {accuracy_score(y_test_labels, y_pred_gbc):.4f}')

## Read in AutoML pred probs to compare measure out-of-sample accuracy

In [None]:
y_pred_automl_cleanlab = pd.read_csv("quartile-multiclass-pjme-testing-data_pred_probs.csv")
y_pred_automl_cleanlab = y_pred_automl_cleanlab["Suggested Label"]
print(f'Accuracy: {accuracy_score(y_test_labels, y_pred_automl_cleanlab):.4f}')