# Bike Sharing Dataset Exogenous Variables Selection 

In this notebook, you will be guided on how to leverage the exogenous variables we provide in order to improve the performance of the bike-sharing dataset. The dataset can be found [here](https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset).

Copy the code to the current working example

In [1]:
!cp ../calculate_feature_score.py .
!cp ../feature_selection.py .
!cp ../utils.py .

## Standard imports


In [2]:
# Import necessary libraries
import numpy as np
import pandas as pd
import os, warnings, requests

# Ignore warnings
warnings.filterwarnings(action='ignore')

## Download the bulk dataset from the API

Enter your API key to get access to the API

In [3]:
import getpass
api_key = getpass.getpass('Enter your Notional API key: ')

Enter your Notional API key: ········


In [4]:
url = "https://api.notional.ai/v1/series/bulk"
headers = {
  "x-notionalai-api-key": api_key,
}

response = requests.request("GET", url, headers=headers)

In [5]:
import urllib.request
import sys


opener = urllib.request.build_opener()
urllib.request.install_opener(opener)
urllib.request.urlretrieve(response.json()['result_url'], './data/all_features.parquet');

## Data preparation

Define the loss function that you want. Here we will use root mean squared error.

In [6]:
from sklearn.metrics import mean_squared_error

def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

Your dataset should be in tabular format with a `date/timestamp column` and a `target column`. All other columns will be considered as exogenous variables.

In [7]:
# Choose your lost function here
scoring = rmse

# Date/timestamp column
timestamp_col = 'date'

# Target column, i.e label
target_col = 'count'

# Path to the bulk dataset parquet file
features_parquet_path = 'data/all_features.parquet'

# The directory to store the feature evaluation results
output_dir = 'fs_results'

# Number of trials for optuna hyperparameter tuning
optuna_n_trials = 100

# Forecast length
prediction_length = 14

We'll read and split the dataset into a train set and a test set. In addition, we will also get the `cvs` variable, which contain the validation split for our cross validation process.

In [8]:
from utils import prepare_train_val_test_data

# Read the bike sharing dataset
data = pd.read_csv("data/bike_sharing_day.csv")

# Currenly we support the timestamp column as string type
data[timestamp_col] = data[timestamp_col].astype(str)
data = data.sort_values(timestamp_col).reset_index(drop=True)

# The length of the test dataset. Set it to None for it to equals to the prediction_length
test_size = None

# The ratio of the validation dataset used for cross validation
val_ratio = 0.25

# Number of cross validation folds
cv_fold = 5

# Should we add a lag_<prediction_length> column to the dataset? Should be yes in most of the cases.
add_lag_col = True

train_data, test_data, cvs = prepare_train_val_test_data(
    data=data, 
    target_col=target_col, 
    timestamp_col=timestamp_col, 
    test_size=test_size, 
    val_ratio=val_ratio, 
    cv_fold=cv_fold, 
    prediction_length=prediction_length, 
    add_lag_col=add_lag_col
)

## Feature Selection

Our feature selection method consists of multiple steps to ensure significant improvement and the applicability of selected features to a wide range of time series forecasting models, even though the method is built solely on the XGBoost model. To utilize our feature selection method, follow these steps:

1. Create an instance of the FeatureSelector class.
2. Call the `fit` method on the created instance, providing the necessary parameters. Note that this process requires a machine with a GPU.
3. Once the feature selection process is completed, you can use the get_best_features() method to obtain a list of features with strong predictive power.

In [9]:
from feature_selection import FeatureSelector
feature_selector = FeatureSelector()

In [8]:
feature_selector.fit(
    train_data=train_data,
    cvs=cvs,
    timestamp_col=timestamp_col,
    target_col=target_col,
    prediction_length=prediction_length,
    features_parquet_path=features_parquet_path,
    output_dir=output_dir,
    scoring=scoring,
    optuna_n_trials=optuna_n_trials,
    gpu_id=0,
    fitted=False #Refit?
)

Fine tuning Xgboost model


  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/18 [00:00<?, ?it/s]

  from pandas import MultiIndex, Int64Index
100%|██████████| 2987/2987 [16:17<00:00,  3.05it/s]


2987 features finished in 977.97 seconds


  from pandas import MultiIndex, Int64Index
100%|██████████| 3767/3767 [22:13<00:00,  2.83it/s]


3767 features finished in 1333.40 seconds


  from pandas import MultiIndex, Int64Index
100%|██████████| 2792/2792 [15:35<00:00,  2.98it/s]


2792 features finished in 935.60 seconds


  from pandas import MultiIndex, Int64Index
100%|██████████| 3823/3823 [23:06<00:00,  2.76it/s]


3823 features finished in 1386.85 seconds


  from pandas import MultiIndex, Int64Index
100%|██████████| 2847/2847 [16:20<00:00,  2.90it/s]


2847 features finished in 980.43 seconds


  from pandas import MultiIndex, Int64Index
100%|██████████| 3895/3895 [22:48<00:00,  2.85it/s]


3895 features finished in 1368.11 seconds


  from pandas import MultiIndex, Int64Index
100%|██████████| 3141/3141 [18:20<00:00,  2.85it/s]


3141 features finished in 1100.52 seconds


  from pandas import MultiIndex, Int64Index
100%|██████████| 3478/3478 [20:42<00:00,  2.80it/s]


3478 features finished in 1242.48 seconds


  from pandas import MultiIndex, Int64Index
100%|██████████| 3473/3473 [20:50<00:00,  2.78it/s]


3473 features finished in 1250.64 seconds


  from pandas import MultiIndex, Int64Index
100%|██████████| 2930/2930 [16:19<00:00,  2.99it/s]


2930 features finished in 979.45 seconds


  from pandas import MultiIndex, Int64Index
100%|██████████| 3694/3694 [21:39<00:00,  2.84it/s]


3694 features finished in 1299.27 seconds


  from pandas import MultiIndex, Int64Index
100%|██████████| 2822/2822 [16:08<00:00,  2.91it/s]


2822 features finished in 968.31 seconds


  from pandas import MultiIndex, Int64Index
100%|██████████| 3760/3760 [21:47<00:00,  2.87it/s]


3760 features finished in 1308.04 seconds


  from pandas import MultiIndex, Int64Index
100%|██████████| 2799/2799 [16:26<00:00,  2.84it/s]


2799 features finished in 986.60 seconds


  from pandas import MultiIndex, Int64Index
100%|██████████| 3868/3868 [22:00<00:00,  2.93it/s]


3868 features finished in 1320.21 seconds


  from pandas import MultiIndex, Int64Index
100%|██████████| 3136/3136 [18:24<00:00,  2.84it/s]


3136 features finished in 1105.01 seconds


  from pandas import MultiIndex, Int64Index
100%|██████████| 3467/3467 [20:33<00:00,  2.81it/s]


3467 features finished in 1233.48 seconds


  from pandas import MultiIndex, Int64Index
100%|██████████| 3464/3464 [20:27<00:00,  2.82it/s]


3464 features finished in 1227.64 seconds


  0%|          | 0/9 [00:00<?, ?it/s]

In [10]:
feature_selector.fit(
    train_data=train_data,
    cvs=cvs,
    timestamp_col=timestamp_col,
    target_col=target_col,
    prediction_length=prediction_length,
    features_parquet_path=features_parquet_path,
    output_dir=output_dir,
    scoring=scoring,
    optuna_n_trials=optuna_n_trials,
    gpu_id=0,
    fitted=True #Refit?
)

Fine tuning Xgboost model


  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/9 [00:00<?, ?it/s]

Get top 5 features subset

In [16]:
selected_features = feature_selector.get_n_best_features(5)
selected_features

[['WS_PRCP_00000049'],
 ['WC_PRCP_00001072'],
 ['WC_AIRP_00001830'],
 ['WC_AIRP_00001830', 'WC_AIRP_00002048'],
 ['WC_TAVG_00002204']]

# Evaluate performance of selected features on different models

Import necessary modules and helper functions

In [12]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from utils import ARIMAModel
import xgboost as xgb

from utils import fine_tune_model, evaluate_models, add_exo_features

We will assess the performance of our selected features using four different time series forecasting models: SARIMAX, Lasso, XGBoost, and RandomForest. This evaluation aims to determine the robustness of the selected features.

We will evaluate the model performance with and without the selected features and make comparisons. To ensure the reliability of the evaluation, it will be conducted on unseen data, guaranteeing the significance and generalizability of the results.

In [13]:
fine_tune_model_args = {
    'train_data': train_data, 
    'target_col': target_col, 
    'cvs': cvs, 
    'scoring': scoring, 
    'timestamp_col': timestamp_col, 
    'optuna_n_trials': 20
}

arima_model = ARIMAModel()
lasso_model = fine_tune_model('lasso', **fine_tune_model_args)
xgb_model = fine_tune_model('xgboost', **fine_tune_model_args)
rf_model = fine_tune_model('random_forest', **fine_tune_model_args)

models = [arima_model, lasso_model, xgb_model, rf_model]
evaluate_models(models, train_data, test_data, target_col, timestamp_col, scoring, prediction_length)

Model name: ARIMAModel
Loss: 1485.3088843066928
Model name: Lasso
Loss: 1812.636646795117
Model name: XGBRegressor
Loss: 1464.1101432373105
Model name: RandomForestRegressor
Loss: 1754.1757420364183
Best model: XGBRegressor
Best loss: 1464.1101432373105


In [14]:
train_data_final = add_exo_features(
    train_data, 
    timestamp_col, 
    selected_features[0], 
    features_parquet_path, 
    prediction_length
)

test_data_final = add_exo_features(
    test_data, 
    timestamp_col, 
    selected_features[0], 
    features_parquet_path, 
    prediction_length
)

In [15]:
fine_tune_model_args = {
    'train_data': train_data_final, 
    'target_col': target_col, 
    'cvs': cvs, 
    'scoring': scoring, 
    'timestamp_col': timestamp_col, 
    'optuna_n_trials': 20
}

arima_model = ARIMAModel()
lasso_model = fine_tune_model('lasso', **fine_tune_model_args)
xgb_model = fine_tune_model('xgboost', **fine_tune_model_args)
rf_model = fine_tune_model('random_forest', **fine_tune_model_args)

models = [arima_model, lasso_model, xgb_model, rf_model]
evaluate_models(models, train_data_final, test_data_final, target_col, timestamp_col, scoring, prediction_length)

Model name: ARIMAModel
Loss: 1714.8787940797754
Model name: Lasso
Loss: 1957.9165417071636
Model name: XGBRegressor
Loss: 1626.9433780483619
Model name: RandomForestRegressor
Loss: 1765.4653728198668
Best model: XGBRegressor
Best loss: 1626.9433780483619
