# Bike Sharing Dataset Exogenous Variables Selection 

In this notebook, you will be guided on how to leverage the exogenous variables we provide in order to improve the performance of the bike-sharing dataset. The dataset can be found [here](https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset).

## Standard imports


In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
import warnings, requests

# Ignore warnings
warnings.filterwarnings(action='ignore')

## Download the bulk dataset from the API

Enter your API key to get access to the API

In [2]:
import getpass
api_key = getpass.getpass('Enter your Notional API key: ')

Enter your Notional API key: ········


In [11]:
url = "https://api.notional.ai/v1/series/bulk"
headers = {
  "x-notionalai-api-key": api_key,
}

response = requests.request("GET", url, headers=headers)

In [13]:
import urllib.request
import sys

def download_progress(count, block_size, total_size):
    if np.random.rand() > 0.8: # Limit the number of printing in notebook
        percent = int(count * block_size * 100 / total_size)
        sys.stdout.write('\rDownloading: %d%%' % percent)
        sys.stdout.flush()

opener = urllib.request.build_opener()
urllib.request.install_opener(opener)
urllib.request.urlretrieve(response.json()['result_url'], './data/all_features.parquet', reporthook=download_progress)

Downloading: 99%

('./data/all_features.parquet', <http.client.HTTPMessage at 0x7fc122b6e040>)

## Data preparation

Define the loss function that you want. Here we will use root mean squared error.

In [2]:
from sklearn.metrics import mean_squared_error

def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

Your dataset should be in tabular format with a `date/timestamp column` and a `target column`. All other columns will be considered as exogenous variables.

In [3]:
# Choose your lost function here
scoring = rmse

# Date/timestamp column
timestamp_col = 'date'

# Target column, i.e label
target_col = 'count'

# Path to the bulk dataset parquet file
features_parquet_path = 'data/all_features.parquet'

# The directory to store the feature evaluation results
output_dir = 'fs_results'

# Number of trials for optuna hyperparameter tuning
optuna_n_trials = 10

# Forecast length
prediction_length = 14

We'll read and split the dataset into a train set and a test set. In addition, we will also get the `cvs` variable, which contain the validation split for our cross validation process.

In [4]:
from utils import prepare_train_val_test_data

# Read the bike sharing dataset
data = pd.read_csv("data/bike_sharing_day.csv")

# Currenly we support the timestamp column as string type
data[timestamp_col] = data[timestamp_col].astype(str)
data = data.sort_values(timestamp_col).reset_index(drop=True)

# The length of the test dataset. Set it to None for it to equals to the prediction_length
test_size = None

# The ratio of the validation dataset used for cross validation
val_ratio = 0.25

# Number of cross validation folds
cv_fold = 5

# Should we add a lag_<prediction_length> column to the dataset? Should be yes in most of the cases.
add_lag_col = True

train_data, test_data, cvs = prepare_train_val_test_data(
    data=data, 
    target_col=target_col, 
    timestamp_col=timestamp_col, 
    test_size=test_size, 
    val_ratio=val_ratio, 
    cv_fold=cv_fold, 
    prediction_length=prediction_length, 
    add_lag_col=add_lag_col
)

## Feature Selection

Our feature selection method consists of multiple steps to ensure significant improvement and the applicability of selected features to a wide range of time series forecasting models, even though the method is built solely on the XGBoost model. To utilize our feature selection method, follow these steps:

1. Create an instance of the FeatureSelector class.
2. Call the `fit` method on the created instance, providing the necessary parameters. Note that this process requires a machine with a GPU.
3. Once the feature selection process is completed, you can use the get_best_features() method to obtain a list of features with strong predictive power.

In [7]:
from feature_selection import FeatureSelector
feature_selector = FeatureSelector()

In [8]:
feature_selector.fit(
    train_data=train_data,
    cvs=cvs,
    timestamp_col=timestamp_col,
    target_col=target_col,
    prediction_length=prediction_length,
    features_parquet_path=features_parquet_path,
    output_dir=output_dir,
    scoring=scoring,
    optuna_n_trials=optuna_n_trials,
    gpu_id=0,
)

  0%|          | 0/18 [00:00<?, ?it/s]

  from pandas import MultiIndex, Int64Index
 11%|█         | 330/2987 [02:32<20:25,  2.17it/s]
Traceback (most recent call last):
  File "calculate_feature_score.py", line 43, in <module>
    output = calculate_feature_score(**input_dict)
  File "calculate_feature_score.py", line 22, in calculate_feature_score
    losses = run_cv(model, train_data_exo_small, target_col,
  File "/notional_data/phuc_workspace/notional-ts-examples/utils.py", line 197, in run_cv
    model.fit(X_train, y_train)
  File "/opt/conda/envs/notional-ts/lib/python3.8/site-packages/xgboost/core.py", line 506, in inner_f
    return f(**kwargs)
  File "/opt/conda/envs/notional-ts/lib/python3.8/site-packages/xgboost/sklearn.py", line 789, in fit
    self._Booster = train(
  File "/opt/conda/envs/notional-ts/lib/python3.8/site-packages/xgboost/training.py", line 188, in train
    bst = _train_internal(params, dtrain,
  File "/opt/conda/envs/notional-ts/lib/python3.8/site-packages/xgboost/training.py", line 81, in _trai

KeyboardInterrupt: 

In [None]:
selected_features = feature_selector.get_best_features()
selected_features

# Evaluate performance of selected features on different models

Import necessary modules and helper functions

In [None]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from utils import ARIMAModel
import xgboost as xgb

from utils import fine_tune_model, evaluate_models, add_exo_features

We will assess the performance of our selected features using four different time series forecasting models: SARIMAX, Lasso, XGBoost, and RandomForest. This evaluation aims to determine the robustness of the selected features.

We will evaluate the model performance with and without the selected features and make comparisons.

In [None]:
fine_tune_model_args = {
    'train_data': train_data, 
    'target_col': target_col, 
    'cvs': cvs, 
    'scoring': scoring, 
    'timestamp_col': timestamp_col, 
    'optuna_n_trials': optuna_n_trials
}

arima_model = ARIMAModel()
lasso_model = fine_tune_model('lasso', **fine_tune_model_args)
xgb_model = fine_tune_model('xgboost', **fine_tune_model_args)
rf_model = fine_tune_model('random_forest', **fine_tune_model_args)

models = [arima_model, lasso_model, xgb_model, rf_model]
evaluate_models(models, train_data, test_data, target_col, timestamp_col, scoring)

In [None]:
train_data_final = add_exo_features(
    train_data, 
    timestamp_col, 
    final_selected_features, 
    parquet_file_path, 
    prediction_length
)

test_data_final = add_exo_features(
    test_data, 
    timestamp_col, 
    final_selected_features, 
    parquet_file_path, 
    prediction_length
)

In [None]:
fine_tune_model_args = {
    'train_data': train_data_final, 
    'target_col': target_col, 
    'cvs': cvs, 
    'scoring': scoring, 
    'timestamp_col': timestamp_col, 
    'optuna_n_trials': optuna_n_trials
}

lr_model = LinearRegression()
arima_model = ARIMAModel()
lasso_model = fine_tune_model('lasso', **fine_tune_model_args)
xgb_model = fine_tune_model('xgboost', **fine_tune_model_args)
rf_model = fine_tune_model('random_forest', **fine_tune_model_args)

models = [lr_model, arima_model, lasso_model, xgb_model, rf_model]
evaluate_models(models, train_data_final, test_data_final, target_col, timestamp_col, scoring)