#### The principal choice of XGBoost model:

Since we have time-series data with a continuous target, this is a regression problem. We also need to capture non-linear relationships and temporal dependencies. For this reason, we chose a tree-based model, specifically XGBoost, which performs well on tabular data and able to hanlde multicolinearity features naturally.  

Note: This is a personal choice and does not mean XGBoost is guaranteed to be the most accurate model for this dataset.


In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from xgboost import XGBRegressor

In [2]:
#Load datasets

train_day_df = pd.read_csv("D:/VS Code Projects/Datasets/Bike Sharing/data_processed/feature_engineered_train_day.csv")
valid_day_df = pd.read_csv("D:/VS Code Projects/Datasets/Bike Sharing/data_processed/feature_engineered_valid_day.csv")
train_hour_df = pd.read_csv("D:/VS Code Projects/Datasets/Bike Sharing/data_processed/feature_engineered_train_hour.csv")
valid_hour_df = pd.read_csv("D:/VS Code Projects/Datasets/Bike Sharing/data_processed/feature_engineered_valid_hour.csv")

In [3]:
### Ensure datetime & remove the column 'Unnamed: 0'
datasets_list =  [train_day_df,train_hour_df,valid_day_df,valid_hour_df]
for dataset in datasets_list:
    dataset['date'] = pd.to_datetime(dataset['date'], errors= 'coerce')
    dataset.drop(columns=['Unnamed: 0'], inplace=True, errors='ignore')

print('ISSUE FIXED')
train_day_df.dtypes

ISSUE FIXED


date                    datetime64[ns]
month                            int64
weather_situation                int64
feels_like_temp_norm           float64
temp_feel_diff                 float64
wind_speed                     float64
temp_x_wind_speed              float64
casual                           int64
registered                       int64
num_rentals                      int64
dtype: object

In [4]:
valid_day_df.dtypes

date                    datetime64[ns]
month                            int64
weather_situation                int64
feels_like_temp_norm           float64
temp_feel_diff                 float64
wind_speed                     float64
temp_x_wind_speed              float64
casual                           int64
registered                       int64
num_rentals                      int64
dtype: object

In [5]:
target = 'num_rentals'
X_day_train = train_day_df.drop(columns= [target, 'date'])
y_day_train = train_day_df[target]

X_day_valid = valid_day_df.drop(columns= [target, 'date'])
y_day_valid =valid_day_df[target]



In [6]:
print("Train shape:", X_day_train.shape)
print("valid shape:", X_day_valid.shape)

Train shape: (365, 8)
valid shape: (274, 8)


In [7]:

X_hour_train = train_hour_df.drop(columns= [target, 'date'])
y_hour_train = train_hour_df[target]

X_hour_valid = valid_hour_df.drop(columns= [target, 'date'])
y_hour_valid =valid_hour_df[target]

In [8]:
#Train a simple XGBoost Regressor

xgb_model  = XGBRegressor(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    n_jobs=-1
)


In [9]:

xgb_model.fit(X_day_train, y_day_train) #Train day set Model
y_day_pred = xgb_model.predict(X_day_valid) #Predict day set Model



In [10]:

xgb_model.fit(X_hour_train, y_hour_train) #Train hour set Model
y_hour_pred = xgb_model.predict(X_hour_valid) #Predict hour set Model

In [11]:
# Prediction Metrics
def evaluate (y_true, y_pred, label):
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    r2 = r2_score(y_true, y_pred)
    print(f"{label} → RMSE: {rmse:.2f}, R²: {r2:.3f}")

In [13]:
evaluate(y_day_valid, y_day_pred, "Daily XGBoost")


Daily XGBoost → RMSE: 1130.01, R²: 0.566


In [14]:
evaluate(y_hour_valid, y_hour_pred, "Hourly XGBoost")

Hourly XGBoost → RMSE: 43.86, R²: 0.957


The XGBoost model is performing extremely well on the hourly data, explaining most of the variability with R² = 0.957 and capturing both temporal patterns (through engineered features like hour, hour_category, and temp × humidity) as well as non-linear relationships.  

However, for the daily dataset, the model performs poorly, showing almost no improvement over the baseline. In the next steps, we will consider trying another model to achieve better performance and apply hyperparameter tuning to optimize the model if possible.
