Imports

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import optuna
import seaborn as sns
from optuna.samplers import TPESampler
from sklearn.model_selection import train_test_split, KFold
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_log_error
from sklearn.preprocessing import MinMaxScaler


**Feature Scaling and Train-Test Split **

Convert Datetime: The 'datetime' column is transformed into a datetime object to enable extraction of time-related features.

Extract Time-Related Features: Essential features such as hour, day, day of the week, month, and year are extracted from the datetime column. These features are crucial for predicting bike rental demand.

Create Interaction Features: Interaction features are generated by multiplying certain columns like 'hour' with 'workingday', 'temp', and 'humidity'. This approach helps in uncovering patterns in rental behavior under varying conditions and times.

One-Hot Encoding for Categorical Features: Categorical variables such as 'season', 'weather', and 'year' undergo one-hot encoding, converting them into a numerical format that is interpretable by the model without implying any ordinal relationship.

Drop Unnecessary Columns: Columns not required for modeling are removed. The dataset is divided into feature set (X) and target variable (y). The target variable ('count') undergoes a logarithmic transformation to address skewness and enhance model performance.

In [2]:
# Load dataset
df = pd.read_csv('/kaggle/input/bike-sharing-demand/train.csv')

# Preprocess the dataset

# Convert datetime
df['datetime'] = pd.to_datetime(df['datetime'])

# Extract time-related features
df['hour'] = df['datetime'].dt.hour
df['day'] = df['datetime'].dt.day
df['day_of_week'] = df['datetime'].dt.dayofweek
df['month'] = df['datetime'].dt.month
df['year'] = df['datetime'].dt.year

# Create interaction features
df['hour_workingday'] = df['hour'] * df['workingday']
df['hour_temp'] = df['hour'] * df['temp']
df['hour_humidity'] = df['hour'] * df['humidity']

# One-hot encoding for categorical features
df = pd.get_dummies(df, columns=['season', 'weather', 'year'], drop_first=True)

# Drop unnecessary columns
X = df.drop(['datetime', 'count', 'casual', 'registered'], axis=1)
y = np.log1p(df['count'])


**Feature Scaling and Data Partitioning**

Numeric Columns List: Creates a list of numeric columns in the dataset. These include environmental factors (temperature, humidity, windspeed) and time-related features (hour, day, week, month, etc.).

Scaling Process: Applies MinMaxScaler to normalize the selected numeric columns within a range of 0-1, ensuring uniformity in the data scale.

Train-test split: Splits the dataset into training and testing sets, crucial for training the model and evaluating its performance on unseen data.

In [3]:

# Define numeric columns for scaling
numeric_columns = ['temp', 'atemp', 'humidity', 'windspeed', 'hour', 'day', 'day_of_week', 'month', 'hour_workingday', 'hour_temp', 'hour_humidity']

# Scale numeric columns
scaler = MinMaxScaler()
X[numeric_columns] = scaler.fit_transform(X[numeric_columns])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


**Hyperparameter Optimization with Optuna and Cross-Validation**

-Setting Up Optuna for Hyperparameter Tuning-

Optuna Framework: Optuna is utilized for hyperparameter optimization. It automates the process of finding the most effective hyperparameters for the XGBoost model.

Defining the Objective Function: An objective function is defined for the Optuna study. This function will guide the optimization process based on the model's performance.

-Hyperparameter Suggestion-

Parameter Space: The function defines a range of values for various hyperparameters like n_estimators, learning_rate, max_depth, etc., from which Optuna will choose the best combination.

Model Initialization: An XGBoost regressor model is initialized within the function with the parameters suggested by Optuna.
K-Fold Cross-Validation

Setup: A 5-fold cross-validation (n_folds = 5) is set up to validate the model's performance and prevent overfitting.
Process: The dataset is split into 5 different sets, and the model is trained and validated on these subsets. This approach ensures a robust estimation of the model's performance.

-Model Evaluation-

Validation and Prediction: In each fold, the model is trained on a subset of the data and makes predictions on the validation fold.

Performance Metric: The Root Mean Squared Logarithmic Error (RMSLE) is calculated for each fold. This metric is especially suited for regression problems with exponential growth, like count data.

Aggregating Results: The mean RMSLE across all folds is computed, providing an overall measure of model performance.

In [4]:

n_folds = 5
rmsle_scores = []

# Define objective function for Optuna
def objective(trial):
    # Suggest hyperparameters
    params = {
        'n_estimators': trial.suggest_int('n_estimators', low=100, high=1000, step=100),
        'learning_rate': trial.suggest_float('learning_rate', 0.001, 0.1, log=True),
        'max_depth': trial.suggest_int('max_depth', low=3, high=11, step=2),
        'min_child_weight': trial.suggest_int('min_child_weight', low=1, high=7, step=2),
        'subsample': trial.suggest_float('subsample', 0.6, 1.0, step=0.1),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0, step=0.1),
        'gamma': trial.suggest_float('gamma', 0, 0.2, step=0.1),
        'reg_alpha': trial.suggest_float('reg_alpha', 0, 0.1, step=0.01),
        'reg_lambda': trial.suggest_float('reg_lambda', 0.1, 2.0, step=0.1)
    }

    
    xgb_model = XGBRegressor(**params, objective='reg:squarederror', n_jobs=-1, random_state=42)
    
    # Initialize KFold cross-validation
    kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)

    # Perform KFold cross-validation
    for train_index, val_index in kf.split(X_train):
        X_train_fold, X_val_fold = X_train.iloc[train_index], X_train.iloc[val_index]
        y_train_fold, y_val_fold = y_train.iloc[train_index], y_train.iloc[val_index]

        # Fit the model on the training fold
        xgb_model.fit(X_train_fold, y_train_fold)

        # Make predictions on the validation fold
        y_val_pred = xgb_model.predict(X_val_fold)

        # Revert logarithmic transformation
        y_val_exp = np.expm1(y_val_fold)
        y_val_pred_exp = np.expm1(y_val_pred)

        # Calculate the RMSLE on validation fold
        rmsle = np.sqrt(mean_squared_log_error(y_val_exp, y_val_pred_exp))
        rmsle_scores.append(rmsle)

    # Calculate mean RMSLE across all folds
    mean_rmsle = np.mean(rmsle_scores)

    return mean_rmsle


**Optuna Study for Hyperparameter Optimization**

Study Setup: An Optuna study is set up with a TPESampler for efficient hyperparameter search, aimed at minimizing the objective function.

Optimization Process: The study conducts 100 trials to find the optimal hyperparameters for the XGBoost model.
Retrieving Optimal Hyperparameters

Extraction of Best Parameters: Post-optimization, the study's best hyperparameters are extracted.
Display: These optimal parameters are printed out for model configuration and further analysis.

In [5]:

# Initialize Optuna study
sampler = TPESampler(seed=42)
study = optuna.create_study(direction='minimize', sampler=sampler)
study.optimize(objective, n_trials=100)

# Get best hyperparameters
best_params = study.best_params
print("Best hyperparameters: ", best_params)


[I 2024-01-05 02:21:42,770] A new study created in memory with name: no-name-588f017b-db96-44b5-9f23-9622509f6fc4
[I 2024-01-05 02:21:49,170] Trial 0 finished with value: 0.27877371031310094 and parameters: {'n_estimators': 400, 'learning_rate': 0.07969454818643935, 'max_depth': 9, 'min_child_weight': 5, 'subsample': 0.6, 'colsample_bytree': 0.6, 'gamma': 0.0, 'reg_alpha': 0.09, 'reg_lambda': 1.3000000000000003}. Best is trial 0 with value: 0.27877371031310094.
[I 2024-01-05 02:22:05,926] Trial 1 finished with value: 0.49409005197523453 and parameters: {'n_estimators': 800, 'learning_rate': 0.0010994335574766201, 'max_depth': 11, 'min_child_weight': 7, 'subsample': 0.7, 'colsample_bytree': 0.6, 'gamma': 0.0, 'reg_alpha': 0.03, 'reg_lambda': 1.1}. Best is trial 0 with value: 0.27877371031310094.
[I 2024-01-05 02:22:15,641] Trial 2 finished with value: 0.4617428332100063 and parameters: {'n_estimators': 500, 'learning_rate': 0.0038234752246751854, 'max_depth': 9, 'min_child_weight': 1, '

Best hyperparameters:  {'n_estimators': 400, 'learning_rate': 0.07969454818643935, 'max_depth': 9, 'min_child_weight': 5, 'subsample': 0.6, 'colsample_bytree': 0.6, 'gamma': 0.0, 'reg_alpha': 0.09, 'reg_lambda': 1.3000000000000003}


**Training Model and Analyzing Feature Importance**

-Model Training-

Training with Optimal Parameters: The XGBoost model is trained using the best hyperparameters obtained from the Optuna study, ensuring optimal performance.

Importance Analysis: The importance of each feature in the trained model is calculated to understand their impact on predictions.

RMSLE Score: The best Root Mean Squared Logarithmic Error (RMSLE) score from the Optuna study is displayed, offering a measure of the model's prediction accuracy on the logarithmic scale.

In [6]:

# Train the model with best hyperparameters
best_xgb_model = XGBRegressor(**best_params, objective='reg:squarederror', n_jobs=-1, random_state=42)
best_xgb_model.fit(X_train, y_train)

# Calculate the feature importances
importances = best_xgb_model.feature_importances_
feature_importances = pd.Series(importances, index=X_train.columns)
feature_importances_sorted = feature_importances.sort_values(ascending=False)

# Print sorted feature importances
print("Feature importances:")
print(feature_importances_sorted)

# Print best RMSLE score
best_rmsle = study.best_value
print("Best Root Mean Squared Logarithmic Error ({}-fold CV): {:.5f}".format(n_folds, best_rmsle))


Feature importances:
hour               0.342402
hour_workingday    0.157868
workingday         0.107860
hour_temp          0.087541
year_2012          0.084809
weather_3          0.046657
month              0.036971
atemp              0.020217
day_of_week        0.018295
hour_humidity      0.014071
humidity           0.013791
season_3           0.013709
season_4           0.012889
temp               0.009483
holiday            0.009271
season_2           0.008447
day                0.005712
weather_2          0.005094
windspeed          0.004911
weather_4          0.000000
dtype: float32
Best Root Mean Squared Logarithmic Error (5-fold CV): 0.27877


**Preprocessing the Test Dataset**

-Loading Test Data-

Data Import: The test dataset for bike-sharing demand is loaded for final evaluation.
Data Preprocessing

Datetime Conversion: The 'datetime' column is converted to a datetime object for feature extraction.
Extracting Time-Related Features: Key time-related features like hour, day, day of the week, month, and year are extracted from the 'datetime' column.

Creating Interaction Features: Interaction features (e.g., hour with working day, temperature, humidity) are generated to capture complex relationships in the test data.
One-Hot Encoding: Categorical features like 'season', 'weather', and 'year' are one-hot encoded for consistency with the training dataset.

Column Pruning: Unnecessary columns, including the original 'datetime' column, are dropped to match the feature set used for training.

In [7]:

# Load the test dataset
test_df = pd.read_csv('/kaggle/input/bike-sharing-demand/test.csv')

# Preprocess the test dataset
test_df['datetime'] = pd.to_datetime(test_df['datetime'])

# Pull out time-related features
test_df['hour'] = test_df['datetime'].dt.hour
test_df['day'] = test_df['datetime'].dt.day
test_df['day_of_week'] = test_df['datetime'].dt.dayofweek
test_df['month'] = test_df['datetime'].dt.month
test_df['year'] = test_df['datetime'].dt.year

# Interaction features
test_df['hour_workingday'] = test_df['hour'] * test_df['workingday']
test_df['hour_temp'] = test_df['hour'] * test_df['temp']
test_df['hour_humidity'] = test_df['hour'] * test_df['humidity']

# One-hot encoding for categorical features
test_df = pd.get_dummies(test_df, columns=['season', 'weather', 'year'], drop_first=True)

# Drop unnecessary columns
X_test = test_df.drop(['datetime'], axis=1)


**Final Predictions on Test Dataset**

-Scaling Test Data-

Feature Scaling: The MinMax scaling applied to the training dataset's numeric features is also applied to the test dataset, ensuring consistency in data representation.

-Making Predictions-

Model Application: Predictions on the test dataset are made using the XGBoost model that was optimized with the best hyperparameters from Optuna.

Predictive Insights: These predictions aim to estimate the bike-sharing demand under various conditions as represented in the test data.

-Reverting Logarithmic Transformation-

Transformation Reversal: The logarithmic transformation applied to the target variable during training is reversed for the predictions. This step converts the predictions back to their original scale, providing actionable and interpretable results.

In [8]:

# Apply MinMax scaling to the numeric features in the test dataset
X_test[numeric_columns] = scaler.transform(X_test[numeric_columns])

# Make predictions on the test dataset using the best XGBoost model from Optuna
y_test_pred = best_xgb_model.predict(X_test)

# Revert the logarithmic transformation
y_test_pred_exp = np.expm1(y_test_pred)


**Creating and Saving Submission File**

-Submission Dataframe Creation-

Dataframe Assembly: A submission dataframe is created containing the original 'datetime' from the test dataset and the predicted 'count', which represents the forecasted bike-sharing demand.

-Saving to CSV-

Exporting Dataframe: The submission dataframe is saved as a CSV file named 'submission.csv'.

In [9]:

# Create a submission dataframe
submission = pd.DataFrame({
    'datetime': test_df['datetime'],
    'count': y_test_pred_exp
})

# Save submission dataframe to a CSV file
submission.to_csv('submission.csv', index=False)
