SEOUL BIKE DATA


We'll see how well we can predict the demand using various atmospheric conditions.

In [1]:
import pandas as pd
import numpy as np
data = pd.read_csv("SeoulBikeData.csv", index_col = "Date", parse_dates = True)
df = pd.DataFrame(data)

In [16]:
from scipy.stats.mstats import winsorize

df.select_dtypes("object").columns.astype("category")
numerical_col = df.select_dtypes("number").columns

for col in numerical_col:
    df[col] = winsorize(df[col], limits = [0.1, 0.1])

Select datatype returns a dataframe of columns either numerical or categorical.

We will now find the correlation between numerical columns and see..

In [17]:
df.select_dtypes("number").corr()["Rented Bike Count"].sort_values(ascending=False)


Rented Bike Count           1.000000
Temperature(C)              0.578677
Hour                        0.433847
Dew point temperature(C)    0.398565
Solar Radiation (MJ/m2)     0.312809
Visibility (10m)            0.194448
Wind speed (m/s)            0.143514
Humidity(%)                -0.202771
Rainfall(mm)                     NaN
Snowfall (cm)                    NaN
Name: Rented Bike Count, dtype: float64

Splitting the data into test, train and val

In [18]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, shuffle=False, test_size=0.2)
train, val = train_test_split(train, shuffle=False, test_size=0.25)

In [19]:
X_train = train.drop(columns=["Rented Bike Count"])  # Features
X_test = test.drop(columns=["Rented Bike Count"])
X_val = train.drop(columns=["Rented Bike Count"])

y_train = train["Rented Bike Count"]  # Target
y_val = train["Rented Bike Count"]
y_test = test["Rented Bike Count"]

You should fit the scaler on the training set only and then apply it to the train, validation, and test sets to avoid data leakage

In [25]:
from sklearn.preprocessing import StandardScaler
scaler_x = StandardScaler()
scaler_x.fit(X_train.select_dtypes("number"))
X_train_scaled = scaler_x.transform(X_train.select_dtypes("number"))
X_test_scaled = scaler_x.transform(X_test.select_dtypes("number"))
X_val_scaled = scaler_x.transform(X_val.select_dtypes("number"))

scaler_y = StandardScaler()
y_df = pd.DataFrame(y_train)
scaler_y.fit(y_df.select_dtypes("number"))

Our data is ready for model usage, we will use histgradientRegressor, I did not encoded because it can handle categrical values.

In [29]:
from sklearn.ensemble import HistGradientBoostingRegressor
model_1 = HistGradientBoostingRegressor(random_state=42)
model_1.fit(X_train_scaled,y_train)

In [30]:
predictions = model_1.predict(X_test_scaled)
y_predicted = (predictions.reshape(-1,1)).flatten()
type(y_predicted)

numpy.ndarray

Metrics like MAPE, MAE, RMSE, and R² accept both NumPy arrays and pandas Series. However, NumPy arrays are generally preferred because they are more lightweight and universally supported across libraries.

In [31]:
y_actual = np.array(y_test)
from sklearn.metrics import mean_absolute_percentage_error, root_mean_squared_error, r2_score
mape = mean_absolute_percentage_error(y_actual, y_predicted)
rmse = root_mean_squared_error(y_actual, y_predicted)
r2 = r2_score(y_actual, y_predicted)
mape, rmse, r2

(1.9032305353916672, 451.3142543344771, 0.29172817958868225)

Now we will use GridSearch to find best parameters or in other words, Fine tune the model to the given dataset.

In [32]:
from sklearn.model_selection import GridSearchCV, TimeSeriesSplit

tcsv = TimeSeriesSplit(n_splits=5)

param_grid = {
    'learning_rate': [0.01, 0.05, 0.07, 0.09, 0.1, 0.125],      # Few learning rates
    'max_iter': [100, 200, 300, 400, 500, 600, 700, 800, 1000],            # Number of iterations
    'max_depth':  [3, 5, 7, 10, 15, 17, 20],                   # Max depth
    # 'min_samples_leaf': [10, 20, 30, 40],          # Control for overfitting
    # 'l2_regularization': [0.0, 0.1, 0.5, 1.0],     # Regularization strength
}

grid_search = GridSearchCV(
    estimator=HistGradientBoostingRegressor(random_state=42),
    param_grid=param_grid,
    scoring='neg_mean_absolute_error',  # Metric for evaluation
    cv=tcsv,                              # Cross-validation folds
    verbose=2,
    n_jobs=-1                          # Use all cores
)

grid_search.fit(X_train_scaled, y_train)

print("Best Parameters:", grid_search.best_params_)

Fitting 5 folds for each of 378 candidates, totalling 1890 fits
Best Parameters: {'learning_rate': 0.01, 'max_depth': 15, 'max_iter': 500}


In [33]:
import joblib

# Save the model
joblib.dump(grid_search.best_estimator_, 'best_model.pkl')

['best_model.pkl']

In [34]:
# Load the model later
loaded_model = joblib.load('best_model.pkl')
predictions_new = loaded_model.predict(X_test_scaled)
y_predicted_new = predictions_new.reshape(-1,1).flatten()

In [35]:
from sklearn.metrics import mean_absolute_percentage_error, root_mean_squared_error, r2_score
mape = mean_absolute_percentage_error(y_actual, y_predicted_new)
rmse = root_mean_squared_error(y_actual, y_predicted_new)
r2 = r2_score(y_actual, y_predicted_new)
mape, rmse, r2

(1.9004201292361624, 450.42672879341563, 0.29451112399289314)