# Bike Sharing (MLFlow - KServe)

This notebook provides a detailed walkthrough of a comprehensive data science workflow, encompassing data preprocessing,
model training and evaluation, hyperparameter tuning, experiment tracking via MLFlow, and model deployment using Seldon
and KServe. The use case under consideration is the well-known bike sharing dataset, sourced from the UCI Machine
Learning Repository.

![bike-sharing](images/bike-sharing.jpg)
(Photo by <a href="https://unsplash.com/@zaccastravels?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">ZACHARY STAINES</a> on <a href="https://unsplash.com/photos/KEhNcoCldbk?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>)

The dataset records the hourly and daily count of rental bikes between 2011 and 2012 in the Capital Bikeshare system,
supplemented with corresponding weather and seasonal data. The primary objective of this dataset is to foster research
into bike sharing systems, which are gaining significant attention due to their implications on traffic management,
environmental sustainability, and public health.

The task associated with this dataset is regression, with 17,389 instances. The overarching goal is to construct a
predictive model capable of forecasting bike rental demand. The primary target variable for prediction is the `cnt`
attribute, representing the total count of rental bikes, inclusive of both casual and registered users.

By leveraging other features in the dataset (such as date, season, year, month, hour, holiday, weekday, working day,
weather conditions, temperature, perceived temperature, humidity, and wind speed), you can train a model to predict this
count with high accuracy.

## Setting Up the Environment

The subsequent code cells are dedicated to importing the requisite dependencies. Additionally, it's recommended to
establish a local directory for preserving the training artifacts generated during your experiment.

In [None]:
import os
import json
import datetime
import itertools
import warnings
import subprocess

from functools import partial

import sklearn
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import mlflow
import mlflow.sklearn

from tqdm import tqdm
from mlflow.models.signature import infer_signature
from mlflow import log_metric, log_param, log_artifact
from sklearn import tree
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.inspection import permutation_importance


plt.style.use("fivethirtyeight")
pd.plotting.register_matplotlib_converters()

warnings.filterwarnings('ignore')

In [None]:
if os.path.exists("model_artifacts"):
    os.system("rm -rf model_artifacts")
os.mkdir("model_artifacts")

## Set an MLflow Experiment

To set an experiment as active in MLflow, you can specify it either by its name using the `experiment_name` parameter or
by its ID using the `experiment_id` parameter. It's important to note that you can't specify both the experiment name
and ID simultaneously.

In [None]:
def set_experiment(exp_name):
    """Register an experiment in MLFlow.
    
    args:
      exp_name (str): The name of the experiment.
    """
    try:
        mlflow.set_experiment(exp_name)
    except Exception as e:
        raise RuntimeError(f"Failed to set the experiment: {e}")

In [None]:
# Set up an experiment with set_exp from ezmllib.mlflow
experiment_name = "bike-sharing-exp"
set_experiment(experiment_name)

## Load the Dataset

With the preliminary setup complete, you can now proceed to load the dataset. The data is provided in a CSV format,
which can be conveniently loaded using the Pandas library in Python. To get a glimpse of the dataset, you'll display the
first five rows using the `head()` method of the DataFrame. This initial exploration will provide a snapshot of the data
we'll be working with.

In [None]:
# Load input data into pandas dataframe
bike_sharing = pd.read_csv("dataset/bike-sharing.csv")
bike_sharing.head()

## Data preprocessing

In this phase, you will prepare the data for the subsequent stages of the analysis. This involves cleaning,
transforming, and structuring the data to ensure it is in the optimal format for your machine learning model.

In [None]:
# Remove unused columns
bike_sharing.drop(columns=["instant", "dteday", "registered", "casual"],
                  inplace=True)

# Use better names
bike_sharing.rename(
    columns={
        "yr": "year",
        "mnth": "month",
        "hr": "hour_of_day",
        "holiday": "is_holiday",
        "workingday": "is_workingday",
        "weathersit": "weather_situation",
        "temp": "temperature",
        "atemp": "feels_like_temperature",
        "hum": "humidity",
        "cnt": "rented_bikes",
    }, inplace=True)

# Convert every data point to `float64`
cols = bike_sharing.select_dtypes(exclude=['float64']).columns
for i in ['season', 'year', 'month', 'hour_of_day', 'is_holiday',
          'weekday', 'is_workingday', 'weather_situation', 'rented_bikes']:
    bike_sharing[i] = bike_sharing[i].astype('float64')

## Data Visualization

In this section, you will employ various visualization techniques to better understand the data. By creating graphical
representations of the data, you can identify patterns, trends, and correlations that might not be evident from the raw
data alone. This step is crucial in guiding the subsequent analysis and model building process.

In [None]:
hour_of_day_agg = bike_sharing.groupby(["hour_of_day"])["rented_bikes"].sum()

hour_of_day_agg.plot(
    kind="line", 
    title="Total rented bikes by hour of day",
    xticks=hour_of_day_agg.index,
    figsize=(10, 5),
)

plt.show()

## Prepare training and test data sets

In this section, you will partition the data into training and test datasets. This is a crucial step in the machine
learning workflow, allowing you to train the model on a subset of the data (the training set), and then evaluate its
performance on unseen data (the test set). This process helps ensure that your model generalizes well to new data and is
not simply memorizing the training data.

In [None]:
# Split the dataset randomly into 70% for training and 30% for testing.
X = bike_sharing.drop("rented_bikes", axis=1)
y = bike_sharing.rented_bikes
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=42)

print(f"Training samples: {X_train.size}")
print(f"Test samples: {X_test.size}")

## Establishing Evaluation Metrics

Before proceeding to the training stage, you will define the evaluation metrics that will be used to assess the
performance of your model. These metrics will provide quantitative measures of the model's accuracy, helping you
understand how well the model is performing and where improvements can be made. This step is crucial in ensuring that
your model meets the desired performance standards.

### Root Mean Square Error (RMSE)

One of the evaluation metrics you will use is the Root Mean Square Error (RMSE). This metric provides a measure of the
differences between the values predicted by the model and the actual values. By taking the square root of the average of
these squared differences, RMSE can give you a sense of the magnitude of the prediction errors. Lower RMSE values
indicate a better fit of the model to the data.

References: 
- https://medium.com/@xaviergeerinck/artificial-intelligence-how-to-measure-performance-accuracy-precision-recall-f1-roc-rmse-611d10e4caac
- https://www.kaggle.com/residentmario/model-fit-metrics#Root-mean-squared-error-(RMSE)

In [None]:
def rmse(y, y_pred):
    return np.sqrt(mean_squared_error(y, y_pred))

def rmse_score(y, y_pred):
    score = rmse(y, y_pred)
    message = "RMSE score: {:.4f}".format(score)
    return score, message

### Cross-Validation RMSLE score

Another evaluation metric you will employ is the Root Mean Squared Logarithmic Error (RMSLE) score, calculated through
cross-validation. Cross-validation is a robust technique that averages measures of prediction accuracy to derive a more
precise estimate of model performance.

The RMSLE score is especially valuable in your situation as it penalizes underestimates more than overestimates.
Therefore, it is an essential metric for a bike sharing demand prediction model, ensuring that you avoid scenarios
where the available number of bikes falls short of the demand.

References: 
- https://en.wikipedia.org/wiki/Cross-validation_(statistics)
- https://www.kaggle.com/carlolepelaars/understanding-the-metric-rmsle

In [None]:
def rmsle_clip(estimator, x, y):
    """Clip negative prediction numbers before calculating RMSLE."""
    y_pred = estimator.predict(x)
    y_pred_clipped = np.clip(y_pred, a_min=0, a_max=None)
    return sklearn.metrics.mean_squared_log_error(y, y_pred_clipped, squared=False)

def rmsle_cv(model, X_train, y_train):
    kf = KFold(n_splits=4, shuffle=True, random_state=42).get_n_splits(X_train.values)
    # Evaluate RMSLE score by cross-validation
    rmsle = cross_val_score(model, X_train.values, y_train, scoring=rmsle_clip, cv=kf, error_score="raise")
    return rmsle

def rmsle_cv_score(model, X_train, y_train):
    score = rmsle_cv(model, X_train, y_train)
    message = "Cross-Validation RMSLE score: {:.4f} (std = {:.4f})".format(score.mean(), score.std())
    return score, message

## Feature Importance

In this section, you will analyze the importance of each feature in the dataset. Feature importance refers to techniques
that assign a score to input features based on how useful they are at predicting a target variable.

Understanding which features are most influential in predicting the target variable can provide valuable insights into
the dataset and the underlying model. This can help you interpret the model's predictions, and can guide further data
collection and feature engineering efforts.

References:
- https://medium.com/bigdatarepublic/feature-importance-whats-in-a-name-79532e59eea3

In [None]:
def model_feature_importance(model):
    feature_importance = pd.DataFrame(
        model.feature_importances_,
        index=X_train.columns,
        columns=["Importance"])

    # sort by importance
    feature_importance.sort_values(by="Importance", ascending=False, inplace=True)

    # plot
    plt.figure(figsize=(10, 4))
    sns.barplot(
        data=feature_importance.reset_index(),
        y="index",
        x="Importance",
    ).set_title("Feature Importance")

    # save image
    plt.savefig("model_artifacts/feature_importance.png", bbox_inches='tight')

## Permutation Importance

Permutation Importance is a technique used to measure feature importance. It works by randomly shuffling a single
feature in the validation data and measuring the decrease in the model's performance. The features that cause the most
significant drop in performance are considered the most important.

This method provides a straightforward way to interpret the influence of each feature on the model's predictions. It can
help you understand which features are driving the model's decisions and where you might focus your attention for
further data analysis or feature engineering.

References:
- https://www.kaggle.com/dansbecker/permutation-importance

In [None]:
def model_permutation_importance(model):
    p_importance = permutation_importance(model, X_test, y_test, random_state=42, n_jobs=-1)

    # sort by importance
    sorted_idx = p_importance.importances_mean.argsort()[::-1]
    p_importance = pd.DataFrame(
        data=p_importance.importances[sorted_idx].T,
        columns=X_train.columns[sorted_idx]
    )

    # plot
    plt.figure(figsize=(10, 4))
    sns.barplot(
        data=p_importance,
        orient="h"
    ).set_title("Permutation Importance")

    # save image
    plt.savefig("model_artifacts/permutation_importance.png", bbox_inches="tight")

## MLflow Tracking

In this phase, you will use MLflow Tracking, a component of MLflow that logs and tracks experiment data. This includes
parameters, metrics, and artifacts of machine learning models during the training process.

MLflow Tracking provides a centralized repository for metadata associated with your experiments, making it easier to
compare different runs, reproduce results, and share findings with your team. This is a crucial step in maintaining an
organized and efficient machine learning workflow.

First, let's setup the logger.

References:
- https://www.mlflow.org/docs/latest/cli.html#mlflow-ui

In [None]:
# Track params and metrics
def log_mlflow_run(model, signature):
    # Auto-logging for scikit-learn estimators
    # mlflow.sklearn.autolog()

    # log estimator_name name
    name = model.__class__.__name__
    mlflow.set_tag("estimator_name", name)

    # log input features
    mlflow.set_tag("features", str(X_train.columns.values.tolist()))

    # Log tracked parameters only
    mlflow.log_params({key: model.get_params()[key] for key in parameters})

    mlflow.log_metrics({
        'RMSLE_CV': score_cv.mean(),
        'RMSE': score})

    # log training loss
    for s in model.train_score_:
        mlflow.log_metric("Train Loss", s)

    # Save model to artifacts
    mlflow.sklearn.log_model(model, "model")#, signature=signature)

    # log charts
    mlflow.log_artifacts("model_artifacts")

    # misc
    # Log all model parameters
    # mlflow.log_params(model.get_params())
    mlflow.log_param("Training size", X_test.size) 
    mlflow.log_param("Test size", y_test.size)

## Model Training and Hyperparameter Tuning

In this section, you will focus on training the model and tuning its hyperparameters. For this particular use case, you
will employ the following approach:

- Approach: You will use a Supervised Learning method, specifically a Decision Tree model. Decision Trees are intuitive
  and easy-to-interpret models that make decisions based on a set of rules inferred from the features.
- Tree Type: Given that the task is to predict a continuous target variable (the count of total rental bikes), you will
  use a Regression Tree.
- Technique/Ensemble Method: To improve the performance of your Decision Tree model, you will use an ensemble method
  known as Gradient Boosting. Gradient Boosting combines several weak learners (in this case, Decision Trees) to create
  a robust predictive model. It trains models in a gradual, additive, and sequential manner, with each new model
  correcting the errors made by the previous ones.

By carefully tuning the hyperparameters of your Gradient Boosting model, you can optimize its performance and ensure it
generalizes well to new data.

References:
- GBRT (Gradient Boosted Regression Tree): https://orbi.uliege.be/bitstream/2268/163521/1/slides.pdf
- Choosing a model: https://scikit-learn.org/stable/tutorial/machine_learning_map
- Machine Learning Models Explained
: https://docs.paperspace.com/machine-learning/wiki/machine-learning-models-explained
- Gradient Boosted Regression Trees: https://orbi.uliege.be/bitstream/2268/163521/1/slides.pdf


In [None]:
# GBRT (Gradient Boosted Regression Tree) scikit-learn implementation 
model_class = GradientBoostingRegressor

Set the training's process hyperparameters.

In [None]:
parameters = {
    "learning_rate": [0.1, 0.05, 0.01],
    "max_depth": [4, 5, 6],
    # "verbose": True,
}

To optimize the performance of your model, you will tune its hyperparameters using a method known as Grid Search.

Grid Search is a traditional method for hyperparameter tuning. It works by defining a grid of hyperparameters and then
evaluating the model performance for each point on the grid. You can think of this as an exhaustive search through a
manually specified subset of the hyperparameter space of the chosen algorithm.

By using Grid Search, you can systematically work through multiple combinations of hyperparameters to determine the
optimal values that improve the performance of the model. This process can significantly enhance the predictive accuracy
of your model.

References:
- More advanced tuning techniques: https://research.fb.com/efficient-tuning-of-online-systems-using-bayesian-optimization/

In [None]:
# generate parameters combinations
params_keys = parameters.keys()
params_values = [
    parameters[key] if isinstance(parameters[key], list) else [parameters[key]]
    for key in params_keys]

runs_parameters = [
    dict(zip(params_keys, combination))
         for combination in itertools.product(*params_values)]

## Model Training

Now that you have prepared the data and set up the model, the next step is to train the model. During this process, the
model will learn from the features of the training data to predict the target variable.

Model training involves adjusting the model to minimize the difference between the predicted and actual values, a
process guided by a specific learning algorithm. In your case, you are using a Gradient Boosting model, which will learn
to correct its errors in a gradual, additive, and sequential manner.

This is a crucial step in the machine learning workflow, as the quality of the model's predictions heavily depends on
the effectiveness of the training process.

In [None]:
# training loop
for i, run_parameters in enumerate(runs_parameters):
    # mlflow: stop active runs if any
    if mlflow.active_run():
        mlflow.end_run()
    # mlflow:track run
    mlflow.start_run(run_name=f"Run {i}")

    # create model instance
    model = model_class(**run_parameters)

    # train
    model.fit(X_train, y_train)

    # get evaluations scores
    ypred = model.predict(X_test)
    score, message = rmse_score(y_test, model.predict(X_test))
    score_cv, message_cv = rmsle_cv_score(model, X_train, y_train)

    # get model signature
    signature = infer_signature(model_input=X_train,
                                model_output=model.predict(X_train))

    # mlflow: log metrics
    log_mlflow_run(model, signature)

    # mlflow: end tracking
    mlflow.end_run()

    print(f"Learning Rate: {run_parameters['learning_rate']}\n"
          f"Max Depth: {run_parameters['max_depth']}\n"
          f"{message}\n"
          f"{message_cv}\n")

## Best Model Results

After training several models and tuning their hyperparameters, you will identify the model that performs the best
according to the chosen evaluation metrics.

In [None]:
best_run_df = mlflow.search_runs(order_by=['metrics.RMSLE_CV ASC'],
                                 max_results=1)
if len(best_run_df.index) == 0:
    raise Exception(f"Found no runs for experiment '{experiment_name}'")

best_run = mlflow.get_run(best_run_df.at[0, 'run_id'])
best_model_uri = f"{best_run.info.artifact_uri}/model"
best_model = mlflow.sklearn.load_model(best_model_uri)

In [None]:
# Print best run info
print("Best run info:")
print(f"Run id: {best_run.info.run_id}")
print(f"Run parameters: {best_run.data.params}")
print(f"Run score: RMSLE_CV = {best_run.data.metrics['RMSLE_CV']:.4f}")
print(f"Run model URI: {best_model_uri}")

In [None]:
model_feature_importance(best_model)

In [None]:
model_permutation_importance(best_model)

## Model Testing

Once you have identified the best model, the next step is to test its predictive performance on unseen data. This is
done using the test dataset, which has been set aside specifically for this purpose.

Testing the model's predictions allows you to evaluate how well the model generalizes to new data. This is a crucial
step in the machine learning process, as it provides a realistic estimate of the model's performance in a real-world
setting.

You will compare the model's predictions with the actual values in the test dataset and calculate your chosen evaluation
metrics. These results will give you a clear indication of the model's predictive accuracy.

In [None]:
test_predictions = X_test.copy()
# real output (rented_bikes) from test dataset
test_predictions["rented_bikes"] = y_test

# add "predicted_rented_bikes" from test dataset
test_predictions["predicted_rented_bikes"] = best_model.predict(X_test).astype(int)

# show results
test_predictions.head()

In [None]:
# plot truth vs prediction values
test_predictions.plot(
    kind="scatter",
    x="rented_bikes",
    y="predicted_rented_bikes",
    title="Rented bikes vs predicted rented bikes",
    figsize=(10, 10)
)

plt.show()

## Model Deployment

In this section of the notebook, you will focus on deploying the trained model and bridge the gap between insightful
data analysis and tangible real-world impact. For this, you will be using KServe, an open-source platform that
facilitates the deployment and management of machine learning models at scale. It provides a robust and scalable
infrastructure to serve predictions from trained models in production environments. The backend that you'll be using for
KServe is Seldon.

In [None]:
manifest = f"""
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kserve-minio-sa
secrets:
- name: {os.getenv('USER')}-objectstore-secret

---
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "bike-sharing"
spec:
  predictor:
    serviceAccountName: kserve-minio-sa
    sklearn:
      protocolVersion: "v2"
      storageUri: "{best_model_uri}"
"""

os.makedirs("manifests", exist_ok=True)

with open(os.path.join("manifests", "isvc.yaml"), "w") as f:
    f.write(manifest)

In [None]:
res = subprocess.run(["kubectl", "apply", "-f", "manifests/isvc.yaml"])