<center>
<h1>LightGBM model optimization</h1>
</center>

---


To conclude this series of experiments, we will use the optuna package to run a toy hyper-parameter search. The best model will be saved to disk. To this end we will employ a 70:15:15 split of the data for training, validation (early stopping) and a held-out test set to quantify the final performance. The final model will be re-trained on 90% of the data before export.

In [None]:
import os
import sys 
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import lightgbm as lgbm
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import optuna
import mlflow
mlflow.autolog()
mlflow.set_experiment("Peptide retention time regression")

sns.set_style("darkgrid")

sys.path.append("..")
from src.data import load_data, preprocess_data
from src.models import LGBMModelHandler
from src.util import  rMAE, rMSE

## Define the Optuna trial as a function that selects hyper-parameters and returns the test set rMAE. 

From hereon, we will make use of the `src.modles.LGBMModelHandler` class for training and evaluation

In [None]:
data = load_data("../data/Peptides_and_iRT.tsv")

In [None]:
def objective(trial):
    
    v = trial.suggest_categorical("vectorizer", [0,1])
    vectorizer = CountVectorizer if v == 1 else TfidfVectorizer
    

    params = {
        "objective": "regression",
        "metric": "l2",
        "verbosity": -1,
        "boosting_type": "gbdt",
        "n_estimators": trial.suggest_int("n_estimators", 100, 1000),
        "learning_rate": trial.suggest_float("learning_rate", 0.001, 0.1),
        "max_bin": trial.suggest_int("max_bin", 2, 256),
        "num_leaves": trial.suggest_int("num_leaves", 2, 51),
        "min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 2, 50),
        "bagging_fraction": trial.suggest_float("bagging_fraction", 0.4, 1.0),
        "bagging_freq": trial.suggest_int("bagging_freq", 1, 7),
        "max_depth": trial.suggest_int("max_depth", 1, 100),
        "lambda_l1" : trial.suggest_float("lambda_l1", 0.0, 1.0),
        "lambda_l2" : trial.suggest_float("lambda_l2", 0.0, 1.0),
    }

    lgbMH = LGBMModelHandler(model_name="LGBM " + "_".join([f"{k}:{v}" for k, v in params.items()]) + f"_{'Countvectorizer' if v == 1 else 'TfidfVectorizer'}", 
                             data=data, 
                             val_frac=0.15,
                             test_frac=0.15, 
                             vectorizer=vectorizer,
                             logging_on=False, # MLFlow doesn't like being in a trial
                             model_parameters=params)
    
    lgbMH.train_eval()
    
    return lgbMH.eval()['rMAE']

### For illustrative purposes, run 10 trials only

In [None]:
sampler = optuna.samplers.TPESampler(seed=42) 
study = optuna.create_study(direction="minimize", sampler=sampler)
study.optimize(objective, n_trials=10)

print("Number of finished trials: {}".format(len(study.trials)))

print("Best trial:")
trial = study.best_trial

print("  Value: {}".format(trial.value))

print("  Params: ")
for key, value in trial.params.items():
    print(f"    {key}: {value}")

## Re-run the best trial with more training data, and save it locally

In [None]:
v = study.best_params.pop('vectorizer')

vectorizer = CountVectorizer if v == 1 else TfidfVectorizer
data = load_data("../data/Peptides_and_iRT.tsv")


lgbMH = LGBMModelHandler(model_name="LGBM best model", 
                            data=data, 
                            val_frac=0.1, 
                            vectorizer=vectorizer,
                            model_parameters=study.best_params)

mlflow.lightgbm.save_model(lgbMH.model, os.path.join("..", "models", lgbMH.model_name))

print(lgbMH.train_eval())

### We can now load the checkpoint and re-use it

In [None]:
X = pd.concat([lgbMH.X_train, lgbMH.X_val])
y = pd.concat([lgbMH.y_train, lgbMH.y_val])

In [None]:
loaded = mlflow.lightgbm.load_model(os.path.join("..", "models", lgbMH.model_name))

In [None]:
loaded.score(lgbMH.X_val, lgbMH.y_val)

In [None]:
plt.figure()

y_true = lgbMH.y_train
y_pred = loaded.predict(lgbMH.X_train)
sns.scatterplot(y_true, y_pred, marker='+', color="darkred")

plt.plot([-100,150], [-100, 150], color="black", lw=0.75)



plt.xlabel("iRT measured")
plt.ylabel("iRT predicted")


plt.title("optimized LGBM predictions vs. GT")
plt.tight_layout()

plt.show()

In [None]:
plt.figure()

lgbm.plot_importance(loaded, color='darkred', figsize=(8,10))

plt.tight_layout()
plt.show()

<h1> Conclusions </h1>

Using LighGBM and some hyper-parameter optimization, we have found a model that achieves a test set rMAE of 3.89. Having used MLflow, we can track all of our experiments, register models, reproduce runs, and revise input and output of each model.

Furthermore, the model can be deployed to production immediately, or loaded from a local checkpoint. It also allows model serving and/or dockerization with built in functions (you will need to run `pip install mlflow[extras]`):

```> mlflow models serve -m <MODEL> --enable-mlserver```

```> mlflow models build -m <MODEL> --enable-mlserver -n <MODEL>```