## Aprendizaje de máquina II
### Carrera de especialización en inteligencia artificial  

#### **VERSIONADO DE MODELOS USANDO MLFLOW**

Este ejemplo pertenece a la [documentación](https://github.com/mlflow/mlflow/tree/master/examples/sklearn_elasticnet_wine) de MLflow, con algunas modificaciones para trackear nuestros modelos utilizando sqlite.  

Para poder reproducir los resultados vistos en clase seguir los siguientes pasos:

* Si bien no es obligatorio, es **altamente** recomendable crearse un nuevo ambiente para administrar las dependencias y asegurar la correcta ejecución.  
Si estamos utilizando conda podemos crear un nuevo entorno con el comando:
`conda create -n mlflow python=3.8`  
Al ejecutar ese comando se nos creará un ambiente llamado "mlflow" (cambiar el nombre si se lo desea) con python versión 3.8.  

Podemos activar nuestro nuevo ambiente ejecutando:  
`conda activate mlflow`  
_(En caso de haber elegido otro nombre para el ambiente, reemplazar "mlflow" por el nombre que hayamos elegido)_


Para completar la instalación del ambiente debemos instalar las siguientes dependencias:  

  - scikit-learn==1.2.0
  - mlflow
  - pandas

Para esto podemos utilizar `pip install nombre_de_la_biblioteca` desde la consola de conda.  

* Luego de configurar nuestro ambiente debemos abrir la command prompt de conda y movernos hacia el directorio en donde tengamos guardado este notebook. Como recomendación, guardarlo en una carpeta exclusiva, ya que se nos irán generando algunos archivos complementarios para poder realizar el tutorial.

En caso de que no hayan navegado por una consola de comandos, [acá](http://www.falconmasters.com/offtopic/como-utilizar-consola-de-windows/#:~:text=Para%20acceder%20a%20ella%20lo,en%20la%20consola%20de%20windows.) hay un breve tutorial con los comandos más útiles.


* Una vez dentro de la carpeta donde almacenamos este notebook, debemos indicarle a mlflow que vamos a utilizar SQLite como backend para almacenar nuestros modelos registrados. Para ello, desde la command prompt de conda debemos ejecutar:  
`mlflow server --backend-store-uri sqlite:///mydb.sqlite`  
Luego de ejecutar ese comando veremos que en la carpeta se crearán una carpeta donde se almacenarám los artefactos de mlflow y una base de datos para el model registry.  
También debemos ver en la consola el siguiente mensaje:
`INFO:waitress:Serving on http://127.0.0.1:5000`

* Esa dirección IP corresponde a nuestro localhost y el número 5000 al número de puerto donde podremos consultar la UI de mlflow.  
Si copiamos y pegamos esa dirección http en algún buscador web, podremos acceder a la UI.  



In [1]:
import os
import warnings
import sys

import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNet
from urllib.parse import urlparse
import mlflow
import mlflow.sklearn

import logging

In [2]:
mlflow.set_tracking_uri("http://127.0.0.1:5000/")
mlflow.set_experiment("Wine_prediction_experiment")

<Experiment: artifact_location='mlflow-artifacts:/1', creation_time=1683674403505, experiment_id='1', last_update_time=1683674403505, lifecycle_stage='active', name='Wine_prediction_experiment', tags={}>

In [3]:
logging.basicConfig(level=logging.WARN)
logger = logging.getLogger(__name__)


def eval_metrics(actual, pred):
    rmse = np.sqrt(mean_squared_error(actual, pred))
    mae = mean_absolute_error(actual, pred)
    r2 = r2_score(actual, pred)
    return rmse, mae, r2

In [35]:
# Imports for the customModel
import pickle
import cloudpickle
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from mlflow.models.signature import infer_signature


class CustomModelPredictor(mlflow.pyfunc.PythonModel):
    """
    This Class define a custom model predictor to calculate something.

    It derives from the mlflow PythonModel base Class and has three
    main methods to work, fit, predict and load_context.
    """

    def __init__(self, model_params: dict):

        self.preprocessor = StandardScaler()
        self.model = RandomForestRegressor(**model_params)

    def load_context(self, context) -> None:
        """
        This method will not be explicitly used in this module, but it is
        necessary for mlflow to understand how to load what has been
        trained further on, by loading the needed artifacts.

        :param context: A :class:`~PythonModelContext` instance containing artifacts that the model
                        can use to perform inference.
        """

        with open(context.artifacts["preprocessor"], "rb") as f:
            self.preprocessor = pickle.load(f)
        with open(context.artifacts["estimator"], "rb") as f:
            self.model = pickle.load(f)

        return None

    def fit(self, X_train = np.ndarray, y_train = np.ndarray, X_val = None, y_val = None) -> None:
        """
        This method will take a pandas DataFrame, fit the model, save that as a
        serialized pickle object and return the signature of the model

        :param X_train: the training input samples.
        :type X_train: pd.DataFrame

        :param y_train: the training output target.
        :type y_train: pd.DataFrame

        :param X_val: the validation input samples.
        :type X_val: pd.DataFrame

        :param y_val: the validation output target.
        :type y_val: pd.DataFrame

        :return: signature of the model
        :rtype: ModelSignature
        """

        # TRAINING THE PIPELINE

        self.preprocessor.fit(X=X_train, y=y_train)
        X_train_transformed = self.preprocessor.transform(X=X_train)
        self.model.fit(X_train_transformed, y_train)

        # MODEL EVALUATION AND LOGGING METRICS

        # Train metrics
        y_train_pred = self.model.predict(X_train_transformed)
        rmse, mae, r2 = eval_metrics(y_train_pred, y_train)
        mlflow.log_metric("train_rmse", rmse)
        mlflow.log_metric("train_r2", r2)
        mlflow.log_metric("train_mae", mae)

        if X_val is not None and y_val is not None:
            # Validation metrics
            X_val_transformed = self.preprocessor.transform(X_val)
            y_val_pred = self.model.predict(X_val_transformed)
            rmse, mae, r2 = eval_metrics(y_val_pred, y_val)
            mlflow.log_metric("val_rmse", rmse)
            mlflow.log_metric("val_r2", r2)
            mlflow.log_metric("val_mae", mae)

        # Dumping fitted objects
        with open("./fitted_model.pkl", "wb") as f:
            cloudpickle.dump(self.model, f)
        with open("./fitted_preprocessor.pkl", "wb") as f:
            cloudpickle.dump(self.preprocessor, f)

        # Inferring signature to return
        signature = infer_signature(X_train, y_train)

        return signature

    def predict(self,context, X: pd.DataFrame):
        """
        This method will evaluate a proper input with the loaded context
        and return the predicted output.

        :param X: the inputted DataFrame with the features of the model
        :type X: pd.DataFrame

        :return: probability predictions of the occurrence of the event given by the model
        :rtype: np.ndarray
        """
        X_transformed = self.preprocessor.transform(X)
        predictions = self.model.predict(X_transformed)
        
        return predictions


In [41]:
if __name__ == "__main__":
    warnings.filterwarnings("ignore")
    np.random.seed(40)

    # Read the wine-quality csv file from the URL
    csv_url = (
        "https://raw.githubusercontent.com/mlflow/mlflow/master/tests/datasets/winequality-red.csv"
    )
    try:
        data = pd.read_csv(csv_url, sep=";")
    except Exception as e:
        logger.exception(
            "Unable to download training & test CSV, check your internet connection. Error: %s", e
        )

    # Split the data into training and test sets. (0.75, 0.25) split.
    train, test = train_test_split(data, test_size=0.25, random_state=20, shuffle=True)
    train, val = train_test_split(train, test_size=0.25, random_state=20, shuffle=True)

    # The predicted column is "quality" which is a scalar from [3, 9]
    train_x = train.drop(["quality"], axis=1)
    test_x = test.drop(["quality"], axis=1)
    val_x = val.drop(["quality"], axis=1)
    train_y = train[["quality"]]
    test_y = test[["quality"]]
    val_y = val[["quality"]]

    model_params = {
                    'max_depth' : 10,
                    'n_estimators' : 27
                    }

    with mlflow.start_run():
        CMP = CustomModelPredictor(model_params = model_params)
        signature = CMP.fit(train_x, train_y, val_x, val_y)

        mlflow.log_param("model_params", model_params)

        # Log the model
        # conda_env = "conda.yaml"
        conda_env = {
            "name": "mlflow-env",
            "channels": ["defaults", "anaconda", "conda-forge"],
            "dependencies": [
                "python==3.8.16",
                "cloudpickle==1.6.0",
                "scikit-learn==0.22.1",
                "ortools==9.4.1874",
            ],
        }
        artifacts = {
            "preprocessor": "./fitted_preprocessor.pkl",
            "estimator": "./fitted_model.pkl",
        }

        mlflow.pyfunc.log_model(
            artifact_path="model",
            artifacts=artifacts,
            python_model=CMP,
            conda_env=conda_env,
            signature=signature,
        )
        

In [42]:
model_production = mlflow.pyfunc.load_model('models:/ElasticnetWineModel/production')

 - mlflow (current: 2.3.2, required: mlflow==2.3)
To fix the mismatches, call `mlflow.pyfunc.get_model_dependencies(model_uri)` to fetch the model's environment and install dependencies using the resulting environment file.


In [43]:
predicted_qualities = model_production.predict(test_x)

(rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)
print("Custom model test metrics")
print("  RMSE: %s" % rmse)
print("  MAE: %s" % mae)
print("  R2: %s" % r2)

Custom model test metrics
  RMSE: 0.6078979712332924
  MAE: 0.453315437430186
  R2: 0.3770858210818864
