# <a id='toc1_'></a>[**Introduction to MLFlow and MLOps**](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [**Introduction to MLFlow and MLOps**](#toc1_)    
  - [**Why MLFlow?**](#toc1_1_)    
  - [**What Can MLFlow Do?**](#toc1_2_)    
- [**Hands-On MLFlow**](#toc2_)    
  - [**Basic Usage: Autologging**](#toc2_1_)    
  - [**Viewing Results Through the UI**](#toc2_2_)    
  - [**Creating Experiments and Designing Logic**](#toc2_3_)    
  - [**Where Does MLFlow Store Data?**](#toc2_4_)    
  - [**Retrieving Models from MLFlow**](#toc2_5_)    
  - [**Register models**](#toc2_6_)    
  - [**Extra**](#toc2_7_)    
    - [**Nested Experiments**](#toc2_7_1_)    
    - [**Setting Up AWS Storage**](#toc2_7_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_'></a>[**Why MLFlow?**](#toc0_)
![MLOps](https://raw.githubusercontent.com/dsml-bootcamp-1/nbs-6-master/refs/heads/master/s-601-602/image_ops.png)

Machine learning models go through several stages: data preprocessing, training, evaluation, deployment, and monitoring. 
Ensuring consistency and reproducibility across these stages is a crucial aspect of MLOps (Machine Learning Operations). 

MLFlow is a tool designed to streamline this process by providing a centralized system to manage and track:
- Experiments and their results (e.g., parameters, metrics)
- Models and their artifacts (e.g., saved files, plots, images)
- Deployment logic for easy retrieval and deployment

## <a id='toc1_2_'></a>[**What Can MLFlow Do?**](#toc0_)
MLFlow can store:
- **Models**: Trained models in various formats (e.g., TensorFlow, PyTorch, Scikit-Learn)
- **Parameters**: Hyperparameters used for training
- **Metrics**: Evaluation metrics (e.g., accuracy, loss)
- **Artifacts**: Additional files (e.g., images, plots, HTML reports)
- **Data**: Input and output data (e.g., CSVs, dataframes)

![MLFlow Overview](../../../../img/mlflow.png)

In [None]:
# Install MLFlow if not already installed
# !pip install mlflow

In [None]:
# Check mlflow version

In [None]:
# Check sklearn version

In [None]:
# If the version is higher than 1.0.2, then downgrade (needed for autologging)
# !pip install scikit-learn==1.0.2


# <a id='toc2_'></a>[**Hands-On MLFlow**](#toc0_)

## <a id='toc2_1_'></a>[**Basic Usage: Autologging**](#toc0_)

MLFlow provides an easy-to-use `autolog` feature. Let's start by training a simple model and see how MLFlow tracks everything.


In [1]:
import pandas as pd
from sklearn.metrics import roc_auc_score, average_precision_score
import plotly.express as px
import mlflow
import mlflow.sklearn
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

In [2]:
# Load dataset
data = load_breast_cancer()
data.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [3]:
# Create X, y split
X = data["data"]
y = data["target"]

In [4]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [10]:
# Enable autologging for Sklearn
mlflow.autolog()

# Train a simple model
with mlflow.start_run(run_name="Random Forest"):
    # Instantiate and fit classifier
    rf = RandomForestClassifier(n_estimators=100, max_depth=2)
    rf.fit(X_train, y_train)
    rf.score(X_test, y_test)

    # Add custom metrics - ROC-AUC, PR-AUC
    pred_proba = rf.predict_proba(X_test)[:, -1]
    mlflow.log_metric("test_roc_auc", roc_auc_score(y_test, pred_proba))

    # Can add as many metrics as I want
    # mlflow.log_metric("test_average_precision", roc_auc_score(y_test, pred_proba))
    # mlflow.log_metric("test_roc_auc", roc_auc_score(y_test, pred_proba))
    # mlflow.log_metric("test_roc_auc", roc_auc_score(y_test, pred_proba))

2025/02/20 19:31:24 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.



## <a id='toc2_2_'></a>[**Viewing Results Through the UI**](#toc0_)

Start the MLFlow UI to visualize your logged experiments:


In [None]:
# Run this in your terminal (not in Jupyter)
# mlflow ui

# Can also change the port
# mlflow ui --port=8080


Navigate to `http://localhost:5000` to see your experiments.

![MLFlow UI Screenshot](https://mlflow.org/docs/latest/_images/quickstart-our-experiment.png) 



## <a id='toc2_3_'></a>[**Creating Experiments and Designing Logic**](#toc0_)
You can explicitly create experiments and log data, custom metrics, tags and other artifacts.

In [23]:
# Set experiment name
X_train

array([[9.029e+00, 1.733e+01, 5.879e+01, ..., 1.750e-01, 4.228e-01,
        1.175e-01],
       [2.109e+01, 2.657e+01, 1.427e+02, ..., 2.903e-01, 4.098e-01,
        1.284e-01],
       [9.173e+00, 1.386e+01, 5.920e+01, ..., 5.087e-02, 3.282e-01,
        8.490e-02],
       ...,
       [1.429e+01, 1.682e+01, 9.030e+01, ..., 3.333e-02, 2.458e-01,
        6.120e-02],
       [1.398e+01, 1.962e+01, 9.112e+01, ..., 1.827e-01, 3.179e-01,
        1.055e-01],
       [1.218e+01, 2.052e+01, 7.722e+01, ..., 7.431e-02, 2.694e-01,
        6.878e-02]])

In [16]:
# mlflow.create_experiment(name='breast_cancer')
mlflow.set_experiment('breast_cancer')
mlflow.autolog()

# Log parameters, metrics, and artifacts
with mlflow.start_run(run_name="Random Forest"):
    clf = RandomForestClassifier(n_estimators=100, random_state=42)
    clf.fit(pd.DataFrame(X_train, columns=data["feature_names"]), y_train)
    
    # Set run tags - features, feature_no, data size
    mlflow.set_tag("features", data["feature_names"])
    mlflow.set_tag("null_handling", None) # keep, fill, drop
    mlflow.set_tag("feat_selection", None) # correlation, manual, statistics

    # Log predictions
    pred_proba = pd.DataFrame(clf.predict_proba(X_test))
    mlflow.log_table(pred_proba, artifact_file="results/pred_proba.json")
    # mlflow.log_table(pred_proba, artifact_file="results/pred_proba_2.csv")
    # mlflow.log_table(pred_proba, artifact_file="results/pred_proba_3.xlsx")

    
    # Log feature importance plot
    # fig = px.bar(x=clf.feature_importances_, y=clf.feature_names_in_)
    # mlflow.log_figure(fig, artifact_file="plots/feature_importance.png")

2025/02/20 20:44:05 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.


In [22]:
clf.feature_names

AttributeError: 'RandomForestClassifier' object has no attribute 'feature_names'

In [18]:
clf.feature_importances_

array([0.04870337, 0.01359088, 0.05326975, 0.04755501, 0.00728533,
       0.01394433, 0.06800084, 0.10620999, 0.00377029, 0.00388577,
       0.02013892, 0.00472399, 0.01130301, 0.02240696, 0.00427091,
       0.00525322, 0.00938583, 0.00351326, 0.00401842, 0.00532146,
       0.07798688, 0.02174901, 0.06711483, 0.15389236, 0.01064421,
       0.02026604, 0.0318016 , 0.14466327, 0.01012018, 0.00521012])


## <a id='toc2_4_'></a>[**Where Does MLFlow Store Data?**](#toc0_)

Depending on the backend setup, MLFlow stores data in:
- **Local filesystem** (e.g., `./mlruns` directory, suitable for quick tests but slow)
- **Local SQLite Database**: Lightweight and easy to set up
- **Cloud storage**: AWS S3, Google Cloud Storage, etc., for large-scale deployments

To configure MLFlow to use a SQLite backend:


In [None]:
# Example command to run in terminal (not in Jupyter)
# mlflow server/ui \
#    --backend-store-uri sqlite:///mlflow.db \
#    --default-artifact-root ./mlruns

In [10]:
# Set tracking uri
mlflow.set_tracking_uri('sqlite:///mlruns.db')


## <a id='toc2_5_'></a>[**Retrieving Models from MLFlow**](#toc0_)

Search through models - more filtering tips [here](https://mlflow.org/docs/latest/search-runs.html).

In [13]:
# Search all runs with PR-AUC higher than 0.7
mlflow.search_runs(
    experiment_names=["breast_cancer"],
    filter_string="""
    tags.features LIKE '%mean radius%'
"""
)

Unnamed: 0,run_id,experiment_id,status,artifact_uri,start_time,end_time,metrics.training_score,metrics.training_precision_score,metrics.training_roc_auc,metrics.training_f1_score,...,tags.mlflow.log-model.history,tags.mlflow.loggedArtifacts,tags.mlflow.source.name,tags.mlflow.runName,tags.mlflow.user,tags.estimator_class,tags.feat_selection,tags.estimator_name,tags.mlflow.source.type,tags.features
0,ddc2b734437442749e2a4257b796e957,1,FINISHED,file:///c:/Users/SabinaFirtala/Desktop/project...,2025-02-20 19:30:54.826000+00:00,2025-02-20 19:31:10.053000+00:00,1.0,1.0,1.0,1.0,...,"[{""run_id"": ""ddc2b734437442749e2a4257b796e957""...","[{""path"": ""results/pred_proba.json"", ""type"": ""...",c:\Users\SabinaFirtala\anaconda3\envs\lizzy_de...,Random Forest,SabinaFirtala,sklearn.ensemble._forest.RandomForestClassifier,,RandomForestClassifier,LOCAL,['mean radius' 'mean texture' 'mean perimeter'...


You can load previously saved models for inference.

In [21]:
from mlflow.tracking import MlflowClient
from pprint import pprint
client = MlflowClient()

# Load model
latest_version = client.get_latest_versions("breast_cancer_tabular")[0]
pprint(latest_version)

<ModelVersion: aliases=[], creation_timestamp=1740080701391, current_stage='None', description='', last_updated_timestamp=1740080701391, name='breast_cancer_tabular', run_id='901eca0c65d645bfae835c0038a53bc0', run_link='', source='file:///c:/Users/SabinaFirtala/Desktop/projects/ironhack/ironhack-v4-data-lessons/cohorts/15_oct_24/05_ml/optional/code_along/mlruns/1/901eca0c65d645bfae835c0038a53bc0/artifacts/model', status='READY', status_message=None, tags={}, user_id=None, version=2>


  latest_version = client.get_latest_versions("breast_cancer_tabular")[0]


In [22]:
# Load model using run_id
run_id = latest_version.run_id
model_uri = f"runs:/{run_id}/model"  # Replace <run_id> with an actual run ID
loaded_model = mlflow.sklearn.load_model(model_uri)

# Use the model for predictions
loaded_model.predict(X_test)



array([1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1,
       0, 1, 1, 0])

## <a id='toc2_6_'></a>[**Register models**](#toc0_)

This can be done either through the UI or via code:

In [None]:
# Register model using runs:/ location


## <a id='toc2_7_'></a>[**Extra**](#toc0_)

### <a id='toc2_7_1_'></a>[**Nested Experiments**](#toc0_)
MLFlow allows nested runs for tracking hierarchical experiments. This can be useful if you want to group results from cross-validation folds in separate runs but keep the same attributes.


In [None]:
from sklearn.model_selection import StratifiedKFold

# Create stratified KFold

In [None]:
# Create nested cross-validation
with mlflow.start_run(run_name="Random Forest") as parent_run:
    # Log features
    
    for i, (train_split, test_split) in enumerate(cv_splits):
        with mlflow.start_run(run_name=f"Random Forest {i}", nested=True):
            # New train-test split
            
            clf = RandomForestClassifier(n_estimators=100, random_state=42)
            clf.fit(X_train, y_train)

            # Use same logging as before


### <a id='toc2_7_2_'></a>[**Setting Up AWS Storage**](#toc0_)
You can configure MLFlow to use AWS Postgresql database (either on RDS or Redshift) as metadata store and AWS S3 as the artifact storage:


In [None]:
# Run in terminal
# !mlflow server \
#     --backend-store-uri 'postgresql://user_name:password@link_to_your_aws_postgresql_db:port' \
#     --default-artifact-root s3://your-bucket-name