# <a id='toc1_'></a>[**Introduction to MLFlow and MLOps**](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [**Introduction to MLFlow and MLOps**](#toc1_)    
  - [**Why MLFlow?**](#toc1_1_)    
  - [**What Can MLFlow Do?**](#toc1_2_)    
- [**Hands-On MLFlow**](#toc2_)    
  - [**Basic Usage: Autologging**](#toc2_1_)    
  - [**Viewing Results Through the UI**](#toc2_2_)    
  - [**Creating Experiments and Designing Logic**](#toc2_3_)    
  - [**Where Does MLFlow Store Data?**](#toc2_4_)    
  - [**Retrieving Models from MLFlow**](#toc2_5_)    
  - [**Register models**](#toc2_6_)    
  - [**Extra**](#toc2_7_)    
    - [**Nested Experiments**](#toc2_7_1_)    
    - [**Setting Up AWS Storage**](#toc2_7_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_'></a>[**Why MLFlow?**](#toc0_)
![MLOps](https://raw.githubusercontent.com/dsml-bootcamp-1/nbs-6-master/refs/heads/master/s-601-602/image_ops.png)

Machine learning models go through several stages: data preprocessing, training, evaluation, deployment, and monitoring. 
Ensuring consistency and reproducibility across these stages is a crucial aspect of MLOps (Machine Learning Operations). 

MLFlow is a tool designed to streamline this process by providing a centralized system to manage and track:
- Experiments and their results (e.g., parameters, metrics)
- Models and their artifacts (e.g., saved files, plots, images)
- Deployment logic for easy retrieval and deployment

## <a id='toc1_2_'></a>[**What Can MLFlow Do?**](#toc0_)
MLFlow can store:
- **Models**: Trained models in various formats (e.g., TensorFlow, PyTorch, Scikit-Learn)
- **Parameters**: Hyperparameters used for training
- **Metrics**: Evaluation metrics (e.g., accuracy, loss)
- **Artifacts**: Additional files (e.g., images, plots, HTML reports)
- **Data**: Input and output data (e.g., CSVs, dataframes)

![MLFlow Overview](../../../../img/mlflow.png)

In [None]:
# Install MLFlow if not already installed
!pip install mlflow

In [None]:
# Check mlflow version
import mlflow
mlflow.__version__

In [None]:
# Check sklearn version
import sklearn
sklearn.__version__

In [None]:
# If the version is higher than 1.0.2, then downgrade (needed for autologging)
# !pip install scikit-learn==1.0.2

In [None]:
import numpy
numpy.__version__

### Troubleshooting
> `ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject`

In [None]:
!pip install numpy==1.26.4


# <a id='toc2_'></a>[**Hands-On MLFlow**](#toc0_)

## <a id='toc2_1_'></a>[**Basic Usage: Autologging**](#toc0_)

MLFlow provides an easy-to-use `autolog` feature. Let's start by training a simple model and see how MLFlow tracks everything.


In [None]:
#!pip install plotly
#!pip install -U kaleido

In [None]:
import pandas as pd
from sklearn.metrics import roc_auc_score, average_precision_score
import plotly.express as px
import mlflow
import mlflow.sklearn
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

In [None]:
# Load dataset
data = load_breast_cancer()
data

In [None]:
# Create X, y split
X = pd.DataFrame(data["data"], columns=data.feature_names)
X.head()

In [None]:
y = pd.Series(data.target)
y.value_counts()

In [None]:
# Train-test split - ideal?
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Enable autologging for Sklearn
mlflow.sklearn.autolog()

# Train a simple model
with mlflow.start_run():
    # Instantiate and fit classifier
    clf = RandomForestClassifier(n_estimators=100, random_state=42)
    clf.fit(X_train, y_train)
    
    # Add custom metrics - ROC-AUC, PR-AUC

In [None]:
mlflow.set_tracking_uri("file:///")


## <a id='toc2_2_'></a>[**Viewing Results Through the UI**](#toc0_)

Start the MLFlow UI to visualize your logged experiments:


In [None]:
# Run this in your terminal (not in Jupyter)
# mlflow ui

# Can also change the port
# mlflow ui --port=8080


Navigate to `http://localhost:5000` to see your experiments.

![MLFlow UI Screenshot](https://mlflow.org/docs/latest/_images/quickstart-our-experiment.png) 



## <a id='toc2_3_'></a>[**Creating Experiments and Designing Logic**](#toc0_)
You can explicitly create experiments and log data, custom metrics, tags and other artifacts.

In [None]:
# Set experiment name
mlflow.set_experiment("breast-cancer-classification")

In [None]:
feat_imp_df

In [None]:
feat_imp_df = pd.DataFrame(
        {
            "importance": list(clf.feature_importances_), 
            "feature": list(clf.feature_names_in_)
        }
    )
feat_imp_df

In [None]:
# Log parameters, metrics, and artifacts
with mlflow.start_run(run_name="Random Forest"):
    clf = RandomForestClassifier(n_estimators=100, random_state=42)
    clf.fit(X_train, y_train)
    
    # Set run tags - features, feature_no, data size
    mlflow.set_tag("feat_selection", "all")
    mlflow.set_tag("feature_no", len(X_train.columns))
    mlflow.set_tag("features", X_train.columns.to_list())

    # Log predictions
    pred = clf.predict_proba(X_test)    
    pred_df = pd.DataFrame(pred, columns=["prediction_score_0", "prediction_score_1"])
    mlflow.log_table(pred_df.reset_index(), artifact_file="results/predictions.json")
    
    # Log custom metrics manually
    mlflow.log_metric("ROC-AUC", roc_auc_score(y_test, pred_df["prediction_score_1"]))
    mlflow.log_metric("PR-AUC", average_precision_score(y_test, pred_df["prediction_score_1"]))
    
    # Log feature importance plot
    feat_imp_df = pd.DataFrame(
        {
            "importance": clf.feature_importances_, 
            "feature": clf.feature_names_in_
        }
    )
    feat_imp_df = feat_imp_df.sort_values(by="importance")
    fig = px.bar(x=feat_imp_df.importance, y=feat_imp_df.feature)
    fig.update_layout(height=800, width=600)
    mlflow.log_figure(fig, artifact_file="plots/feature_importances.png")

In [None]:
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
pred = clf.predict_proba(X_test) 
pd.DataFrame(pred)

In [None]:
from sklearn.linear_model import LogisticRegression

# Log parameters, metrics, and artifacts
with mlflow.start_run(run_name="Logistic Regression"):
    clf = LogisticRegression()
    clf.fit(X_train.iloc[:, :-2], y_train)
    
    # Set run tags - features, feature_no, data size
    mlflow.set_tag("feat_selection", "manual")
    mlflow.set_tag("feature_no", len(X_train.iloc[:, :-2].columns))
    mlflow.set_tag("features", X_train.iloc[:, :-2].columns.to_list())


## <a id='toc2_4_'></a>[**Where Does MLFlow Store Data?**](#toc0_)

Depending on the backend setup, MLFlow stores data in:
- **Local filesystem** (e.g., `./mlruns` directory, suitable for quick tests but slow)
- **Local SQLite Database**: Lightweight and easy to set up
- **Cloud storage**: AWS S3, Google Cloud Storage, etc., for large-scale deployments

To configure MLFlow to use a SQLite backend:


In [None]:
# Example command to run in terminal (not in Jupyter)
# mlflow server/ui \
#    --backend-store-uri sqlite:///mlflow.db \
#    --default-artifact-root ./mlruns

In [None]:
# Check where experiments are saved

In [None]:
# Set tracking uri
mlflow.set_tracking_uri("sqlite:///mlruns.db")
mlflow.set_experiment("breast-cancer-classification")

In [None]:
# Log parameters, metrics, and artifacts
for i in range(1, 11):
    with mlflow.start_run(run_name="Random Forest"):
        clf = RandomForestClassifier(n_estimators=100, random_state=42)
        clf.fit(X_train.iloc[:, :i], y_train)
        mlflow.sklearn.log_model(clf, artifact_path="my_model")

        # Set run tags - features, feature_no, data size
        mlflow.set_tag("feat_selection", "sequential")
        mlflow.set_tag("feature_no", len(X_train.iloc[:, :i].columns))
        mlflow.set_tag("features", X_train.iloc[:, :i].columns.to_list())

        # Log predictions
        pred = clf.predict_proba(X_test.iloc[:, :i])    
        pred_df = pd.DataFrame(pred, columns=["prediction_score_0", "prediction_score_1"])
        mlflow.log_table(pred_df.reset_index(), artifact_file="results/predictions.json")

        # Log custom metrics manually
        mlflow.log_metric("ROC-AUC", roc_auc_score(y_test, pred_df["prediction_score_1"]))
        mlflow.log_metric("PR-AUC", average_precision_score(y_test, pred_df["prediction_score_1"]))

        # Log feature importance plot
        feat_imp_df = pd.DataFrame(
            {
                "importance": clf.feature_importances_, 
                "feature": clf.feature_names_in_
            }
        )
        feat_imp_df = feat_imp_df.sort_values(by="importance")
        fig = px.bar(x=feat_imp_df.importance, y=feat_imp_df.feature)
        fig.update_layout(height=800, width=600)
        mlflow.log_figure(fig, artifact_file="plots/feature_importances.png")


## <a id='toc2_5_'></a>[**Retrieving Models from MLFlow**](#toc0_)

Search through models - more filtering tips [here](https://mlflow.org/docs/latest/search-runs.html).

In [None]:
# Search all runs with PR-AUC higher than 0.7
runs = mlflow.search_runs(
    experiment_names=["breast-cancer-classification"],
    filter_string="""metrics.`ROC-AUC` > 0.99
    AND tags.feat_selection LIKE 'sequential'
    """
)
runs

You can load previously saved models for inference.

In [None]:
# The model is stored under the model folder by the sklearn autologging, but
# I can save it anywhere
mlflow.log_model(clf, artifact_path="my_model")

In [None]:
# Load model using run_id
run_id = "018dd9c9aec8463a8eec9fd44ddb98be"
model_uri = f"runs:/{run_id}/model"  # Replace <run_id> with an actual run ID
loaded_model = mlflow.sklearn.load_model(model_uri)

# Use the model for predictions
loaded_model.predict(X_test.iloc[:, :7])

In [None]:
loaded_model.predict_proba(X_test.iloc[:, :7])

In [None]:
loaded_model.fit(X_test.iloc[:, :-7], y_test)

## <a id='toc2_6_'></a>[**Register models**](#toc0_)

This can be done either through the UI or via code:

In [None]:
# Register model using runs:/ location
mlflow.register_model(model_uri=f"runs:/{run_id}/model", name="breast-cancer-classification")


## <a id='toc2_7_'></a>[**Extra**](#toc0_)

### <a id='toc2_7_1_'></a>[**Nested Experiments**](#toc0_)
MLFlow allows nested runs for tracking hierarchical experiments. This can be useful if you want to group results from cross-validation folds in separate runs but keep the same attributes.


In [None]:
from sklearn.model_selection import StratifiedKFold

# Create stratified KFold
cross_validator = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_splits = cross_validator.split(X, y)

for train_indices, test_indices in cv_splits:
    print(train_indices)
    print(test_indices)

In [None]:
for i, elem in enumerate(["red", "brown", "blue"]):
    print(i, elem)

In [None]:
# Create nested cross-validation
mlflow.sklearn.autolog(disable=False)
mlflow.set_tracking_uri("sqlite:///mlruns.db")
mlflow.set_experiment('breast-cancer-classification')

with mlflow.start_run(run_name="Random Forest", nested=True) as parent_run:
    # Log features
    
    for i, (train_split, test_split) in enumerate(cv_splits):
        with mlflow.start_run(run_name=f"Random Forest {i}", nested=True):
            # New train-test split
            
            clf = RandomForestClassifier(n_estimators=100, random_state=42)
            clf.fit(X_train, y_train)

            # Use same logging as before
            mlflow.set_tag("fold", i)


### <a id='toc2_7_2_'></a>[**Setting Up AWS Storage**](#toc0_)
You can configure MLFlow to use AWS Postgresql database (either on RDS or Redshift) as metadata store and AWS S3 as the artifact storage:


In [None]:
# Run in terminal
# !mlflow server/ui \
#     --backend-store-uri 'postgresql://user_name:password@link_to_your_aws_postgresql_db:port' \
#     --default-artifact-root s3://your-bucket-name

In [None]:
mlflow.set_tracking_uri("postgresql://user_name:password@link_to_your_aws_postgresql_db:port")
mlflow.create_experiment("name", artifact_location="s3://your-bucket-name")