# <a id='toc1_'></a>[**Introduction to MLFlow and MLOps**](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [**Introduction to MLFlow and MLOps**](#toc1_)    
  - [**Why MLFlow?**](#toc1_1_)    
  - [**What Can MLFlow Do?**](#toc1_2_)    
- [**Hands-On MLFlow**](#toc2_)    
  - [**Basic Usage: Autologging**](#toc2_1_)    
  - [**Viewing Results Through the UI**](#toc2_2_)    
  - [**Creating Experiments and Designing Logic**](#toc2_3_)    
  - [**Where Does MLFlow Store Data?**](#toc2_4_)    
  - [**Retrieving Models from MLFlow**](#toc2_5_)    
  - [**Register models**](#toc2_6_)    
  - [**Extra**](#toc2_7_)    
    - [**Nested Experiments**](#toc2_7_1_)    
    - [**Setting Up AWS Storage**](#toc2_7_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_'></a>[**Why MLFlow?**](#toc0_)
![MLOps](https://raw.githubusercontent.com/dsml-bootcamp-1/nbs-6-master/refs/heads/master/s-601-602/image_ops.png)

Machine learning models go through several stages: data preprocessing, training, evaluation, deployment, and monitoring. 
Ensuring consistency and reproducibility across these stages is a crucial aspect of MLOps (Machine Learning Operations). 

MLFlow is a tool designed to streamline this process by providing a centralized system to manage and track:
- Experiments and their results (e.g., parameters, metrics)
- Models and their artifacts (e.g., saved files, plots, images)
- Deployment logic for easy retrieval and deployment

## <a id='toc1_2_'></a>[**What Can MLFlow Do?**](#toc0_)
MLFlow can store:
- **Models**: Trained models in various formats (e.g., TensorFlow, PyTorch, Scikit-Learn)
- **Parameters**: Hyperparameters used for training
- **Metrics**: Evaluation metrics (e.g., accuracy, loss)
- **Artifacts**: Additional files (e.g., images, plots, HTML reports)
- **Data**: Input and output data (e.g., CSVs, dataframes)

![MLFlow Overview](../../../../img/mlflow.png)

In [None]:
# Install MLFlow if not already installed
!pip install mlflow

In [None]:
# Check mlflow version
import mlflow
mlflow.__version__

In [None]:
# Check sklearn version
import sklearn
sklearn.__version__

In [None]:
# If the version is higher than 1.0.2, then downgrade (needed for autologging)
# !pip install scikit-learn==1.0.2

In [None]:
import numpy
numpy.__version__

### Troubleshooting
> `ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject`

In [None]:
!pip install numpy==1.26.4


# <a id='toc2_'></a>[**Hands-On MLFlow**](#toc0_)

## <a id='toc2_1_'></a>[**Basic Usage: Autologging**](#toc0_)

MLFlow provides an easy-to-use `autolog` feature. Let's start by training a simple model and see how MLFlow tracks everything.


In [None]:
#!pip install plotly
#!pip install -U kaleido

In [1]:
import pandas as pd
from sklearn.metrics import roc_auc_score, average_precision_score
import plotly.express as px
import mlflow
import mlflow.sklearn
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

In [2]:
# Load dataset
data = load_breast_cancer()
data

{'data': array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
         1.189e-01],
        [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
         8.902e-02],
        [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
         8.758e-02],
        ...,
        [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
         7.820e-02],
        [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
         1.240e-01],
        [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
         7.039e-02]]),
 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
        1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
        1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
        1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0

In [3]:
# Create X, y split
X = pd.DataFrame(data["data"], columns=data.feature_names)
X.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [4]:
y = pd.Series(data.target)
y.value_counts()

1    357
0    212
dtype: int64

In [5]:
# Train-test split - ideal?
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [6]:
# Enable autologging for Sklearn
mlflow.sklearn.autolog()

# Train a simple model
with mlflow.start_run():
    # Instantiate and fit classifier
    clf = RandomForestClassifier(n_estimators=100, random_state=42)
    clf.fit(X_train, y_train)
    
    # Add custom metrics - ROC-AUC, PR-AUC

In [11]:
mlflow.set_tracking_uri("file:///")


## <a id='toc2_2_'></a>[**Viewing Results Through the UI**](#toc0_)

Start the MLFlow UI to visualize your logged experiments:


In [None]:
# Run this in your terminal (not in Jupyter)
# mlflow ui

# Can also change the port
# mlflow ui --port=8080


Navigate to `http://localhost:5000` to see your experiments.

![MLFlow UI Screenshot](https://mlflow.org/docs/latest/_images/quickstart-our-experiment.png) 



## <a id='toc2_3_'></a>[**Creating Experiments and Designing Logic**](#toc0_)
You can explicitly create experiments and log data, custom metrics, tags and other artifacts.

In [7]:
# Set experiment name
mlflow.set_experiment("breast-cancer-classification")

2024/11/12 11:07:00 INFO mlflow.tracking.fluent: Experiment with name 'breast-cancer-classification' does not exist. Creating a new experiment.


<Experiment: artifact_location='file:///c:/Users/sabin/Downloads/freelancing/ironhack/ironhack-v4-data-lessons/cohorts/12_nov_24_sgf/10_extraweek/code_along_nb/mlruns/122193226159064133', creation_time=1731406020997, experiment_id='122193226159064133', last_update_time=1731406020997, lifecycle_stage='active', name='breast-cancer-classification', tags={}>

In [27]:
feat_imp_df

Unnamed: 0,importance,feature
0,"[0.048703371737755234, 0.013590877656998469, 0...","[mean radius, mean texture, mean perimeter, me..."


In [31]:
feat_imp_df = pd.DataFrame(
        {
            "importance": list(clf.feature_importances_), 
            "feature": list(clf.feature_names_in_)
        }
    )
feat_imp_df

Unnamed: 0,importance,feature
0,0.048703,mean radius
1,0.013591,mean texture
2,0.05327,mean perimeter
3,0.047555,mean area
4,0.007285,mean smoothness
5,0.013944,mean compactness
6,0.068001,mean concavity
7,0.10621,mean concave points
8,0.00377,mean symmetry
9,0.003886,mean fractal dimension


In [None]:
# Log parameters, metrics, and artifacts
with mlflow.start_run(run_name="Random Forest"):
    clf = RandomForestClassifier(n_estimators=100, random_state=42)
    clf.fit(X_train, y_train)
    
    # Set run tags - features, feature_no, data size
    mlflow.set_tag("feat_selection", "all")
    mlflow.set_tag("feature_no", len(X_train.columns))
    mlflow.set_tag("features", X_train.columns.to_list())

    # Log predictions
    pred = clf.predict_proba(X_test)    
    pred_df = pd.DataFrame(pred, columns=["prediction_score_0", "prediction_score_1"])
    mlflow.log_table(pred_df.reset_index(), artifact_file="results/predictions.json")
    
    # Log custom metrics manually
    mlflow.log_metric("ROC-AUC", roc_auc_score(y_test, pred_df["prediction_score_1"]))
    mlflow.log_metric("PR-AUC", average_precision_score(y_test, pred_df["prediction_score_1"]))
    
    # Log feature importance plot
    feat_imp_df = pd.DataFrame(
        {
            "importance": clf.feature_importances_, 
            "feature": clf.feature_names_in_
        }
    )
    feat_imp_df = feat_imp_df.sort_values(by="importance")
    fig = px.bar(x=feat_imp_df.importance, y=feat_imp_df.feature)
    fig.update_layout(height=800, width=600)
    mlflow.log_figure(fig, artifact_file="plots/feature_importances.png")

In [19]:
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
pred = clf.predict_proba(X_test) 
pd.DataFrame(pred)

2024/11/12 11:26:14 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '86b1cf7d3dc944ca9e0ee6c708e070f4', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


array([[0.03, 0.97],
       [1.  , 0.  ],
       [1.  , 0.  ],
       [0.01, 0.99],
       [0.  , 1.  ],
       [1.  , 0.  ],
       [1.  , 0.  ],
       [0.84, 0.16],
       [0.65, 0.35],
       [0.06, 0.94],
       [0.06, 0.94],
       [0.98, 0.02],
       [0.07, 0.93],
       [0.88, 0.12],
       [0.01, 0.99],
       [0.99, 0.01],
       [0.05, 0.95],
       [0.  , 1.  ],
       [0.  , 1.  ],
       [1.  , 0.  ],
       [0.15, 0.85],
       [0.  , 1.  ],
       [1.  , 0.  ],
       [0.  , 1.  ],
       [0.  , 1.  ],
       [0.08, 0.92],
       [0.  , 1.  ],
       [0.04, 0.96],
       [0.  , 1.  ],
       [1.  , 0.  ],
       [0.  , 1.  ],
       [0.01, 0.99],
       [0.23, 0.77],
       [0.02, 0.98],
       [0.  , 1.  ],
       [0.  , 1.  ],
       [0.76, 0.24],
       [0.02, 0.98],
       [1.  , 0.  ],
       [0.1 , 0.9 ],
       [0.  , 1.  ],
       [0.98, 0.02],
       [0.  , 1.  ],
       [0.  , 1.  ],
       [0.22, 0.78],
       [0.01, 0.99],
       [0.07, 0.93],
       [0.04,

In [16]:
from sklearn.linear_model import LogisticRegression

# Log parameters, metrics, and artifacts
with mlflow.start_run(run_name="Logistic Regression"):
    clf = LogisticRegression()
    clf.fit(X_train.iloc[:, :-2], y_train)
    
    # Set run tags - features, feature_no, data size
    mlflow.set_tag("feat_selection", "manual")
    mlflow.set_tag("feature_no", len(X_train.iloc[:, :-2].columns))
    mlflow.set_tag("features", X_train.iloc[:, :-2].columns.to_list())

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



## <a id='toc2_4_'></a>[**Where Does MLFlow Store Data?**](#toc0_)

Depending on the backend setup, MLFlow stores data in:
- **Local filesystem** (e.g., `./mlruns` directory, suitable for quick tests but slow)
- **Local SQLite Database**: Lightweight and easy to set up
- **Cloud storage**: AWS S3, Google Cloud Storage, etc., for large-scale deployments

To configure MLFlow to use a SQLite backend:


In [None]:
# Example command to run in terminal (not in Jupyter)
# mlflow server/ui \
#    --backend-store-uri sqlite:///mlflow.db \
#    --default-artifact-root ./mlruns

In [None]:
# Check where experiments are saved

In [36]:
# Set tracking uri
mlflow.set_tracking_uri("sqlite:///mlruns.db")
mlflow.set_experiment("breast-cancer-classification")

2024/11/12 12:29:04 INFO mlflow.tracking.fluent: Experiment with name 'breast-cancer-classification' does not exist. Creating a new experiment.


<Experiment: artifact_location='file:///c:/Users/sabin/Downloads/freelancing/ironhack/ironhack-v4-data-lessons/cohorts/12_nov_24_sgf/10_extraweek/code_along_nb/mlruns/1', creation_time=1731410944231, experiment_id='1', last_update_time=1731410944231, lifecycle_stage='active', name='breast-cancer-classification', tags={}>

In [42]:
# Log parameters, metrics, and artifacts
for i in range(1, 11):
    with mlflow.start_run(run_name="Random Forest"):
        clf = RandomForestClassifier(n_estimators=100, random_state=42)
        clf.fit(X_train.iloc[:, :i], y_train)
        mlflow.sklearn.log_model(clf, artifact_path="my_model")

        # Set run tags - features, feature_no, data size
        mlflow.set_tag("feat_selection", "sequential")
        mlflow.set_tag("feature_no", len(X_train.iloc[:, :i].columns))
        mlflow.set_tag("features", X_train.iloc[:, :i].columns.to_list())

        # Log predictions
        pred = clf.predict_proba(X_test.iloc[:, :i])    
        pred_df = pd.DataFrame(pred, columns=["prediction_score_0", "prediction_score_1"])
        mlflow.log_table(pred_df.reset_index(), artifact_file="results/predictions.json")

        # Log custom metrics manually
        mlflow.log_metric("ROC-AUC", roc_auc_score(y_test, pred_df["prediction_score_1"]))
        mlflow.log_metric("PR-AUC", average_precision_score(y_test, pred_df["prediction_score_1"]))

        # Log feature importance plot
        feat_imp_df = pd.DataFrame(
            {
                "importance": clf.feature_importances_, 
                "feature": clf.feature_names_in_
            }
        )
        feat_imp_df = feat_imp_df.sort_values(by="importance")
        fig = px.bar(x=feat_imp_df.importance, y=feat_imp_df.feature)
        fig.update_layout(height=800, width=600)
        mlflow.log_figure(fig, artifact_file="plots/feature_importances.png")

Exception ignored in: <function WeakValueDictionary.__init__.<locals>.remove at 0x0000023172BD5160>
Traceback (most recent call last):
  File "c:\Users\sabin\anaconda3\envs\ml\lib\weakref.py", line 107, in remove
    self = selfref()
KeyboardInterrupt: 


KeyboardInterrupt: 


## <a id='toc2_5_'></a>[**Retrieving Models from MLFlow**](#toc0_)

Search through models - more filtering tips [here](https://mlflow.org/docs/latest/search-runs.html).

In [40]:
# Search all runs with PR-AUC higher than 0.7
runs = mlflow.search_runs(
    experiment_names=["breast-cancer-classification"],
    filter_string="""metrics.`ROC-AUC` > 0.99
    AND tags.feat_selection LIKE 'sequential'
    """
)
runs

Unnamed: 0,run_id,experiment_id,status,artifact_uri,start_time,end_time,metrics.training_recall_score,metrics.training_precision_score,metrics.training_roc_auc,metrics.training_f1_score,...,tags.mlflow.source.type,tags.mlflow.loggedArtifacts,tags.mlflow.runName,tags.mlflow.log-model.history,tags.feat_selection,tags.estimator_class,tags.mlflow.user,tags.feature_no,tags.estimator_name,tags.features
0,259b9a2d58284fef800f9e8c501187aa,1,FINISHED,file:///c:/Users/sabin/Downloads/freelancing/i...,2024-11-12 11:36:14.415000+00:00,2024-11-12 11:36:24.701000+00:00,1.0,1.0,1.0,1.0,...,LOCAL,"[{""path"": ""results/predictions.json"", ""type"": ...",Random Forest,"[{""run_id"": ""259b9a2d58284fef800f9e8c501187aa""...",sequential,sklearn.ensemble._forest.RandomForestClassifier,spyral,10,RandomForestClassifier,"['mean radius', 'mean texture', 'mean perimete..."
1,9dc440b03ce04950970c1fa1f6183682,1,FINISHED,file:///c:/Users/sabin/Downloads/freelancing/i...,2024-11-12 11:36:03.340000+00:00,2024-11-12 11:36:14.375000+00:00,1.0,1.0,1.0,1.0,...,LOCAL,"[{""path"": ""results/predictions.json"", ""type"": ...",Random Forest,"[{""run_id"": ""9dc440b03ce04950970c1fa1f6183682""...",sequential,sklearn.ensemble._forest.RandomForestClassifier,spyral,9,RandomForestClassifier,"['mean radius', 'mean texture', 'mean perimete..."
2,2a4cd1f00b2d478bbb07fb229abe0c7b,1,FINISHED,file:///c:/Users/sabin/Downloads/freelancing/i...,2024-11-12 11:35:52.656000+00:00,2024-11-12 11:36:03.302000+00:00,1.0,1.0,1.0,1.0,...,LOCAL,"[{""path"": ""results/predictions.json"", ""type"": ...",Random Forest,"[{""run_id"": ""2a4cd1f00b2d478bbb07fb229abe0c7b""...",sequential,sklearn.ensemble._forest.RandomForestClassifier,spyral,8,RandomForestClassifier,"['mean radius', 'mean texture', 'mean perimete..."
3,018dd9c9aec8463a8eec9fd44ddb98be,1,FINISHED,file:///c:/Users/sabin/Downloads/freelancing/i...,2024-11-12 11:35:42.619000+00:00,2024-11-12 11:35:52.626000+00:00,1.0,1.0,1.0,1.0,...,LOCAL,"[{""path"": ""results/predictions.json"", ""type"": ...",Random Forest,"[{""run_id"": ""018dd9c9aec8463a8eec9fd44ddb98be""...",sequential,sklearn.ensemble._forest.RandomForestClassifier,spyral,7,RandomForestClassifier,"['mean radius', 'mean texture', 'mean perimete..."
4,6fc088081c8646458273964a9ed42dfb,1,FINISHED,file:///c:/Users/sabin/Downloads/freelancing/i...,2024-11-12 11:35:32.515000+00:00,2024-11-12 11:35:42.590000+00:00,1.0,1.0,1.0,1.0,...,LOCAL,"[{""path"": ""results/predictions.json"", ""type"": ...",Random Forest,"[{""run_id"": ""6fc088081c8646458273964a9ed42dfb""...",sequential,sklearn.ensemble._forest.RandomForestClassifier,spyral,6,RandomForestClassifier,"['mean radius', 'mean texture', 'mean perimete..."
5,14b149b3ee1d469aa9db64c784f87159,1,FINISHED,file:///c:/Users/sabin/Downloads/freelancing/i...,2024-11-12 11:35:22.741000+00:00,2024-11-12 11:35:32.492000+00:00,1.0,1.0,1.0,1.0,...,LOCAL,"[{""path"": ""results/predictions.json"", ""type"": ...",Random Forest,"[{""run_id"": ""14b149b3ee1d469aa9db64c784f87159""...",sequential,sklearn.ensemble._forest.RandomForestClassifier,spyral,5,RandomForestClassifier,"['mean radius', 'mean texture', 'mean perimete..."
6,76c2affd76ed415897e109f9745138d4,1,FINISHED,file:///c:/Users/sabin/Downloads/freelancing/i...,2024-11-12 11:35:13.450000+00:00,2024-11-12 11:35:22.712000+00:00,1.0,1.0,1.0,1.0,...,LOCAL,"[{""path"": ""results/predictions.json"", ""type"": ...",Random Forest,"[{""run_id"": ""76c2affd76ed415897e109f9745138d4""...",sequential,sklearn.ensemble._forest.RandomForestClassifier,spyral,4,RandomForestClassifier,"['mean radius', 'mean texture', 'mean perimete..."


You can load previously saved models for inference.

In [None]:
# The model is stored under the model folder by the sklearn autologging, but
# I can save it anywhere
mlflow.log_model(clf, artifact_path="my_model")

In [47]:
# Load model using run_id
run_id = "018dd9c9aec8463a8eec9fd44ddb98be"
model_uri = f"runs:/{run_id}/model"  # Replace <run_id> with an actual run ID
loaded_model = mlflow.sklearn.load_model(model_uri)

# Use the model for predictions
loaded_model.predict(X_test.iloc[:, :7])

array([1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1,
       0, 1, 0, 0])

In [48]:
loaded_model.predict_proba(X_test.iloc[:, :7])

array([[0.04, 0.96],
       [1.  , 0.  ],
       [0.92, 0.08],
       [0.03, 0.97],
       [0.  , 1.  ],
       [1.  , 0.  ],
       [1.  , 0.  ],
       [0.86, 0.14],
       [0.66, 0.34],
       [0.  , 1.  ],
       [0.04, 0.96],
       [0.91, 0.09],
       [0.33, 0.67],
       [0.97, 0.03],
       [0.  , 1.  ],
       [1.  , 0.  ],
       [0.1 , 0.9 ],
       [0.  , 1.  ],
       [0.01, 0.99],
       [1.  , 0.  ],
       [0.14, 0.86],
       [0.  , 1.  ],
       [1.  , 0.  ],
       [0.01, 0.99],
       [0.02, 0.98],
       [0.09, 0.91],
       [0.04, 0.96],
       [0.04, 0.96],
       [0.  , 1.  ],
       [0.99, 0.01],
       [0.04, 0.96],
       [0.  , 1.  ],
       [0.  , 1.  ],
       [0.02, 0.98],
       [0.  , 1.  ],
       [0.  , 1.  ],
       [0.73, 0.27],
       [0.02, 0.98],
       [1.  , 0.  ],
       [0.26, 0.74],
       [0.  , 1.  ],
       [0.92, 0.08],
       [0.  , 1.  ],
       [0.02, 0.98],
       [0.05, 0.95],
       [0.01, 0.99],
       [0.02, 0.98],
       [0.  ,

In [50]:
loaded_model.fit(X_test.iloc[:, :-7], y_test)

2024/11/12 12:54:24 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'a19b55c9f6b542d8abd3de0216aa2758', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


RandomForestClassifier(random_state=42)

## <a id='toc2_6_'></a>[**Register models**](#toc0_)

This can be done either through the UI or via code:

In [51]:
# Register model using runs:/ location
mlflow.register_model(model_uri=f"runs:/{run_id}/model", name="breast-cancer-classification")

Registered model 'breast-cancer-classification' already exists. Creating a new version of this model...
Created version '2' of model 'breast-cancer-classification'.


<ModelVersion: aliases=[], creation_timestamp=1731412821110, current_stage='None', description=None, last_updated_timestamp=1731412821110, name='breast-cancer-classification', run_id='018dd9c9aec8463a8eec9fd44ddb98be', run_link=None, source='file:///c:/Users/sabin/Downloads/freelancing/ironhack/ironhack-v4-data-lessons/cohorts/12_nov_24_sgf/10_extraweek/code_along_nb/mlruns/1/018dd9c9aec8463a8eec9fd44ddb98be/artifacts/model', status='READY', status_message=None, tags={}, user_id=None, version=2>


## <a id='toc2_7_'></a>[**Extra**](#toc0_)

### <a id='toc2_7_1_'></a>[**Nested Experiments**](#toc0_)
MLFlow allows nested runs for tracking hierarchical experiments. This can be useful if you want to group results from cross-validation folds in separate runs but keep the same attributes.


In [5]:
from sklearn.model_selection import StratifiedKFold

# Create stratified KFold
cross_validator = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_splits = cross_validator.split(X, y)

for train_indices, test_indices in cv_splits:
    print(train_indices)
    print(test_indices)

[  1   2   3   4   5   6   7   9  10  11  12  13  15  16  18  19  20  21
  22  23  24  25  26  27  28  30  32  33  34  35  36  37  38  39  40  41
  42  43  44  45  46  47  48  49  50  51  52  53  54  56  57  58  59  60
  62  63  64  66  67  68  69  71  72  73  74  78  81  82  83  85  87  89
  91  92  93  94  95  96  97  98  99 100 101 102 103 105 107 108 109 110
 113 116 120 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137
 139 140 141 142 143 145 146 147 148 150 152 153 154 155 156 157 158 160
 162 163 164 165 166 167 170 171 172 173 174 175 176 177 178 179 180 181
 182 183 184 185 187 188 189 190 191 192 193 195 196 197 201 202 204 206
 207 208 209 210 211 212 215 216 217 218 219 220 221 222 223 224 225 226
 227 228 229 231 232 233 234 235 236 237 238 239 241 242 243 245 246 247
 248 249 250 251 252 253 254 255 256 257 259 261 262 263 265 266 267 270
 271 272 273 274 276 277 278 279 280 282 285 286 287 288 289 290 293 294
 295 296 297 299 300 301 302 303 304 305 306 307 30

In [54]:
for i, elem in enumerate(["red", "brown", "blue"]):
    print(i, elem)

0 red
1 brown
2 blue


In [10]:
# Create nested cross-validation
mlflow.sklearn.autolog(disable=False)
mlflow.set_tracking_uri("sqlite:///mlruns.db")
mlflow.set_experiment('breast-cancer-classification')

with mlflow.start_run(run_name="Random Forest", nested=True) as parent_run:
    # Log features
    
    for i, (train_split, test_split) in enumerate(cv_splits):
        with mlflow.start_run(run_name=f"Random Forest {i}", nested=True):
            # New train-test split
            
            clf = RandomForestClassifier(n_estimators=100, random_state=42)
            clf.fit(X_train, y_train)

            # Use same logging as before
            mlflow.set_tag("fold", i)


### <a id='toc2_7_2_'></a>[**Setting Up AWS Storage**](#toc0_)
You can configure MLFlow to use AWS Postgresql database (either on RDS or Redshift) as metadata store and AWS S3 as the artifact storage:


In [None]:
# Run in terminal
# !mlflow server/ui \
#     --backend-store-uri 'postgresql://user_name:password@link_to_your_aws_postgresql_db:port' \
#     --default-artifact-root s3://your-bucket-name

In [None]:
mlflow.set_tracking_uri("postgresql://user_name:password@link_to_your_aws_postgresql_db:port")
mlflow.create_experiment("name", artifact_location="s3://your-bucket-name")