## MLflow : Model Registry

    Introduction to model registry
The Model Registry in MLflow is a centralized system to manage the lifecycle of machine learning models. It provides a way to organize, track, and control models as they progress through different stages of development and deployment. Here's a breakdown of its key features and concepts:

Key Features of the MLflow Model Registry:
Model Versioning:

Every time you register a new model or update an existing one, the Model Registry automatically assigns it a unique version number.
This helps you keep track of changes and maintain a history of all model versions.

**Model Stages:**

Models can be assigned to one of the following stages:
- None: Default stage when a model is first registered.
- Staging: Indicates the model is being tested or validated.
- Production: Indicates the model is ready for deployment.
- Archived: Indicates the model is no longer in use but retained for reference.
You can transition models between these stages to reflect their lifecycle.

**Model Metadata :**

- The registry stores metadata about each model, such as:
Who created the model. When it was created. Associated tags or descriptions. This metadata helps in auditing and understanding the model's context.

- Model Lineage:
The registry links models to their training runs in MLflow Tracking.
This allows you to trace back to the data, code, and parameters used to train the model.

- Access Control:
You can set permissions to control who can register, update, or transition models between stages.

- Integration with Deployment:
The Model Registry integrates with deployment tools, making it easier to serve models directly from the registry.

In [139]:
# Import necessary libraries :
import mlflow
import mlflow.sklearn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

class AutomaticMLflowLogger:
    """
    AutomaticMLflowLogger class:
    Generates a synthetic classification dataset, converts it to DataFrame, splits it into train/test sets,
    and logs a LogisticRegression model with MLflow.
    """

    def __init__(self, n_samples:int, n_features:int, n_informative:int, n_redundant:int, random_state:int, n_classes:int) -> None:
        """
        Initializes an instance of AutomaticMLflowLogger.

        Parameters:
        -----------
        n_samples : int
            Total number of samples to generate (e.g., 1000).
        n_features : int
            Total number of features (e.g., 20).
        n_informative : int
            Number of informative features (e.g., 10).
        n_redundant : int
            Number of redundant features (e.g., 5).
        random_state : int
            Random seed for reproducibility (e.g., 42).
        n_classes : int
            Number of classes in the target variable (e.g., 3).
        """
        self.n_samples = n_samples
        self.n_features = n_features
        self.n_informative = n_informative
        self.n_redundant = n_redundant
        self.random_state = random_state
        self.n_classes = n_classes

        # Generate classification dataset
        self.X, self.y = make_classification(
            n_samples=n_samples,
            n_features=n_features,
            n_informative=n_informative,
            n_redundant=n_redundant,
            random_state=random_state,
            n_classes=n_classes
        )

        # Print dataset info
        print("INFORMATION ON THE GENERATED DATASET:")
        print(f"--------------------------------------")
        print(f"X dtype: {self.X.dtype}, y dtype: {self.y.dtype}")
        print(f"Number of samples: {n_samples}, Features: {n_features}, Classes: {n_classes}")
        print(f"First 5 rows of X:\n{self.X[:5]}")
        print(f"First 5 rows of y:\n{self.y[:5]}")

    def to_dataframe(self) -> pd.DataFrame:
        """
        Convert X and y into a pandas DataFrame.

        Returns
        -------
        pd.DataFrame
            DataFrame with feature columns named feature_0, ..., feature_n and target column named 'target'.
        """
        self.df = pd.DataFrame(data=self.X, columns=[f"feature_{n}" for n in range(self.X.shape[1])])
        self.df["target"] = self.y
        return self.df

    def split_dataframe(self, test_size:float=0.2, random_state:int=None) -> None:
        """
        Split the dataset into train and test sets using sklearn's train_test_split.

        Parameters
        ----------
        test_size : float
            Proportion of the dataset to include in the test split (default=0.2)
        random_state : int
            Random seed for reproducibility
        """
        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(
            self.X, self.y, test_size=test_size, random_state=random_state
        )
        print(f"X_train shape: {self.X_train.shape}, y_train shape: {self.y_train.shape}")
        print(f"X_test shape: {self.X_test.shape}, y_test shape: {self.y_test.shape}")

    def tracking_params_with_mlflow(self, name_experiment: str, penalty="l2", solver="lbfgs", random_state=42, n_jobs=1, run_name="run"):
        """
        Train a LogisticRegression model and log it with MLflow.

        Parameters:
        -----------
        name_experiment : str
            Name of the MLflow experiment.
        penalty : str
            Regularization type (default: 'l2').
        solver : str
            Solver for LogisticRegression (default: 'lbfgs').
        random_state : int
            Random state for reproducibility (default: 42).
        n_jobs : int
            Number of parallel jobs (default: 1).
        run_name : str
            Name of the MLflow run (default: 'run').
        """
        try:
            # Set experiment
            mlflow.set_experiment(name_experiment)
            mlflow.sklearn.autolog()  # Enable automatic logging

            # Start run
            with mlflow.start_run(run_name=run_name):
                model = LogisticRegression(penalty=penalty, solver=solver, random_state=random_state, n_jobs=n_jobs)
                model.fit(self.X_train, self.y_train)

                # Predictions
                y_train_pred = model.predict(self.X_train)
                y_test_pred = model.predict(self.X_test)

                # Metrics
                train_acc = accuracy_score(self.y_train, y_train_pred)
                test_acc = accuracy_score(self.y_test, y_test_pred)
                print(f"Train Accuracy: {train_acc:.4f}")
                print(f"Test Accuracy: {test_acc:.4f}")

        except Exception as e:
            print(f"Error in MLflow logging: {e}")


In [140]:
logger = AutomaticMLflowLogger(1000, 10, 5, 2, 42, 3)
df = logger.to_dataframe()
logger.split_dataframe(test_size=0.25, random_state=42)
logger.tracking_params_with_mlflow(name_experiment="Classification_Experiment", run_name="Logit_Run_1")


2025/12/29 19:56:45 INFO mlflow.tracking.fluent: Experiment with name 'Classification_Experiment' does not exist. Creating a new experiment.


INFORMATION ON THE GENERATED DATASET:
--------------------------------------
X dtype: float64, y dtype: int64
Number of samples: 1000, Features: 10, Classes: 3
First 5 rows of X:
[[-2.56891645 -0.25740861 -2.67935708  3.86481793  2.56499796 -0.73755596
  -3.33098499 -1.21337007 -1.47310497 -0.84638564]
 [ 0.62286056  0.53454361  0.01828302 -0.28338169  1.90763743 -0.34130985
   1.20623966 -1.09353229 -0.46979071 -0.18802193]
 [-0.17125115 -0.49627753  1.61334708  2.48806861 -1.67796555  0.30360427
  -2.10457904  0.71453069  3.47599878  0.6233862 ]
 [-0.87142307 -0.3339456   3.36844633  0.97215326 -0.13438845  0.21281985
   0.70089905  0.71604575 -1.30090958  3.43983124]
 [ 2.34640238 -0.69996534 -0.20325055 -0.25674549 -1.97425145  0.61966312
  -1.24795009 -1.66211471  3.92145164 -0.75949065]]
First 5 rows of y:
[1 1 0 1 0]
X_train shape: (750, 10), y_train shape: (750,)
X_test shape: (250, 10), y_test shape: (250,)
Train Accuracy: 0.7013
Test Accuracy: 0.7160
