# Log models with MLflow

You can use MLflow in Azure Machine Learning to log models. When you log a model as a model instead of an artifact, a MLmodel is created in the output directory. The MLmodel file contains all the model's metadata. You can customize the model's signature when logging the model.

## Before you start

You'll need the latest version of the **azure-ai-ml** package to run the code in this notebook. Run the cell below to verify that it is installed.

> **Note**:
> If the **azure-ai-ml** package is not installed, run `pip install azure-ai-ml` to install it.

In [1]:
pip show azure-ai-ml

Name: azure-ai-ml
Version: 1.25.0
Summary: Microsoft Azure Machine Learning Client Library for Python
Home-page: https://github.com/Azure/azure-sdk-for-python
Author: Microsoft Corporation
Author-email: azuresdkengsysadmins@microsoft.com
License: MIT License
Location: /anaconda/envs/azureml_py38/lib/python3.10/site-packages
Requires: azure-common, azure-core, azure-mgmt-core, azure-monitor-opentelemetry, azure-storage-blob, azure-storage-file-datalake, azure-storage-file-share, colorama, isodate, jsonschema, marshmallow, msrest, pydash, pyjwt, pyyaml, strictyaml, tqdm, typing-extensions
Required-by: 
Note: you may need to restart the kernel to use updated packages.


## Connect to your workspace

With the required SDK packages installed, now you're ready to connect to your workspace.

To connect to a workspace, we need identifier parameters - a subscription ID, resource group name, and workspace name. Since you're working with a compute instance, managed by Azure Machine Learning, you can use the default values to connect to the workspace.

In [2]:
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ai.ml import MLClient

try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()

In [3]:
# Get a handle to workspace
ml_client = MLClient.from_config(credential=credential)

Found the config file in: /config.json


## Autologging with MLflow

When you use autologging, your model is automatically logged. The model flavor and schema is inferred. 

Run the following cell to create the **train-model-autolog.py** script in the **src** folder. The script trains a classification model by using the **accident.csv** file in the same folder, which is passed as an argument. 

In [4]:
import os

# create a folder for the script files
script_folder = 'src'
os.makedirs(script_folder, exist_ok=True)
print(script_folder, 'folder created')

src folder created


In [5]:
%%writefile $script_folder/train-model-autolog.py
# import libraries
import mlflow
import os
import argparse
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
import tempfile
import joblib
from pathlib import Path
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

def main(args):
    # enable autologging
    mlflow.autolog()

    # read data
    df = get_data(args.training_data)

    # split data
    X_train, X_test, y_train, y_test = split_data(df)

    # train model
    model = train_model(args.reg_rate, X_train, X_test, y_train, y_test)

    eval_model(model, X_test, y_test)

# function that reads the data
def get_data(path):
    print("Reading data...")
    data = pd.read_csv(path)
    df = data.copy().dropna()
    
    return df


# function that splits the data
def split_data(df):
    print("Splitting data...")
    # Numeric transformer pipeline
    numeric_transformer = make_pipeline(
        SimpleImputer(strategy="mean"),
        StandardScaler(),
    )
    
    # Categorical transformer pipeline
    categorical_transformer = make_pipeline(
        SimpleImputer(strategy="most_frequent"),
        OneHotEncoder(drop='first')
    )
    
    # Define categorical and numeric columns
    cat_columns = ['Gender', 'Helmet_Used', 'Seatbelt_Used']
    num_columns = ['Age', 'Speed_of_Impact']
    
    # Combined feature transformer
    features_transformer = ColumnTransformer(
        transformers=[
            ("numeric", numeric_transformer, num_columns),
            ("categorical", categorical_transformer, cat_columns),
        ],
    )
    # Separate features and labels
    X = df[['Age', 'Gender', 'Speed_of_Impact', 'Helmet_Used', 'Seatbelt_Used']]
    y = df['Survived'].values


    # Split data into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
    
    # Transform train and test data
    X_train = features_transformer.fit_transform(X_train)
    X_test = features_transformer.transform(X_test)

    return X_train, X_test, y_train, y_test

# function that trains the model
def train_model(reg_rate, X_train, X_test, y_train, y_test):
    print("Training model...")
    model = LogisticRegression(C=1/reg_rate, solver="liblinear").fit(X_train, y_train)

    return model

# function that evaluates the model
def eval_model(model, X_test, y_test):
    # calculate accuracy
    y_hat = model.predict(X_test)
    acc = np.average(y_hat == y_test)
    print('Accuracy:', acc)

    # calculate AUC
    y_scores = model.predict_proba(X_test)
    auc = roc_auc_score(y_test,y_scores[:,1])
    print('AUC: ' + str(auc))

    # plot ROC curve
    fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])
    fig = plt.figure(figsize=(6, 4))
    # Plot the diagonal 50% line
    plt.plot([0, 1], [0, 1], 'k--')
    # Plot the FPR and TPR achieved by our model
    plt.plot(fpr, tpr)
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve')
    plt.savefig("ROC-Curve.png")


def parse_args():
    # setup arg parser
    parser = argparse.ArgumentParser()

    # add arguments
    parser.add_argument("--training_data", dest='training_data',
                        type=str)
    parser.add_argument("--reg_rate", dest='reg_rate',
                        type=float, default=0.01)

    # parse args
    args = parser.parse_args()

    # return args
    return args

# run script
if __name__ == "__main__":
    # add space in logs
    print("\n\n")
    print("*" * 60)

    # parse args
    args = parse_args()

    # run main function
    main(args)

    # add space in logs
    print("*" * 60)
    print("\n\n")


Overwriting src/train-model-autolog.py


Now, you can submit the script as a command job.

Run the cell below to train the model. 

In [6]:
from azure.ai.ml import command

# configure job

job = command(
    code="./src",
    command="python train-model-autolog.py --training_data accident.csv",
    environment="AzureML-sklearn-0.24-ubuntu18.04-py37-cpu@latest",
    compute="captgt0071",
    display_name="accident-train-autolog",
    experiment_name="accident-training"
    )

# submit job
returned_job = ml_client.create_or_update(job)
aml_url = returned_job.studio_url
print("Monitor your job at", aml_url)

Class AutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class AutoDeleteConditionSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseAutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class IntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class ProtectionLevelSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseIntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
[32mUploading src (0.45 MBs): 100%|██

Monitor your job at https://ml.azure.com/runs/quirky_quince_1rwgzl654p?wsid=/subscriptions/cda9116f-5326-4a9b-9407-bc3a4391c27c/resourcegroups/mlw-dp100/workspaces/dala_project&tid=aef6e45c-850f-4f38-a10b-1df3ad33cdb0


In the Studio, navigate to the **diabetes-train-autolog** job to explore the overview of the command job you ran. Find the logged artifacts in the **Outputs + logs** tab. Select the `model` folder to find the `MLmodel` file and explore its contents.

## Specify the flavor with autologging

You can use autologging, but still specify the flavor of the model. In the example, the model's flavor is scikit-learn.so use sklearn auto logging

Run the following cell to create the **train-model-sklearn.py** script in the **src** folder. The script trains a classification model by using the **accident.csv** file in the same folder, which is passed as an argument. 

In [6]:
%%writefile $script_folder/train-model-sklearn.py
# import libraries
import mlflow
import os
import argparse
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
import tempfile
import joblib
from pathlib import Path
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

def main(args):
    # enable autologging
    mlflow.sklearn.autolog()

    # read data
    df = get_data(args.training_data)

    # split data
    X_train, X_test, y_train, y_test = split_data(df)

    # train model
    model = train_model(args.reg_rate, X_train, X_test, y_train, y_test)

    eval_model(model, X_test, y_test)

# function that reads the data
def get_data(path):
    print("Reading data...")
    data = pd.read_csv(path)
    df = data.copy().dropna()
    
    return df


# function that splits the data
def split_data(df):
    print("Splitting data...")
    # Numeric transformer pipeline
    numeric_transformer = make_pipeline(
        SimpleImputer(strategy="mean"),
        StandardScaler(),
    )
    
    # Categorical transformer pipeline
    categorical_transformer = make_pipeline(
        SimpleImputer(strategy="most_frequent"),
        OneHotEncoder(drop='first')
    )
    
    # Define categorical and numeric columns
    cat_columns = ['Gender', 'Helmet_Used', 'Seatbelt_Used']
    num_columns = ['Age', 'Speed_of_Impact']
    
    # Combined feature transformer
    features_transformer = ColumnTransformer(
        transformers=[
            ("numeric", numeric_transformer, num_columns),
            ("categorical", categorical_transformer, cat_columns),
        ],
    )
    # Separate features and labels
    X = df[['Age', 'Gender', 'Speed_of_Impact', 'Helmet_Used', 'Seatbelt_Used']]
    y = df['Survived'].values


    # Split data into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
    
    # Transform train and test data
    X_train = features_transformer.fit_transform(X_train)
    X_test = features_transformer.transform(X_test)

    return X_train, X_test, y_train, y_test

# function that trains the model
def train_model(reg_rate, X_train, X_test, y_train, y_test):
    print("Training model...")
    model = LogisticRegression(C=1/reg_rate, solver="liblinear").fit(X_train, y_train)

    return model

# function that evaluates the model
def eval_model(model, X_test, y_test):
    # calculate accuracy
    y_hat = model.predict(X_test)
    acc = np.average(y_hat == y_test)
    print('Accuracy:', acc)

    # calculate AUC
    y_scores = model.predict_proba(X_test)
    auc = roc_auc_score(y_test,y_scores[:,1])
    print('AUC: ' + str(auc))

    # plot ROC curve
    fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])
    fig = plt.figure(figsize=(6, 4))
    # Plot the diagonal 50% line
    plt.plot([0, 1], [0, 1], 'k--')
    # Plot the FPR and TPR achieved by our model
    plt.plot(fpr, tpr)
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve')
    plt.savefig("ROC-Curve.png")


def parse_args():
    # setup arg parser
    parser = argparse.ArgumentParser()

    # add arguments
    parser.add_argument("--training_data", dest='training_data',
                        type=str)
    parser.add_argument("--reg_rate", dest='reg_rate',
                        type=float, default=0.01)

    # parse args
    args = parser.parse_args()

    # return args
    return args

# run script
if __name__ == "__main__":
    # add space in logs
    print("\n\n")
    print("*" * 60)

    # parse args
    args = parse_args()

    # run main function
    main(args)

    # add space in logs
    print("*" * 60)
    print("\n\n")


Overwriting src/train-model-sklearn.py


Now, you can submit the script as a command job.

Run the cell below to train the model. 

In [9]:
from azure.ai.ml import command

# configure job

job = command(
    code="./src",
    command="python train-model-sklearn.py --training_data accident.csv",
    environment="AzureML-sklearn-0.24-ubuntu18.04-py37-cpu@latest",
    compute="captgt0071",
    display_name="accident-train-sklearn",
    experiment_name="accident-training"
    )

# submit job
returned_job = ml_client.create_or_update(job)
aml_url = returned_job.studio_url
print("Monitor your job at", aml_url)

[32mUploading src (0.45 MBs):   0%|          | 0/450581 [00:00<?, ?it/s][32mUploading src (0.45 MBs): 100%|██████████| 450581/450581 [00:00<00:00, 15924381.02it/s]
[39m



Monitor your job at https://ml.azure.com/runs/calm_book_hz88rfymdw?wsid=/subscriptions/cda9116f-5326-4a9b-9407-bc3a4391c27c/resourcegroups/mlw-dp100/workspaces/dala_project&tid=aef6e45c-850f-4f38-a10b-1df3ad33cdb0


In the Studio, navigate to the **accident-train-sklearn** job to explore the overview of the command job you ran. Find the logged artifacts in the **Outputs + logs** tab. Select the `model` folder to find the `MLmodel` file and explore its contents.

Compare the `MLmodel` files of the previous two runs. You'll notice that they're the same, indicating that MLflow's autolog feature correctly inferred the model's flavor.

## Customize the model with an inferred signature

You can manually log the model using `mlflow.sklearn.log_model` instead of autologging. You'll create a signature by inferring it from the training dataset and predicted results. And finally, you'll log the scikit-learn model.

Run the following cell to create the **train-model-infer.py** script in the **src** folder. The script trains a classification model by using the **diabetes.csv** file in the same folder, which is passed as an argument. 

In [7]:
%%writefile $script_folder/train-model-infer.py
# import libraries
import mlflow
import os
import argparse
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt
from pathlib import Path
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from mlflow.models import infer_signature

# Define paths for saving model and transformer
MODEL_PATH = "model.pkl"
TRANSFORMER_PATH = "features_transformer.pkl"

def main(args):
    # read data
    df = get_data(args.training_data)

    # split data
    X_train, X_test, y_train, y_test, features_transformer = split_data(df)

    # Save the feature transformer for future use
    joblib.dump(features_transformer, TRANSFORMER_PATH)
    print(f"Feature transformer saved at: {TRANSFORMER_PATH}")

    # train model
    model = train_model(args.reg_rate, X_train, y_train)
    mlflow.log_param("regularization_rate", args.reg_rate)

    # Save trained model
    joblib.dump(model, MODEL_PATH)
    print(f"Trained model saved at: {MODEL_PATH}")

    # evaluate model
    y_hat = eval_model(model, X_test, y_test)

    # create the signature by inferring it from the datasets
    signature = infer_signature(X_test, y_hat)

    # Log the model and feature transformer in MLflow
    mlflow.sklearn.log_model(model, artifact_path="model", signature=signature)
    mlflow.log_artifact(TRANSFORMER_PATH, artifact_path="preprocessor")

    print("Model and Feature Transformer saved in MLflow!")

# function that reads the data
def get_data(path):
    print("Reading data...")
    data = pd.read_csv(path)
    df = data.copy().dropna()
    
    return df

# function that splits the data
def split_data(df):
    print("Splitting data...")
    
    # Numeric transformer pipeline
    numeric_transformer = make_pipeline(
        SimpleImputer(strategy="mean"),
        StandardScaler(),
    )
    
    # Categorical transformer pipeline
    categorical_transformer = make_pipeline(
        SimpleImputer(strategy="most_frequent"),
        OneHotEncoder(drop='first')
    )
    
    # Define categorical and numeric columns
    cat_columns = ['Gender', 'Helmet_Used', 'Seatbelt_Used']
    num_columns = ['Age', 'Speed_of_Impact']
    
    # Combined feature transformer
    features_transformer = ColumnTransformer(
        transformers=[
            ("numeric", numeric_transformer, num_columns),
            ("categorical", categorical_transformer, cat_columns),
        ],
    )
    
    # Separate features and labels
    X = df[['Age', 'Gender', 'Speed_of_Impact', 'Helmet_Used', 'Seatbelt_Used']]
    y = df['Survived'].values

    # Split data into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
    
    # Transform train and test data
    X_train = features_transformer.fit_transform(X_train)
    X_test = features_transformer.transform(X_test)

    return X_train, X_test, y_train, y_test, features_transformer

# function that trains the model
def train_model(reg_rate, X_train, y_train):
    print("Training model...")

    # Train logistic regression model
    model = LogisticRegression(C=1/reg_rate, solver="liblinear")
    model.fit(X_train, y_train)

    return model

# function that evaluates the model
def eval_model(model, X_test, y_test):
    # calculate accuracy
    y_hat = model.predict(X_test)
    acc = np.average(y_hat == y_test)
    print('Accuracy:', acc)

    # calculate AUC
    y_scores = model.predict_proba(X_test)
    auc = roc_auc_score(y_test, y_scores[:,1])
    print('AUC:', auc)

    mlflow.log_metric("accuracy", acc)
    mlflow.log_metric("auc", auc)

    # plot ROC curve
    fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])
    plt.figure(figsize=(6, 4))
    plt.plot([0, 1], [0, 1], 'k--')  # Diagonal reference line
    plt.plot(fpr, tpr)  # ROC curve
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve')
    plt.savefig("ROC-Curve.png")

    return y_hat

def parse_args():
    # setup arg parser
    parser = argparse.ArgumentParser()

    # add arguments
    parser.add_argument("--training_data", dest='training_data', type=str, required=True)
    parser.add_argument("--reg_rate", dest='reg_rate', type=float, default=0.01,help="Regularization rate must be positive")

    # parse args
    args = parser.parse_args()

    return args

# run script
if __name__ == "__main__":
    print("\n\n")
    print("*" * 60)

    # parse args
    args = parse_args()

    # run main function
    main(args)

    print("*" * 60)
    print("\n\n")


Writing src/train-model-infer.py


Now, you can submit the script as a command job.

Run the cell below to train the model. 

In [8]:
from azure.ai.ml import command

# configure job

job = command(
    code="./src",
    command="python train-model-infer.py --training_data accident.csv",
    environment="AzureML-sklearn-0.24-ubuntu18.04-py37-cpu@latest",
    compute="captgt0071",
    display_name="accident-train-infer",
    experiment_name="accident-training"
    )

# submit job
returned_job = ml_client.create_or_update(job)
aml_url = returned_job.studio_url
print("Monitor your job at", aml_url)

Class AutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class AutoDeleteConditionSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseAutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class IntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class ProtectionLevelSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseIntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
[32mUploading src (0.46 MBs): 100%|██

Monitor your job at https://ml.azure.com/runs/wheat_camera_mpd9f1rc74?wsid=/subscriptions/cda9116f-5326-4a9b-9407-bc3a4391c27c/resourcegroups/mlw-dp100/workspaces/dala_project&tid=aef6e45c-850f-4f38-a10b-1df3ad33cdb0


In the Studio, navigate to the **accident-train-infer** job to explore the overview of the command job you ran. Find the logged artifacts in the **Outputs + logs** tab. Select the `model` folder to find the `MLmodel` file and explore its contents.

Compare the `MLmodel` files with the previous two runs. You'll notice that they're all the same, indicating that MLflow's autolog feature correctly inferred the model's signature too.

## Customize the model with a defined signature

You can manually log the model using `mlflow.sklearn.log_model`. You'll also create a signature manually. And finally, you'll log the scikit-learn model.

Run the following cell to create the **train-model-signature.py** script in the **src** folder. The script trains a classification model by using the **accident.csv** file in the same folder, which is passed as an argument. 

In [9]:
%%writefile $script_folder/train-model-signature.py
# import libraries
import mlflow
import os
import argparse
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt
from pathlib import Path
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from mlflow.models import infer_signature
import mlflow.sklearn
from mlflow.models.signature import ModelSignature
from mlflow.types.schema import Schema, ColSpec

# Define paths for saving model and transformer
MODEL_PATH = "model.pkl"
TRANSFORMER_PATH = "features_transformer.pkl"

def main(args):
    # read data
    df = get_data(args.training_data)

    # split data
    X_train, X_test, y_train, y_test, features_transformer = split_data(df)

    # Save the feature transformer for future use
    joblib.dump(features_transformer, TRANSFORMER_PATH)
    print(f"Feature transformer saved at: {TRANSFORMER_PATH}")

    # train model
    model = train_model(args.reg_rate, X_train, y_train)
    mlflow.log_param("regularization_rate", args.reg_rate)

    # Save trained model
    joblib.dump(model, MODEL_PATH)
    print(f"Trained model saved at: {MODEL_PATH}")

    # evaluate model
    y_hat = eval_model(model, X_test, y_test)

    # create the signature manually
    
    input_schema = Schema([
        ColSpec("double", "Age"),  # Assuming age is a float/double
        ColSpec("string", "Gender"),  # Categorical variable
        ColSpec("double", "Speed_of_Impact"),  # Assuming speed is a float/double
        ColSpec("string", "Helmet_Used"),  # Categorical variable
        ColSpec("string", "Seatbelt_Used")  # Categorical variable
    ])

    output_schema = Schema([ColSpec("integer", "Survived")])
    signature = ModelSignature(inputs=input_schema, outputs=output_schema)
 

    # Log the model and feature transformer in MLflow
    mlflow.sklearn.log_model(model, artifact_path="model", signature=signature)
    mlflow.log_artifact(TRANSFORMER_PATH, artifact_path="preprocessor")

    print("Model and Feature Transformer saved in MLflow!")

# function that reads the data
def get_data(path):
    print("Reading data...")
    data = pd.read_csv(path)
    # Convert numeric columns to float
    data['Age'] = data['Age'].astype(float)
    data['Speed_of_Impact'] = data['Speed_of_Impact'].astype(float)
    df = data.copy().dropna()
    return df

# function that splits the data
def split_data(df):
    print("Splitting data...")
    
    # Numeric transformer pipeline
    numeric_transformer = make_pipeline(
        SimpleImputer(strategy="mean"),
        StandardScaler(),
    )
    
    # Categorical transformer pipeline
    categorical_transformer = make_pipeline(
        SimpleImputer(strategy="most_frequent"),
        OneHotEncoder(drop='first')
    )
    
    # Define categorical and numeric columns
    cat_columns = ['Gender', 'Helmet_Used', 'Seatbelt_Used']
    num_columns = ['Age', 'Speed_of_Impact']
    
    # Combined feature transformer
    features_transformer = ColumnTransformer(
        transformers=[
            ("numeric", numeric_transformer, num_columns),
            ("categorical", categorical_transformer, cat_columns),
        ],
    )
    
    # Separate features and labels
    X = df[['Age', 'Gender', 'Speed_of_Impact', 'Helmet_Used', 'Seatbelt_Used']]
    y = df['Survived'].values

    # Split data into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
    
    # Transform train and test data
    X_train = features_transformer.fit_transform(X_train)
    X_test = features_transformer.transform(X_test)

    return X_train, X_test, y_train, y_test, features_transformer

# function that trains the model
def train_model(reg_rate, X_train, y_train):
    print("Training model...")

    # Train logistic regression model
    model = LogisticRegression(C=1/reg_rate, solver="liblinear")
    model.fit(X_train, y_train)

    return model

# function that evaluates the model
def eval_model(model, X_test, y_test):
    # calculate accuracy
    y_hat = model.predict(X_test)
    acc = np.average(y_hat == y_test)
    print('Accuracy:', acc)

    # calculate AUC
    y_scores = model.predict_proba(X_test)
    auc = roc_auc_score(y_test, y_scores[:,1])
    print('AUC:', auc)

    mlflow.log_metric("accuracy", acc)
    mlflow.log_metric("auc", auc)

    # plot ROC curve
    fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])
    plt.figure(figsize=(6, 4))
    plt.plot([0, 1], [0, 1], 'k--')  # Diagonal reference line
    plt.plot(fpr, tpr)  # ROC curve
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve')
    plt.savefig("ROC-Curve.png")
    plt.savefig("ROC-Curve.png")
    mlflow.log_artifact("ROC-Curve.png")

    return y_hat

def parse_args():
    # setup arg parser
    parser = argparse.ArgumentParser()

    # add arguments
    parser.add_argument("--training_data", dest='training_data', type=str, required=True)
    parser.add_argument("--reg_rate", dest='reg_rate', type=float, default=0.01,help="Regularization rate must be positive")

    # parse args
    args = parser.parse_args()

    return args

# run script
if __name__ == "__main__":
    print("\n\n")
    print("*" * 60)

    # parse args
    args = parse_args()

    # run main function
    main(args)

    print("*" * 60)
    print("\n\n")

Writing src/train-model-signature.py


Now, you can submit the script as a command job.

Run the cell below to train the model. 

In [10]:
from azure.ai.ml import command

# configure job

job = command(
    code="./src",
    command="python train-model-signature.py --training_data accident.csv",
    environment="AzureML-sklearn-0.24-ubuntu18.04-py37-cpu@latest",
    compute="captgt0071",
    display_name="accident-train-signature",
    experiment_name="accident-training"
    )

# submit job
returned_job = ml_client.create_or_update(job)
aml_url = returned_job.studio_url
print("Monitor your job at", aml_url)

[32mUploading src (0.46 MBs): 100%|██████████| 461023/461023 [00:00<00:00, 11497078.35it/s]
[39m



Monitor your job at https://ml.azure.com/runs/quirky_candle_jky5bkyprc?wsid=/subscriptions/cda9116f-5326-4a9b-9407-bc3a4391c27c/resourcegroups/mlw-dp100/workspaces/dala_project&tid=aef6e45c-850f-4f38-a10b-1df3ad33cdb0


In the Studio, navigate to the **accident-train-signature** job to explore the overview of the command job you ran. Find the logged artifacts in the **Outputs + logs** tab. Select the `model` folder to find the `MLmodel` file and explore its contents.

Compare the `MLmodel` files with the previous runs. You'll notice that the signature is different from the previous runs. Previous runs used tensor-based signatures, whereas the latest run used a column-based signature.

## Register the model

When you choose a model you want to deploy, you can first register the model. 

To register the latest model, you'll refer to the name of the job run. By registering the model as an MLflow model, you can easily deploy it later.

In [14]:
returned_job

Experiment,Name,Type,Status,Details Page
accident-training,quirky_candle_jky5bkyprc,command,Starting,Link to Azure Machine Learning studio


In [13]:
from azure.ai.ml.entities import Model
from azure.ai.ml.constants import AssetTypes

job_name = returned_job.name

job_name 

'quirky_candle_jky5bkyprc'

In [16]:
f"azureml://jobs/{job_name}/outputs/artifacts/paths/model/"

'azureml://jobs/quirky_candle_jky5bkyprc/outputs/artifacts/paths/model/'

In [17]:
from azure.ai.ml.entities import Model
from azure.ai.ml.constants import AssetTypes

job_name = returned_job.name

run_model = Model(
    path=f"azureml://jobs/{job_name}/outputs/artifacts/paths/model/",
    name="mlflow-accident",
    description="Model created from run.",
    type=AssetTypes.MLFLOW_MODEL,
)
# Uncomment after adding required details above
ml_client.models.create_or_update(run_model)

In the Studio, navigate to the **Models** page. In the model list, find the `mlflow-diabetes` model and select it to explore it further.

- In the **Details** tab of the `mlflow-diabetes` model, you can review that it's a `MLFLOW` type model and the job that trained the model.
- In the **Artifacts** tab you can find the directory with the `MLmodel` file.

If you want to explore the model's behavior further, you can **optionally** choose to deploy the model to a real-time endpoint.