# CPU-Based Fine-Tuning of Small Language Models (SLMs) with Azure Machine Learning Phi-4

## **1.Overview**
### **[1.1 Motivations for Small Language Models (SLMs)]**
- Efficiency: SLM provides higher computational efficiency, requires less memory and storage resources, and runs faster due to a smaller number of parameters, enabling more efficient use of computational resources.
- Cost: SLM is less expensive to train and deploy than larger models, making it affordable to a wider range of organizations and particularly suited to resource-constrained environments such as edge computing.
- Customizability: SLMs are more flexible in specialized applications and can be more easily fine-tuned for specific tasks than large models. While large models have shown significant benefits, the potential of smaller models for training with large-scale datasets remains untapped, and SLM demonstrates that smaller models trained on sufficient data can also achieve efficient performance.
- Reasoning Efficiency: Smaller models are typically more efficient in the inference phase and are especially suitable for deployment in real-world applications with limited resources. Efficient reasoning not only accelerates response time, but also significantly reduces computational and energy costs.
- Research accessibility: SLM is open source and small enough to be easily accessible to a wide range of researchers, especially for teams that do not have enough resources to handle larger models. It provides a low-cost research platform for experimentation and innovation in the field of language modeling.
- Architectural and Optimization Advances: SLM incorporates a variety of architectural and performance optimization techniques that significantly improve computational efficiency. These optimizations enable SLM to train quickly on common GPUs with low memory consumption.
- Open Source Contributions: The developers of SLM have made significant contributions to the open source community by publicly releasing model checkpoints and code, encouraging other researchers to build on this foundation for further innovation and applications.
- End-user applications: Due to its excellent performance and compact model structure, SLM is well suited for end-user applications and may even run on mobile devices, thus providing a lightweight platform for a wide range of applications.
- Training data and process: SLM's training process is not only efficient but also repeatable, using a mixture of natural language data and code data, designed to make pre-training more transparent and accessible.


### **[1.2 Phi-4 (Microsoft Research)]**
Phi-4 is the latest generation of self-supervised language models developed by Microsoft Research, inheriting the success of phi-3 and optimizing it in several ways. phi-4 performs well in several public benchmarks, and in particular makes significant progress in handling long contexts and multimodal tasks. Its support for long contexts up to 256K further improves the model's reasoning and contextual understanding.

- Phi-4-mini is a 4.2B parametric language model (256K and 8K).
- Phi-4-small is a 9B parametric language model (256K and 16K).
- Phi-4-medium is an 18B parametric language model (256K and 8K).
- Phi-4 Vision is a 5B parametric multimodal model that integrates language and vision functions to handle more complex cross-modal tasks.
- The release of the Phi-4 family will further advance the performance of large-scale models in real-world applications, especially in the areas of complex dialog generation and image understanding.

In this example, we will learn how to use QLoRA to fine-tune phi-4-mini-8k-instruct:QLoRA is an efficient fine-tuning technique that quantizes a pre-trained language model to 4 bits and attaches a small fine-tuned “low-rank adapter” to it. In this example, the choice was made to use a cpu for model training, which, despite the longer training time, offers a more flexible and scalable solution with better cost-effectiveness, broader availability, and ease of use for smaller training tasks and debugging, and in some cases may not require the large computational power of a GPU.

## **2.Hands-on lab**

### **[2.1 How to use Azure Machine Learning]**
At the very beginning, I will briefly describe the Azure Machine Learning Getting Started operation. If you are already familiar with Azure Machine Learning, you can skip this step and go directly to the subsequent model fine-tuning steps. For this experiment, we are working on Azure AI Machine Learning Studio. In this tutorial, I will demonstrate training, registering, and deploying the phi-4 model. This tutorial will help you familiarize yourself with the core concepts of Azure Machine Learning and its most common uses. You will learn how to run a training job on a scalable compute resource, then deploy that job, and finally test the deployment. You will create a training script to handle data preparation, training, and registering the model. After training the model, deploy it as a endpoint and then invoke the endpoint for inference.

2.1.1 To use Azure Machine Learning, you need a workspace. If you don't have a workspace, complete Create Getting Started with the required resources to create a workspace

2.1.2 Log in to the studio and select your workspace (if not already open).

2.1.3 Open or create a notebook in your workspace: If you want to copy and paste code into cells, create a new notebook. Alternatively, open from the Examples section of the Studio and select Clone to add the notebook to your File.

2.1.4 Setting up the kernel and opening it in Visual Studio Code (VS Code)

2.1.5 In the top bar above the open notebook, create a compute instance (if one does not already exist). If the calculation instance is stopped, select “Start calculation” and wait for it to run. Wait for the compute instance to run. Then make sure that the kernel in the upper right corner is Python 3.10 - SDK v2. If it is not, use the drop-down list to select that kernel.
2.1.6 If you see a banner prompting you to authenticate, select “Authenticate”.

2.1.7 You can run the notebook here or open it in VS Code to get a full Integrated Development Environment (IDE) with the power of Azure Machine Learning Resources. Select “Open in VS Code” and then choose the Web or Desktop option. When started this way, VS Code attaches to the compute instance, kernel, and workspace file system.

2.1.8 Before diving into the code, you need a way to reference the workspace. The workspace is the top-level resource for Azure Machine Learning, providing a centralized location for all projects created when using Azure Machine Learning to be handled. You will create ml_client for the workspace handle. Then, use ml_client to manage resources and jobs. In the next cell, enter your subscription ID, resource group name, and workspace name.

### **[2.2 Preparation]**


2.2.1 First we use the following code to install the datasets library and install the datasets package through the currently running Python interpreter.

In [None]:
!pip install datasets

In [None]:
import sys
!{sys.executable} -m pip install datasets

2.2.2 Let's prepare the dataset. In this example, we use the load_dataset function to load the HuggingFaceH4/ultrachat_200k dataset from the Hugging Face database and specify to load the first 2% of the data in the train_sft section

In [None]:
from datasets import load_dataset
from random import randrange

# Load dataset from the hub
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split='train_sft[:2%]')

print(f"dataset size: {len(dataset)}")
print(dataset[randrange(len(dataset))])

2.2.3 Let's create training and testing examples using a shorter version of our dataset. In order to instruct Tune our model, we need to convert the structured examples into a collection of tasks described by instructions. We define a that takes a sample and returns a string with the format instruction. formatting_function
We use train_test_split to split the dataset into an 80% training set and a 20% test set, and save the training set and test set as the files train.jsonl and eval.jsonl, respectively (using the JSON Lines format). These files can be used for subsequent machine learning training and evaluation.

In [None]:
dataset = dataset.train_test_split(test_size=0.2)
train_dataset = dataset['train']
train_dataset.to_json(f"data/train.jsonl")
test_dataset = dataset['test']
test_dataset.to_json(f"data/eval.jsonl")

2.2.4 Let's save this training and test dataset in json format. Now, let's load the Azure ML SDK. this will help us create the necessary components.

In [None]:
# import required libraries
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ai.ml import MLClient, Input
from azure.ai.ml.dsl import pipeline
from azure.ai.ml import load_component
from azure.ai.ml import command
from azure.ai.ml.entities import Data
from azure.ai.ml import Input
from azure.ai.ml import Output
from azure.ai.ml.constants import AssetTypes

2.2.5 Now, let's create the workspace client. The function of the following code is to attempt to connect to an Azure ML workspace via the default authentication method (DefaultAzureCredential). If the automatic connection fails (e.g. the profile is unavailable or the necessary configuration information is missing), you will be prompted to manually enter the Azure Subscription ID, Resource Group, and Workspace Name and manually create a connection with this information. In this way, the code successfully creates an instance of the MLClient client to access the Azure Machine Learning service, regardless of the environment.

In [None]:
credential = DefaultAzureCredential()
workspace_ml_client = None
try:
    workspace_ml_client = MLClient.from_config(credential)
except Exception as ex:
    print(ex)
    subscription_id= "Enter your subscription_id"
    resource_group = "Enter your resource_group"
    workspace= "Enter your workspace name"
    workspace_ml_client = MLClient(credential, subscription_id, resource_group, workspace)

2.2.6 Here, let's create a custom training environment. Define and create an Azure Machine Learning environment llm-training that is based on a specified Docker image (acft-hf-nlp-cpu:latest) and includes a Conda environment profile. The code then uploads this environment to the Azure ML workspace via the workspace_ml_client client. It updates the environment if it already exists, and creates a new one if it does not. This environment will be used for large-scale language model training tasks.

In [None]:
from azure.ai.ml.entities import Environment, BuildContext
env_docker_image = Environment(
    image="mcr.microsoft.com/azureml/curated/acft-hf-nlp-cpu:latest",
    conda_file="environment/conda.yml",
    name="llm-training",
    description="Environment created for llm training.",
)
workspace_ml_client.environments.create_or_update(env_docker_image)

2.2.7 Let's take a look at conda.yaml. at first my code couldn't read conda.yml properly, so I chose to create the .yml file on my own and upload it to the Azure AI notebook folder, which subsequently ran fine.

In [None]:
name: model-env
channels:
  - conda-forge
dependencies:
  - python=3.8
  - pip=24.0
  - pip:
    - bitsandbytes==0.43.1
    - transformers~=4.41
    - peft~=0.11
    - accelerate~=0.30
    - trl==0.8.6
    - einops==0.8.0
    - datasets==2.19.1
    - wandb==0.17.0
    - mlflow==2.13.0
    - azureml-mlflow==1.56.0 
    - torchvision==0.18.0    

### **[2.3 Training]**
Let's take a look at the training script. We will use the approach recently introduced in the paper “QLoRA: Quantization-aware low-rank adapter tuning for language generation” by Tim Dettmers et al. QLoRA is a new technique that reduces the memory footprint of a large language model without sacrificing performance during fine-tuning.The TL; Dr;. QLoRA works by:
- Quantize a pre-trained model to 4 bits and freeze it.
- Attach small, trainable adapter layers. (Lora)
- Fine-tune only the adapter layer while using the frozen quantized model as context.

2.3.1 The following code implements a complete training process for fine-tuning a large-scale pre-trained language model, Phi-4-mini, which is efficiently fine-tuned using PEFT (Lora Configuration), and supports multi-task training, evaluation, logging, and model saving. A customized training process is implemented with SFTTrainer, and Hugging Face's transformers and datasets libraries are used to process the data and model.

**Note** ： We disable fp16 and bf16 for cpu training, use smaller batches in “per_device_eval_batch_size” and “per_device_train_batch_size”, and disable “gradient_checkpointing” on CPU to avoid overhead. avoid overhead; also remove Flash Attention, use CPU-friendly float32, explicitly specify loading to CPU in “device_map”; reduce maximum sequence length, set tokenizer.model_max_length and max_seq_length to smaller values

In [None]:
%%writefile src/train.py

import os
#import mlflow
import argparse
import sys
import logging

import datasets
from datasets import load_dataset
from peft import LoraConfig
import torch
import transformers
from trl import SFTTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from datasets import load_dataset

logger = logging.getLogger(__name__)


###################
# Hyper-parameters
###################
training_config = {
    "bf16": False,
    "do_eval": False,
    "learning_rate": 5.0e-06,
    "log_level": "info",
    "logging_steps": 20,
    "logging_strategy": "steps",
    "lr_scheduler_type": "cosine",
    "num_train_epochs": 1,
    "max_steps": -1,
    "output_dir": "./checkpoint_dir",
    "overwrite_output_dir": True,
    "per_device_eval_batch_size": 2,
    "per_device_train_batch_size": 4,
    "remove_unused_columns": True,
    "save_steps": 100,
    "save_total_limit": 1,
    "seed": 0,
    "gradient_checkpointing": True,
    "gradient_checkpointing_kwargs":{"use_reentrant": False},
    "gradient_accumulation_steps": 1,
    "warmup_ratio": 0.1,
    }

peft_config = {
    "r": 8,
    "lora_alpha": 16,
    "lora_dropout": 0.05,
    "bias": "none",
    "task_type": "CAUSAL_LM",
    "target_modules": "all-linear",
    "modules_to_save": None,
}
train_conf = TrainingArguments(**training_config)
peft_conf = LoraConfig(**peft_config)

###############
# Setup logging
###############
logging.basicConfig(
    format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
    handlers=[logging.StreamHandler(sys.stdout)],
)
log_level = train_conf.get_process_log_level()
logger.setLevel(log_level)
datasets.utils.logging.set_verbosity(log_level)
transformers.utils.logging.set_verbosity(log_level)
transformers.utils.logging.enable_default_handler()
transformers.utils.logging.enable_explicit_format()

# Log on each process a small summary
logger.warning(
    f"Process rank: {train_conf.local_rank}, device: {train_conf.device}, n_gpu: {train_conf.n_gpu}"
    + f" distributed training: {bool(train_conf.local_rank != -1)}, 16-bits training: {train_conf.fp16}"
)
logger.info(f"Training/evaluation parameters {train_conf}")
logger.info(f"PEFT parameters {peft_conf}")

################
# Modle Loading
################
checkpoint_path = "microsoft/Phi-4-mini-8k-instruct"
# checkpoint_path = "microsoft/Phi-4-mini-8k-instruct"
model_kwargs = dict(
    use_cache=False,
    trust_remote_code=True,
    attn_implementation="eager",  # loading the model with flash-attenstion support
    torch_dtype=torch.bfloat16,
    device_map="cpu"
)
model = AutoModelForCausalLM.from_pretrained(checkpoint_path, **model_kwargs)
tokenizer = AutoTokenizer.from_pretrained(checkpoint_path)
tokenizer.model_max_length = 2048
tokenizer.pad_token = tokenizer.unk_token  # use unk rather than eos token to prevent endless generation
tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
tokenizer.padding_side = 'right'

##################
# Data Processing
##################
def apply_chat_template(
    example,
    tokenizer,
):
    messages = example["messages"]
    # Add an empty system message if there is none
    if messages[0]["role"] != "system":
        messages.insert(0, {"role": "system", "content": ""})
    example["text"] = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=False)
    return example



def main(args):
    train_dataset = load_dataset('json', data_files=args.train_file, split='train')
    test_dataset = load_dataset('json', data_files=args.eval_file, split='train')
    column_names = list(train_dataset.features)

    processed_train_dataset = train_dataset.map(
        apply_chat_template,
        fn_kwargs={"tokenizer": tokenizer},
        num_proc=10,
        remove_columns=column_names,
        desc="Applying chat template to train_sft",
    )

    processed_test_dataset = test_dataset.map(
        apply_chat_template,
        fn_kwargs={"tokenizer": tokenizer},
        num_proc=10,
        remove_columns=column_names,
        desc="Applying chat template to test_sft",
    )

    ###########
    # Training
    ###########
    trainer = SFTTrainer(
        model=model,
        args=train_conf,
        peft_config=peft_conf,
        train_dataset=processed_train_dataset,
        eval_dataset=processed_test_dataset,
        max_seq_length=2048,
        dataset_text_field="text",
        tokenizer=tokenizer,
        packing=True
    )
    train_result = trainer.train()
    metrics = train_result.metrics
    trainer.log_metrics("train", metrics)
    trainer.save_metrics("train", metrics)
    trainer.save_state()


    #############
    # Evaluation
    #############
    tokenizer.padding_side = 'left'
    metrics = trainer.evaluate()
    metrics["eval_samples"] = len(processed_test_dataset)
    trainer.log_metrics("eval", metrics)
    trainer.save_metrics("eval", metrics)


    # ############
    # # Save model
    # ############
    os.makedirs(args.model_dir, exist_ok=True)
    torch.save(model, os.path.join(args.model_dir, "model.pt"))

def parse_args():
    # setup argparse
    parser = argparse.ArgumentParser()

    # add arguments
    parser.add_argument("--train-file", type=str, help="Input data for training")
    parser.add_argument("--eval-file", type=str, help="Input data for eval")
    parser.add_argument("--model-dir", type=str, default="./", help="output directory for model")
    parser.add_argument("--epochs", default=10, type=int, help="number of epochs")
    parser.add_argument(
        "--batch-size",
        default=16,
        type=int,
        help="mini batch size for each gpu/process",
    )
    parser.add_argument("--learning-rate", default=0.001, type=float, help="learning rate")
    parser.add_argument("--momentum", default=0.9, type=float, help="momentum")
    parser.add_argument(
        "--print-freq",
        default=200,
        type=int,
        help="frequency of printing training statistics",
    )

    # parse args
    args = parser.parse_args()

    # return args
    return args


# run script
if __name__ == "__main__":
    # parse args
    args = parse_args()
    # call main function
    main(args)

2.3.2 Let's create a training compute. We choose to import the AmlCompute library for creating, configuring, and managing compute clusters.

In [None]:
from azure.ai.ml.entities import AmlCompute
# If you have a specific compute size to work with change it here. By default we use the 1 x A100 compute from the above list

compute_cluster_size = "Standard_E4ds_v4"  # 4 核 CPU，32GB RAM
compute_cluster = "phi4"

try:
    compute = ml_client.compute.get(compute_cluster)
    print("The compute cluster already exists! Reusing it for the current run")
except Exception as ex:
    print(
        f"Looks like the compute cluster doesn't exist. Creating a new one with compute size {compute_cluster_size}!"
    )
    try:
        print("Attempt #1 - Trying to create a dedicated compute")
        compute = AmlCompute(
            name=compute_cluster,
            size=compute_cluster_size,
            tier="Dedicated",
            max_instances=1,  # For multi node training set this to an integer value more than 1
        )
        ml_client.compute.begin_create_or_update(compute).wait()
    except Exception as e:
        print("Error")

If you don't know the name of the “compute_cluster_size” and “compute_cluster” in the current compute instance, you can get them using the following code

In [None]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

# Connecting to the Azure ML Workspace
credential = DefaultAzureCredential()
ml_client = MLClient.from_config(credential)

# Get all calculation instances
compute_list = ml_client.compute.list()

# Print information about all available compute instances
for compute in compute_list:
    print(f"Compute Name: {compute.name}, Size: {compute.size}, Type: {compute.type}")


Some tips that may help:

- LoRA rankings do not need to be very high. (e.g., r=256) In our experience, 8 or 16 baselines are sufficient.
- If the training dataset is small, it is better to set rank=alpha. 2*rank or 4*rank training is usually unstable on small datasets.
- When using lora, set the learning rate to small. Learning rates such as 1e-3 or 2e-4 are not recommended. We start with 8e-4 or 5e-5.
- You should check if we have enough GPU memory instead of setting a larger batch size. This is because if the context length is as long as 8K, OOM (out of memory) may occur. Batch size can be increased using gradient checkpoints and gradient accumulation.
- If you are sensitive to batch size and memory, never stick with Adam, including low-bit Adam. Adam requires extra GPU memory to compute the 1st and 2nd momentum. SGD (Stochastic gradient descent) converges slower but does not take up extra GPU memory.

Now, let's use the above training script to call a computation job in the AML computation we just created.

2.3.3 Now, let's call a computational job using the above training script in the AML computation we just created. Since we are using a CPU with less computational power and less memory, we choose smaller batches, fewer epochs, and more conservative learning rates and momentum in order to optimize resource usage and reduce the computational burden during training, improve training efficiency and avoid memory overflow or system lag.

**Note** : The authors failed to create the training environment in this step, and then used the command #version=“15” when creating the environment, which directly specifies the version number of the environment and runs normally.

In [None]:
from azure.ai.ml import command
from azure.ai.ml import Input
from azure.ai.ml.entities import ResourceConfiguration

job = command(
    inputs=dict(
        train_file=Input(type="uri_file", path="./data/train.jsonl"),
        eval_file=Input(type="uri_file", path="./data/eval.jsonl"),
        epoch=1,
        batchsize=2,
        lr=0.01,
        momentum=0.9,
        prtfreq=200,
        output = "./outputs"
    ),
    code="./src",  # local path where the code is stored
    compute = 'phi-4',
    command="accelerate launch train.py --train-file ${{inputs.train_file}} --eval-file ${{inputs.eval_file}} --epochs ${{inputs.epoch}} --batch-size ${{inputs.batchsize}} --learning-rate ${{inputs.lr}} --momentum ${{inputs.momentum}} --print-freq ${{inputs.prtfreq}} --model-dir ${{inputs.output}}",
    environment="azureml:llm-training:latest",
    distribution={
        "type": "PyTorch",
        "process_count_per_instance": 1,
    },
)
returned_job  = workspace_ml_client.jobs.create_or_update(job)
workspace_ml_client.jobs.stream(returned_job.name)

2.3.4 Let's look at the pipeline output. Gets and prints the outputs of the job returned_job. All the outputs of the job are accessed through the job name (job_name). outputs is an object containing a dictionary or similar structure containing the results of the job, through which we can view and download the files or data output by the job.

In [None]:
# check if the `trained_model` output is available
job_name = returned_job.name
print("pipeline job outputs: ", workspace_ml_client.jobs.get(job_name).outputs)

### **[2.4 Endpoint]**

2.4.1 After fine-tuning the model, let's register the job in the workspace to create the end node. Register the trained phi-4 model as an MLflow model in the Azure ML model registry for subsequent access, management, and deployment.

In [None]:
from azure.ai.ml.entities import Model
from azure.ai.ml.constants import AssetTypes

run_model = Model(
    path=f"azureml://jobs/{job_name}/outputs/artifacts/paths/outputs/mlflow_model_folder",
    name="phi-3-finetuned",
    description="Model created from run.",
    type=AssetTypes.MLFLOW_MODEL,
)
model = workspace_ml_client.models.create_or_update(run_model)

2.4.2 Let's create the end node, make sure there is an online endpoint available in Azure ML for the model to perform real-time inference, and that an Azure hosted identity is used for authentication.

In [None]:
from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    IdentityConfiguration,
    ManagedIdentityConfiguration,
)

# Check if the endpoint already exists in the workspace
try:
    endpoint = workspace_ml_client.online_endpoints.get(endpoint_name)
    print("---Endpoint already exists---")
except:
    # Create an online endpoint if it doesn't exist

    # Define the endpoint
    endpoint = ManagedOnlineEndpoint(
        name=endpoint_name,
        description=f"Test endpoint for {model.name}",
        identity=IdentityConfiguration(
            type="user_assigned",
            user_assigned_identities=[ManagedIdentityConfiguration(resource_id=uai_id)],
        )
        if uai_id != ""
        else None,
    )

# Trigger the endpoint creation
try:
    workspace_ml_client.begin_create_or_update(endpoint).wait()
    print("\n---Endpoint created successfully---\n")
except Exception as err:
    raise RuntimeError(
        f"Endpoint creation failed. Detailed Response:\n{err}"
    ) from err

2.4.3 After creating the end nodes, we can move on to creating the deployment. Initialize the parameters related to the deployment of the model in Azure Machine Learning, especially the settings for online deployments. The goal is to configure the parameters used in the deployment, such as the deployment name, SKU, request timeout, and environment variables.

In [None]:
# Initialize deployment parameters

deployment_name = "phi3-deploy"
sku_name = "Standard_NCs_v3"

REQUEST_TIMEOUT_MS = 90000

deployment_env_vars = {
    "SUBSCRIPTION_ID": subscription_id,
    "RESOURCE_GROUP_NAME": resource_group,
    "UAI_CLIENT_ID": uai_client_id,
}

2.4.4 For the purpose of reasoning, we will use a different base image that specifies the Docker image to be used for the deployment model, and the configuration contains the routing settings for health checks, readiness checks, and scoring requests. It defines how each type of request will be processed through port 5001 and different paths. With these configurations, Azure ML is able to ensure that containers work at different lifecycle stages, such as startup, ready to process requests, and processing inference requests, when the model is deployed.

In [None]:
from azure.ai.ml.entities import Model, Environment
env = Environment(
    image='mcr.microsoft.com/azureml/curated/foundation-model-inference:latest',
    inference_config={
        "liveness_route": {"port": 5001, "path": "/"},
        "readiness_route": {"port": 5001, "path": "/"},
        "scoring_route": {"port": 5001, "path": "/score"},
    },
)

2.4.5 Finally we create and deploy an online deployment in Azure Machine Learning. Configure the parameters required for the deployment, such as the deployment name, model used, compute instance type, environment settings, request timeout, health checks, etc. Finally, the code triggers the creation or update of the online deployment through the begin_create_or_update method.

In [None]:
from azure.ai.ml.entities import (
    OnlineRequestSettings,
    CodeConfiguration,
    ManagedOnlineDeployment,
    ProbeSettings,
    Environment
)

deployment = ManagedOnlineDeployment(
    name=deployment_name,
    endpoint_name=endpoint_name,
    model=model.id,
    instance_type=sku_name,
    instance_count=1,
    #code_configuration=code_configuration,
    environment = env,
    environment_variables=deployment_env_vars,
    request_settings=OnlineRequestSettings(request_timeout_ms=REQUEST_TIMEOUT_MS),
    liveness_probe=ProbeSettings(
        failure_threshold=30,
        success_threshold=1,
        period=100,
        initial_delay=500,
    ),
    readiness_probe=ProbeSettings(
        failure_threshold=30,
        success_threshold=1,
        period=100,
        initial_delay=500,
    ),
)

# Trigger the deployment creation
try:
    workspace_ml_client.begin_create_or_update(deployment).wait()
    print("\n---Deployment created successfully---\n")
except Exception as err:
    raise RuntimeError(
        f"Deployment creation failed. Detailed Response:\n{err}"
    ) from err

2.4.6 If you want to delete a terminal node, see the following code.

In [None]:
workspace_ml_client.online_deployments.begin_delete(name = deployment_name, 
                                                    endpoint_name = endpoint_name)
workspace_ml_client._online_endpoints.begin_delete(name = endpoint_name)

*References:*

https://learn.microsoft.com/zh-cn/azure/machine-learning/tutorial-azure-ml-in-a-day?view=azureml-api-2
https://techcommunity.microsoft.com/blog/machinelearningblog/finetune-small-language-model-slm-phi-3-using-azure-machine-learning/4130399
https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/
https://www.philschmid.de/sagemaker-falcon-180b-qlora
https://github.com/daekeun-ml/azure-llm-fine-tuning 