# <span style="font-size:0.8em;">📝 Plan of Action</span>

This notebook guides you through the end-to-end process of fine-tuning the **Qwen2.5-7B-Instruct** model into a **reasoning model** using medical data on **Azure ML**. Qwen2.5-7B-Instruct is an instruction-tuned large language model developed by Alibaba Cloud, based on their Qwen2.5-7B foundation model. It is optimized for following human instructions across a wide range of tasks, such as question answering, code generation, and language understanding. In this walkthrough, one will learn how to enhance the model's reasoning capabilities using **Reinforced Fine-Tuning (RFT)** techniques, with a focus on **GRPO (**G**roup **R**elative **P**olicy **O**ptimization)**.

<img src="images/agenda.png" alt="image.png" width="1000"/>

-------------------------------------------------------------------------------------------------------

# <span style="font-size:0.8em;">⚙️ Section 1: Setup - AML Resources</span>

Install the necessary packages and CLI tools to get started with Azure Machine Learning:

- **azure-core**: Provides core utilities and HTTP infrastructure used by all Azure SDKs for Python.
- **azure-ai-ml**: The Python SDK used to interact with Azure Machine Learning for managing and running ML workflows.
- **rich**: A library for rendering richly formatted text, tables, and progress bars directly in the terminal.
- **huggingface_hub**: Lets you download, upload, and manage models and datasets from the Hugging Face Hub.
- **AzureCLI**: The `az` command-line interface used to manage Azure resources and services from your terminal.

✅ These tools form the foundation for orchestrating scalable and efficient ML workloads on Azure.

In [None]:
%%capture

# Installing Azure cli and Azure SDK for Python.
! pip install azure-core azure-ai-ml rich huggingface_hub
! curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash

In this example, a **Hugging Face access token** is used to download the **Qwen2.5-7B-Instruct model**. This token is required only for the initial run to access the gated model; subsequent runs will use the cached copy of the model from the workspace.

In [None]:
# Set your Hugging Face token here (generate one at https://huggingface.co/settings/tokens)
# This token is required the first time you run the script, to download gated models from Hugging Face.
! export HF_TOKEN="hf_xxxxxxxxxxxxx"

The Azure Machine Learning (AML) **setup process is encapsulated** into a script that provisions all required resources in the workspace. \
By the end of the setup, the AML workspace will be fully configured with the below resources: 

- **Dataset** : [MedMCQA](https://medmcqa.github.io): A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering. We use a modified version of the MedMCQA dataset, restricting our experiments to question/answer pairs having only a single correct answer. The modified dataset used in the demo can be found in `datasets/med_mcqa`
- **Model** : [Qwen2_5-7B-Instruct_base](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)
- **Compute Cluster**: STANDARD_ND96ISR_H100_V5 cluster with at least 2 nodes 
- **Environment**: Is designed for GRPO specific large-scale, distributed training and inference of reasoning models using Azure Machine Learning, TRL, DeepSpeed, vLLM, and LoRA.

In [None]:
from aml_setup import (
    setup,
)  # This script sets up the Azure ML client, model, and environment.

ml_client, med_mcqa_data, model, compute, environment = setup()

The complete dataset can be found in datasets/med_mcqa. Here is a sample of the dataset :

## <span style="font-size:0.8em;"> Sample DataSet</span>

📘 When solving a multiple-choice question, we ask the LLM to follow this structure:

**Think out loud**: Explain your reasoning step by step. Wrap this part in tags.  
**Give your final answer**: Clearly state your chosen option (A, B, C, or D) and explain why it's correct. Wrap this part in tags.<br>
**Final Answer line**: On a new line, write Final Answer: followed by just one letter — A, B, C, or D. 

✅ **Example Question**: 

```text
CSF Rhinorrhea occurs due to damage of:

Options:

A. Roof of orbit 
B. Cribriform plate of ethmoidal bone 
C. Frontal sinus 
D. Sphenoid bone
```
**Ideal reasoning model response:**
```text
<think>

Start by identifying the anatomical structure most commonly associated with CSF (cerebrospinal fluid) leakage. CSF rhinorrhea typically results from a breach in the skull base, especially the cribriform plate of the ethmoid bone, which is thin and located near the nasal cavity.

</think>

The cribriform plate of the ethmoid bone is the most common site of CSF leakage into the nasal cavity, leading to CSF rhinorrhea. This makes option B the correct answer. 

<answer>B</answer>

Final Answer: B
```


---------------------------------------------------------------------------------------


# <span style="font-size:0.8em;">🧩 Section 2: How to train a Reasoning Model on AML Using GRPO Trainer</span>

<div style="display: flex; align-items: flex-start; gap: 32px;">
  <div style="flex: 1;">
    <p>The reasoning model training process typically includes three key components:</p>
    <ul>
      <li><strong>Sampler</strong> – Generates multiple candidate responses from the model</li>
      <li><strong>Reward Function</strong> – Evaluates and scores each response based on criteria like accuracy or structure</li>
      <li><strong>Trainer</strong> – Updates the model to reinforce high-quality outputs</li>
    </ul>
    <p>
      In this example we use the <strong>GRPO Trainer</strong> for training Qwen2.5-7B-Instruct model into a reasoning model. We use the GRPO implementation from TRL library.
    </p>
    <br>
    <p>
      <strong>GRPO</strong> (<strong>G</strong>roup <strong>R</strong>elative <strong>P</strong>olicy <strong>O</strong>ptimization) is a reinforcement learning technique that:
    </p>
    <ul>
      <li><em>Compares</em> multiple answers within a group</li>
      <li><em>Rewards</em> the best-performing outputs</li>
      <li><em>Penalizes</em> poor ones</li>
      <li>Applies careful updates to <em>avoid sudden changes</em></li>
    </ul>
  </div>
  <div style="flex: 1; display: flex; justify-content: center;">
    <img src="images/training_loop.png" alt="Training Loop" style="max-width:100%; width: 600px;"/>
  </div>
</div> 


## <span style="font-size:0.8em;">Why does training reasoning models become easy in Azure ML?</span>  


- **AzureML natively supports reasoning model training**, with seamless integration of vLLM and scalable training workflows.

- **DeepSpeed scales effortlessly on AML**, enabling multi-node training by sharding model states across GPUs.

- **LoRA support makes fine-tuning large models lightweight and cost-efficient**, even on smaller setups.

- **Robust tracking, metrics, and debugging tools** make experimentation on AML smooth and production-ready.

## <span style="font-size:0.8em;"> GRPO Trainer Configuration</span>  

There are 4 main configs and scripts to train a reasoning model using TRL. 


**1. BldDemo_Reasoning_Train.py**

This is the main script for running GRPO training. Here is a section, where one can control the **_base model (current policy)_**, **_reward function_**, **_dataset_** and **_LoRA_** configuration for PEFT (Parameter-Efficient Fine-Tuning). 

It's **important to note that the sampler, trainer and grader are abstracted** within the GRPO trainer implementation.


<img src="images/grpo_trainer.png" alt="image.png" width="1050"/>

**2. grpo_trainer_rewards.py**

This file defines a set of reward functions **_used to evaluate model outputs_** during training for reasoning tasks. 

For example, a format_reward function **_encourages the mode_** to follow the correct output structure, while an accuracy_reward function **_promotes correct answers_**—both rewarding desired behavior and penalizing deviations. 

_format_reward_: 


<img src="images/reward_func.png" alt="image.png" width="1050"/>


**3. grpo_trainer_config.yaml**

This file defines the training configuration. Some of the key trainer config parameters are discussed below:

a. **_vllm_mode_** - TRL leverages vLLM to accelerate sampling during reasoning model training.

It is supported in two modes:

- **Server Mode**
    - vLLM runs on _dedicated_ nodes/GPUs
    - Ideal for large-scale, high-throughput sampling

- **Colocate Mode**
    - vLLM and trainer _share_ the same GPU
    - Useful for smaller setups or resource-constrained environments

<img src="images/vllm.png" alt="image.png" width="1050"/>

b. **_reward_functions_**: The reward functions to **_use by the grader_**, we define the format and accuracy reward. \
c. **_reward_weights_**: Weight of each reward function. We give **_more weight_** to the accuracy reward. \
d. **_report_to_**: To integrate the **_logs and metrics_** into Azure ML.


<img src="images/reward_weights.png" alt="image.png" width="1050"/>


**4. Deepspeed ZeRO config**

Azure ML simplifies hardware scaling with built-in support for distributed training. Use DeepSpeed to maximize hardware efficiency across GPU clusters.

In this example, ZeRO Stage3 config has been used where - The **model**, **Gradients** and **Optimizer States** are partitioned across the GPUs. 

This drastically reduces memory requirements per GPU and without ZeRO Stage 3, larger models would need significantly more GPUs.

**Efficient scaling = Lower costs + ability to train much larger models**


<img src="images/deepspeed_config_explain.png" alt="image.png" width="600"/>


a. _offload_optimizer_device = cpu_ : allows to offload optimizer states computations to be made on CPU. \
b. _zero_stage_: Stage 3 optimization, **helps scale training horizontally** for bigger models. \
c. _train_micro_batch_size_per_gpu_: **Batch size that a single GPU** processes in one forward/backward pass. 

A small train_micro_batch_size_per_gpu with offload_optimizer_device: cpu one can **fit bigger models** or **train on fewer GPUs** at the cost of slightly longer training times.


<img src="images/deepspeed.png" alt="image.png" Width="1050"/>


# <span style="font-size:0.8em;"> 🚀 Section 3: Launch the job! </span> 
The below section shows how to kick off the pytorch distributed command job and rut it across multiple nodes. \
For this command job we pass the **GRPO trainer config, base model, dataset, DeepSpeed configuration** as arguments.

In [None]:
from azure.ai.ml import command, Input, Output
from azure.ai.ml.entities import ManagedOnlineEndpoint, ManagedOnlineDeployment, Model
from azure.ai.ml.constants import AssetTypes
from aml_setup import N_NODES

# Below is a command job that takes grpo config, deepspeed config, the dataset and the model parameters as inputs.
# This kicks off a distributed job on a gpu cluster with 2 nodes (8XH100 on each).
command_str = f"""python BldDemo_Reasoning_Train.py \
    --config {"grpo_trainer_config.yaml" if N_NODES==2 else "grpo_trainer_config_single_node.yaml"} \
    --model_name_or_path ${{inputs.model_dir}} \
    --dataset_name ${{inputs.dataset}} \
    --output_dir ${{outputs.checkpoint_folder}} \
    --final_model_save_path ${{outputs.mlflow_model_folder}} \
    --deepspeed {"deepspeed_stage3_zero_config.json" if N_NODES==2 else "deepspeed_stage3_zero_config_single_node.json"} \
    --mlflow_task_type "chat-completion" \
    --base_model_name "{model.name}"
"""

# Model directory and dataset as job inputs.
job_input = {
    "model_dir": Input(
        path=model.path,
        type=AssetTypes.CUSTOM_MODEL,
    ),
    "dataset": Input(
        type=AssetTypes.URI_FOLDER,
        path=med_mcqa_data.path,
    ),
}

# The job outputs the finetuned model in mlflow format and the intermediate checkpoints.
job_output = {
    "mlflow_model_folder": Output(
        type=AssetTypes.CUSTOM_MODEL,
        mode="rw_mount",
    ),
    "checkpoint_folder": Output(
        type=AssetTypes.URI_FOLDER,
        mode="rw_mount",
    ),
}

# Setting up the distributed training job.
job = command(
    code="./src",
    inputs=job_input,
    command=command_str,
    environment=environment,
    compute=compute.name,
    instance_count=N_NODES,
    outputs=job_output,
    distribution={
        "type": "PyTorch",
        # set process count to the number of gpus per node
        "process_count_per_instance": 8,
    },
    experiment_name="build-demo-reasoning-training-jobs",
    display_name=f"build-demo-reasoning-train-batchsize-{N_NODES*8}",
    properties={"_azureml.LogTrainingMetricsToAzMon": "true"},
    # Environment variables to enable profiling
    environment_variables={
        "KINETO_USE_DAEMON": "1",
        "ENABLE_AZUREML_TRAINING_PROFILER": "true",
        "AZUREML_PROFILER_WAIT_DURATION_SECOND": "2",
        "AZUREML_PROFILER_RUN_DURATION_MILLISECOND": "500",
        "AZUREML_COMMON_RUNTIME_USE_APPINSIGHTS_CAPABILITY": "true",
    },
)

The below block will submit a machine learning job to Azure Machine Learning (AML) for execution.

After submission, the returned **train_job object contains metadata and status information** about the job, such as its ID, current state, and output location. 

In [None]:
# 🚀 Submit the job
train_job = ml_client.jobs.create_or_update(job)
train_job

## ⏳ **Wait** for the job to finish successfully, To move to next section..
<hr>

## Register and deploy the fine tuned model

The output of the training is a set of files representing the weights of the trained model. To use it for inferencing, we will register the files as a model and then create an endpoint and a deployment for it. An endpoint provides security, url and traffic-splitting aspects of inferencing, whereas a deployment actually hosts and runs the registered model.

You can find the assets registered in this section in the AzureML portal ([ml.azure.com](ml.azure.com)). Navigate to your resource group and workspace and click on the models or endpoints tab on the left panel. Deployments are sub-entities of endpoints and they can be found on the detailed view page of a particular endpoint.

In [None]:
# Registering the model is necessary to deploy the model to an online endpoint.

model_output_path = f"azureml://jobs/{train_job.name}/outputs/mlflow_model_folder"
run_model = Model(
    path=model_output_path,  # model output path from the job
    name="grpo-finetuned-model",  # registered model name
    description=f"Model created from run {train_job.name}.",
    type=AssetTypes.MLFLOW_MODEL,  # registering as mlflow model
)

ft_model = ml_client.models.create_or_update(run_model)

In [None]:
online_endpoint_name = "grpo-ft-model-endpoint"
# create an online endpoint
endpoint = ManagedOnlineEndpoint(
    name=online_endpoint_name,  # name for the endpoint
    description="Online endpoint for the GRPO fine-tuned model",
    auth_mode="key",
)
endpoint = ml_client.begin_create_or_update(endpoint)

In [None]:
# A deployment will be created in the online endpoint.
deployment = ManagedOnlineDeployment(
    name="grpo-ft-model-deployment",  # name for the deployment
    endpoint_name=online_endpoint_name,  # endpoint name where model will be deployed
    model=ft_model,  # finetuned and registered model
    instance_type="Standard_ND96amsr_A100_v4",
    instance_count=1,
)
ml_client.begin_create_or_update(deployment)

## Results and metrics

This job has a truncated dataset and fewer iterations than needed to see a significant principle for managing job runtime. But with over 100 iterations and the full dataset (takes about 6 hours on 16 H100s), you may see an improvement in the accuracy metric.

As training progresses, the length of completions may increase with more iterations. Ideally, the mean completion length should stabilize over time, indicating that responses are not being capped and are of reasonable size. These metrics are of interest: 

- eval_rewards/accuracy/mean: This represents the mean of accuracy reward over the eval dataset as the training progresses. Note that we calculate this over the eval dataset but dont use the information from it to change our training.
- eval_completions/mean_length: This represents the mean of the completions over the eval dataset. We should see this metric increasing as the model beigns to reason, which usually takes more tokens.
- eval_reward: This represents how the overall reward (80% accuracy + 20% format) moved as the training progressed. We should see this increasing over longer runs.
- reward: This is the training version of the net reward. This should increase as well as the training progresses. This information is used to inform the training process.

You can view this in AzureML portal under the metrics Tab of the job. Expand the panel to the left which says "Select metrics" and search for the above listed metrics.