In [None]:
%%capture

# Installing Azure cli and Azure SDK for Python.
! pip install azure-core azure-ai-ml rich
#! curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash

### Create AzureML Workspace connections

In [None]:
from azure.ai.ml import MLClient
from azure.identity import (
    DefaultAzureCredential,
    InteractiveBrowserCredential,
)

try:
    credential = DefaultAzureCredential()
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    credential = InteractiveBrowserCredential()

try:
    workspace_ml_client = MLClient.from_config(credential=credential)
except:
    workspace_ml_client = MLClient(
        credential,
        subscription_id="<SUBSCRIPTION_ID>",
        resource_group_name="<RESOURCE_GROUP>",
        workspace_name="<WORKSPACE_NAME>",
    )

# the models, fine tuning pipelines and environments are available in various AzureML system registries
registry_ml_client = MLClient(credential, registry_name="azureml")
experiment_name = "grpo_chat_completion_qwen_2_5_7b_instruct"
# Get AzureML workspace object.
workspace = workspace_ml_client._workspaces.get(workspace_ml_client.workspace_name)
workspace.id

### 2. Pick a model to fine tune

`Qwen2.5` is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. If you have opened this notebook for a different model, replace the model name and version accordingly.

The pipeline which we are going to use for fine-tuning supports models which are open source and available from hugging face and open source models from Azure model catalog.
Note the model id property of the model. This will be passed as input to the fine tuning job. This is also available as the `Asset ID` field in model details page in Azure AI Studio Model Catalog.

For importing a model from AzureML catalogue refer to model import section of [chat_completion_with_model_as_platform.ipynb](..\finetuning\standalone\model-as-a-platform\chat-completion\chat_completion_with_model_as_platform.ipynb)

In [None]:
# Valid HuggingFace ID of Open Source model to be used for fine-tuning
model_name = f"Qwen/Qwen2.5-7B-Instruct"

The complete dataset can be found in datasets/med_mcqa. Here is a sample of the dataset :

## <span style="font-size:0.8em;"> Sample DataSet</span>

📘 When solving a multiple-choice question, we ask the LLM to follow this structure:

**Think out loud**: Explain your reasoning step by step. Wrap this part in tags.  
**Give your final answer**: Clearly state your chosen option (A, B, C, or D) and explain why it's correct. Wrap this part in tags.<br>
**Final Answer line**: On a new line, write Final Answer: followed by just one letter — A, B, C, or D. 

✅ **Example Question**: 

```text
CSF Rhinorrhea occurs due to damage of:

Options:

A. Roof of orbit 
B. Cribriform plate of ethmoidal bone 
C. Frontal sinus 
D. Sphenoid bone
```
**Ideal reasoning model response:**
```text
<think>

Start by identifying the anatomical structure most commonly associated with CSF (cerebrospinal fluid) leakage. CSF rhinorrhea typically results from a breach in the skull base, especially the cribriform plate of the ethmoid bone, which is thin and located near the nasal cavity.

</think>

The cribriform plate of the ethmoid bone is the most common site of CSF leakage into the nasal cavity, leading to CSF rhinorrhea. This makes option B the correct answer. 

<answer>B</answer>

Final Answer: B
```


---------------------------------------------------------------------------------------


# <span style="font-size:0.8em;">🧩 Section 2: How to train a Reasoning Model on AML Using GRPO Trainer</span>

<div style="display: flex; align-items: flex-start; gap: 32px;">
  <div style="flex: 1;">
    <p>The reasoning model training process typically includes three key components:</p>
    <ul>
      <li><strong>Sampler</strong> – Generates multiple candidate responses from the model</li>
      <li><strong>Reward Function</strong> – Evaluates and scores each response based on criteria like accuracy or structure</li>
      <li><strong>Trainer</strong> – Updates the model to reinforce high-quality outputs</li>
    </ul>
    <p>
      In this example we use the <strong>GRPO Trainer</strong> for training Qwen2.5-7B-Instruct model into a reasoning model. We use the GRPO implementation from TRL library.
    </p>
    <br>
    <p>
      <strong>GRPO</strong> (<strong>G</strong>roup <strong>R</strong>elative <strong>P</strong>olicy <strong>O</strong>ptimization) is a reinforcement learning technique that:
    </p>
    <ul>
      <li><em>Compares</em> multiple answers within a group</li>
      <li><em>Rewards</em> the best-performing outputs</li>
      <li><em>Penalizes</em> poor ones</li>
      <li>Applies careful updates to <em>avoid sudden changes</em></li>
    </ul>
  </div>
  <div style="flex: 1; display: flex; justify-content: center;">
    <img src="images/training_loop.png" alt="Training Loop" style="max-width:100%; width: 600px;"/>
  </div>
</div> 


## <span style="font-size:0.8em;">Why does training reasoning models become easy in Azure ML?</span>  


- **AzureML natively supports reasoning model training**, with seamless integration of vLLM and scalable training workflows.

- **DeepSpeed scales effortlessly on AML**, enabling multi-node training by sharding model states across GPUs.

- **Robust tracking, metrics, and debugging tools** make experimentation on AML smooth and production-ready.

#### 2.2 Create compute

In order to finetune a model on Azure Machine Learning studio, you will need to create a compute resource first. **Creating a compute will take 3-4 minutes.** 

For additional references, see [Azure Machine Learning in a Day](https://github.com/Azure/azureml-examples/blob/main/tutorials/azureml-in-a-day/azureml-in-a-day.ipynb). 

In [None]:
import time
import warnings

from azure.ai.ml.entities import AmlCompute
from azure.core.exceptions import ResourceNotFoundError

compute_cluster_name = "compute-cluster-grpo-chat-completion"

try:
    compute_cluster = workspace_ml_client.compute.get(compute_cluster_name)
    print("Found existing compute target.")
except ResourceNotFoundError:
    print("Creating a new compute target...")
    compute_cluster = AmlCompute(
        name=compute_cluster_name,
        type="amlcompute",
        size="Standard_ND96amsr_A100_v4",
        idle_time_before_scale_down=120,
        min_instances=0,
        max_instances=4,
    )
    workspace_ml_client.begin_create_or_update(compute_cluster).result()

In [None]:
# Default training parameters
training_parameters = dict(
    dataset_prompt_column="problem",
    epsilon="0.5",
    eval_strategy="no",
    gradient_accumulation_steps="4",
    learning_rate="1e-06",
    max_steps="10",
    num_generations="4",
    num_iterations="1",
    num_nodes_finetune="1",
    num_train_epochs=3,
    number_of_gpu_to_use_finetuning="8",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=1,
)

# Let's construct finetuning parameters using training and optimization paramters.
finetune_parameters = {**training_parameters}

In [None]:
# Set the pipeline display name for distinguishing different runs from the name
def get_pipeline_display_name():
    batch_size = (
        int(finetune_parameters.get("per_device_train_batch_size", 1))
        * int(finetune_parameters.get("gradient_accumulation_steps", 1))
        * int(finetune_parameters.get("number_of_gpu_to_use_finetuning"))
        * int(finetune_parameters.get("num_nodes_finetune", 1))
    )
    max_prompt_length = finetune_parameters.get("max_prompt_length", -1)
    return (
        model_name
        + "-"
        + "grpo"
        + "-"
        + f"bs{batch_size}"
        + "-"
        + f"{max_prompt_length}"
    )


pipeline_display_name = get_pipeline_display_name()
print(f"Display name used for the run: {pipeline_display_name}")

In [None]:
from azure.ai.ml.dsl import pipeline
from azure.ai.ml import Input

# fetch the pipeline component
pipeline_component_func = registry_ml_client.components.get(
    name="grpo_chat_completion_pipeline", label="latest"
)


# define the pipeline job
@pipeline(name=pipeline_display_name)
def create_pipeline():
    grpo_chat_completion_pipeline = pipeline_component_func(
        huggingface_id=model_name,
        compute_model_import=compute_cluster_name,
        compute_finetune=compute_cluster_name,
        # map the dataset splits to parameters
        dataset_train_split=Input(
            type="uri_file", path="./datasets/med_mcqa/train.jsonl"
        ),
        dataset_validation_split=Input(
            type="uri_file", path="./datasets/med_mcqa/validation.jsonl"
        ),
        deepspeed_config=Input(type="uri_file", path="./config/zero3.json"),
        # Training settings
        **finetune_parameters,
    )
    return {
        # map the output of the fine tuning job to the output of pipeline job so that we can easily register the fine tuned model
        # registering the model is required to deploy the model to an online or batch endpoint
        "trained_model": grpo_chat_completion_pipeline.outputs.mlflow_model_folder
    }


pipeline_object = create_pipeline()

# don't use cached results from previous jobs
pipeline_object.settings.force_rerun = True

# set continue on step failure to False
pipeline_object.settings.continue_on_step_failure = False

In [None]:
# submit the pipeline job
created_job = workspace_ml_client.jobs.create_or_update(
    pipeline_object, experiment_name=experiment_name
)

In [None]:
# wait for the pipeline job to complete
status = workspace_ml_client.jobs.get(created_job.name).status

import time

while True:
    status = workspace_ml_client.jobs.get(created_job.name).status
    print(f"Current job status: {status}")
    if status in ["Failed", "Completed", "Canceled"]:
        print("Job has finished with status: {0}".format(status))
        break
    else:
        print("Job is still running. Checking again in 30 seconds.")
        time.sleep(30)

## Register and deploy the fine tuned model

The output of the training is a set of files representing the weights of the trained model. To use it for inferencing, we will register the files as a model and then create an endpoint and a deployment for it. An endpoint provides security, url and traffic-splitting aspects of inferencing, whereas a deployment actually hosts and runs the registered model.

You can find the assets registered in this section in the AzureML portal ([ml.azure.com](ml.azure.com)). Navigate to your resource group and workspace and click on the models or endpoints tab on the left panel. Deployments are sub-entities of endpoints and they can be found on the detailed view page of a particular endpoint.

For detailed implementation of model registration and deployment, refer to the same section in [launch_grpo_command_job-med-mcqa-commented.ipynb](./launch_grpo_command_job-med-mcqa-commented.ipynb).