<div style="background: linear-gradient(135deg, #0078d4 0%, #106ebe 50%, #005a9e 100%); color: white; padding: 30px; border-radius: 12px; margin: 20px 0; box-shadow: 0 4px 15px rgba(0, 120, 212, 0.3);">
    <h1 style="margin: 0; text-align: center; font-size: 2.2em; font-weight: 600; letter-spacing: 0.5px; font-family: 'Segoe UI', -apple-system, BlinkMacSystemFont, Roboto, 'Helvetica Neue', sans-serif;">
        Ignite Demo to Train, Customize, Optimize and Host Reasoning Models in AzureML
    </h1>
</div>


<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 20px; border-radius: 10px; margin: 20px 0;">
    <h3 style="margin: 0; text-align: center;"> Sections Breakdown </h3>
</div>

<ol style="color: #2c3e50; line-height: 1.8;">
<li>üîß <b>Setup Workspace:</b> Configure Azure ML workspace and authenticate</li>
<li>üß† <b>RFT Training (GRPO):</b> Fine-tune reasoning model using Group Relative Policy Optimization</li>
<li>‚ö° <b>RFT Training (Reinforce++):</b> Fine-tune using critic-free reinforcement learning</li>
<li>üì¶ <b>Create Data Assets:</b> Convert pipeline outputs to reusable data assets</li>
<li>üìä <b>Model Performance Comparison:</b> Evaluate and compare base model vs GRPO vs Reinforce++</li>
<li>üéØ <b>Create Draft Model:</b> Train EAGLE3 draft model for speculative decoding</li>
<li>üîó <b>Combine Draft and Base Model:</b> Package base and draft models for deployment</li>
<li>üöÄ <b>Deploy Speculative Endpoint:</b> Deploy managed online endpoint with speculative decoding</li>
<li>üì° <b>Deploy Base Endpoint:</b> Deploy baseline endpoint for performance comparison</li>
<li>üß™ <b>Test Base and Speculative Decoding Endpoints:</b> Validate both endpoints with inference requests</li>
<li>üìà <b>Endpoints Performance Evaluation:</b> Compare metrics between base and speculative decoding endpoints</li>
</ol>

<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 20px; border-radius: 10px; margin: 20px 0;">
    <h3 style="margin: 0; text-align: center;">Prerequisites & Requirements</h3>
</div>



##### Compute Requirements
* **Training:** Standard_ND96isr_H100_v5, Standard_ND96amsr_A100_v4
* **Deployment:** Kubernetes cluster with GPU instances (octagpu)
##### Dataset & Models
* **Dataset:** [FinQA](https://finqasite.github.io/) - 2.8k financial reports with 8k Q&A pairs
* **Models:** [Llama-3.1-8B-Instruct-FP8](https://huggingface.co/nvidia/Llama-3.1-8B-Instruct-FP8), [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B)


<div style="background: #e7f3ff; border: 1px solid #b3d9ff; padding: 15px; border-radius: 5px; margin: 20px 0;">
    <p style="margin: 0; color: #0066cc;">
        <strong>üí° Note:</strong> Ensure your Azure ML workspace has access to the required compute resources and GPU instances before proceeding with the training and deployment steps.
    </p>
</div>

<div style="background: linear-gradient(135deg, #0078d4 0%, #106ebe 50%, #005a9e 100%); color: white; padding: 30px; border-radius: 12px; margin: 20px 0; box-shadow: 0 4px 15px rgba(0, 120, 212, 0.3);">
    <h1 style="margin: 0; text-align: center; font-size: 2.2em; font-weight: 600; letter-spacing: 0.5px; font-family: 'Segoe UI', -apple-system, BlinkMacSystemFont, Roboto, 'Helvetica Neue', sans-serif;">
        RFT Finetuning - GRPO & Reinforce Plus Plus
    </h1>
</div>


<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 20px; border-radius: 10px; margin: 20px 0;">
    <h3 style="margin: 0; text-align: center;">‚öôÔ∏è Section 1. Setup Workspace and Register Components</h3>
</div>

<p>This section establishes connectivity to your workspace and sets up the required authentication.</p>

In [None]:
%pip install -r requirements.txt

In [None]:
import matplotlib.pyplot as plt
from scripts.utils import setup_workspace
from scripts.dataset import prepare_finqa_dataset
from scripts.run import get_run_metrics
from scripts.reinforcement_learning import run_rl_training_pipeline
from scripts.evaluation import run_evaluation_pipeline
from scripts.speculative_decoding import (
    run_draft_model_pipeline,
    prepare_combined_model_for_deployment,
    deploy_speculative_decoding_endpoint,
)
from scripts.deployment import create_managed_deployment, test_deployment

In [None]:
# Setup Azure ML workspace and registry connections
ml_client, registry_ml_client = setup_workspace(
    config_path="./config.json", registry_name="Ignite_2025_Demo"
)

<p>Prepare dataset for Finetuning. This would save train, test and valid dataset under data folder</p>

In [None]:
train_data_path, test_data_path, valid_data_path = prepare_finqa_dataset(
    ml_client, data_dir="data", register_datasets=False
)  # Prepare the FinQA dataset for training and evaluation


##### üìñ Components and Pipelines used in this notebook can be installed locally by following the instructions listed here : [Ignite Components and Pipelines](Ignite_Components_And_Pipelines/README.md)



## <span style="font-size:0.8em;"> </span>

<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 20px; border-radius: 10px; margin: 20px 0;">
    <h3 style="margin: 0; text-align: center;">üß© Section 2. Run RFT Training Pipeline (GRPO)</h3>
</div>

<p>GRPO (Group Relative Policy Optimization) is an advanced reinforcement learning technique for fine-tuning LLMs that uses relative learning instead of absolute rewards by comparing model outputs within groups/batches. 
<ul><li>This approach processes multiple responses simultaneously to learn relative preferences through direct policy optimization using reinforcement learning signals and preference learning from human feedback or reward models.</li> 
<li>Common use cases include instruction following improvement, mathematical reasoning enhancement, code generation optimization, and general conversational AI alignment. </li>
<li>In this notebook, we use GRPO to fine-tune an LLM on financial reasoning tasks, improving the model's ability to solve complex financial questions with step-by-step reasoning.</li>
</p>

<p>
The RFT run will output multiple model checkpoints base on value of <b>trainer_save_freq</b> which is defined in config.
<p>
<i>For example, if this value is 20, the model checkpoint is stored for every 20th optimization step of the trainer. 
Where model checkpoint is a fully deployable copy of model's weights fine-tuned until that point.</i></p>
</p>

In [None]:
# Run complete RL training pipeline: train model, register model
grpo_job, status, grpo_registered_model = run_rl_training_pipeline(
    ml_client=ml_client,
    registry_ml_client=registry_ml_client,
    base_model_id="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",  # Huggingface ID ot the model which is to be RFT finetuned.
    compute_cluster="k8s-a100-compute",  # Name of the Kubernetes Cluster in Workspace
    rl_method="grpo",  # RL methodology to be selected for training run.
    train_data_path=train_data_path,  # Path to training dataset
    valid_data_path=valid_data_path,  # Path to validation dataset
    config={
        "num_nodes_finetune": 1,  # Training specific arguments which can be overridden by user.
        "trainer_total_epochs": 1,
        "trainer_save_freq": 20,
    },
)

## <span style="font-size:0.8em;"> </span>

<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 20px; border-radius: 10px; margin: 20px 0;">
    <h3 style="margin: 0; text-align: center;">üß©Section 3. Run RFT Training Pipeline ( Reinforce++ )</h3>
</div>

<p>Reinforce++ is a critic-free reinforcement learning framework that addresses key limitations of traditional RLHF algorithms like PPO by introducing Global Advantage Normalization instead of prompt-level normalization.
<ul><li>This method eliminates the computational and memory overhead of critic networks while providing more stable and theoretically sound advantage estimation by normalizing across entire global batches rather than small prompt-specific groups.</li>
<li>Reinforce++ offers significant advantages including removal of critic network overhead, theoretically unbiased estimation (bias vanishes as batch size increases), superior stability compared to local normalization methods like GRPO/RLOO, and better resistance to overfitting in RLHF scenarios.</li>
<li>In this notebook, we use Reinforce++ to fine-tune an LLM on financial reasoning tasks, leveraging its global advantage normalization to achieve more stable policy updates and superior performance in complex agentic reasoning scenarios.</li>
</p>

<p>
The RFT run will output multiple model checkpoints base on value of <b>trainer_save_freq</b> which is defined in config.
</p>

In [None]:
# Run complete RL training pipeline: verify datasets, register data, train model, register model
rlpp_job, status, rlpp_registered_model = run_rl_training_pipeline(
    ml_client=ml_client,
    registry_ml_client=registry_ml_client,
    base_model_id="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",  # Huggingface ID ot the model which is to be RFT finetuned.
    compute_cluster="k8s-a100-compute",  # Name of the Kubernetes Cluster in workspace.
    rl_method="reinforce_plus_plus",  # RL methodology to be selected for training run.
    train_data_path=train_data_path,  # Path to training dataset
    valid_data_path=valid_data_path,  # Path to validation dataset
    config={
        "num_nodes_finetune": 1,
        "trainer_total_epochs": 1,  # Training specific arguments which can be overridden by user.
        "trainer_save_freq": 20,
    },
)

## <span style="font-size:0.8em;"> </span>

<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 20px; border-radius: 10px; margin: 20px 0;">
    <h3 style="margin: 0; text-align: center;">üìäSection 4. Compare Model Performance across Base Model vs GRPO vs Reinforce++ </h3>
</div>


<p>This section evaluates and compares the performance of different finetuned models and base model across key metrics:</p>



<p><strong>Evaluation Process:</strong></p>
<ul>
<li>Tests multiple checkpoints from each training method</li>
<li>Evaluates on FinQA validation dataset for financial reasoning accuracy</li>
<li>Provides comprehensive metrics to determine the best performing model</li>
</ul>

<p><em>üí° The evaluation will help identify which RL method produces the most effective model for financial reasoning tasks.</em></p>

<p> We will now submit evaluation job, with grpo and rlpp model outputs </p>

In [None]:
eval_job, status = (
    run_evaluation_pipeline(  # Function which invokes the model evaluation pipeline.
        ml_client=ml_client,
        registry_ml_client=registry_ml_client,
        compute_cluster="k8s-a100-compute",
        grpo_model_dir=grpo_registered_model.path,  # Output from GPRO RL provided as data asset created from earlier step.
        rlpp_model_dir=rlpp_registered_model.path,  # Output from Reinforce_plus_plus RL provided as data asset created from earlier step.
        validation_dataset_path=test_data_path,  # Path to test dataset
        run_config={
            "num_nodes": 1,  # Number of nodes to be used for evaluation run.
            "number_of_gpu_to_use": 8,  # Number of GPUs in a node to be used for evaluation run.
            "base_path_1_label": "GRPO",  # Label to identify GRPO model outputs.
            "base_path_2_label": "RLPP",  # Label to identify RLPP model outputs.
            "explore_pattern_1": "global_step_{checkpoint}/actor/lora_adapter/",
            "explore_pattern_2": "global_step_{checkpoint}/actor/lora_adapter/",
            "checkpoint_values_1": "12",
            "checkpoint_values_2": "12",
            "use_lora_adapters_1": True,
            "use_lora_adapters_2": True,
            "evaluate_base_model": True,  # Set to True to evaluate base model along with RL finetuned models.
            "hf_model_id": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",  # Huggingface ID of the base model
            "max_prompt_length": 8196,
            "max_response_length": 1024,
            "dtype": "bfloat16",
            "tensor_parallel_size": 4,
        },  # Configuration parameters for evaluation run.
    )
)

<p> Now, lets fetch metrics from evalution run inorder to show comparison</p>

In [None]:
eval_metrics = get_run_metrics(eval_job)

In [None]:
BASE_metrics = {k: v for k, v in eval_metrics.items() if "base_model" in k}
GRPO_metrics = {k: v for k, v in eval_metrics.items() if "GRPO" in k}
RLPP_metrics = {k: v for k, v in eval_metrics.items() if "RLPP" in k}

In [None]:
min_base_accuracy = (
    min([v for k, v in BASE_metrics.items() if "min" in k]) if BASE_metrics else 0
)
max_grpo_accuracy = (
    max([v for k, v in GRPO_metrics.items() if "max" in k]) if GRPO_metrics else 0
)
max_rlpp_accuracy = (
    max([v for k, v in RLPP_metrics.items() if "max" in k]) if RLPP_metrics else 0
)

<p>GRPO vs Reinforce++ vs Base Model Performance Comparison</p>

In [None]:
categories = ["Baseline Model", "GRPO Model", "RL++ Model"]
values = [min_base_accuracy, max_grpo_accuracy, max_rlpp_accuracy]

plt.bar(categories, values, color=["blue", "orange", "green"])

# Add labels and title
plt.xlabel("Model Type", fontsize=12, labelpad=10, color="#BC1B1B")
plt.ylabel("Accuracy", fontsize=12, labelpad=10, color="#BC1B1B")
plt.title(
    "Graph Comparing Baseline, GRPO, and RL++ Model Accuracies", pad=10, color="#BC1B1B"
)

# Show plot
plt.show()

<p>The evaluation results demonstrate that both GRPO and Reinforce++ fine-tuning methods significantly improve financial reasoning performance compared to the base model. 
These accuracy metrics help identify the optimal checkpoint for deployment in the speculative decoding pipeline.</p>


<div style="background: linear-gradient(135deg, #0078d4 0%, #106ebe 50%, #005a9e 100%); color: white; padding: 30px; border-radius: 12px; margin: 20px 0; box-shadow: 0 4px 15px rgba(0, 120, 212, 0.3);">
    <h1 style="margin: 0; text-align: center; font-size: 2.2em; font-weight: 600; letter-spacing: 0.5px; font-family: 'Segoe UI', -apple-system, BlinkMacSystemFont, Roboto, 'Helvetica Neue', sans-serif;">
        Speculative Decoding
    </h1>
</div>

#### In the following sections would cover creation of draft model, combining base and draft model, deploying speculative decoding model, as well as endpoint benchmarking.

## <span style="font-size:0.8em;"> </span>

<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 20px; border-radius: 10px; margin: 20px 0;">
    <h3 style="margin: 0; text-align: center;">üß©Section 5. Create Draft Model for Speculative Decoding</h3>
</div>

<p>EAGLE3 (Enhanced Adaptive Generation with Lookahead for Efficient Execution) is the latest advancement in speculative decoding that provides significant performance improvements:</p>

<ul>
<li><strong>Direct Token Prediction with Multi-layer Fusion:</strong> Abandons feature prediction for direct token prediction using advanced multi-layer feature fusion, enabling more accurate speculation and full benefit from scaled training data</li>
<li><strong>Superior Performance:</strong> Achieves speedup ratios up to 6.5x (1.4x improvement over EAGLE-2) while maintaining identical output quality through advanced speculative decoding techniques</li>
</ul>

<p>This pipeline creates a specialized draft model that works alongside the base model to enable dramatically improved inference performance for reasoning tasks. The EAGLE3 approach is particularly effective for complex financial reasoning scenarios where maintaining accuracy while achieving significant speed improvements is crucial.</p>

<p><strong>Reference:</strong> <a href="https://arxiv.org/abs/2503.01840">https://arxiv.org/abs/2503.01840</a></p>


In [None]:
# Train EAGLE3 draft model for speculative decoding
draft_job, draft_status = run_draft_model_pipeline(
    ml_client=ml_client,
    registry_ml_client=registry_ml_client,
    compute_cluster="k8s-a100-compute",  # Name of the Kubernetes Cluster in Workspace.
    num_epochs=1,  # Number of train epochs to be run by draft trainer.
    monitor=False,  # Set to True to wait for completion.
    base_model_mlflow_path="azureml://registries/azureml-meta/models/Meta-Llama-3-8B-Instruct/versions/9",
    draft_train_data_path="./data_for_draft_model/train/sharegpt_train_small.jsonl",
)

## <span style="font-size:0.8em;"> </span>

<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 20px; border-radius: 10px; margin: 20px 0;">
    <h3 style="margin: 0; text-align: center;">üîÑSection 6. Prepare Combined Model for Deployment</h3>
</div>



<p>For creation of a <strong>speculative decoding endpoint</strong>, we need <strong>two models</strong> working in tandem:</p>

<ul>
    <li><strong>Base Model:</strong> The primary model (e.g., Llama-3.1-8B-Instruct-FP8) that generates high-quality outputs</li>
    <li><strong>Draft Model:</strong> The EAGLE3 model that quickly generates candidate tokens for speculation</li>
</ul>

<p><strong>Why Combine Into Single AML Model?</strong></p>

<p>We'll package both models into a <strong>single Azure ML model</strong> to:</p>
<ul>
    <li>Simplify deployment to Azure ML online endpoints</li>
    <li>Ensure both models are versioned and managed together</li>
    <li>Streamline the endpoint creation process</li>
    <li>Enable seamless speculative decoding inference</li>
</ul>


In [None]:
# Download draft model, download base model, combine and register for deployment
combined_model = prepare_combined_model_for_deployment(
    ml_client=ml_client,
    registry_ml_client=registry_ml_client,
    draft_job_name=draft_job.name,  # Previous Draft Trainer job name for downloading draft model.
    base_model_hf_id="nvidia/Llama-3.1-8B-Instruct-FP8",  # Huggingface ID of the base model paired along with draft model.
    model_name="speculative-decode-model",  # User provided model name for combined model.
)

## <span style="font-size:0.8em;"> </span>

<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 20px; border-radius: 10px; margin: 20px 0;">
    <h3 style="margin: 0; text-align: center;">üöÄSection 7. Deploy Speculative Decoding Endpoint</h3>
</div>



<p>This section creates and deploys a <strong>managed online endpoint</strong> that leverages the combined model for speculative decoding inference.</p>
<strong>What happens during deployment:</strong>
<ul>
    <li><strong>Endpoint Creation:</strong> Sets up a managed online endpoint in Azure ML.</li>
    <li><strong>Model Loading:</strong> Loads both the base model and EAGLE3 draft model onto GPU instances, setting it up for inference.</li>
</ul>
<p>The deployment process typically takes 15-20 minutes depending on instance availability.</p>


In [None]:
# Deploy managed online endpoint with speculative decoding
endpoint_name = deploy_speculative_decoding_endpoint(
    ml_client=ml_client,  # ML Client which specifies the workspace where endpoint gets deployed.
    combined_model=combined_model,  # Reference from previous steps where combined model is created.
    instance_type="octagepu",  # Instance type Kubernetes Cluster
    compute_name="k8s-a100-compute",
)

## <span style="font-size:0.8em;"> </span>

<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 20px; border-radius: 10px; margin: 20px 0;">
    <h3 style="margin: 0; text-align: center;">üöÄSection 8. Deploy Base Model Endpoint for Comparison</h3>
</div>



<p>This section creates and deploys a <strong>managed online endpoint</strong> with just the base model for performance comparison against the speculative decoding endpoint.</p>

<strong>What happens during deployment:</strong>
<ul>
    <li><strong>Endpoint Creation:</strong> Sets up a standard managed online endpoint in Azure ML.</li>
    <li><strong>Base Model Loading:</strong> Loads only the base model onto GPU instances for standard inference.</li>
    <li><strong>Performance Baseline:</strong> Provides a baseline to measure the speedup achieved by speculative decoding.</li>
</ul>

<p>This baseline endpoint allows you to compare inference speed between standard generation and speculative decoding approaches.</p>

In [None]:
# Deploy managed online endpoint with base model
base_endpoint_name = create_managed_deployment(  # Function to create endpoint for base model.
    ml_client=ml_client,  # ML Client which specifies the workspace where endpoint gets deployed.
    model_asset_id="meta-llama/Meta-Llama-3-8B-Instruct",  # Huggingface ID of the base model.
    instance_type="Standard_ND96amsr_A100_v4",  # Compute SKU on which base model will be deployed.
)

## <span style="font-size:0.8em;"> </span>

<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 20px; border-radius: 10px; margin: 20px 0;">
    <h3 style="margin: 0; text-align: center;">üß™Section 9. Test Deployment</h3>
</div>

<p>This section tests both the speculative decoding endpoint and base model endpoint.</p>

<strong>What happens during testing:</strong>
<ul>
    <li><strong>Endpoint Validation:</strong> Confirms both endpoints are responding correctly to inference requests.</li>
</ul>

<p>The testing process validates that the deployed models can handle requests and respond successfully.</p>

In [None]:
speculative_result = test_deployment(
    ml_client, endpoint_name
)  # Test the deployed endpoint with a financial reasoning question
base_result = test_deployment(
    ml_client, base_endpoint_name
)  # Test the deployed endpoint with a financial reasoning question

## <span style="font-size:0.8em;"> </span>

<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 20px; border-radius: 10px; margin: 20px 0;">
    <h3 style="margin: 0; text-align: center;">üìäSection 10. Performance Evaluation Pipeline</h3>
</div>

<p>This section launches a comprehensive evaluation pipeline to compare performance metrics between the base model endpoint and speculative decoding endpoint.</p>


<p><strong>What happens during evaluation:</strong></p>
<ul>
    <li><strong>Performance Comparison:</strong> Analyzes speed improvements achieved by speculative decoding</li>
    <li><strong>Statistical Analysis:</strong> Provides detailed metrics and visualizations of performance gains</li>
</ul>

In [None]:
# Run evaluation job to compare base model and speculative decoding endpoints' performance
evaluation_job = run_evaluation_speculative_decoding(
    ml_client=ml_client,
    base_endpoint_name=base_endpoint_name,  # Base model endpoint from previous step.
    speculative_endpoint_name=endpoint_name,  # Speculative endpoint from previous step.
    base_model="meta-llama/Meta-Llama-3-8B-Instruct",  # HuggingFace repo ID of the model used in base endpoint, used for tokenization.
    speculative_model="meta-llama/Meta-Llama-3-8B-Instruct",  # HuggingFace repo ID of the model used in speculative decoding endpoint, used for tokenization.
)

Following metrics are used to evaluate the performance of the endpoints:
 
- **Input Throughput (Tokens/sec)**: Measures how many input tokens per second the model/server can process.
- **Output Throughput (Tokens/sec)**: Measures how many output tokens per second the model/server can generate.
- **Request Throughput (Requests/sec)**: Measures how many complete requests the model/server can handle per second.
 
It is expected that the **speculative decoding endpoint will outperform the base model endpoint** across all these metrics, demonstrating the efficiency gains achieved through speculative decoding.

<img src="metrics-base-target-spec-dec.png" alt="Performance Metrics: Base Model vs Speculative Decoding" style="max-width: 100%; height: auto;">