Skip to content

MDadopoulos/RL-Tool-Calling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

9 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

RL Tool Calling with Qwen3-8B

Python PyTorch License

TL;DR: A model-agnostic, reproducible reinforcement learning pipeline for enhancing tool-calling capabilities, inspired by NVIDIA's Nemotron-Tool-N1 and ToolRL. Built with TRL + Unsloth (not VERL), achieving baseline-parity performance under strict budget constraints. Demonstrates successful GRPO/DAPO implementation with easy adaptability to Unsloth-supported models, using prompt/schema adapters where needed (Llama, Mistral, Phi, Gemma, etc.).


๐Ÿ“ฆ Pretrained Checkpoint (Hugging Face)

Finetuned 1-epoch checkpoint is available on Hugging Face:


๐Ÿ’ก Motivation

Most implementations of LLM tool-calling with RL use VERL (as in the original Nemotron and ToolRL papers). This project explores an alternative approach using TRL + Unsloth to:

  • Reduce implementation complexity โ€” Leverage existing HuggingFace ecosystem
  • Enable memory-efficient training โ€” Unsloth provides 2x faster training with lower VRAM usage
  • Democratize experimentation โ€” Achievable on single GPU with limited budget (<$300)

The base Qwen3-8B model already has strong tool-calling capabilities, making significant improvements challenging without extensive compute. However, this project successfully demonstrates:

  • โœ… End-to-end RL pipeline that works correctly
  • โœ… Model-agnostic design easily adaptable to any HuggingFace model
  • โœ… Reproducible setup on cloud infrastructure

๐ŸŽฏ Project Overview

This project adapts state-of-the-art reinforcement learning techniques to enhance tool calling performance in the Qwen3-8B model. Key aspects:

  • Model: Qwen3-8B (8.2B parameters) with LoRA adapters (43.6M trainable params, 0.53%)
  • RL Algorithm: Group Relative Policy Optimization (GRPO) with DAPO loss configuration
  • Training Framework: Unsloth + TRL for memory-efficient fine-tuning
  • Reward System: Multi-component scoring evaluating format, correctness, and reasoning quality
  • Datasets: 69,653 examples from ToolACE + XLAM function calling datasets
  • Evaluation: BFCL v4 benchmark

๐Ÿ”ง Key Features

  • Template-Aware, Model-Agnostic: Works out-of-the-box for Qwen3; other models may require prompt/schema adapters
  • Memory Efficiency: Unsloth integration enables training on single 80GB GPU
  • DAPO Configuration: ฮฒ=0 removes KL divergence term, eliminating expensive reference model
  • Comprehensive Data Pipeline: Automated preprocessing for ToolACE and XLAM datasets
  • Full Observability: Weights & Biases integration for experiment tracking

๐Ÿ’ป Hardware & Training Configuration

This project was trained on Lambda Cloud infrastructure:

Component Specification Cost
GPU 1x NVIDIA H100 (80GB PCIe) $2.49/hr
RAM 200GB DDR5 โ€”
Storage 1TB NVMe SSD โ€”
Total Training Cost ~100 hours ร— $2.49/hr ~$249 for 1 epoch

Hyperparameter Selection & Memory Constraints

All training hyperparameters were specifically tuned to fit within 80GB VRAM without OOM errors:

Parameter Value Rationale
per_device_train_batch_size 8 Maximum batch size for available VRAM
gradient_accumulation_steps 4 Achieves effective batch size of 32
num_generations 1 Memory-constrained generation count (vs 4-8 in papers)
beta 0.0 Removes reference model (saves ~40GB VRAM)
LoRA r / alpha 16 / 32 Smaller rank = less memory footprint
max_seq_len 4096 Balances context length and memory

Important Note: With larger infrastructure (e.g., 8ร—H100 setup as in Nemotron paper), you could use:

  • Larger batch sizes (16-32 per device)
  • More generations per step (4-8)
  • Full model fine-tuning instead of LoRA
  • Reference model with ฮฒ > 0 for better stability

These choices would likely yield better results but were infeasible under budget constraints.


๐Ÿš€ Quick Start

Prerequisites

# Clone repository
git clone https://github.com/MDadopoulos/RLCompRew.git
cd RLCompRew

Requirements:

  • Python 3.12+
  • 80GB+ GPU recommended (A100/H100)
  • See pyproject.toml for full dependencies

Setup Environment

# Install uv package manager if needed
# https://docs.astral.sh/uv/getting-started/installation/
curl -LsSf https://astral.sh/uv/install.sh | sh  # Linux/Mac
# For Windows, see official docs

# Create virtual environment and install dependencies
uv venv
uv sync

# Activate environment
#source .venv/bin/activate  # Linux/Mac
.venv\Scripts\activate   # Windows

Data Preparation

# Login to Hugging Face (required for dataset access)
huggingface-cli login

# Download and preprocess raw datasets (ToolACE + XLAM)
uv run python -m src.toolcalling_rl.data.raw_data_preprocess

# Build training dataset with prompts and format
uv run python -m src.toolcalling_rl.data.toolcall_preprocess

Processed data will be saved to data/processed/toolcall_data/ (~69K examples).

Training

# Login to Weights & Biases for experiment tracking
wandb login

# Start GRPO training
uv run python -m src.toolcalling_rl.train.rl_train \
    --cfg_path configs/train/grpo_qwen3_8b.yaml \
    --data_dir data/processed/toolcall_data

# For long-running jobs on cloud, use tmux:
# tmux new -s train
# source .venv/bin/activate
# uv run python -m src.toolcalling_rl.train.rl_train ...

Final Training Statistics

Num examples: 69,653
Num epochs: 1
Total steps: 4,353
Batch size per device: 8
Gradient accumulation: 4
Effective batch size: 32
Trainable parameters: 43,646,976 / 8,234,382,336 (0.53%)

Adapting to Other Models (Unsloth-Supported)

This pipeline is designed to work with Unsloth-supported decoder-only models. Compatibility depends on the chat/prompt template and the function-calling schema of each model family. Qwen3 works out-of-the-box; for others, create a separate YAML and adjust the prompt construction and schema as needed.

# 1) Copy the base config and rename
cp configs/train/grpo_qwen3_8b.yaml configs/train/grpo_llama3_8b.yaml

# 2) Edit the new config
vim configs/train/grpo_llama3_8b.yaml

# 3) Set the target model (pick from Unsloth-supported models)
model_name: "unsloth/Llama-3.1-8B-Instruct"
# Other examples:
#   - "unsloth/Mistral-7B-Instruct-v0.3"
#   - "unsloth/Phi-3-medium-4k-instruct"
#   - "unsloth/gemma-2-9b-it"

# 4) Adjust memory-related settings and template/schema for your target model
max_seq_len: 8192          # Reduce if you hit OOM
grpo.num_generations: 1    # Increase if you have more VRAM
grpo.per_device_train_batch_size: 4   # Tune with gradient_accumulation_steps
grpo.gradient_accumulation_steps: 8   # Keep effective batch similar
lora.r: 16                  # Increase for larger models if memory allows

# 5) Run training with the new config
uv run python -m src.toolcalling_rl.train.rl_train \
    --cfg_path configs/train/grpo_llama3_8b.yaml \
    --data_dir data/processed/toolcall_data

Compatibility: Any decoder-only model supported by Unsloth (Llama, Mistral, Phi, Gemma, Qwen, Yi, DeepSeek), with prompt/schema adapters as required.
Recommendation: For clearer RL gains, prefer models with weaker baseline tool-calling (e.g., Llama 3.1-8B, Mistral-7B) and always tune both the YAML (memory) and prompt/schema (format) to your target model.


๐Ÿ”ฌ Technical Implementation

LoRA Configuration

Parameter-efficient fine-tuning using LoRA (Low-Rank Adaptation):

  • Rank (r): 16 โ€” Balances expressiveness and memory
  • Alpha: 32 โ€” Scaling factor for LoRA updates
  • Target Modules: All attention and MLP projections
  • Trainable %: 0.53% of base model parameters

GRPO with DAPO

Group Relative Policy Optimization (GRPO) with Distribution-Aware Policy Optimization (DAPO) configuration:

  • ฮฒ = 0.0: Removes KL divergence term entirely (standard DAPO approach)
  • Benefits:
    • No reference model needed โ†’ ~40GB VRAM savings
    • Faster training iterations
    • Recommended by Unsloth for single-GPU setups
  • Trade-off: Less stable optimization compared to ฮฒ > 0

Reward Function

Multi-component reward system evaluating four dimensions:

Component Weight Description
Format Validation 0.2 Valid JSON structure, correct tool call syntax
Function Name Match 0.2 Correct function selected from available tools
Argument Correctness 0.4 Accurate parameter values and types
Reasoning Quality 0.2 Presence of structured thinking/explanation

Rewards are normalized to [-1, 1] range and combined linearly.


Evaluation

Model evaluated on BFCL v4 (Berkeley Function Calling Leaderboard) benchmark:

Test Categories Covered: multi_turn,single_turn,live,non_live

  • Live / Non-Live / Multi-Turn / Single-Turn

Key Findings:

  • Achieved baseline-parity with pre-trained Qwen3-8B
  • Successfully validated end-to-end RL pipeline functionality
  • Limited improvement due to:
    • Base model already strong at tool calling
    • Single epoch training (budget constraints)
    • Conservative hyperparameters (memory constraints)

Note: With multi-epoch training and larger compute budget, further improvements are expected.

Reproducing BFCL Evaluation

Reproduce the BFCL v4 evaluation with the Gorilla leaderboard and VLLM backend.

  1. Environment setup (Conda):
#if you dont have conda first install:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
source ~/.bashrc

conda create -n BFCL python=3.10 -y
conda activate BFCL
  1. Install Gorilla BFCL:
git clone https://github.com/ShishirPatil/gorilla.git
cd gorilla/berkeley-function-call-leaderboard
pip install -e .

# Optional: W&B integration
cp bfcl_eval/.env.example .env
pip install -e .[wandb]
  1. Bring your checkpoints and merge LoRA:
# From your project root
cp -rp RLCompRew/checkpoints/toolcalling gorilla/berkeley-function-call-leaderboard/

Create merge_lora.py in gorilla/berkeley-function-call-leaderboard:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

if __name__ == "__main__":
    base = "unsloth/Qwen3-8B"
    lora = "toolcalling/qwen3-8b-tool-rl/checkpoint-4332/"
    out  = "toolcalling/qwen3-8b-tool-rl/qwen3-8b-4332-merged"

    tok = AutoTokenizer.from_pretrained(lora, use_fast=True)
    mdl = AutoModelForCausalLM.from_pretrained(base, torch_dtype="auto", device_map="auto")
    mdl = PeftModel.from_pretrained(mdl, lora)
    mdl = mdl.merge_and_unload()
    mdl.save_pretrained(out)
    tok.save_pretrained(out)
  1. Register a model entry (example) in BFCL registry:

Find this file locally : https://github.com/ShishirPatil/gorilla/blob/main/berkeley-function-call-leaderboard/bfcl_eval/constants/model_config.py and add:

"qwen3-8b-tool-rl-FC": ModelConfig(
    model_name="unsloth/Qwen3-8B",
    display_name="Qwen3-8B-tool-rl (FC)",
    url="https://huggingface.co/unsloth/Qwen3-8B",
    org="Qwen",
    license="apache-2.0",
    model_handler=QwenFCHandler,
    is_fc_model=True,
    underscore_to_dot=False,
),
  1. Generate and evaluate with VLLM:
# Fine-tuned (merged LoRA)
bfcl generate \
  --model qwen3-8b-tool-rl-FC \
  --test-category multi_turn,single_turn,live,non_live \
  --backend vllm \
  --num-gpus 1 \
  --gpu-memory-utilization 0.9 \
  --local-model-path toolcalling/qwen3-8b-tool-rl/qwen3-8b-4332-merged

bfcl evaluate \
  --model qwen3-8b-tool-rl-FC \
  --test-category multi_turn,single_turn,live,non_live
  1. Baseline comparison (base Qwen3-8B-FC):
bfcl generate \
  --model Qwen/Qwen3-8B-FC \
  --test-category multi_turn,single_turn,live,non_live \
  --backend vllm \
  --num-gpus 1 \
  --gpu-memory-utilization 0.9

bfcl evaluate \
  --model Qwen/Qwen3-8B-FC \
  --test-category multi_turn,single_turn,live,non_live

Model Selection & Performance Analysis

Key Insight: Qwen3-8B was not an ideal choice for demonstrating RL improvements.

Why Limited Improvement?

  • Base Qwen3-8B already has strong tool-calling capabilities out-of-the-box
  • Pre-trained on extensive function-calling data
  • Limited headroom for improvement without extensive multi-epoch training
  • Already achieves high scores on BFCL v4 benchmark before RL training

Value Delivered Despite Minimal Performance Gains:

  • โœ… Validated end-to-end RL pipeline โ€” All components work correctly
  • โœ… Created reproducible methodology โ€” Easily applied to any HuggingFace model
  • โœ… Built reusable infrastructure โ€” Data pipeline, reward functions, evaluation harness

Pipeline Generalizability: The codebase is model-agnostic and can be applied to:

  • Any decoder-only LLM on HuggingFace Hub (Llama, Mistral, Phi, Gemma, etc.)
  • Different model sizes (3B โ†’ 70B with appropriate hardware)
  • Other structured generation tasks beyond tool calling (JSON generation, code completion, etc.)

Lesson Learned: Model selection should consider baseline capabilities and improvement potential, not just absolute performance. However, building a reusable, well-documented, model-agnostic pipeline has value beyond single-model results.


๐Ÿ“ Project Structure

RLCompRew/
โ”œโ”€โ”€ src/toolcalling_rl/
โ”‚   โ”œโ”€โ”€ data/              # Data preprocessing pipelines
โ”‚   โ”‚   โ”œโ”€โ”€ raw_data_preprocess.py    # Download & clean ToolACE/XLAM
โ”‚   โ”‚   โ””โ”€โ”€ toolcall_preprocess.py    # Format for RL training
โ”‚   โ”œโ”€โ”€ train/             # RL training implementation
โ”‚   โ”‚   โ”œโ”€โ”€ rl_train.py               # Main GRPO training loop
โ”‚   โ”œโ”€โ”€ reward/            # Reward function implementations
โ”‚   โ”‚   โ””โ”€โ”€ toolcalling_rew.py        # Tool-calling specific rewards
โ”‚   โ””โ”€โ”€ utils/             # Utility functions
โ”‚       โ””โ”€โ”€ tool_utils.py             # Tool parsing & validation
โ”‚
โ”œโ”€โ”€ configs/
โ”‚   โ””โ”€โ”€ train/
โ”‚       โ””โ”€โ”€ grpo_qwen3_8b.yaml        # Main training configuration
โ”‚
โ”œโ”€โ”€ scripts/
โ”‚   โ”œโ”€โ”€ setup.sh           # Environment setup for Lambda Cloud
โ”‚   โ””โ”€โ”€ prepare_data.sh    # Data preparation script
โ”‚
โ”œโ”€โ”€ data/ # not commited due to size
โ”‚   โ”œโ”€โ”€ raw/               # Downloaded datasets (ToolACE, XLAM)
โ”‚   โ””โ”€โ”€ processed/         # Preprocessed training data
โ”‚       โ””โ”€โ”€ toolcall_data/ # Final formatted dataset (69K examples)
โ”œโ”€โ”€ pyproject.toml         # Python dependencies (uv)
โ””โ”€โ”€ README.md

๐Ÿ”ฎ Future Directions

With additional compute budget, promising directions include:

  • Apply to Weaker Baselines โ€” Test on Llama 3.1-8B, Mistral-7B, or Phi-3 and smaller models to demonstrate clearer RL improvements
  • Multi-Epoch Training โ€” Extend to 3-5 epochs to assess convergence and performance ceiling
  • DR-GRPO Experiments โ€” Distributional reward modeling for variance reduction
  • GSPO Variant โ€” Group Score Policy Optimization for improved stability
  • Ablation Studies โ€” Test impact of each reward component (format, name, args, reasoning)
  • Scaling Up โ€” Apply pipeline to Qwen2.5-14B, Qwen2.5-32B, or Llama 3.1-70B
  • Beta Sweep โ€” Test ฮฒ โˆˆ {0.0, 0.01, 0.05, 0.1} on multi-GPU setup with reference model
  • Other Structured Tasks โ€” Adapt pipeline for JSON generation, SQL queries, or code completion

๐Ÿ“š References

This project was inspired by the following work:

  • Nemotron-Research-Tool-N1 โ€” NVIDIA Labs
    GitHub | Demonstrates RL techniques for tool calling in LLMs

  • ToolRL: Reward is All Tool Learning Needs
    GitHub Research on reward-based optimization for tool learning

  • ToolACE โ€” Multi-turn tool calling conversations

  • XLAM Function Calling 60K โ€” Large-scale function calling dataset

  • BFCL v4 โ€” Berkeley Function Calling Leaderboard evaluation benchmark


โญ If you find this project useful, please consider giving it a star! โญ

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published