TL;DR: A model-agnostic, reproducible reinforcement learning pipeline for enhancing tool-calling capabilities, inspired by NVIDIA's Nemotron-Tool-N1 and ToolRL. Built with TRL + Unsloth (not VERL), achieving baseline-parity performance under strict budget constraints. Demonstrates successful GRPO/DAPO implementation with easy adaptability to Unsloth-supported models, using prompt/schema adapters where needed (Llama, Mistral, Phi, Gemma, etc.).
Finetuned 1-epoch checkpoint is available on Hugging Face:
- Model card: mikedad/qwen3-8b-tool-rl
Most implementations of LLM tool-calling with RL use VERL (as in the original Nemotron and ToolRL papers). This project explores an alternative approach using TRL + Unsloth to:
- Reduce implementation complexity โ Leverage existing HuggingFace ecosystem
- Enable memory-efficient training โ Unsloth provides 2x faster training with lower VRAM usage
- Democratize experimentation โ Achievable on single GPU with limited budget (<$300)
The base Qwen3-8B model already has strong tool-calling capabilities, making significant improvements challenging without extensive compute. However, this project successfully demonstrates:
- โ End-to-end RL pipeline that works correctly
- โ Model-agnostic design easily adaptable to any HuggingFace model
- โ Reproducible setup on cloud infrastructure
This project adapts state-of-the-art reinforcement learning techniques to enhance tool calling performance in the Qwen3-8B model. Key aspects:
- Model: Qwen3-8B (8.2B parameters) with LoRA adapters (43.6M trainable params, 0.53%)
- RL Algorithm: Group Relative Policy Optimization (GRPO) with DAPO loss configuration
- Training Framework: Unsloth + TRL for memory-efficient fine-tuning
- Reward System: Multi-component scoring evaluating format, correctness, and reasoning quality
- Datasets: 69,653 examples from ToolACE + XLAM function calling datasets
- Evaluation: BFCL v4 benchmark
- Template-Aware, Model-Agnostic: Works out-of-the-box for Qwen3; other models may require prompt/schema adapters
- Memory Efficiency: Unsloth integration enables training on single 80GB GPU
- DAPO Configuration: ฮฒ=0 removes KL divergence term, eliminating expensive reference model
- Comprehensive Data Pipeline: Automated preprocessing for ToolACE and XLAM datasets
- Full Observability: Weights & Biases integration for experiment tracking
This project was trained on Lambda Cloud infrastructure:
| Component | Specification | Cost |
|---|---|---|
| GPU | 1x NVIDIA H100 (80GB PCIe) | $2.49/hr |
| RAM | 200GB DDR5 | โ |
| Storage | 1TB NVMe SSD | โ |
| Total Training Cost | ~100 hours ร $2.49/hr | ~$249 for 1 epoch |
All training hyperparameters were specifically tuned to fit within 80GB VRAM without OOM errors:
| Parameter | Value | Rationale |
|---|---|---|
per_device_train_batch_size |
8 | Maximum batch size for available VRAM |
gradient_accumulation_steps |
4 | Achieves effective batch size of 32 |
num_generations |
1 | Memory-constrained generation count (vs 4-8 in papers) |
beta |
0.0 | Removes reference model (saves ~40GB VRAM) |
LoRA r / alpha |
16 / 32 | Smaller rank = less memory footprint |
max_seq_len |
4096 | Balances context length and memory |
Important Note: With larger infrastructure (e.g., 8รH100 setup as in Nemotron paper), you could use:
- Larger batch sizes (16-32 per device)
- More generations per step (4-8)
- Full model fine-tuning instead of LoRA
- Reference model with ฮฒ > 0 for better stability
These choices would likely yield better results but were infeasible under budget constraints.
# Clone repository
git clone https://github.com/MDadopoulos/RLCompRew.git
cd RLCompRewRequirements:
- Python 3.12+
- 80GB+ GPU recommended (A100/H100)
- See
pyproject.tomlfor full dependencies
# Install uv package manager if needed
# https://docs.astral.sh/uv/getting-started/installation/
curl -LsSf https://astral.sh/uv/install.sh | sh # Linux/Mac
# For Windows, see official docs
# Create virtual environment and install dependencies
uv venv
uv sync
# Activate environment
#source .venv/bin/activate # Linux/Mac
.venv\Scripts\activate # Windows# Login to Hugging Face (required for dataset access)
huggingface-cli login
# Download and preprocess raw datasets (ToolACE + XLAM)
uv run python -m src.toolcalling_rl.data.raw_data_preprocess
# Build training dataset with prompts and format
uv run python -m src.toolcalling_rl.data.toolcall_preprocessProcessed data will be saved to data/processed/toolcall_data/ (~69K examples).
# Login to Weights & Biases for experiment tracking
wandb login
# Start GRPO training
uv run python -m src.toolcalling_rl.train.rl_train \
--cfg_path configs/train/grpo_qwen3_8b.yaml \
--data_dir data/processed/toolcall_data
# For long-running jobs on cloud, use tmux:
# tmux new -s train
# source .venv/bin/activate
# uv run python -m src.toolcalling_rl.train.rl_train ...Num examples: 69,653
Num epochs: 1
Total steps: 4,353
Batch size per device: 8
Gradient accumulation: 4
Effective batch size: 32
Trainable parameters: 43,646,976 / 8,234,382,336 (0.53%)
This pipeline is designed to work with Unsloth-supported decoder-only models. Compatibility depends on the chat/prompt template and the function-calling schema of each model family. Qwen3 works out-of-the-box; for others, create a separate YAML and adjust the prompt construction and schema as needed.
# 1) Copy the base config and rename
cp configs/train/grpo_qwen3_8b.yaml configs/train/grpo_llama3_8b.yaml
# 2) Edit the new config
vim configs/train/grpo_llama3_8b.yaml
# 3) Set the target model (pick from Unsloth-supported models)
model_name: "unsloth/Llama-3.1-8B-Instruct"
# Other examples:
# - "unsloth/Mistral-7B-Instruct-v0.3"
# - "unsloth/Phi-3-medium-4k-instruct"
# - "unsloth/gemma-2-9b-it"
# 4) Adjust memory-related settings and template/schema for your target model
max_seq_len: 8192 # Reduce if you hit OOM
grpo.num_generations: 1 # Increase if you have more VRAM
grpo.per_device_train_batch_size: 4 # Tune with gradient_accumulation_steps
grpo.gradient_accumulation_steps: 8 # Keep effective batch similar
lora.r: 16 # Increase for larger models if memory allows
# 5) Run training with the new config
uv run python -m src.toolcalling_rl.train.rl_train \
--cfg_path configs/train/grpo_llama3_8b.yaml \
--data_dir data/processed/toolcall_dataCompatibility: Any decoder-only model supported by Unsloth (Llama, Mistral, Phi, Gemma, Qwen, Yi, DeepSeek), with prompt/schema adapters as required.
Recommendation: For clearer RL gains, prefer models with weaker baseline tool-calling (e.g., Llama 3.1-8B, Mistral-7B) and always tune both the YAML (memory) and prompt/schema (format) to your target model.
Parameter-efficient fine-tuning using LoRA (Low-Rank Adaptation):
- Rank (r): 16 โ Balances expressiveness and memory
- Alpha: 32 โ Scaling factor for LoRA updates
- Target Modules: All attention and MLP projections
- Trainable %: 0.53% of base model parameters
Group Relative Policy Optimization (GRPO) with Distribution-Aware Policy Optimization (DAPO) configuration:
- ฮฒ = 0.0: Removes KL divergence term entirely (standard DAPO approach)
- Benefits:
- No reference model needed โ ~40GB VRAM savings
- Faster training iterations
- Recommended by Unsloth for single-GPU setups
- Trade-off: Less stable optimization compared to ฮฒ > 0
Multi-component reward system evaluating four dimensions:
| Component | Weight | Description |
|---|---|---|
| Format Validation | 0.2 | Valid JSON structure, correct tool call syntax |
| Function Name Match | 0.2 | Correct function selected from available tools |
| Argument Correctness | 0.4 | Accurate parameter values and types |
| Reasoning Quality | 0.2 | Presence of structured thinking/explanation |
Rewards are normalized to [-1, 1] range and combined linearly.
Model evaluated on BFCL v4 (Berkeley Function Calling Leaderboard) benchmark:
Test Categories Covered: multi_turn,single_turn,live,non_live
- Live / Non-Live / Multi-Turn / Single-Turn
Key Findings:
- Achieved baseline-parity with pre-trained Qwen3-8B
- Successfully validated end-to-end RL pipeline functionality
- Limited improvement due to:
- Base model already strong at tool calling
- Single epoch training (budget constraints)
- Conservative hyperparameters (memory constraints)
Note: With multi-epoch training and larger compute budget, further improvements are expected.
Reproduce the BFCL v4 evaluation with the Gorilla leaderboard and VLLM backend.
- Environment setup (Conda):
#if you dont have conda first install:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
source ~/.bashrc
conda create -n BFCL python=3.10 -y
conda activate BFCL- Install Gorilla BFCL:
git clone https://github.com/ShishirPatil/gorilla.git
cd gorilla/berkeley-function-call-leaderboard
pip install -e .
# Optional: W&B integration
cp bfcl_eval/.env.example .env
pip install -e .[wandb]- Bring your checkpoints and merge LoRA:
# From your project root
cp -rp RLCompRew/checkpoints/toolcalling gorilla/berkeley-function-call-leaderboard/Create merge_lora.py in gorilla/berkeley-function-call-leaderboard:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
if __name__ == "__main__":
base = "unsloth/Qwen3-8B"
lora = "toolcalling/qwen3-8b-tool-rl/checkpoint-4332/"
out = "toolcalling/qwen3-8b-tool-rl/qwen3-8b-4332-merged"
tok = AutoTokenizer.from_pretrained(lora, use_fast=True)
mdl = AutoModelForCausalLM.from_pretrained(base, torch_dtype="auto", device_map="auto")
mdl = PeftModel.from_pretrained(mdl, lora)
mdl = mdl.merge_and_unload()
mdl.save_pretrained(out)
tok.save_pretrained(out)- Register a model entry (example) in BFCL registry:
Find this file locally : https://github.com/ShishirPatil/gorilla/blob/main/berkeley-function-call-leaderboard/bfcl_eval/constants/model_config.py and add:
"qwen3-8b-tool-rl-FC": ModelConfig(
model_name="unsloth/Qwen3-8B",
display_name="Qwen3-8B-tool-rl (FC)",
url="https://huggingface.co/unsloth/Qwen3-8B",
org="Qwen",
license="apache-2.0",
model_handler=QwenFCHandler,
is_fc_model=True,
underscore_to_dot=False,
),- Generate and evaluate with VLLM:
# Fine-tuned (merged LoRA)
bfcl generate \
--model qwen3-8b-tool-rl-FC \
--test-category multi_turn,single_turn,live,non_live \
--backend vllm \
--num-gpus 1 \
--gpu-memory-utilization 0.9 \
--local-model-path toolcalling/qwen3-8b-tool-rl/qwen3-8b-4332-merged
bfcl evaluate \
--model qwen3-8b-tool-rl-FC \
--test-category multi_turn,single_turn,live,non_live- Baseline comparison (base Qwen3-8B-FC):
bfcl generate \
--model Qwen/Qwen3-8B-FC \
--test-category multi_turn,single_turn,live,non_live \
--backend vllm \
--num-gpus 1 \
--gpu-memory-utilization 0.9
bfcl evaluate \
--model Qwen/Qwen3-8B-FC \
--test-category multi_turn,single_turn,live,non_liveKey Insight: Qwen3-8B was not an ideal choice for demonstrating RL improvements.
Why Limited Improvement?
- Base Qwen3-8B already has strong tool-calling capabilities out-of-the-box
- Pre-trained on extensive function-calling data
- Limited headroom for improvement without extensive multi-epoch training
- Already achieves high scores on BFCL v4 benchmark before RL training
Value Delivered Despite Minimal Performance Gains:
- โ Validated end-to-end RL pipeline โ All components work correctly
- โ Created reproducible methodology โ Easily applied to any HuggingFace model
- โ Built reusable infrastructure โ Data pipeline, reward functions, evaluation harness
Pipeline Generalizability: The codebase is model-agnostic and can be applied to:
- Any decoder-only LLM on HuggingFace Hub (Llama, Mistral, Phi, Gemma, etc.)
- Different model sizes (3B โ 70B with appropriate hardware)
- Other structured generation tasks beyond tool calling (JSON generation, code completion, etc.)
Lesson Learned: Model selection should consider baseline capabilities and improvement potential, not just absolute performance. However, building a reusable, well-documented, model-agnostic pipeline has value beyond single-model results.
RLCompRew/
โโโ src/toolcalling_rl/
โ โโโ data/ # Data preprocessing pipelines
โ โ โโโ raw_data_preprocess.py # Download & clean ToolACE/XLAM
โ โ โโโ toolcall_preprocess.py # Format for RL training
โ โโโ train/ # RL training implementation
โ โ โโโ rl_train.py # Main GRPO training loop
โ โโโ reward/ # Reward function implementations
โ โ โโโ toolcalling_rew.py # Tool-calling specific rewards
โ โโโ utils/ # Utility functions
โ โโโ tool_utils.py # Tool parsing & validation
โ
โโโ configs/
โ โโโ train/
โ โโโ grpo_qwen3_8b.yaml # Main training configuration
โ
โโโ scripts/
โ โโโ setup.sh # Environment setup for Lambda Cloud
โ โโโ prepare_data.sh # Data preparation script
โ
โโโ data/ # not commited due to size
โ โโโ raw/ # Downloaded datasets (ToolACE, XLAM)
โ โโโ processed/ # Preprocessed training data
โ โโโ toolcall_data/ # Final formatted dataset (69K examples)
โโโ pyproject.toml # Python dependencies (uv)
โโโ README.md
With additional compute budget, promising directions include:
- Apply to Weaker Baselines โ Test on Llama 3.1-8B, Mistral-7B, or Phi-3 and smaller models to demonstrate clearer RL improvements
- Multi-Epoch Training โ Extend to 3-5 epochs to assess convergence and performance ceiling
- DR-GRPO Experiments โ Distributional reward modeling for variance reduction
- GSPO Variant โ Group Score Policy Optimization for improved stability
- Ablation Studies โ Test impact of each reward component (format, name, args, reasoning)
- Scaling Up โ Apply pipeline to Qwen2.5-14B, Qwen2.5-32B, or Llama 3.1-70B
- Beta Sweep โ Test ฮฒ โ {0.0, 0.01, 0.05, 0.1} on multi-GPU setup with reference model
- Other Structured Tasks โ Adapt pipeline for JSON generation, SQL queries, or code completion
This project was inspired by the following work:
-
Nemotron-Research-Tool-N1 โ NVIDIA Labs
GitHub | Demonstrates RL techniques for tool calling in LLMs -
ToolRL: Reward is All Tool Learning Needs
GitHub Research on reward-based optimization for tool learning -
ToolACE โ Multi-turn tool calling conversations
-
XLAM Function Calling 60K โ Large-scale function calling dataset
-
BFCL v4 โ Berkeley Function Calling Leaderboard evaluation benchmark
โญ If you find this project useful, please consider giving it a star! โญ