RL Tool Calling with Qwen3-8B

TL;DR: A model-agnostic, reproducible reinforcement learning pipeline for enhancing tool-calling capabilities, inspired by NVIDIA's Nemotron-Tool-N1 and ToolRL. Built with TRL + Unsloth (not VERL), achieving baseline-parity performance under strict budget constraints. Demonstrates successful GRPO/DAPO implementation with easy adaptability to Unsloth-supported models, using prompt/schema adapters where needed (Llama, Mistral, Phi, Gemma, etc.).

📦 Pretrained Checkpoint (Hugging Face)

Finetuned 1-epoch checkpoint is available on Hugging Face:

Model card: mikedad/qwen3-8b-tool-rl

💡 Motivation

Most implementations of LLM tool-calling with RL use VERL (as in the original Nemotron and ToolRL papers). This project explores an alternative approach using TRL + Unsloth to:

Reduce implementation complexity — Leverage existing HuggingFace ecosystem
Enable memory-efficient training — Unsloth provides 2x faster training with lower VRAM usage
Democratize experimentation — Achievable on single GPU with limited budget (<$300)

The base Qwen3-8B model already has strong tool-calling capabilities, making significant improvements challenging without extensive compute. However, this project successfully demonstrates:

✅ End-to-end RL pipeline that works correctly
✅ Model-agnostic design easily adaptable to any HuggingFace model
✅ Reproducible setup on cloud infrastructure

🎯 Project Overview

This project adapts state-of-the-art reinforcement learning techniques to enhance tool calling performance in the Qwen3-8B model. Key aspects:

Model: Qwen3-8B (8.2B parameters) with LoRA adapters (43.6M trainable params, 0.53%)
RL Algorithm: Group Relative Policy Optimization (GRPO) with DAPO loss configuration
Training Framework: Unsloth + TRL for memory-efficient fine-tuning
Reward System: Multi-component scoring evaluating format, correctness, and reasoning quality
Datasets: 69,653 examples from ToolACE + XLAM function calling datasets
Evaluation: BFCL v4 benchmark

🔧 Key Features

Template-Aware, Model-Agnostic: Works out-of-the-box for Qwen3; other models may require prompt/schema adapters
Memory Efficiency: Unsloth integration enables training on single 80GB GPU
DAPO Configuration: β=0 removes KL divergence term, eliminating expensive reference model
Comprehensive Data Pipeline: Automated preprocessing for ToolACE and XLAM datasets
Full Observability: Weights & Biases integration for experiment tracking

💻 Hardware & Training Configuration

This project was trained on Lambda Cloud infrastructure:

Component	Specification	Cost
GPU	1x NVIDIA H100 (80GB PCIe)	$2.49/hr
RAM	200GB DDR5	—
Storage	1TB NVMe SSD	—
Total Training Cost	~100 hours × $2.49/hr	~$249 for 1 epoch

Hyperparameter Selection & Memory Constraints

All training hyperparameters were specifically tuned to fit within 80GB VRAM without OOM errors:

Parameter	Value	Rationale
`per_device_train_batch_size`	8	Maximum batch size for available VRAM
`gradient_accumulation_steps`	4	Achieves effective batch size of 32
`num_generations`	1	Memory-constrained generation count (vs 4-8 in papers)
`beta`	0.0	Removes reference model (saves ~40GB VRAM)
LoRA `r` / `alpha`	16 / 32	Smaller rank = less memory footprint
`max_seq_len`	4096	Balances context length and memory

Important Note: With larger infrastructure (e.g., 8×H100 setup as in Nemotron paper), you could use:

Larger batch sizes (16-32 per device)
More generations per step (4-8)
Full model fine-tuning instead of LoRA
Reference model with β > 0 for better stability

These choices would likely yield better results but were infeasible under budget constraints.

🚀 Quick Start

Prerequisites

# Clone repository
git clone https://github.com/MDadopoulos/RLCompRew.git
cd RLCompRew

Requirements:

Python 3.12+
80GB+ GPU recommended (A100/H100)
See pyproject.toml for full dependencies

Setup Environment

# Install uv package manager if needed
# https://docs.astral.sh/uv/getting-started/installation/
curl -LsSf https://astral.sh/uv/install.sh | sh  # Linux/Mac
# For Windows, see official docs

# Create virtual environment and install dependencies
uv venv
uv sync

# Activate environment
#source .venv/bin/activate  # Linux/Mac
.venv\Scripts\activate   # Windows

Data Preparation

# Login to Hugging Face (required for dataset access)
huggingface-cli login

# Download and preprocess raw datasets (ToolACE + XLAM)
uv run python -m src.toolcalling_rl.data.raw_data_preprocess

# Build training dataset with prompts and format
uv run python -m src.toolcalling_rl.data.toolcall_preprocess

Processed data will be saved to data/processed/toolcall_data/ (~69K examples).

Training

# Login to Weights & Biases for experiment tracking
wandb login

# Start GRPO training
uv run python -m src.toolcalling_rl.train.rl_train \
    --cfg_path configs/train/grpo_qwen3_8b.yaml \
    --data_dir data/processed/toolcall_data

# For long-running jobs on cloud, use tmux:
# tmux new -s train
# source .venv/bin/activate
# uv run python -m src.toolcalling_rl.train.rl_train ...

Final Training Statistics

Num examples: 69,653
Num epochs: 1
Total steps: 4,353
Batch size per device: 8
Gradient accumulation: 4
Effective batch size: 32
Trainable parameters: 43,646,976 / 8,234,382,336 (0.53%)

Adapting to Other Models (Unsloth-Supported)

This pipeline is designed to work with Unsloth-supported decoder-only models. Compatibility depends on the chat/prompt template and the function-calling schema of each model family. Qwen3 works out-of-the-box; for others, create a separate YAML and adjust the prompt construction and schema as needed.

# 1) Copy the base config and rename
cp configs/train/grpo_qwen3_8b.yaml configs/train/grpo_llama3_8b.yaml

# 2) Edit the new config
vim configs/train/grpo_llama3_8b.yaml

# 3) Set the target model (pick from Unsloth-supported models)
model_name: "unsloth/Llama-3.1-8B-Instruct"
# Other examples:
#   - "unsloth/Mistral-7B-Instruct-v0.3"
#   - "unsloth/Phi-3-medium-4k-instruct"
#   - "unsloth/gemma-2-9b-it"

# 4) Adjust memory-related settings and template/schema for your target model
max_seq_len: 8192          # Reduce if you hit OOM
grpo.num_generations: 1    # Increase if you have more VRAM
grpo.per_device_train_batch_size: 4   # Tune with gradient_accumulation_steps
grpo.gradient_accumulation_steps: 8   # Keep effective batch similar
lora.r: 16                  # Increase for larger models if memory allows

# 5) Run training with the new config
uv run python -m src.toolcalling_rl.train.rl_train \
    --cfg_path configs/train/grpo_llama3_8b.yaml \
    --data_dir data/processed/toolcall_data

Compatibility: Any decoder-only model supported by Unsloth (Llama, Mistral, Phi, Gemma, Qwen, Yi, DeepSeek), with prompt/schema adapters as required.
Recommendation: For clearer RL gains, prefer models with weaker baseline tool-calling (e.g., Llama 3.1-8B, Mistral-7B) and always tune both the YAML (memory) and prompt/schema (format) to your target model.

🔬 Technical Implementation

LoRA Configuration

Parameter-efficient fine-tuning using LoRA (Low-Rank Adaptation):

Rank (r): 16 — Balances expressiveness and memory
Alpha: 32 — Scaling factor for LoRA updates
Target Modules: All attention and MLP projections
Trainable %: 0.53% of base model parameters

GRPO with DAPO

Group Relative Policy Optimization (GRPO) with Distribution-Aware Policy Optimization (DAPO) configuration:

β = 0.0: Removes KL divergence term entirely (standard DAPO approach)
Benefits:
- No reference model needed → ~40GB VRAM savings
- Faster training iterations
- Recommended by Unsloth for single-GPU setups
Trade-off: Less stable optimization compared to β > 0

Reward Function

Multi-component reward system evaluating four dimensions:

Component	Weight	Description
Format Validation	0.2	Valid JSON structure, correct tool call syntax
Function Name Match	0.2	Correct function selected from available tools
Argument Correctness	0.4	Accurate parameter values and types
Reasoning Quality	0.2	Presence of structured thinking/explanation

Rewards are normalized to [-1, 1] range and combined linearly.

Evaluation

Model evaluated on BFCL v4 (Berkeley Function Calling Leaderboard) benchmark:

Test Categories Covered: multi_turn,single_turn,live,non_live

Live / Non-Live / Multi-Turn / Single-Turn

Key Findings:

Achieved baseline-parity with pre-trained Qwen3-8B
Successfully validated end-to-end RL pipeline functionality
Limited improvement due to:
- Base model already strong at tool calling
- Single epoch training (budget constraints)
- Conservative hyperparameters (memory constraints)

Note: With multi-epoch training and larger compute budget, further improvements are expected.

Reproducing BFCL Evaluation

Reproduce the BFCL v4 evaluation with the Gorilla leaderboard and VLLM backend.

Environment setup (Conda):

#if you dont have conda first install:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
source ~/.bashrc

conda create -n BFCL python=3.10 -y
conda activate BFCL

Install Gorilla BFCL:

git clone https://github.com/ShishirPatil/gorilla.git
cd gorilla/berkeley-function-call-leaderboard
pip install -e .

# Optional: W&B integration
cp bfcl_eval/.env.example .env
pip install -e .[wandb]

Bring your checkpoints and merge LoRA:

# From your project root
cp -rp RLCompRew/checkpoints/toolcalling gorilla/berkeley-function-call-leaderboard/

Create merge_lora.py in gorilla/berkeley-function-call-leaderboard:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

if __name__ == "__main__":
    base = "unsloth/Qwen3-8B"
    lora = "toolcalling/qwen3-8b-tool-rl/checkpoint-4332/"
    out  = "toolcalling/qwen3-8b-tool-rl/qwen3-8b-4332-merged"

    tok = AutoTokenizer.from_pretrained(lora, use_fast=True)
    mdl = AutoModelForCausalLM.from_pretrained(base, torch_dtype="auto", device_map="auto")
    mdl = PeftModel.from_pretrained(mdl, lora)
    mdl = mdl.merge_and_unload()
    mdl.save_pretrained(out)
    tok.save_pretrained(out)

Register a model entry (example) in BFCL registry:

Find this file locally : https://github.com/ShishirPatil/gorilla/blob/main/berkeley-function-call-leaderboard/bfcl_eval/constants/model_config.py and add:

"qwen3-8b-tool-rl-FC": ModelConfig(
    model_name="unsloth/Qwen3-8B",
    display_name="Qwen3-8B-tool-rl (FC)",
    url="https://huggingface.co/unsloth/Qwen3-8B",
    org="Qwen",
    license="apache-2.0",
    model_handler=QwenFCHandler,
    is_fc_model=True,
    underscore_to_dot=False,
),

Generate and evaluate with VLLM:

# Fine-tuned (merged LoRA)
bfcl generate \
  --model qwen3-8b-tool-rl-FC \
  --test-category multi_turn,single_turn,live,non_live \
  --backend vllm \
  --num-gpus 1 \
  --gpu-memory-utilization 0.9 \
  --local-model-path toolcalling/qwen3-8b-tool-rl/qwen3-8b-4332-merged

bfcl evaluate \
  --model qwen3-8b-tool-rl-FC \
  --test-category multi_turn,single_turn,live,non_live

Baseline comparison (base Qwen3-8B-FC):

bfcl generate \
  --model Qwen/Qwen3-8B-FC \
  --test-category multi_turn,single_turn,live,non_live \
  --backend vllm \
  --num-gpus 1 \
  --gpu-memory-utilization 0.9

bfcl evaluate \
  --model Qwen/Qwen3-8B-FC \
  --test-category multi_turn,single_turn,live,non_live

Model Selection & Performance Analysis

Key Insight: Qwen3-8B was not an ideal choice for demonstrating RL improvements.

Why Limited Improvement?

Base Qwen3-8B already has strong tool-calling capabilities out-of-the-box
Pre-trained on extensive function-calling data
Limited headroom for improvement without extensive multi-epoch training
Already achieves high scores on BFCL v4 benchmark before RL training

Value Delivered Despite Minimal Performance Gains:

✅ Validated end-to-end RL pipeline — All components work correctly
✅ Created reproducible methodology — Easily applied to any HuggingFace model
✅ Built reusable infrastructure — Data pipeline, reward functions, evaluation harness

Pipeline Generalizability: The codebase is model-agnostic and can be applied to:

Any decoder-only LLM on HuggingFace Hub (Llama, Mistral, Phi, Gemma, etc.)
Different model sizes (3B → 70B with appropriate hardware)
Other structured generation tasks beyond tool calling (JSON generation, code completion, etc.)

Lesson Learned: Model selection should consider baseline capabilities and improvement potential, not just absolute performance. However, building a reusable, well-documented, model-agnostic pipeline has value beyond single-model results.

📁 Project Structure

RLCompRew/
├── src/toolcalling_rl/
│   ├── data/              # Data preprocessing pipelines
│   │   ├── raw_data_preprocess.py    # Download & clean ToolACE/XLAM
│   │   └── toolcall_preprocess.py    # Format for RL training
│   ├── train/             # RL training implementation
│   │   ├── rl_train.py               # Main GRPO training loop
│   ├── reward/            # Reward function implementations
│   │   └── toolcalling_rew.py        # Tool-calling specific rewards
│   └── utils/             # Utility functions
│       └── tool_utils.py             # Tool parsing & validation
│
├── configs/
│   └── train/
│       └── grpo_qwen3_8b.yaml        # Main training configuration
│
├── scripts/
│   ├── setup.sh           # Environment setup for Lambda Cloud
│   └── prepare_data.sh    # Data preparation script
│
├── data/ # not commited due to size
│   ├── raw/               # Downloaded datasets (ToolACE, XLAM)
│   └── processed/         # Preprocessed training data
│       └── toolcall_data/ # Final formatted dataset (69K examples)
├── pyproject.toml         # Python dependencies (uv)
└── README.md

🔮 Future Directions

With additional compute budget, promising directions include:

Apply to Weaker Baselines — Test on Llama 3.1-8B, Mistral-7B, or Phi-3 and smaller models to demonstrate clearer RL improvements
Multi-Epoch Training — Extend to 3-5 epochs to assess convergence and performance ceiling
DR-GRPO Experiments — Distributional reward modeling for variance reduction
GSPO Variant — Group Score Policy Optimization for improved stability
Ablation Studies — Test impact of each reward component (format, name, args, reasoning)
Scaling Up — Apply pipeline to Qwen2.5-14B, Qwen2.5-32B, or Llama 3.1-70B
Beta Sweep — Test β ∈ {0.0, 0.01, 0.05, 0.1} on multi-GPU setup with reference model
Other Structured Tasks — Adapt pipeline for JSON generation, SQL queries, or code completion

📚 References

This project was inspired by the following work:

Nemotron-Research-Tool-N1 — NVIDIA Labs
GitHub | Demonstrates RL techniques for tool calling in LLMs
ToolRL: Reward is All Tool Learning Needs
GitHub Research on reward-based optimization for tool learning
ToolACE — Multi-turn tool calling conversations
XLAM Function Calling 60K — Large-scale function calling dataset
BFCL v4 — Berkeley Function Calling Leaderboard evaluation benchmark

⭐ If you find this project useful, please consider giving it a star! ⭐

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.vscode		.vscode
configs/train		configs/train
scripts		scripts
src		src
.gitignore		.gitignore
.python-version		.python-version
LICENSE.txt		LICENSE.txt
README.md		README.md
THIRD_PARTY_NOTICES.md		THIRD_PARTY_NOTICES.md
experiment-grpo-rl.pem		experiment-grpo-rl.pem
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RL Tool Calling with Qwen3-8B

📦 Pretrained Checkpoint (Hugging Face)

💡 Motivation

🎯 Project Overview

🔧 Key Features

💻 Hardware & Training Configuration

Hyperparameter Selection & Memory Constraints

🚀 Quick Start

Prerequisites

Setup Environment

Data Preparation

Training

Final Training Statistics

Adapting to Other Models (Unsloth-Supported)

🔬 Technical Implementation

LoRA Configuration

GRPO with DAPO

Reward Function

Evaluation

Reproducing BFCL Evaluation

Model Selection & Performance Analysis

📁 Project Structure

🔮 Future Directions

📚 References

About

Uh oh!

Releases

Packages

Languages

License

MDadopoulos/RL-Tool-Calling

Folders and files

Latest commit

History

Repository files navigation

RL Tool Calling with Qwen3-8B

📦 Pretrained Checkpoint (Hugging Face)

💡 Motivation

🎯 Project Overview

🔧 Key Features

💻 Hardware & Training Configuration

Hyperparameter Selection & Memory Constraints

🚀 Quick Start

Prerequisites

Setup Environment

Data Preparation

Training

Final Training Statistics

Adapting to Other Models (Unsloth-Supported)

🔬 Technical Implementation

LoRA Configuration

GRPO with DAPO

Reward Function

Evaluation

Reproducing BFCL Evaluation

Model Selection & Performance Analysis

📁 Project Structure

🔮 Future Directions

📚 References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages