SymphonyLearn

A Pytorch-native platform for hetetrogenous & decentralized training of large-scale AI models

🧭 Overview

As AI models, especially Large Language Models (LLMs) and Vision-Language Models (VLMs), continue to grow in scale and complexity, the need for heterogeneous and decentralized training strategies is becoming increasingly critical. Training such massive models demands enormous computational resources, which are often inaccessible to most researchers and organizations.

HPC centers around the world host a wide variety of GPUs, ranging across different vendors, architectures, and hardware configurations. However, these variations introduce compatibility and utilization challenges, often preventing AI researchers from seamlessly leveraging multiple HPC systems at once.

This platform demonstrates a practical approach to overcoming these challenges by connecting heterogeneous HPC resources in a decentralized manner using the Diloco algorithm. It enables collaborative, cross-platform AI model training without requiring homogeneous hardware or centralized orchestration. In particular, this platform enables two levels of heterogeneity:

Cross-Hardware heterogeneity: Train models across multiple hardware platforms—leveraging both PyTorch and DaCe integration. This provides flexibility to exploit various computation resources, regardless of vendor or GPU generation. Supporting efficient training on exotic backends can be supported via extending DaCe.
Non-uniform GPU distribution: HPC clusters vary widely in their node configurations (commonly with 4 or 8 GPUs per node). Our platform offers native support for varying GPU counts and node structures, allowing seamless scaling across diverse systems.

💡 Why Use This Platform?

Overcome hardware heterogeneity in HPC environments
Enable decentralized collaboration for large-model training
Achieve efficient cross-center resource sharing
Lower the computational and financial barriers to AI research

🧪 Tested platforms

We tested and validated heterogenous training on the following platforms:

Nvidia GPUs (L40S, A100, H100, H200)
AMD GPUs (MI300X)

Training on a CPU backend is not supported as of now.

📑 Documentations

Besides the Getting started section, you can find additional documentations in the doc folder.

🚀 Getting Started

Installing the framework

First clone the repository with:

git clone --recursive https://github.com/PanocularAI/symphony-learn.git

Make sure that you pull all submodules using --recursive flag.

To facilitate the installation of the framework, you may run the make file to automatically setup the dependencies and environment.

make all

However, it might be the case that any of the commands in the makefile fail due to incompatibility with your setup. Therefore, please refer to the detailed Installation Guideline for installing dependencies and troubleshooting.

Setting up Tailscale VPN

To establish communication between different compute islands, each compute node must have a routable public IP address. If public IPs are not available, it is recommended to use Tailscale. Please follow the instructions to setup the tailscale service on your machine.

Launch the decentralized training

Here we explain a sample training of a llama3 model on two different Nvidia islands in a decentralized way. For a more detailed explanation and required arguments to support fully heterogenous training (different number of gpus and vendors), please refer to Launching Training.

You need to execute the following three commands in different shell sessions:

Start the lighthouse engine.

RUST_BACKTRACE=1 torchft_lighthouse --bind=<public_ip>:29510 --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 10000

Run the training on the first island:

TORCHFT_LIGHTHOUSE=http://<public_ip>:29510 \
NGPU=1 \
LOCAL_ADDR=${LOCAL_ADDR} \
MASTER_ADDR=${MASTER_ADDR} \
MASTER_PORT=29500 \
NNODES=<num_nodes> \
ISHOST=<yes_if_master-no_if_worker> \
GLOO_SOCKET_IFNAME=<network_card> \
NCCL_SOCKET_IFNAME=<network_card> \
CONFIG_FILE="./models/llama3/train_configs/debug_model.toml" \
uv run ./run_train.sh --fault_tolerance.enable --fault_tolerance.replica_id=0 --fault_tolerance.group_size=2

Run the training on the second island:

TORCHFT_LIGHTHOUSE=http://<public_ip>:29510 \
NGPU=1 \
LOCAL_ADDR=<local_ip> \
MASTER_ADDR=<master_c10d_ip> \
MASTER_PORT=29500 \
NNODES=<num_nodes> \
ISHOST=<yes_if_master-no_if_worker> \
GLOO_SOCKET_IFNAME=<network_card> \
NCCL_SOCKET_IFNAME=<network_card> \
CONFIG_FILE="./models/llama3/train_configs/debug_model.toml" \
uv run ./run_train.sh --fault_tolerance.enable --fault_tolerance.replica_id=1 --fault_tolerance.group_size=2

🤖 Supported models

Currently, we validated the decentralized training of the following models:

Llama3
Qwen3
GPT_OSS
Resnets

Before running training, make sure to download the relevant tokenizer from HF, using:

uv run python torchtitan/scripts/download_hf_assets.py --repo_id <hf_repo_name> --assets tokenizer --hf_token=$HF_TOKEN

Replace <hf_repo_name> with the HF model path, such as meta-llama/Llama-3.1-8B for Llama3-8B, or Qwen/Qwen3-0.6B for Qwen3-0.6B. Specifying $HF_TOKEN and requesting access to Llama3 on HF is required for downloading Llama tokenizer.

There are many more already added in TorchTitan models and TorchTitan experiment models. Moreover, new models can be simply added by following the Adding a new model tutorial.

🙏 Acknowledgement

This work builds upon the following open-source frameworks:

TorchTitan — a PyTorch-native platform for large-scale generative AI model training (Liang et al., ICLR 2025).
TorchFT — a library providing fault-tolerance primitives for distributed PyTorch training (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo).

We gratefully acknowledge the PyTorch, TorchTitan, and TorchFT teams for their foundational contributions to distributed and fault-tolerant ML training infrastructures.

This project is gratefully funded by federal ministry of breakthrough innovation (SPRIN-D) under Composite Learning Challenge.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
docs		docs
models		models
skypilot		skypilot
tests		tests
torchft @ a601ab4		torchft @ a601ab4
torchtitan @ 4a6bd69		torchtitan @ 4a6bd69
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
launch.py		launch.py
pyproject.toml		pyproject.toml
run_train.sh		run_train.sh
train.py		train.py
train_spec.py		train_spec.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SymphonyLearn

A Pytorch-native platform for hetetrogenous & decentralized training of large-scale AI models

🧭 Overview

💡 Why Use This Platform?

🧪 Tested platforms

📑 Documentations

🚀 Getting Started

Installing the framework

Setting up Tailscale VPN

Launch the decentralized training

🤖 Supported models

🙏 Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SymphonyLearn

A Pytorch-native platform for hetetrogenous & decentralized training of large-scale AI models

🧭 Overview

💡 Why Use This Platform?

🧪 Tested platforms

📑 Documentations

🚀 Getting Started

Installing the framework

Setting up Tailscale VPN

Launch the decentralized training

🤖 Supported models

🙏 Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages