Swarm Learning is a decentralized, privacy-preserving Machine Learning framework. This framework utilizes the computing power at, or near, the distributed data sources to run the Machine Learning algorithms that train the models. It uses the security of a blockchain platform to share learnings with peers in a safe and secure manner. In Swarm Learning, training of the model occurs at the edge, where data is most recent, and where prompt, data-driven decisions are mostly necessary. In this completely decentralized architecture, only the insights learned are shared with the collaborating ML peers, not the raw data. This tremendously enhances data security and privacy.
Swarm Learning nodes works in collaboration with other Swarm Learning nodes in the network. It regularly shares its learnings with the other nodes and incorporates their insights. This process continues until the Swarm Learning nodes train the model to desired state. User can monitor the progress of the current training as shown in the below image. It shows all running Swarm nodes, loss, model metric (for example, accuracy) and overall training progress for each User ML node. On hovering over the "progress bar", one can see the number of completed epochs and the total number of epochs.
Swarm Learning framework is made up of various components known as nodes, such as Swarm Learning (SL) nodes, Swarm Network (SN) nodes, Swarm Learning Command Interface (SWCI) nodes, and Swarm Operator (SWOP) nodes. Each node of Swarm Learning is modularized and runs in a separate container. The nodes represent different Swarm Learning functionality and not physical server nodes.
-
SL nodes run the core of Swarm Learning. An SL node works in collaboration with all the other SL nodes in the network. It regularly shares its learnings with the other nodes and incorporates their insights. SL nodes act as an interface between the user model application (ML node) and other Swarm Learning components. SL nodes take care of distributing and merging model weights in a secured way.
-
SN nodes form the blockchain network. The current version of Swarm Learning uses an open-source version of Ethereum as the underlying blockchain platform. The SN nodes interact with each other using this blockchain platform to maintain and track progress. The SN nodes use this state and progress information to co-ordinate the working of the other swarm learning components. Blockchain can be persisted across SN restart to preserve past progress network. User can lookup blockchain and see all history of operations. Users have the flexibility to stop Swarm after training is completed. Once user restarts the SN network, the existing history can be accessed using the
get
orlist
command of SWCI management interface. Sentinel Node is a special SN node. The Sentinel node is responsible for initializing the blockchain network. This is the first node to start.
NOTE: Only metadata is written to the blockchain. The model itself is not stored in the blockchain.
-
SWOP node is an agent that can manage Swarm Learning operations. SWOP is responsible to execute tasks that are assigned to it. A SWOP node can execute only one task at a time. SWOP helps in executing tasks such as starting and stopping Swarm runs, building and upgrading ML containers, and sharing models for training. For more information about SWOP, see Swarm Operator node (SWOP).
-
SWCI node is the command interface tool to the Swarm Learning framework. It is used to monitor the Swarm Learning framework. SWCI nodes can connect to any of the SN nodes in a given Swarm Learning framework to manage the framework. For more information on SWCI, see Swarm Learning Command Interface.
-
SLM-UI node is the GUI management tool to the Swarm Learning framework. It has three functionalities. It is used to install Swarm Learning framework, deploy a Swarm training and monitor the progress of the current training and track past training runs to decide the best training.
-
Swarm Learning security and digital identity aspects are handled by X.509 certificates. Communication among Swarm Learning components are secured using X.509 certificates. User can either generate their own certificates or directly use certificates generated by any standard security software such as SPIRE. For more information on SPIRE, see https://thebottomturtle.io/Solving-the-bottom-turtle-SPIFFE-SPIRE-Book.pdf and https://spiffe.io/.
NOTE: Swarm Learning framework does not initialize if certificates are not provided.
- Swarm Learning components communicate with each other using a set of TCP/IP ports. For more information on port details that must be opened, see Exposed Ports.
NOTE: The participating nodes must be able to access each other's ports.
- License Server installs and manages the license that is required to run the Swarm Learning framework. The licenses are managed by the AutoPass License Server (APLS) container. For more information, see APLS User Guide.
User can transform/modify any Keras or PyTorch based ML program that is written using Python3 into a Swarm Learning ML program by making a few simple changes to the model training code by including the SwarmCallback
API. For more information, see any of the examples included with the Swarm Learning package.
The transformed user Machine Learning (user ML node) program can be built as a Docker container or can be run on the host.
NOTE: HPE recommends users to build an ML Docker container for easier and automatic deployment.
The ML node is responsible to train and iteratively update the model. For each ML node, there is a corresponding SL node in the Swarm Learning framework, which performs the Swarm training. Each pair of ML and SL nodes must run on the same host. This process continues until the SL nodes train the model to the desired state.
NOTE: All the ML nodes must use the same ML platform either Keras (based on TensorFlow 2 backend) or PyTorch. Using Keras for some and PyTorch for the other nodes is not supported.
- Prerequisites for Swarm Learning
- Upgrading from earlier versions
- Download and setup Swarm Learning using the SLM-UI installer
- Execute a simple predefined example - MNIST example
- Running MNIST example using SLM-UI
- Monitoring & Tracking Swarm Learning training using SLM-UI
- Frequently Asked Questions
- Troubleshooting
- Release Notes
NOTE: Accessing Hewlett Packard Enterprise Support clause and Concurrent swarm training feature mentioned in the documentation are applicable for enterprise customers ONLY.
NOTE: The examples and scripts that are bundled with the Swarm UI installer may not be latest. If there are any issues running it, please use the copy directly from github.
- How Swarm Learning Components interact
- Component interactions when using Reverse Proxy
- Swarm Learning Concepts
- Working of a Swarm Learning node
- Adapting ML programs for Swarm Learning
- Swarm wheels package
- Configuring Swarm Learning components
- Using SWCI
- Using SWOP
- Running Swarm learning examples using SLM-UI
- Running Swarm Learning using CLI
- Running Swarm Learning with SE Linux
- Running Swarm Learning with Podman
- Running Swarm Learning with Spire
- Examples
- Swarm Learning diagnostics using CLI
- Centralized Swarm diagnostics using SLM-UI
- Extending Swarm Learning for new ML platforms
- Merge Methods - Whitepaper
- Uninstalling Swarm Learning using SLM-UI
Refer to Acronyms and Abbreviations for more information.
Feedback and questions are appreciated. You can use the issue tracker to report bugs on GitHub. (Or) Join the HPE Developer Slack Workspace and start a discussion in our #hpe-swarm-learning channel.
Refer to Contributing for more information.
The distribution of Swarm Learning in this repository is for non-commercial and experimental use under this license.
See ATTRIBUTIONS and DATA LICENSE for terms and conditions for using the datasets included in this repository.