Skip to content

Latest commit

 

History

History
34 lines (22 loc) · 1.28 KB

fault_tolerant_training_expert.rst

File metadata and controls

34 lines (22 loc) · 1.28 KB
orphan:

Fault-tolerant Training (expert)

Audience: Experts looking to enable and handle their own fault-tolerance.

Pre-requisites: Users must have first read :doc:`Fault-tolrance Training (basic) <fault_tolerant_training_basic>`


Enable fault-tolerant behavior anywhere

To enable fault tolerance on your own cloud or cluster environment enable the PL_FAULT_TOLERANT_TRAINING environment variable:

PL_FAULT_TOLERANT_TRAINING=1 python script.py

Although Lighting will now be fault-tolerant, you'll have to handle all the nuances of making sure the models are automatically restarted.

Note

This complexity is already handled for you if you use lightning-grid.


Enable fault-tolerant behavior on your own cluster

The simplest way to enable fault-tolerant behavior is to enable lightning-grid to work on your on-prem cluster or cloud environment which will handle all the nuances of fault-tolerant training at scale.

Email us to connect with your own cloud account:

mailto:onprem@pytorchlightning.ai