Skip to content

Commit

Permalink
Docs fixes (#19529)
Browse files Browse the repository at this point in the history
  • Loading branch information
awaelchli committed Feb 26, 2024
1 parent 2e512d4 commit 2a827f3
Show file tree
Hide file tree
Showing 5 changed files with 22 additions and 23 deletions.
8 changes: 5 additions & 3 deletions docs/source-pytorch/common/checkpointing_intermediate.rst
Original file line number Diff line number Diff line change
Expand Up @@ -167,9 +167,11 @@ In distributed training cases where a model is running across many machines, Lig
trainer = Trainer(strategy="ddp")
model = MyLightningModule(hparams)
trainer.fit(model)
# Saves only on the main process
# Handles strategy-specific saving logic like XLA, FSDP, DeepSpeed etc.
trainer.save_checkpoint("example.ckpt")
Not using :meth:`~lightning.pytorch.trainer.trainer.Trainer.save_checkpoint` can lead to unexpected behavior and potential deadlock. Using other saving functions will result in all devices attempting to save the checkpoint. As a result, we highly recommend using the Trainer's save functionality.
If using custom saving functions cannot be avoided, we recommend using the :func:`~lightning.pytorch.utilities.rank_zero.rank_zero_only` decorator to ensure saving occurs only on the main process. Note that this will only work if all ranks hold the exact same state and won't work when using
model parallel distributed strategies such as deepspeed or sharded training.
By using :meth:`~lightning.pytorch.trainer.trainer.Trainer.save_checkpoint` instead of ``torch.save``, you make your code agnostic to the distributed training strategy being used.
It will ensure that checkpoints are saved correctly in a multi-process setting, avoiding race conditions, deadlocks and other common issues that normally require boilerplate code to handle properly.
10 changes: 6 additions & 4 deletions docs/source-pytorch/starter/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,9 @@ Install lightning inside a virtual env or conda environment with pip
python -m pip install lightning
--------------
----


******************
Install with Conda
Expand Down Expand Up @@ -66,17 +68,17 @@ Install future patch releases from the source. Note that the patch release conta
^^^^^^^^^^^^^^^^^^^^^^
Custom PyTorch Version
^^^^^^^^^^^^^^^^^^^^^^
To use any PyTorch version visit the `PyTorch Installation Page <https://pytorch.org/get-started/locally/#start-locally>`_.

To use any PyTorch version visit the `PyTorch Installation Page <https://pytorch.org/get-started/locally/#start-locally>`_.
You can find the list of supported PyTorch versions in our :ref:`compatibility matrix <versioning:Compatibility matrix>`.

----


*******************************************
Optimized for ML workflows (lightning Apps)
Optimized for ML workflows (Lightning Apps)
*******************************************
If you are deploying workflows built with Lightning in production and require fewer dependencies, try using the optimized `lightning[apps]` package:
If you are deploying workflows built with Lightning in production and require fewer dependencies, try using the optimized ``lightning[apps]`` package:

.. code-block:: bash
Expand Down
11 changes: 5 additions & 6 deletions src/lightning/pytorch/callbacks/model_checkpoint.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,13 +89,12 @@ class ModelCheckpoint(Checkpoint):
in a deterministic manner. Default: ``None``.
save_top_k: if ``save_top_k == k``,
the best k models according to the quantity monitored will be saved.
if ``save_top_k == 0``, no models are saved.
if ``save_top_k == -1``, all models are saved.
If ``save_top_k == 0``, no models are saved.
If ``save_top_k == -1``, all models are saved.
Please note that the monitors are checked every ``every_n_epochs`` epochs.
if ``save_top_k >= 2`` and the callback is called multiple
times inside an epoch, the name of the saved file will be
appended with a version count starting with ``v1``
unless ``enable_version_counter`` is set to False.
If ``save_top_k >= 2`` and the callback is called multiple times inside an epoch, and the filename remains
unchanged, the name of the saved file will be appended with a version count starting with ``v1`` to avoid
collisions unless ``enable_version_counter`` is set to False.
mode: one of {min, max}.
If ``save_top_k != 0``, the decision to overwrite the current save file is made
based on either the maximization or the minimization of the monitored quantity.
Expand Down
4 changes: 4 additions & 0 deletions src/lightning/pytorch/core/hooks.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,10 @@ def on_train_batch_end(self, outputs: STEP_OUTPUT, batch: Any, batch_idx: int) -
batch: The batched data as it is returned by the training DataLoader.
batch_idx: the index of the batch
Note:
The value ``outputs["loss"]`` here will be the normalized value w.r.t ``accumulate_grad_batches`` of the
loss returned from ``training_step``.
"""

def on_validation_batch_start(self, batch: Any, batch_idx: int, dataloader_idx: int = 0) -> None:
Expand Down
12 changes: 2 additions & 10 deletions src/lightning/pytorch/core/module.py
Original file line number Diff line number Diff line change
Expand Up @@ -1285,20 +1285,12 @@ def optimizer_step(
Examples::
# DEFAULT
def optimizer_step(self, epoch, batch_idx, optimizer, optimizer_closure):
optimizer.step(closure=optimizer_closure)
# Add your custom logic to run directly before `optimizer.step()`
# Learning rate warm-up
def optimizer_step(self, epoch, batch_idx, optimizer, optimizer_closure):
# update params
optimizer.step(closure=optimizer_closure)
# manually warm up lr without a scheduler
if self.trainer.global_step < 500:
lr_scale = min(1.0, float(self.trainer.global_step + 1) / 500.0)
for pg in optimizer.param_groups:
pg["lr"] = lr_scale * self.learning_rate
# Add your custom logic to run directly after `optimizer.step()`
"""
optimizer.step(closure=optimizer_closure)
Expand Down

0 comments on commit 2a827f3

Please sign in to comment.