Skip to content

Commit

Permalink
Merge branch 'master' into bugfix/logging-on-batch-start
Browse files Browse the repository at this point in the history
  • Loading branch information
carmocca committed Oct 15, 2021
2 parents 7d0131b + db4e770 commit 966f709
Show file tree
Hide file tree
Showing 156 changed files with 3,959 additions and 1,193 deletions.
2 changes: 1 addition & 1 deletion .azure-pipelines/gpu-tests.yml
Expand Up @@ -51,7 +51,7 @@ jobs:
- bash: |
python -c "fname = 'requirements/extra.txt' ; lines = [line for line in open(fname).readlines() if 'horovod' not in line] ; open(fname, 'w').writelines(lines)"
pip install fairscale>=0.3.4
pip install "deepspeed==0.4.3" # FIXME: bug with >= 0.4.4
pip install deepspeed==0.5.4
pip install . --requirement requirements/devel.txt
pip list
displayName: 'Install dependencies'
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Expand Up @@ -156,3 +156,4 @@ cifar-10-batches-py
*.pt
# ctags
tags
.tags
8 changes: 4 additions & 4 deletions .pre-commit-config.yaml
Expand Up @@ -46,7 +46,7 @@ repos:
- id: detect-private-key

- repo: https://github.com/asottile/pyupgrade
rev: v2.23.2
rev: v2.29.0
hooks:
- id: pyupgrade
args: [--py36-plus]
Expand All @@ -71,20 +71,20 @@ repos:
name: Format imports

- repo: https://github.com/psf/black
rev: 21.7b0
rev: 21.9b0
hooks:
- id: black
name: Format code

- repo: https://github.com/asottile/blacken-docs
rev: v1.10.0
rev: v1.11.0
hooks:
- id: blacken-docs
args: [ --line-length=120 ]
additional_dependencies: [ black==21.7b0 ]

- repo: https://github.com/executablebooks/mdformat
rev: 0.7.8
rev: 0.7.10
hooks:
- id: mdformat
additional_dependencies:
Expand Down
103 changes: 101 additions & 2 deletions CHANGELOG.md
Expand Up @@ -5,11 +5,14 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).


## [unReleased] - 2021-MM-DD
## [1.5.0] - 2021-MM-DD

### Added


- Add support for monitoring the learning rate monitor without schedulers in `LearningRateMonitor` ([#9786](https://github.com/PyTorchLightning/pytorch-lightning/issues/9786))


- Register `ShardedTensor` state dict hooks in `LightningModule.__init__` if the pytorch version supports `ShardedTensor` ([#8944](https://github.com/PyTorchLightning/pytorch-lightning/pull/8944))


Expand Down Expand Up @@ -146,6 +149,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Added `ModelSummary` callback ([#9344](https://github.com/PyTorchLightning/pytorch-lightning/pull/9344))


- Added `log_images`, `log_text` and `log_table` to `WandbLogger` ([#9545](https://github.com/PyTorchLightning/pytorch-lightning/pull/9545))


- Added `PL_RECONCILE_PROCESS` environment variable to enable process reconciliation regardless of cluster environment settings ([#9389](https://github.com/PyTorchLightning/pytorch-lightning/pull/9389))


Expand All @@ -157,6 +163,11 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).

- Added `RichModelSummary` callback ([#9546](https://github.com/PyTorchLightning/pytorch-lightning/pull/9546))

- Added a warning when an unknown key is encountered in optimizer configuration, and when `OneCycleLR` is used with `"interval": "epoch"` ([#9666](https://github.com/PyTorchLightning/pytorch-lightning/pull/9666))


- Added `DeviceStatsMonitor` callback ([#9712](https://github.com/PyTorchLightning/pytorch-lightning/pull/9712))


- Added `enable_progress_bar` to Trainer constructor ([#9664](https://github.com/PyTorchLightning/pytorch-lightning/pull/9664))

Expand All @@ -167,8 +178,30 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Added support for `torch.use_deterministic_algorithms` ([#9121](https://github.com/PyTorchLightning/pytorch-lightning/pull/9121))


- Enabled automatic parameters tying for TPUs ([#9525](https://github.com/PyTorchLightning/pytorch-lightning/pull/9525))


- Raise a `MisconfigurationException` when trainer functions are called with `ckpt_path="best"` but `checkpoint_callback` isn't configured ([#9841](https://github.com/PyTorchLightning/pytorch-lightning/pull/9841))


- Added support for `torch.autograd.set_detect_anomaly` through `Trainer` constructor argument `detect_anomaly` ([#9848](https://github.com/PyTorchLightning/pytorch-lightning/pull/9848))

- Added a `len` method to `LightningDataModule` ([#9895](https://github.com/PyTorchLightning/pytorch-lightning/pull/9895))

- Added `enable_model_summary` flag to Trainer ([#9699](https://github.com/PyTorchLightning/pytorch-lightning/pull/9699))


- Added `strategy` argument to Trainer ([#8597](https://github.com/PyTorchLightning/pytorch-lightning/pull/8597))


### Changed

- Setting `Trainer(accelerator="ddp_cpu")` now does not spawn a subprocess if `num_processes` is kept `1` along with `num_nodes > 1` ([#9603](https://github.com/PyTorchLightning/pytorch-lightning/pull/9603)).


- Module imports are now catching `ModuleNotFoundError` instead of `ImportError` ([#9867](https://github.com/PyTorchLightning/pytorch-lightning/pull/9867))


- `pytorch_lightning.loggers.neptune.NeptuneLogger` is now consistent with new [neptune-client](https://github.com/neptune-ai/neptune-client) API ([#6867](https://github.com/PyTorchLightning/pytorch-lightning/pull/6867)).

Old [neptune-client](https://github.com/neptune-ai/neptune-client) API is supported by `NeptuneClient` from [neptune-contrib](https://github.com/neptune-ai/neptune-contrib) repo.
Expand Down Expand Up @@ -244,11 +277,36 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Changed `HorovodPlugin.all_gather` to return a `torch.Tensor` instead of a list ([#9696](https://github.com/PyTorchLightning/pytorch-lightning/pull/9696))


- Changed Trainer connectors to be protected attributes:
* Configuration Validator ([#9779](https://github.com/PyTorchLightning/pytorch-lightning/pull/9779))


- Restore `current_epoch` and `global_step` irrespective of trainer task ([#9413](https://github.com/PyTorchLightning/pytorch-lightning/pull/9413))


- Raise an exception if using `amp_level` with native `amp_backend` ([#9755](https://github.com/PyTorchLightning/pytorch-lightning/pull/9755))


- Added `filename` argument in `ModelCheckpoint.format_checkpoint_name` ([#9818](https://github.com/PyTorchLightning/pytorch-lightning/pull/9818))


- Update the logic to check for accumulation steps with deepspeed ([#9826](https://github.com/PyTorchLightning/pytorch-lightning/pull/9826))


- Updated error message for interactive incompatible plugins ([#9896](https://github.com/PyTorchLightning/pytorch-lightning/pull/9896))


- Updated several places in the loops and trainer to access `training_type_plugin` directly instead of `accelerator` ([#9901](https://github.com/PyTorchLightning/pytorch-lightning/pull/9901))


### Deprecated

- Deprecated trainer argument `terminate_on_nan` in favour of `detect_anomaly`([#9175](https://github.com/PyTorchLightning/pytorch-lightning/pull/9175))


- Deprecated `Trainer.terminate_on_nan` public attribute access ([#9849](https://github.com/PyTorchLightning/pytorch-lightning/pull/9849))


- Deprecated `LightningModule.summarize()` in favor of `pytorch_lightning.utilities.model_summary.summarize()`


Expand Down Expand Up @@ -288,7 +346,7 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Deprecated passing `progress_bar_refresh_rate` to the `Trainer` constructor in favor of adding the `ProgressBar` callback with `refresh_rate` directly to the list of callbacks, or passing `enable_progress_bar=False` to disable the progress bar ([#9616](https://github.com/PyTorchLightning/pytorch-lightning/pull/9616))


- Deprecate `LightningDistributed` and move the broadcast logic to `DDPPlugin` and `DDPSpawnPlugin` directly ([#9691](https://github.com/PyTorchLightning/pytorch-lightning/pull/9691))
- Deprecated `LightningDistributed` and move the broadcast logic to `DDPPlugin` and `DDPSpawnPlugin` directly ([#9691](https://github.com/PyTorchLightning/pytorch-lightning/pull/9691))


- Deprecated passing `stochastic_weight_avg` from the `Trainer` constructor in favor of adding the `StochasticWeightAveraging` callback directly to the list of callbacks ([#8989](https://github.com/PyTorchLightning/pytorch-lightning/pull/8989))
Expand All @@ -297,6 +355,23 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Deprecated Accelerator collective API `barrier`, `broadcast`, and `all_gather`, call `TrainingTypePlugin` collective API directly ([#9677](https://github.com/PyTorchLightning/pytorch-lightning/pull/9677))


- Deprecated `checkpoint_callback` from the `Trainer` constructor in favour of `enable_checkpointing` ([#9754](https://github.com/PyTorchLightning/pytorch-lightning/pull/9754))


- Deprecated the `LightningModule.on_post_move_to_device` method ([#9525](https://github.com/PyTorchLightning/pytorch-lightning/pull/9525))


- Deprecated `pytorch_lightning.core.decorators.parameter_validation` in favor of `pytorch_lightning.utilities.parameter_tying.set_shared_parameters` ([#9525](https://github.com/PyTorchLightning/pytorch-lightning/pull/9525))


- Deprecated passing `weights_summary` to the `Trainer` constructor in favor of adding the `ModelSummary` callback with `max_depth` directly to the list of callbacks ([#9699](https://github.com/PyTorchLightning/pytorch-lightning/pull/9699))


- Deprecated `log_gpu_memory`, `gpu_metrics`, and util funcs in favor of `DeviceStatsMonitor` callback ([#9921](https://github.com/PyTorchLightning/pytorch-lightning/pull/9921))


- Deprecated `GPUStatsMonitor` and `XLAStatsMonitor` in favor of `DeviceStatsMonitor` callback ([#9924](https://github.com/PyTorchLightning/pytorch-lightning/pull/9924))

### Removed

- Removed deprecated `metrics` ([#8586](https://github.com/PyTorchLightning/pytorch-lightning/pull/8586/))
Expand Down Expand Up @@ -398,6 +473,15 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Removed `TrainerProperties` mixin and moved property definitions directly into `Trainer` ([#9495](https://github.com/PyTorchLightning/pytorch-lightning/pull/9495))


- Removed a redundant warning with `ModelCheckpoint(monitor=None)` callback ([#9875](https://github.com/PyTorchLightning/pytorch-lightning/pull/9875))


- Remove `epoch` from `trainer.logged_metrics` ([#9904](https://github.com/PyTorchLightning/pytorch-lightning/pull/9904))


- Removed `should_rank_save_checkpoint` property from Trainer ([#9433](https://github.com/PyTorchLightning/pytorch-lightning/pull/9433))


### Fixed


Expand All @@ -422,6 +506,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Fixed `BasePredictionWriter` not returning the batch_indices in a non-distributed setting ([#9432](https://github.com/PyTorchLightning/pytorch-lightning/pull/9432))


- Fixed an error when running on in XLA environments with no TPU attached ([#9572](https://github.com/PyTorchLightning/pytorch-lightning/pull/9572))


- Fixed check on torchmetrics logged whose `compute()` output is a multielement tensor ([#9582](https://github.com/PyTorchLightning/pytorch-lightning/pull/9582))


Expand All @@ -443,9 +530,21 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Fixed `self.log(on_epoch=True)` for the `on_batch_start` and `on_train_batch_start` hooks ([#9780](https://github.com/PyTorchLightning/pytorch-lightning/pull/9780))


- Fixed restoring training state during `trainer.fit` only ([#9413](https://github.com/PyTorchLightning/pytorch-lightning/pull/9413))


- Fixed DeepSpeed and Lightning both calling the scheduler ([#9788](https://github.com/PyTorchLightning/pytorch-lightning/pull/9788))


- Fixed missing arguments when saving hyperparameters from the parent class but not from the child class ([#9800](https://github.com/PyTorchLightning/pytorch-lightning/pull/9800))


- Fixed DeepSpeed GPU device IDs ([#9847](https://github.com/PyTorchLightning/pytorch-lightning/pull/9847))


- Reset `val_dataloader` in `tuner/batch_size_scaling` ([#9857](https://github.com/PyTorchLightning/pytorch-lightning/pull/9857))


## [1.4.9] - 2021-09-30

- Fixed `lr_find` to generate same results on multiple calls ([#9704](https://github.com/PyTorchLightning/pytorch-lightning/pull/9704))
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/test_basic_parity.py
Expand Up @@ -159,7 +159,7 @@ def lightning_loop(cls_model, idx, device_type: str = "cuda", num_epochs=10):
# as the first run is skipped, no need to run it long
max_epochs=num_epochs if idx > 0 else 1,
enable_progress_bar=False,
weights_summary=None,
enable_model_summary=False,
gpus=1 if device_type == "cuda" else 0,
checkpoint_callback=False,
logger=False,
Expand Down
3 changes: 2 additions & 1 deletion dockers/nvidia/Dockerfile
Expand Up @@ -32,7 +32,8 @@ RUN \
rm -rf pytorch-lightning ; \
wget https://github.com/PyTorchLightning/pytorch-lightning/archive/${LIGHTNING_VERSION}.zip --progress=bar:force:noscroll ; \
unzip ${LIGHTNING_VERSION}.zip ; \
mv pytorch-lightning-*/ pytorch-lightning ; \
mv pytorch-lightning-*/* pytorch-lightning ; \
rm -rf pytorch-lightning-* ; \
rm *.zip ; \
fi && \
# save the examples
Expand Down
3 changes: 2 additions & 1 deletion dockers/tpu-tests/tpu_test_cases.jsonnet
Expand Up @@ -35,7 +35,8 @@ local tputests = base.BaseTest {
coverage run --source=pytorch_lightning -m pytest -v --capture=no \
tests/profiler/test_xla_profiler.py \
pytorch_lightning/utilities/xla_device.py \
tests/accelerators/test_tpu_backend.py \
tests/accelerators/test_tpu.py \
tests/callbacks/test_device_stats_monitor.py \
tests/models/test_tpu.py
test_exit_code=$?
echo "\n||| END PYTEST LOGS |||\n"
Expand Down
17 changes: 12 additions & 5 deletions docs/source/advanced/multi_gpu.rst
Expand Up @@ -611,27 +611,34 @@ Let's say you have a batch size of 7 in your dataloader.
def train_dataloader(self):
return Dataset(..., batch_size=7)

In (DDP, Horovod) your effective batch size will be 7 * gpus * num_nodes.
In DDP, DDP_SPAWN, Deepspeed, DDP_SHARDED, or Horovod your effective batch size will be 7 * gpus * num_nodes.

.. code-block:: python
# effective batch size = 7 * 8
Trainer(gpus=8, accelerator="ddp|horovod")
Trainer(gpus=8, accelerator="ddp")
Trainer(gpus=8, accelerator="ddp_spawn")
Trainer(gpus=8, accelerator="ddp_sharded")
Trainer(gpus=8, accelerator="horovod")
# effective batch size = 7 * 8 * 10
Trainer(gpus=8, num_nodes=10, accelerator="ddp|horovod")
Trainer(gpus=8, num_nodes=10, accelerator="ddp")
Trainer(gpus=8, num_nodes=10, accelerator="ddp_spawn")
Trainer(gpus=8, num_nodes=10, accelerator="ddp_sharded")
Trainer(gpus=8, num_nodes=10, accelerator="horovod")
In DDP2, your effective batch size will be 7 * num_nodes.
In DDP2 or DP, your effective batch size will be 7 * num_nodes.
The reason is that the full batch is visible to all GPUs on the node when using DDP2.

.. code-block:: python
# effective batch size = 7
Trainer(gpus=8, accelerator="ddp2")
Trainer(gpus=8, accelerator="dp")
# effective batch size = 7 * 10
Trainer(gpus=8, num_nodes=10, accelerator="ddp2")
Trainer(gpus=8, accelerator="dp")
.. note:: Huge batch sizes are actually really bad for convergence. Check out:
Expand Down
2 changes: 1 addition & 1 deletion docs/source/advanced/training_tricks.rst
Expand Up @@ -30,7 +30,7 @@ Gradient Clipping
Gradient clipping may be enabled to avoid exploding gradients. By default, this will clip the gradient norm by calling
:func:`torch.nn.utils.clip_grad_norm_` computed over all model parameters together.
If the Trainer's ``gradient_clip_algorithm`` is set to ``'value'`` (``'norm'`` by default), this will use instead
:func:`torch.nn.utils.clip_grad_norm_` for each parameter instead.
:func:`torch.nn.utils.clip_grad_value_` for each parameter instead.

.. note::
If using mixed precision, the ``gradient_clip_val`` does not need to be changed as the gradients are unscaled
Expand Down

0 comments on commit 966f709

Please sign in to comment.