Model Parallelism Training and More Logging Options
Overview
Lightning 1.1 is out! You can now train models with twice the parameters and zero code changes with the new sharded model training! We also have a new plugin for sequential model parallelism, more logging options, and a lot of improvements!
Release highlights: https://bit.ly/3gyLZpP
Learn more about sharded training: https://bit.ly/2W3hgI0
Detail changes
Added
- Added "monitor" key to saved
ModelCheckpoints(#4383) - Added
ConfusionMatrixclass interface (#4348) - Added multiclass AUROC metric (#4236)
- Added global step indexing to the checkpoint name for a better sub-epoch checkpointing experience (#3807)
- Added optimizer hooks in callbacks (#4379)
- Added option to log momentum (#4384)
- Added
current_scoretoModelCheckpoint.on_save_checkpoint(#4721) - Added logging using
self.login train and evaluation for epoch end hooks (#4913) - Added ability for DDP plugin to modify optimizer state saving (#4675)
- Added casting to python types for NumPy scalars when logging
hparams(#4647) - Added
prefixargument in loggers (#4557) - Added printing of total num of params, trainable and non-trainable params in ModelSummary (#4521)
- Added
PrecisionRecallCurve, ROC, AveragePrecisionclass metric (#4549) - Added custom
ApexandNativeAMPasPrecision plugins(#4355) - Added
DALI MNISTexample (#3721) - Added
sharded pluginfor DDP for multi-GPU training memory optimizations (#4773) - Added
experiment_idto the NeptuneLogger (#3462) - Added
Pytorch Geometricintegration example with Lightning (#4568) - Added
all_gathermethod toLightningModulewhich allows gradient-based tensor synchronizations for use-cases such as negative sampling. (#5012) - Enabled
self.login most functions (#4969) - Added changeable extension variable for
ModelCheckpoint(#4977)
Changed
- Removed
multiclass_rocandmulticlass_precision_recall_curve, userocandprecision_recall_curveinstead (#4549) - Tuner algorithms will be skipped if
fast_dev_run=True(#3903) - WandbLogger does not force wandb
reinitarg to True anymore and creates a run only when needed (#4648) - Changed
automatic_optimizationto be a model attribute (#4602) - Changed
Simple Profilerreport to order by percentage time spent + num calls (#4880) - Simplify optimization Logic (#4984)
- Classification metrics overhaul (#4837)
- Updated
fast_dev_runto accept integer representing num_batches (#4629) - Refactored optimizer (#4658)
Deprecated
- Deprecated
prefixargument inModelCheckpoint(#4765) - Deprecated the old way of assigning hyper-parameters through
self.hparams = ...(#4813) - Deprecated
mode='auto'fromModelCheckpointandEarlyStopping(#4695)
Removed
- Removed
reorderparameter of theaucmetric (#5004)
Fixed
- Added feature to move tensors to CPU before saving (#4309)
- Fixed
LoggerConnectorto have logged metrics on root device in DP (#4138) - Auto convert tensors to contiguous format when
gather_all(#4907) - Fixed
PYTHONPATHfor DDP test model (#4528) - Fixed allowing logger to support indexing (#4595)
- Fixed DDP and manual_optimization (#4976)
Contributors
@ananyahjha93, @awaelchli, @blatr, @Borda, @borisdayma, @carmocca, @ddrevicky, @george-gca, @gianscarpe, @irustandi, @janhenriklambrechts, @jeremyjordan, @justusschock, @lezwon, @rohitgr7, @s-rog, @SeanNaren, @SkafteNicki, @tadejsv, @tchaton, @williamFalcon, @zippeurfou
If we forgot someone due to not matching commit email with GitHub account, let us know :]