Upgrade PTL to 1.0.2 #1278

ericharper · 2020-10-12T16:39:25Z

Lightning Changes:

Trainer parameters
- row_log_interval -> log_every_n_steps: https://pytorch-lightning.readthedocs.io/en/latest/trainer.html#log-every-n-steps
- distributed_backend -> accelerator: https://pytorch-lightning.readthedocs.io/en/latest/trainer.html#accelerator
- log_save_interval -> flush_logs_every_n_steps
pl.callbacks.LearningRateLogger -> pl.callbacks.LearningRateMonitor
TensorMetric -> Metric: https://pytorch-lightning.readthedocs.io/en/latest/metrics.html#
- Please note, this involved an API change. All TensorMetrics have been ported but are not guaranteed to work 100% correctly
- Ported Metrics include [WERBPE, WER, TopKClassificationAccuracy, ClassificationReport, Perplexity]
The checkpoint callback has changed and is partially supported in this PR.
The recommended logging API has changed. Please switch to self.log for logging scalars instead of return dictionaries from training_step, training_epoch_end, validation..., test.... The logger object should be used for logging anything other than scalars.

NeMo Changes

Added strict to most NeMo restoring functions to handle Loading NLP and ASR models might result in Missing key(s) in state_dict error #1297

Under the hood changes:

Getting error AttributeError: 'Trainer' object has no attribute 'configure_logger', now Trainer.logger_connector.configure_logger
configure_checkpoint_callback -> callback_connector.init_default_checkpoint_callback

okuchaiev · 2020-10-12T17:03:12Z

let's hold on with this until 1.0.0 is available.

blisc · 2020-10-13T20:57:18Z

nemo/collections/common/metrics/classification_accuracy.py



-def compute_topk_accuracy(correct_counts, total_counts):


This function is used in asr/models/classification_models.py and asr/models/label_models.py

We might need to create functional metric for this case.

We should not tie such metrics to PTL itself, but keep them general and simply use them in PTL metric wrappers. Otherwise it will become harder to use them outside of PTL contexts

The PTL Metrics API is independent of PTL. Take a look at this example from their docs:

from pytorch_lightning import metrics train_accuracy = metrics.Accuracy() valid_accuracy = metrics.Accuracy(compute_on_step=False) for epoch in range(epochs): for x, y in train_data: y_hat = model(x) # training step accuracy batch_acc = train_accuracy(y_hat, y) for x, y in valid_data: y_hat = model(x) valid_accuracy(y_hat, y) # total accuracy over all training batches total_train_accuracy = train_accuracy.compute() # total accuracy over all validation batches total_valid_accuracy = valid_accuracy.compute()

Yes I understand that, but why do we want to make a basic utility function into something that wraps PTL of any form ? Please revert this one and simply call it from a PTL metric class if needed.

nemo/utils/exp_manager.py

blisc · 2020-10-13T21:21:11Z

nemo/utils/exp_manager.py


    checkpoint_callback = NeMoModelCheckpoint(
        filepath=Path(log_dir / 'checkpoints' / '{val_loss:.2f}-{epoch}'),
        save_top_k=3,
+        monitor='val_loss',


Need to find a way to pass these parameters to the user through the yaml

"val_loss" is not a given - RNNT can optionally disable it cause it is non-trivial cost to compute it over entire datasets (expensive Prediction step and Joint step cost). The default should be loss because that is understood to always be available (since backprop cannot occur without it).

There is also the case that we may not choose to supply a validation set at all (some datasets don't have a validation set, or user constructed dataset does not have a split for it).

Back in 0.9.0, lightning used to assume that "val_loss" was the key that people returned from their validation step.

While moving in a more general director makes sense for lightning, should we do so for NeMo? Or should we just enforce model users to good defaults? If they choose not to use the default, then they have to manually adjust their scripts to make use of things like the ModelCheckpoint callback.

So currently ModelPT has no access to ExpManager (it's called before ModelPT) nor can it ensure the list of callbacks is in the correct order to override it at a later point. Even if it could parse the callback list, it can't access any of the path values as exp manager is a stateless function.

For now, it's fine to keep it val_loss, and provide an exp manager override to chose the monitor. I can override that in RNNT configs.

Also, good defaults is very model dependent. An experiment manager should offer the flexibility without having to completely override one of it's core tasks

Yes I think the correct way is to leave "val_loss" as default and provide an override via it's yaml config. I'll think on it more once we merge this PR

lgtm-com · 2020-10-13T22:15:42Z

This pull request introduces 3 alerts when merging e40b472 into fd98a89 - view on LGTM.com

new alerts:

3 for Unused local variable

lgtm-com · 2020-10-14T15:04:35Z

This pull request introduces 1 alert when merging 36b9f87 into fd98a89 - view on LGTM.com

new alerts:

1 for Unused local variable

lgtm-com · 2020-10-14T16:29:00Z

This pull request introduces 1 alert when merging d6d564d into fd98a89 - view on LGTM.com

new alerts:

1 for Unused local variable

lgtm-com · 2020-10-14T16:47:13Z

This pull request introduces 1 alert when merging ac7102a into fd98a89 - view on LGTM.com

new alerts:

1 for Unused local variable

lgtm-com · 2020-10-14T16:56:04Z

This pull request introduces 1 alert when merging 2a0f459 into fd98a89 - view on LGTM.com

new alerts:

1 for Unused local variable

lgtm-com · 2020-10-14T17:57:43Z

This pull request introduces 1 alert when merging 89c14ca into 87206f7 - view on LGTM.com

new alerts:

1 for Unused local variable

lgtm-com · 2020-10-14T19:02:41Z

This pull request introduces 1 alert when merging 5999ee9 into 87206f7 - view on LGTM.com

new alerts:

1 for Unused local variable

lgtm-com · 2020-10-14T21:29:48Z

This pull request introduces 2 alerts when merging a28a12d into 87206f7 - view on LGTM.com

new alerts:

1 for Unused local variable
1 for Unused import

lgtm-com · 2020-10-14T21:47:03Z

This pull request introduces 2 alerts when merging bf0f265 into 87206f7 - view on LGTM.com

new alerts:

1 for Unused local variable
1 for Unused import

lgtm-com · 2020-10-14T22:11:32Z

This pull request introduces 2 alerts when merging 4752293 into 87206f7 - view on LGTM.com

new alerts:

1 for Unused local variable
1 for Unused import

lgtm-com · 2020-10-14T22:38:13Z

This pull request introduces 2 alerts when merging 1bc44e9 into 87206f7 - view on LGTM.com

new alerts:

1 for Unused local variable
1 for Unused import

lgtm-com · 2020-10-15T15:22:38Z

This pull request introduces 2 alerts when merging a489ee6 into 87206f7 - view on LGTM.com

new alerts:

1 for Unused local variable
1 for Unused import

blisc · 2020-10-15T15:55:14Z

nemo/collections/nlp/models/token_classification/token_classification_model.py

@@ -148,7 +148,10 @@ def validation_step(self, batch, batch_idx):

        preds = torch.argmax(logits, axis=-1)[subtokens_mask]
        labels = labels[subtokens_mask]
-        tp, fp, fn = self.classification_report(preds, labels)
+        self.classification_report(preds, labels)


@ericharper, can you review the changes to this file? I mostly copied from text_classification_model.py.
Specifically, can you check that the instantiation of self.classification_report makes sense in this file.

What's the difference between ['macro', 'micro', 'weighted']? Should everything just be micro?

'micro' is accuracy, all these are different ways of aggregating tp, tn, fp for each label and calculating the final result. 'Macro' should be the default.

ClassificationReport is used in 4 models:
IntentSlotClassificationModel
TextClassificationModel
PunctuationCapitalizationModel
TokenClassificationModel

Can I ask for a review on each model so we can approve of the changes?

Updated all of the above models.

lgtm-com · 2020-10-15T16:03:52Z

This pull request introduces 1 alert when merging 2e91818 into 87206f7 - view on LGTM.com

new alerts:

1 for Unused local variable

blisc · 2020-10-16T15:08:17Z

Jenkinsfile

+    // TODO: Pytorch Lightning has some issues with restoring Metric classes, asked on the lightning slack if they can
+    // provide a simple solution.
+    // stage('L2: Parallel NLP Examples 2') {


'NER finetuning from pretrained Test' is currently blocked by Lightning-AI/pytorch-lightning#4195

Think you might be able to use Lightning.load_from_checkpoint(..., strict=False) to avoid the key mismatch. (Uses load_state_dict(..., strict=False) under the hood.

Signed-off-by: ericharper <complex451@gmail.com>

lgtm-com · 2020-10-17T21:52:13Z

This pull request introduces 1 alert when merging 71e8f6f into 910caf6 - view on LGTM.com

new alerts:

1 for Unused import

Signed-off-by: ericharper <complex451@gmail.com>

lgtm-com · 2020-10-17T22:17:44Z

This pull request introduces 1 alert when merging ae46780 into 910caf6 - view on LGTM.com

new alerts:

1 for Unused import

Signed-off-by: ericharper <complex451@gmail.com>

lgtm-com · 2020-10-18T20:30:58Z

This pull request introduces 1 alert when merging ae687b0 into 910caf6 - view on LGTM.com

new alerts:

1 for Unused import

Signed-off-by: ericharper <complex451@gmail.com>

lgtm-com · 2020-10-18T20:52:39Z

This pull request introduces 3 alerts when merging 6139420 into 910caf6 - view on LGTM.com

new alerts:

2 for Unused local variable
1 for Unused import

Signed-off-by: ericharper <complex451@gmail.com>

lgtm-com · 2020-10-19T14:39:19Z

This pull request introduces 3 alerts when merging bbcac81 into 910caf6 - view on LGTM.com

new alerts:

2 for Unused local variable
1 for Unused import

Signed-off-by: ericharper <complex451@gmail.com>

lgtm-com · 2020-10-19T14:54:06Z

This pull request introduces 3 alerts when merging 41e4be1 into 910caf6 - view on LGTM.com

new alerts:

2 for Unused local variable
1 for Unused import

Signed-off-by: ericharper <complex451@gmail.com>

lgtm-com · 2020-10-19T15:03:48Z

This pull request introduces 3 alerts when merging 247975d into 910caf6 - view on LGTM.com

new alerts:

2 for Unused local variable
1 for Unused import

Signed-off-by: ericharper <complex451@gmail.com>

lgtm-com · 2020-10-19T15:44:59Z

This pull request introduces 3 alerts when merging 8175ae7 into 910caf6 - view on LGTM.com

new alerts:

2 for Unused local variable
1 for Unused import

ekmb

The classification report changes LGTM.

nemo/collections/nlp/metrics/classification_report.py

ekmb · 2020-10-19T16:42:16Z

nemo/collections/nlp/metrics/classification_report.py

@@ -139,18 +153,18 @@ def get_precision_recall_f1(
            + '\n'
        )

-        logging.info(report)
+        # logging.info(report)


Please remove if not needed

ekmb · 2020-10-19T16:43:36Z

nemo/collections/nlp/models/question_answering/qa_model.py

+        # return {
+        #     'train_loss': loss,
+        #     'lr': self._optimizer.param_groups[0]['lr']
+        # }


Please remove if not needed

ekmb · 2020-10-19T16:43:53Z

nemo/collections/nlp/models/question_answering/qa_model.py

-        tensorboard_logs = {'val_loss': avg_loss, 'exact_match': exact_match, 'f1': f1}
-        return {'val_loss': avg_loss, 'log': tensorboard_logs}
+        # tensorboard_logs = {'val_loss': avg_loss, 'exact_match': exact_match, 'f1': f1}
+        # return {'val_loss': avg_loss, 'log': tensorboard_logs}


ekmb · 2020-10-19T16:45:33Z

nemo/collections/nlp/models/token_classification/token_classification_model.py

@@ -80,7 +80,10 @@ def __init__(self, cfg: DictConfig, trainer: Trainer = None):

        self.loss = self.setup_loss(class_balancing=self._cfg.dataset.class_balancing)
        # setup to track metrics
-        self.classification_report = ClassificationReport(len(self._cfg.label_ids), label_ids=self._cfg.label_ids)
+        # TODO: What is the current mode?


done, 'macro' is the default

Signed-off-by: ericharper <complex451@gmail.com>

lgtm-com · 2020-10-19T17:22:17Z

This pull request introduces 2 alerts when merging 655fb48 into 910caf6 - view on LGTM.com

new alerts:

2 for Unused local variable

ericharper changed the title [WIP] Upgrade PTL to 1.0.0rc* [WIP] Upgrade PTL to 1.0.0 Oct 13, 2020

This comment has been minimized.

Sign in to view

ericharper force-pushed the upgrade_ptl branch from 9e53ccd to 09d8f23 Compare October 13, 2020 18:06

This comment has been minimized.

Sign in to view

blisc reviewed Oct 13, 2020

View reviewed changes

This comment has been minimized.

Sign in to view

blisc reviewed Oct 15, 2020

View reviewed changes

blisc mentioned this pull request Oct 16, 2020

Saved model named QuartNet15x5--val_loss-0.00-epoch-xxx.ckpt but val_loss on tensorboard is non-zero #1288

Closed

blisc reviewed Oct 16, 2020

View reviewed changes

blisc marked this pull request as ready for review October 16, 2020 17:13

style

71e8f6f

Signed-off-by: ericharper <complex451@gmail.com>

update

ae46780

Signed-off-by: ericharper <complex451@gmail.com>

nlp logging update

ae687b0

Signed-off-by: ericharper <complex451@gmail.com>

nlp logging update

6139420

Signed-off-by: ericharper <complex451@gmail.com>

nlp logging update

bbcac81

Signed-off-by: ericharper <complex451@gmail.com>

nlp logging update

41e4be1

Signed-off-by: ericharper <complex451@gmail.com>

nlp logging update

247975d

Signed-off-by: ericharper <complex451@gmail.com>

ericharper added 2 commits October 19, 2020 09:29

nlp logging update

d2ef590

Signed-off-by: ericharper <complex451@gmail.com>

style

8175ae7

Signed-off-by: ericharper <complex451@gmail.com>

okuchaiev requested a review from ekmb October 19, 2020 16:17

ekmb requested changes Oct 19, 2020

View reviewed changes

ericharper added 6 commits October 19, 2020 11:05

update docstring

aa5febc

Signed-off-by: ericharper <complex451@gmail.com>

remove commented line

1825555

Signed-off-by: ericharper <complex451@gmail.com>

remove import

ea5021a

Signed-off-by: ericharper <complex451@gmail.com>

remove commented lines

505661e

Signed-off-by: ericharper <complex451@gmail.com>

remove commented lines

829478d

Signed-off-by: ericharper <complex451@gmail.com>

remove commented lines

655fb48

Signed-off-by: ericharper <complex451@gmail.com>

ericharper requested a review from ekmb October 19, 2020 17:11

ekmb approved these changes Oct 19, 2020

View reviewed changes

ericharper merged commit 4561fbf into main Oct 19, 2020

ericharper deleted the upgrade_ptl branch October 19, 2020 17:43

Upgrade PTL to 1.0.2 #1278

Upgrade PTL to 1.0.2 #1278

Conversation

ericharper commented Oct 12, 2020 • edited

okuchaiev commented Oct 12, 2020

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

blisc Oct 13, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

titu1994 Oct 14, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

titu1994 Oct 14, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

lgtm-com bot commented Oct 13, 2020

lgtm-com bot commented Oct 14, 2020

lgtm-com bot commented Oct 14, 2020

lgtm-com bot commented Oct 14, 2020

lgtm-com bot commented Oct 14, 2020

lgtm-com bot commented Oct 14, 2020

lgtm-com bot commented Oct 14, 2020

lgtm-com bot commented Oct 14, 2020

lgtm-com bot commented Oct 14, 2020

lgtm-com bot commented Oct 14, 2020

lgtm-com bot commented Oct 14, 2020

lgtm-com bot commented Oct 15, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lgtm-com bot commented Oct 15, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lgtm-com bot commented Oct 17, 2020

lgtm-com bot commented Oct 17, 2020

lgtm-com bot commented Oct 18, 2020

lgtm-com bot commented Oct 18, 2020

lgtm-com bot commented Oct 19, 2020

lgtm-com bot commented Oct 19, 2020

lgtm-com bot commented Oct 19, 2020

lgtm-com bot commented Oct 19, 2020

ekmb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lgtm-com bot commented Oct 19, 2020

ericharper commented Oct 12, 2020 •

edited

blisc Oct 13, 2020 •

edited

titu1994 Oct 14, 2020 •

edited

titu1994 Oct 14, 2020 •

edited