new feature for profiling training runs #782

jeremyjordan · 2020-02-02T20:19:00Z

Before submitting

Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
Did you read the contributor guideline?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

Fixes #753

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

jeremyjordan · 2020-02-02T23:08:29Z

Added tests for the simple profiler. Not sure the best way to test the advanced profiler. The current output is string which is somewhat tricky/annoying to parse. There was a recent PR which allowed for getting profile stats as a dict, but that was only merged into Python a couple weeks ago.

jeremyjordan · 2020-02-03T00:10:03Z

I'm planning to add more documentation regarding usage, but would like to get some input to make sure this is generally an acceptable solution.

Borda · 2020-02-03T16:21:46Z

looks promising, just need to have deeper look... :]

jeremyjordan · 2020-02-03T17:02:17Z

great! I’ll go ahead and work on more documentation this evening

Borda · 2020-02-03T17:09:18Z

maybe wait for @williamFalcon first screening opinion before you start with fine pruning... ;]

pytorch_lightning/trainer/training_loop.py

Borda · 2020-02-03T17:14:44Z

pytorch_lightning/utilities/profiler.py

+
+    def start(self, action_name):
+        if action_name in self.current_actions:
+            raise ValueError(


I would consider some kind of soft information (just a logger warning) since it is a complementary service...

I gave it some thought and would prefer to keep the exception. For the default profiled actions, this will be tested and we'll ensure that everything works properly (verifying that no tests raise the exception). If the user is profiling additional actions, I would think they would want to know when their code is incorrect. The only time this error is raised is if someone calls profiler.stop() without calling profiler.start() or vice versa.

In reality, I don't even think we should document the start() and stop() interfaces to the user (since it can quickly get very messy), following the principle of having one way to do things.

Force User Decisions To Best Practices
There are 1,000 ways to do something. However, something eventually becomes standard practice that everyone does. Thus we pick one way of doing it and force everyone to do it this way.

Borda · 2020-02-03T17:15:14Z

pytorch_lightning/utilities/profiler.py

+
+    def describe(self):
+        def print_row(action, mean, std_dev):
+            print(f"{action:<20s}\t|  {mean:<15}\t|  {std_dev:<15}")


use logging

Borda · 2020-02-03T17:15:37Z

pytorch_lightning/utilities/profiler.py

+    """
+
+    def __init__(self, output_filename=None):
+        '''


jeremyjordan · 2020-02-04T03:04:56Z

I used the LightningTemplateModel to verify that the profiler behaves and outputs as expected. The printed summary could definitely be presented better. Should we keep the reporting in seconds or use something like milliseconds? Any other tips to clean up this summary?

Also, we should still print the summary when the user exits early using "Ctrl C" to trigger a KeyboardEscape. Where would I need to make a change to make sure that's the case?

>>> from pl_examples.basic_examples.lightning_module_template import LightningTemplateModel
>>> from pytorch_lightning import Trainer
>>> from pytorch_lightning.utilities.profiler import Profiler
>>> from argparse import Namespace
>>> hparams = {
...     "batch_size": 32,
...     "in_features": 784,
...     "hidden_dim": 128,
...     "drop_prob": 0.2,
...     "out_features": 10,
...     "learning_rate": 1e-3,
...     "data_root": "data"
... }
>>> hparams = Namespace(**hparams)
>>> model = LightningTemplateModel(hparams)
>>> profiler = Profiler()
>>> trainer = Trainer(profiler=profiler, max_epochs=1)
>>> trainer.fit(model)
INFO:root:Training data loader called.
INFO:root:Test data loader called.
INFO:root:Validation data loader called.
INFO:root:
  | Name      | Type        | Params | In sizes | Out sizes
-------------------------------------------------------------
0 | c_d1      | Linear      | 100 K  | [5, 784] | [5, 128] 
1 | c_d1_bn   | BatchNorm1d | 256    | [5, 128] | [5, 128] 
2 | c_d1_drop | Dropout     | 0      | [5, 128] | [5, 128] 
3 | c_d2      | Linear      | 1 K    | [5, 128] | [5, 10]  
Epoch 1: 100%|█| 2188/2188 [00:28<00:00, 153.56batch/s, batch_idx=1874, loss=0.210, train_loss=0.183, v_num=2, val_acc=0.952, val_loss=0.1/
Epoch 1: 100%|█| 2188/2188 [00:28<00:00, 76.55batch/s, batch_idx=1874, loss=0.210, train_loss=0.183, v_num=2, val_acc=0.952, val_loss=0.171
INFO:profiler:
Profiler Report

Action                  |  Mean duration (s)    |  Std dev.       
------------------------------------------------------------
on_epoch_start          |  7.863e-06            |  0.0            
on_batch_start          |  5.4336e-06           |  1.5331e-06     
model_forward           |  0.0016781            |  0.0002032      
model_backward          |  0.0017407            |  0.00023147     
on_after_backward       |  4.2085e-06           |  1.5902e-06     
optimizer_step          |  0.0010742            |  0.00016276     
on_batch_end            |  4.4134e-06           |  1.1413e-06     
on_epoch_end            |  5.039e-06            |  0.0            
on_train_end            |  4.06e-06             |  0.0 
1

jeremyjordan · 2020-02-05T04:06:26Z

Ok, I added a method to the base profiler which we can use when we need to profile calling next() on iterables (eg. get next training batch).

I also stopped reporting standard deviations for durations and instead report the total time spent per action. I think this will be more informative.

Profiler Report

Action                  |  Mean duration (s)    |  Total time (s) 
-----------------------------------------------------------------
on_epoch_start          |  5.993e-06            |  5.993e-06      
get_train_batch         |  0.0087412            |  16.398         
on_batch_start          |  5.0865e-06           |  0.0095372      
model_forward           |  0.0017818            |  3.3408         
model_backward          |  0.0018283            |  3.4282         
on_after_backward       |  4.2862e-06           |  0.0080366      
optimizer_step          |  0.0011072            |  2.0759         
on_batch_end            |  4.5202e-06           |  0.0084753      
on_epoch_end            |  3.919e-06            |  3.919e-06      
on_train_end            |  5.449e-06            |  5.449e-06

williamFalcon · 2020-02-05T04:07:57Z

@jeremyjordan this is so sick haha. great job!

jeremyjordan · 2020-02-05T04:10:45Z

@williamFalcon thanks! i'm also excited about the AdvancedProfiler as this may help people who really want to get into the details. i'm envisioning they'll start with Profiler and switch to AdvancedProfiler if they decide they want to try optimizing a step. the output for AdvancedProfiler is something like: (for each action)

Profile stats for: get_train_batch
         4869394 function calls (4863767 primitive calls) in 19.107 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
3752/1876    0.012    0.000   19.100    0.010 {built-in method builtins.next}
     1876    0.008    0.000   19.090    0.010 dataloader.py:344(__next__)
     1876    0.075    0.000   19.082    0.010 dataloader.py:383(_next_data)
     1875    0.013    0.000   18.933    0.010 fetch.py:42(fetch)
     1875    0.086    0.000   18.494    0.010 fetch.py:44(<listcomp>)
    60000    1.797    0.000   18.408    0.000 mnist.py:80(__getitem__)
    60000    0.268    0.000   13.126    0.000 transforms.py:68(__call__)
    60000    0.184    0.000    7.104    0.000 transforms.py:93(__call__)
    60000    1.679    0.000    6.921    0.000 functional.py:42(to_tensor)
    60000    0.258    0.000    5.754    0.000 transforms.py:167(__call__)
    60000    2.856    0.000    5.495    0.000 functional.py:191(normalize)
    60000    0.727    0.000    3.128    0.000 Image.py:2612(fromarray)
    60000    0.398    0.000    2.362    0.000 Image.py:2552(frombuffer)
    60000    0.410    0.000    1.395    0.000 Image.py:734(tobytes)
    60000    1.197    0.000    1.197    0.000 {method 'div' of 'torch._C._TensorBase' objects}
    60000    0.246    0.000    1.069    0.000 Image.py:2469(new)
   120000    0.990    0.000    0.990    0.000 {built-in method as_tensor}
   120000    0.795    0.000    0.795    0.000 {method 'transpose' of 'torch._C._TensorBase' objects}
    60000    0.694    0.000    0.694    0.000 {method 'view' of 'torch._C._TensorBase' objects}
    60000    0.650    0.000    0.650    0.000 {method 'float' of 'torch._C._TensorBase' objects}
    60000    0.551    0.000    0.551    0.000 {method 'sub_' of 'torch._C._TensorBase' objects}
   120000    0.326    0.000    0.539    0.000 Image.py:572(_new)
    60000    0.500    0.000    0.500    0.000 {method 'clone' of 'torch._C._TensorBase' objects}
    60000    0.298    0.000    0.450    0.000 Image.py:461(_getencoder)
5625/1875    0.049    0.000    0.426    0.000 collate.py:42(default_collate)
    60000    0.397    0.000    0.397    0.000 {method 'div_' of 'torch._C._TensorBase' objects}
   180000    0.150    0.000    0.372    0.000 Image.py:629(__del__)
     1875    0.010    0.000    0.363    0.000 collate.py:79(<listcomp>)
   180000    0.315    0.000    0.315    0.000 Image.py:547(__init__)
     1875    0.299    0.000    0.299    0.000 {built-in method stack}
    60000    0.284    0.000    0.284    0.000 {built-in method PIL._imaging.map_buffer}
   120000    0.184    0.000    0.252    0.000 Image.py:2451(_check_size)
   560625    0.222    0.000    0.246    0.000 {built-in method builtins.isinstance}
    60000    0.239    0.000    0.239    0.000 {built-in method PIL._imaging.fill}
   180000    0.152    0.000    0.222    0.000 Image.py:591(__exit__)
    60000    0.220    0.000    0.220    0.000 {method 'encode' of 'ImagingEncoder' objects}
    60000    0.201    0.000    0.201    0.000 {built-in method from_buffer}
    60000    0.102    0.000    0.201    0.000 functional.py:30(_is_tensor_image)
    60000    0.189    0.000    0.189    0.000 {method 'numpy' of 'torch._C._TensorBase' objects}
    60000    0.090    0.000    0.126    0.000 Image.py:838(load)
420003/420002    0.088    0.000    0.088    0.000 {built-in method builtins.len}
     1876    0.002    0.000    0.073    0.000 dataloader.py:338(_next_index)
    60000    0.053    0.000    0.072    0.000 functional.py:34(_is_numpy)
   181875    0.071    0.000    0.071    0.000 {built-in method builtins.hasattr}
    60000    0.053    0.000    0.070    0.000 __init__.py:116(is_tensor)
    60000    0.070    0.000    0.070    0.000 {built-in method builtins.max}
     1876    0.047    0.000    0.070    0.000 sampler.py:198(__iter__)
    60000    0.054    0.000    0.069    0.000 functional.py:23(_is_pil_image)
    60000    0.067    0.000    0.067    0.000 {built-in method PIL._imaging.raw_encoder}
    60000    0.066    0.000    0.066    0.000 {built-in method builtins.getattr}
    60000    0.038    0.000    0.056    0.000 _util.py:9(isStringType)
    60000    0.054    0.000    0.054    0.000 {method 'setimage' of 'ImagingEncoder' objects}
   180000    0.052    0.000    0.052    0.000 Image.py:568(size)
    60000    0.041    0.000    0.041    0.000 {method 'contiguous' of 'torch._C._TensorBase' objects}
    60000    0.037    0.000    0.037    0.000 {method 'pixel_access' of 'ImagingCore' objects}
     1875    0.030    0.000    0.030    0.000 {built-in method tensor}
   120000    0.030    0.000    0.030    0.000 {method 'copy' of 'dict' objects}
    60000    0.029    0.000    0.029    0.000 {method 'ndimension' of 'torch._C._TensorBase' objects}
   120000    0.027    0.000    0.027    0.000 {method 'append' of 'list' objects}
     3750    0.003    0.000    0.024    0.000 abc.py:137(__instancecheck__)
     3750    0.012    0.000    0.022    0.000 {built-in method _abc._abc_instancecheck}
    61876    0.021    0.000    0.021    0.000 {method 'get' of 'dict' objects}
    60000    0.016    0.000    0.016    0.000 {method 'join' of 'bytes' objects}
     1875    0.002    0.000    0.010    0.000 abc.py:141(__subclasscheck__)
     1875    0.006    0.000    0.008    0.000 {built-in method _abc._abc_subclasscheck}
     1876    0.005    0.000    0.007    0.000 profiler.py:204(stop)
        1    0.000    0.000    0.004    0.004 sampler.py:103(__iter__)
        1    0.002    0.002    0.002    0.002 {method 'tolist' of 'torch._C._TensorBase' objects}
     1875    0.002    0.000    0.002    0.000 _collections_abc.py:302(__subclasshook__)
        1    0.002    0.002    0.002    0.002 {built-in method randperm}
     1875    0.001    0.000    0.001    0.000 worker.py:74(get_worker_info)
     1876    0.001    0.000    0.001    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 mnist.py:102(__len__)
        1    0.000    0.000    0.000    0.000 tensor.py:443(__len__)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.iter}
        1    0.000    0.000    0.000    0.000 {method 'dim' of 'torch._C._TensorBase' objects}

Borda

Great work, Thx!

pytorch_lightning/utilities/profiler.py

williamFalcon · 2020-02-05T11:28:48Z

tests/test_profiler.py

+import numpy as np
+
+
+def test_simple_profiler():


need a test for the advanced profiler

williamFalcon · 2020-02-05T11:29:13Z

docs/source/profiler.rst

+.. role:: hidden
+    :class: hidden-section
+
+Profiler


Maybe "Performance Profiler" ?

pytorch_lightning/trainer/trainer.py

williamFalcon · 2020-02-05T11:31:33Z

@jeremyjordan amazing job! Let's make sure the docs and tests are thorough here.

In addition, I'd love to get a medium post about this. I can co-publish with you if you want, so we can get it on a lot of publications. email me to discuss

Borda · 2020-02-05T11:35:25Z

You can make a #publishing channel on slack ;)
Btw, Do you think that s medium post is good way for citing?

williamFalcon · 2020-02-05T13:26:16Z

it's good for discovery because we can also publish in towards data science, kdnuggets, etc...

jeremyjordan · 2020-02-05T15:03:16Z

yeah i'd love to do a blog post on this! i agree that great documentation (including blog posts) is important for user adoption. happy to cross-post on my blog as well.

jeremyjordan · 2020-02-06T03:36:52Z

@williamFalcon i've added more documentation, a quick test for the AdvancedProfiler, and made the update to the Trainer to allow for the usage of profiler=True. ready for review :)

…ytorch-lightning into feature/profiling

williamFalcon · 2020-02-06T13:42:28Z

@jeremyjordan can we make the import:

from pytorch_lightning.profilers import ProfilerA, ProfilerB

Borda · 2020-02-06T14:16:23Z

the profiler is in pytorch_lightning/utilities/profiler.py
so more likely from pytorch_lightning import ProfilerA, ProfilerB

williamFalcon · 2020-02-06T14:20:25Z

@Borda we need organization in the package. i think we need a collection called profilers because it allows other profilers to enter lightning as well. much like we have loggers and callbacks

jeremyjordan · 2020-02-06T14:22:53Z

should we move pytorch_lightning/utilities/profiler.py to pytorch_lightning/profiler/?

williamFalcon · 2020-02-06T14:30:05Z

yes. i kind of hate utility folders, it usually means we didn’t think hard about organization :)

jeremyjordan · 2020-02-06T14:42:00Z

awesome! i'll make the change.

haha i feel the same way, utils.py is my pet peeve.

Borda · 2020-02-06T14:51:50Z

@Borda we need organization in the package. i think we need a collection called profilers because it allows other profilers to enter lightning as well. much like we have loggers and callbacks

sure, do you want to do it in 0.6.1 or in next release 0.7?

williamFalcon · 2020-02-06T14:53:26Z

we can add the organization over time. right now since profilers is new we need to do it now

Borda · 2020-02-06T16:17:10Z

Sure, just think about versions since the restructure on utils API would be better in a new release, like 0.7 https://github.com/PyTorchLightning/pytorch-lightning/milestone/5

williamFalcon · 2020-02-06T16:23:45Z

i don’t think we have any utils api right now... the point is to avoid utils.

Jeremy Jordan added 7 commits February 2, 2020 09:48

initial implementation

43b20b4

formatting, pass through profiler, docstring

903af11

call profiler during training

e90c869

add initial tests

5133732

report stats when training is done

e466e01

fix formatting

f229d18

error handling, bugfix in passthroughprofiler

d57275c

Jeremy Jordan added 3 commits February 2, 2020 18:22

finish documenting profiler arg in Trainer

9ec8618

relax required precision for profiling tests

a2bedb0

option to dump cProfiler results to text file

343ab83

jeremyjordan requested review from williamFalcon and Borda February 3, 2020 00:08

Borda added the feature Is an improvement or enhancement label Feb 3, 2020

Borda added this to the 0.7.0 milestone Feb 3, 2020

Borda reviewed Feb 3, 2020

View reviewed changes

Jeremy Jordan added 2 commits February 3, 2020 18:23

use logging, format with black

4e1129d

include profiler in docs

30fc655

Jeremy Jordan added 3 commits February 3, 2020 22:59

improved logging and better docs

bd0f7c4

appease the linter

266bd3b

better summaries, wrapper for iterables

da6950c

jeremyjordan changed the title ~~WIP: new feature for profiling training runs~~ new feature for profiling training runs Feb 5, 2020

Borda approved these changes Feb 5, 2020

View reviewed changes

williamFalcon requested changes Feb 5, 2020

View reviewed changes

Jeremy Jordan added 4 commits February 5, 2020 20:39

fix typo

45c30c8

allow profiler=True creation

474785f

more documentation

1c23cef

add tests for advanced profiler

93b917d

williamFalcon and others added 3 commits February 6, 2020 08:26

Update trainer.py

c117ae1

make profilers accessible in pl.utilities

b9e842f

Merge branch 'feature/profiling' of https://github.com/jeremyjordan/p…

6aa6a52

…ytorch-lightning into feature/profiling

Jeremy Jordan added 2 commits February 6, 2020 18:47

reorg profiler files

f118466

change import for profiler tests

5ad693f

williamFalcon merged commit 1cf430f into Lightning-AI:master Feb 7, 2020

Borda mentioned this pull request Feb 14, 2020

advanced profiler description fails for python 3.6 #836

Closed

Borda modified the milestones: 0.7.0, 0.6.1 Feb 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new feature for profiling training runs #782

new feature for profiling training runs #782

jeremyjordan commented Feb 2, 2020 •

edited

Loading

jeremyjordan commented Feb 2, 2020

jeremyjordan commented Feb 3, 2020

Borda commented Feb 3, 2020

jeremyjordan commented Feb 3, 2020

Borda commented Feb 3, 2020

Borda Feb 3, 2020

jeremyjordan Feb 3, 2020

Borda Feb 3, 2020

Borda Feb 3, 2020

jeremyjordan commented Feb 4, 2020 •

edited

Loading

jeremyjordan commented Feb 5, 2020

williamFalcon commented Feb 5, 2020

jeremyjordan commented Feb 5, 2020 •

edited

Loading

Borda left a comment

williamFalcon Feb 5, 2020

williamFalcon Feb 5, 2020

williamFalcon commented Feb 5, 2020

Borda commented Feb 5, 2020 •

edited

Loading

williamFalcon commented Feb 5, 2020

jeremyjordan commented Feb 5, 2020

jeremyjordan commented Feb 6, 2020

williamFalcon commented Feb 6, 2020 •

edited

Loading

Borda commented Feb 6, 2020 •

edited

Loading

williamFalcon commented Feb 6, 2020

jeremyjordan commented Feb 6, 2020

williamFalcon commented Feb 6, 2020

jeremyjordan commented Feb 6, 2020

Borda commented Feb 6, 2020

williamFalcon commented Feb 6, 2020

Borda commented Feb 6, 2020 •

edited

Loading

williamFalcon commented Feb 6, 2020

new feature for profiling training runs #782

new feature for profiling training runs #782

Conversation

jeremyjordan commented Feb 2, 2020 • edited Loading

Before submitting

What does this PR do?

PR review

jeremyjordan commented Feb 2, 2020

jeremyjordan commented Feb 3, 2020

Borda commented Feb 3, 2020

jeremyjordan commented Feb 3, 2020

Borda commented Feb 3, 2020

Borda Feb 3, 2020

Choose a reason for hiding this comment

jeremyjordan Feb 3, 2020

Choose a reason for hiding this comment

Borda Feb 3, 2020

Choose a reason for hiding this comment

Borda Feb 3, 2020

Choose a reason for hiding this comment

jeremyjordan commented Feb 4, 2020 • edited Loading

jeremyjordan commented Feb 5, 2020

williamFalcon commented Feb 5, 2020

jeremyjordan commented Feb 5, 2020 • edited Loading

Borda left a comment

Choose a reason for hiding this comment

williamFalcon Feb 5, 2020

Choose a reason for hiding this comment

williamFalcon Feb 5, 2020

Choose a reason for hiding this comment

williamFalcon commented Feb 5, 2020

Borda commented Feb 5, 2020 • edited Loading

williamFalcon commented Feb 5, 2020

jeremyjordan commented Feb 5, 2020

jeremyjordan commented Feb 6, 2020

williamFalcon commented Feb 6, 2020 • edited Loading

Borda commented Feb 6, 2020 • edited Loading

williamFalcon commented Feb 6, 2020

jeremyjordan commented Feb 6, 2020

williamFalcon commented Feb 6, 2020

jeremyjordan commented Feb 6, 2020

Borda commented Feb 6, 2020

williamFalcon commented Feb 6, 2020

Borda commented Feb 6, 2020 • edited Loading

williamFalcon commented Feb 6, 2020

jeremyjordan commented Feb 2, 2020 •

edited

Loading

jeremyjordan commented Feb 4, 2020 •

edited

Loading

jeremyjordan commented Feb 5, 2020 •

edited

Loading

Borda commented Feb 5, 2020 •

edited

Loading

williamFalcon commented Feb 6, 2020 •

edited

Loading

Borda commented Feb 6, 2020 •

edited

Loading

Borda commented Feb 6, 2020 •

edited

Loading