Provide debugging traces and options as a ENV variable or JIT option #304

parthmannan · 2024-04-30T16:11:14Z

🚀 Feature

An environment variable that dumps out the various Thunder provided debug traces to a log file. This can have variable levels like
export THUNDER_DEBUG=<option>

0/'' : Disable
1/'trace' : Enable and dump Thunder generated trace. Can be limited to the trace after delete last used
2/'nvfuser_region' : Enable and dump nvFuser captured regions in addition to 1
3/'nvfuser_code' : Enable and dump nvFuser generated CUDA kernel code in addition to 1 and 2
4/'torch_compile_debug' : Enable the torch.compile debug logging (TORCH_COMPILE_DEBUG=1)

This is a narrow example of the possible debug log levels. Each of these logs can be in a different log file.

Motivation

To get the trace and other debugging information today, we need to add code that captures the trace and prints it after running a model iteration with the inputs.

This is cumbersome as the training code needs to be edited to enable tracing and re-edited when finished.
The ability to find when an iteration has finished and add the tracing code at the appropriate location may not always be possible as Thunder aims to compile more and more convoluted set of repositories. For example, when using libraries like
Lightning Trainer, the user may want to just call model.train() but editing the iteration loop can be difficult.

cc - @mruberry

cc @carmocca @apaz-cli

The text was updated successfully, but these errors were encountered:

crcrpar · 2024-06-16T06:22:59Z

@kshitij12345 would you think

lightning-thunder/thunder/core/transforms.py

Line 468 in 21a222b

    
           def add_post_optimization_transform(cfn: Callable, transform: PostOptimizationTransform) -> Callable:

would let us write a simple callback that just saves the given traces?

kshitij12345 · 2024-06-17T08:48:09Z

I see three issues using add_post_optimization_transform

Currently, post_optimization_transform is not applied to prologue_trace - so we won't be able to save it.
The transform is applied to forward and backward trace independently (but we don't explicitly say if given trace is forward or backward). We can probably derive it from trace signature but I don't think it is a good idea.
Also, if using multiple post_optimization_transforms, user will have to make sure that this saving transform would be last, otherwise, it would miss saving information from other transforms which were applied after this one.

lightning-thunder/thunder/__init__.py

Lines 603 to 610 in 21a222b

    
           for transform in post_optimization_transforms: 
        
               # NOTE: `backward_trc` could be None. 
        
               thunder.core.utils.check_type(transform, PostOptimizationTransform) 
        
               computation_trc = transform.transform_trace(computation_trc, executors_list=cd.executors_list) 
        
               extraces.append(computation_trc) 
        
               if backward_trc is not None: 
        
                   backward_trc = transform.transform_trace(backward_trc, executors_list=cd.executors_list) 
        
                   backward_traces.append(backward_trc)

Also, using add_post_optimization_transform would still require some changes to training code.

parthmannan added enhancement New feature or request debugging labels Apr 30, 2024

mruberry added triage review and removed triage review labels Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide debugging traces and options as a ENV variable or JIT option #304

Provide debugging traces and options as a ENV variable or JIT option #304

parthmannan commented Apr 30, 2024 •

edited by github-actions bot

Loading

crcrpar commented Jun 16, 2024

kshitij12345 commented Jun 17, 2024

Provide debugging traces and options as a ENV variable or JIT option #304

Provide debugging traces and options as a ENV variable or JIT option #304

Comments

parthmannan commented Apr 30, 2024 • edited by github-actions bot Loading

🚀 Feature

Motivation

crcrpar commented Jun 16, 2024

kshitij12345 commented Jun 17, 2024

parthmannan commented Apr 30, 2024 •

edited by github-actions bot

Loading