Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide debugging traces and options as a ENV variable or JIT option #304

Open
parthmannan opened this issue Apr 30, 2024 · 2 comments
Open
Labels
debugging enhancement New feature or request

Comments

@parthmannan
Copy link
Collaborator

parthmannan commented Apr 30, 2024

馃殌 Feature

An environment variable that dumps out the various Thunder provided debug traces to a log file. This can have variable levels like
export THUNDER_DEBUG=<option>

0/'' : Disable
1/'trace' : Enable and dump Thunder generated trace. Can be limited to the trace after delete last used
2/'nvfuser_region' : Enable and dump nvFuser captured regions in addition to 1
3/'nvfuser_code' : Enable and dump nvFuser generated CUDA kernel code in addition to 1 and 2
4/'torch_compile_debug' : Enable the torch.compile debug logging (TORCH_COMPILE_DEBUG=1)

This is a narrow example of the possible debug log levels. Each of these logs can be in a different log file.

Motivation

To get the trace and other debugging information today, we need to add code that captures the trace and prints it after running a model iteration with the inputs.

  • This is cumbersome as the training code needs to be edited to enable tracing and re-edited when finished.
  • The ability to find when an iteration has finished and add the tracing code at the appropriate location may not always be possible as Thunder aims to compile more and more convoluted set of repositories. For example, when using libraries like
    Lightning Trainer, the user may want to just call model.train() but editing the iteration loop can be difficult.

cc - @mruberry

cc @carmocca @apaz-cli

@crcrpar
Copy link
Collaborator

crcrpar commented Jun 16, 2024

@kshitij12345 would you think

def add_post_optimization_transform(cfn: Callable, transform: PostOptimizationTransform) -> Callable:
would let us write a simple callback that just saves the given traces?

@kshitij12345
Copy link
Collaborator

I see three issues using add_post_optimization_transform

  • Currently, post_optimization_transform is not applied to prologue_trace - so we won't be able to save it.
  • The transform is applied to forward and backward trace independently (but we don't explicitly say if given trace is forward or backward). We can probably derive it from trace signature but I don't think it is a good idea.
  • Also, if using multiple post_optimization_transforms, user will have to make sure that this saving transform would be last, otherwise, it would miss saving information from other transforms which were applied after this one.

for transform in post_optimization_transforms:
# NOTE: `backward_trc` could be None.
thunder.core.utils.check_type(transform, PostOptimizationTransform)
computation_trc = transform.transform_trace(computation_trc, executors_list=cd.executors_list)
extraces.append(computation_trc)
if backward_trc is not None:
backward_trc = transform.transform_trace(backward_trc, executors_list=cd.executors_list)
backward_traces.append(backward_trc)

Also, using add_post_optimization_transform would still require some changes to training code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
debugging enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants