Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import HTA to Chakra to extract synchronization dependency #1

Open
wants to merge 1 commit into
base: refactor
Choose a base branch
from

Conversation

JoongunPark
Copy link

@JoongunPark JoongunPark commented Feb 29, 2024

Summary

This PR is to process synchronization dependency between the Chakra nodes.
In order to do that, we use CriticalPathAnalyzer in Holistic Trace Analysis (https://github.com/facebookresearch/HolisticTraceAnalysis/blob/main/hta/analyzers/critical_path_analysis.py).

Please note that,

  1. Command has been changed. We need to specify rank number by --rank.
  2. Kineto trace profiler should collect nodes with their 'cat' field is 'cuda_sync'. Please follow the instruction here ([profiler] add option for kineto synchronization events in the trace pytorch/pytorch#105187).

Test Plan

Download and Install HTA

git clone https://github.com/facebookresearch/HolisticTraceAnalysis.git
cd HolisticTraceAnalysis
git submodule update --init
pip install -r requirements.txt
pip install -e .

Run Chakra et_converter

python3 tools/trace_link.py --pytorch-et-file result/eg.rank_0.pt.trace.json --kineto-file result/kineto.rank_0_step_5.1708449344148840892.pt.trace.json --rank 0 --output-file result/rank_0.json

Test input with Resnet-50 with 2 GTX1070 (rank 0)
eg.rank_0.pt.trace.json
kineto.rank_0_step_5.1708449344148840892.pt.trace.json

Test result with Resnet-50 with 2 GTX1070 (rank 0)
rank_0.json

Test Result with Megatron (No Sync dependency)

I've observed that this update will not cause any changes in result with trace which has no synchronization dependency.

Original
sys[4] finished, 607252677 cycles
sys[5] finished, 607253196 cycles
sys[6] finished, 607253715 cycles
sys[7] finished, 607254234 cycles
sys[0] finished, 607254753 cycles
sys[1] finished, 607255272 cycles
sys[2] finished, 607255791 cycles
sys[3] finished, 607256310 cycles

New
sys[4] finished, 607252677 cycles
sys[5] finished, 607253196 cycles
sys[6] finished, 607253715 cycles
sys[7] finished, 607254234 cycles
sys[0] finished, 607254753 cycles
sys[1] finished, 607255272 cycles
sys[2] finished, 607255791 cycles
sys[3] finished, 607256310 cycles

@JoongunPark JoongunPark force-pushed the refactor branch 6 times, most recently from be20a5d to e73309b Compare February 29, 2024 21:42
@JoongunPark JoongunPark changed the title Import HTA to extract synchronization dependency in Chakra Import HTA to import synchronization dependency in Chakra Feb 29, 2024
@JoongunPark JoongunPark changed the title Import HTA to import synchronization dependency in Chakra Import HTA to extract synchronization dependency Feb 29, 2024
@JoongunPark JoongunPark changed the title Import HTA to extract synchronization dependency Import HTA to Chakra to extract synchronization dependency Feb 29, 2024
@JoongunPark JoongunPark force-pushed the refactor branch 3 times, most recently from f305f72 to 8f71209 Compare March 15, 2024 19:42
train/compute/python/tools/trace_link.py Outdated Show resolved Hide resolved
train/compute/python/tools/trace_link.py Outdated Show resolved Hide resolved
train/compute/python/tools/trace_link.py Outdated Show resolved Hide resolved
train/compute/python/tools/trace_link.py Outdated Show resolved Hide resolved
train/compute/python/tools/trace_link.py Outdated Show resolved Hide resolved
train/compute/python/tools/trace_link.py Show resolved Hide resolved
linker = TraceLinker(
args.pytorch_et_file,
args.kineto_file,
args.log_level
)
linker.load_traces()
linker.enforce_inter_thread_order()
linker.enforce_sync_order(cpa)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure whether it is the best name to describe your method. Could you please justify the method name or rename the method name?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to make similar name with 'enforce_inter_thread_order'.

self.raw_events = None
self.sync_deps = {}

annotation = "ProfilerStep"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to know whether we can assume that the annotation is always 'ProfilerStep'

Copy link
Author

@JoongunPark JoongunPark Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can. That is always assumed in HTA testing example. HTA also describes in that way.

            annotation (str): a trace annotation to limit the analysis to,
                for example "ProfilerStep" would match all annotations that
                match this string (ProfilerStep#100, ProfilerStep#101 etc)

self.sync_deps = {}

annotation = "ProfilerStep"
instance_id = 0
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the meaning of instance_id? Why is it set to zero?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to HTA, instance_id is used to classify which annotation to consider.

            instance_id: can be either of the following
                (int) - specify which instance of the annotation to consider.
                        Defaults to the first instance.
                (Tuple(int, int)) - considers a range of annotation instances start to end,
                        inclusive of both start and end instance.

@TaekyungHeo
Copy link
Owner

The tool fails with the following command.

$ python tools/trace_link.py --pytorch-et-file ~/Downloads/megatron_et_0.json --kineto-file ~/Downloads/megatron_kineto_0.json --output-file ~/rank0.json --rank 0 
[2024-03-15 17:16:12,875] trace_file.py:36 [WARNING]: No trace file is found in /Users/theo/Downloads
[2024-03-15 17:16:12,875] trace_file.py:88 [WARNING]: failed to create rank to trace map
[2024-03-15 17:16:12,875] trace.py:735 [INFO]: ranks=[]
[2024-03-15 17:16:12,875] trace.py:741 [ERROR]: The list of ranks to be parsed is empty.
Traceback (most recent call last):
  File "/Users/theo/param-jupark/train/compute/python/tools/trace_link.py", line 1196, in <module>
    main()
  File "/Users/theo/param-jupark/train/compute/python/tools/trace_link.py", line 1177, in main
    cpa = CriticalPathAnalyzer(
  File "/Users/theo/param-jupark/train/compute/python/tools/trace_link.py", line 39, in __init__
    self.event_sync_trace = TraceAnalysis(trace_dir = kineto_file)
  File "/Users/theo/HolisticTraceAnalysis/hta/trace_analysis.py", line 37, in __init__
    self.t.load_traces(include_last_profiler_step)
  File "/Users/theo/HolisticTraceAnalysis/hta/common/trace.py", line 614, in load_traces
    self.align_and_filter_trace(include_last_profiler_step)
  File "/Users/theo/HolisticTraceAnalysis/hta/common/trace.py", line 754, in align_and_filter_trace
    self._align_all_ranks()
  File "/Users/theo/HolisticTraceAnalysis/hta/common/trace.py", line 873, in _align_all_ranks
    self.min_ts = min(trace_df["ts"].min() for trace_df in self.traces.values())
ValueError: min() arg is an empty sequence

@JoongunPark
Copy link
Author

The tool fails with the following command.

$ python tools/trace_link.py --pytorch-et-file ~/Downloads/megatron_et_0.json --kineto-file ~/Downloads/megatron_kineto_0.json --output-file ~/rank0.json --rank 0 
[2024-03-15 17:16:12,875] trace_file.py:36 [WARNING]: No trace file is found in /Users/theo/Downloads
[2024-03-15 17:16:12,875] trace_file.py:88 [WARNING]: failed to create rank to trace map
[2024-03-15 17:16:12,875] trace.py:735 [INFO]: ranks=[]
[2024-03-15 17:16:12,875] trace.py:741 [ERROR]: The list of ranks to be parsed is empty.
Traceback (most recent call last):
  File "/Users/theo/param-jupark/train/compute/python/tools/trace_link.py", line 1196, in <module>
    main()
  File "/Users/theo/param-jupark/train/compute/python/tools/trace_link.py", line 1177, in main
    cpa = CriticalPathAnalyzer(
  File "/Users/theo/param-jupark/train/compute/python/tools/trace_link.py", line 39, in __init__
    self.event_sync_trace = TraceAnalysis(trace_dir = kineto_file)
  File "/Users/theo/HolisticTraceAnalysis/hta/trace_analysis.py", line 37, in __init__
    self.t.load_traces(include_last_profiler_step)
  File "/Users/theo/HolisticTraceAnalysis/hta/common/trace.py", line 614, in load_traces
    self.align_and_filter_trace(include_last_profiler_step)
  File "/Users/theo/HolisticTraceAnalysis/hta/common/trace.py", line 754, in align_and_filter_trace
    self._align_all_ranks()
  File "/Users/theo/HolisticTraceAnalysis/hta/common/trace.py", line 873, in _align_all_ranks
    self.min_ts = min(trace_df["ts"].min() for trace_df in self.traces.values())
ValueError: min() arg is an empty sequence

This failure occurs when the HTA can not find the files in the directory. Could you check if the files are really there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants