Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long running OR commands should report their runtime #5003

Open
stefanottili opened this issue Apr 26, 2024 · 31 comments
Open

Long running OR commands should report their runtime #5003

stefanottili opened this issue Apr 26, 2024 · 31 comments
Assignees
Labels
good first issue Good for newcomers

Comments

@stefanottili
Copy link

Description

Every “long running” command should report how long it took to run it.
The goal would be to be able to easily compare log files from different runs.
E.g.
vimdiff dir[12]/log
or
grep runtime dir2/log > dir2/runtime
diff dir[12]/runtime

Suggested Solution

Currently there seems to be a way to get ORFS to report runtime and memory usage.
But this relies on running the openroad executable for each and every step.
And it's not "built into" openroad, it relies on external scripting.
It should be possible to have one OR tcl script to run "the whole flow" and still get this information in the log.

Log                            Elapsed seconds Peak Memory/KB
1_1_yosys                                  243         971412
2_1_floorplan                               27         510688
2_2_floorplan_io                             4         363812
2_3_floorplan_tdms                           4         362360
2_4_floorplan_macro                          4         367364
2_5_floorplan_tapcell                        4         325804
2_6_floorplan_pdn                            6         381020
3_1_place_gp_skip_io                        38         461672
3_2_place_iop                                4         372996
3_3_place_gp                               613        1048224
3_4_place_resized                           94         602152
3_5_place_dp                                78         657092
4_1_cts                                    152         747132
5_1_grt                                   6137        1298604
5_2_fillcell                                 5         494576
5_3_route                                  543       11470100
6_1_merge                                   15         815860
6_report                                   198        2751824
Total                                     8169       11470100

Additional Context

No response

@oharboe
Copy link
Collaborator

oharboe commented May 1, 2024

OpenROAD-flow-scripts is a flow that uses OpenROAD. This flow is based on invoking OpenROAD many times, reading in the previous .odb file and writing out an updated one for each stage and reporting times and memory usage.

If I understand correctly, you describe a flow where a single .tcl script is used and the OpenROAD binary is invoked once.

OpenROAD-flow-scripts used to have a "single .tcl script" version of the flow, but it was abandoned because it wasn't being used, maintained and there were no tests for it.

There are many flows other than OpenROAD-flow-scripts that exist on top of the OpenROAD tool. To name a few: Hammer, Zero ASIC SiliconCompiler, bazel_hdl.

None of these flows, to my knowledge, is run by a single .tcl script. The OpenROAD binary is invoked multiple times.

For these flows, OpenROAD is not the only tool being used that memory and times have to be tracked for.

Hence, I think today's solution where the measurement of memory usage and running times is outside of the scope of the OpenROAD binary seems to work well. There are some notable exceptions, memory consumption and running times is tracked and printed for each step within detailed routing.

So in short, I think that absent of a more specific description for changes, I think that today's solution is pretty good. It has evolved to the way it is for good reasons.

@stefanottili
Copy link
Author

I have created log files from years ago running ispd routing testcases, using OpenRoad global and detailed route one script OpenRoad tcl script, no need to save and reload odb databases in between.
It would have been nice then and even nicer now to have runtime info in the logs, alongside the information which commands were run. Both of which changed over time.

Having multiple flows to implement their own runtime logging sounds like an argument to put it into the tool itself to me.
I can’t think of a p&r tool I’ve used over the decades that wouldn’t log its runtimes.

@maliberty maliberty added the good first issue Good for newcomers label May 1, 2024
@maliberty
Copy link
Member

I think it is a nice to have though not a high priority. To be clear this is command level timing rather than flow step level timing so it is a bit more fine grained. In general OR doesn't have a unified approach to commands so adding this will likely happen when there is some motivating need to make commands more common. Dependencies like sta further complicate the implementation.

Detailed routing does already report its runtime so that should be present in your old logs.

@stefanottili
Copy link
Author

stefanottili commented May 1, 2024

It's definitely not "high priority", but very much "nice to have".
Not once have I thought "gee, I wish this tool wouldn't report it's runtime or memory usage" ...

Making it easy to compare run's between versions/machines/implementations by diff'ing/grep'ing/scripting is useful.
Key is to have something consistent and easy to grep for. E.g. "egrep -e CMD -e TIME openroad.log".
Not for the "progress report/internal details", but only for the "mayor, time consuming steps".

There is a fine balance between "show some progress/info" and "keep the log file short and concise".

@maliberty
Copy link
Member

maliberty commented May 1, 2024

We could either report whenever a command takes more than X time or just decide which commands are interesting to always report. The former will make the output somewhat non-deterministic around the threshold, the later may report small times on small designs. I favor the later for determinism in which case in which case we should make a list.

@rovinski
Copy link
Collaborator

rovinski commented May 1, 2024

A few comments:

  • Every command could be a long-running command under some circumstances or design input
  • As @oharboe mentioned, there used to be a method for running the flow all in one script, but it was removed because no one used it and it fell out of maintenance
  • If your GRT is taking longer than DRT (as in the OP) that usually indicates some kind of congestion or layer resource reduction issue
  • I know of many commands in EDA tools that don't report time / mem usage. Perhaps you are thinking of per-session statistics? That sounds like something doable in OR, like printing time / mem on exit.
  • For a per-command basis, there becomes an issue of too much verbosity. You say that you have never not wanted it, I know that @oharboe likes logs to be as clutter-free as possible, for example.
  • Time can be measured in Tcl somewhat easily: https://stackoverflow.com/questions/5242647/tcl-how-to-know-how-much-time-a-function-has-worked, memory less so

@stefanottili
Copy link
Author

Every “long running” command should report how long it took to run it.

In my use-case this info would have been useful to me.

So it sounds like our preferences differ and there is the question of the resources necessary to implement this request.

If long running commands (placement, resize, global route, antenna, detailed route) would report time (in c++), this would in no way impede other flows to ignore this info and log this some other way as today, best of both worlds ...
Considering the existing log file clutter, these few additional lines don't add much.

@maliberty
Copy link
Member

Would

global_place
rtl_macro_placer 
repair_design
repair_timing
global_route
detailed_route

suffice? I suspect those are the big ticket items and wouldn't add too much verbosity.

@oharboe
Copy link
Collaborator

oharboe commented May 2, 2024

repair_timing specifically can take surprisingly long, so having extra detail there would be useful.

Also report timing can be very slow.

CTS is a good example.I have seen cases where CTS takes 10 minutes and report/fix timing takes 24 hours.

@QuantamHD
Copy link
Collaborator

Adding time to logs makes them non hermetic so this should be an opt in thing.

@oharboe
Copy link
Collaborator

oharboe commented May 2, 2024

Adding time to logs makes them non hermetic so this should be an opt in thing.

This is already the case for detailed routing. How do you deal with that (in bazel_hdl, presumably?)?

@maliberty
Copy link
Member

repair_timing does give detail with -verbose. This would just be adding an overall time message.

@stefanottili
Copy link
Author

Here's an example why runtime and memory consumption reporting from openroad is useful.
Todays openroad on a machine with 32 threads, 56GB memory + 14GB swap, wsl2 unbuntu:

Now I know not to try to detail_route the 3M stdcell ispd24 mempool_group testcase again.

read_def is fast
placement density displays instantaneous
rudy takes 35sec before the cursor switches from pointer to busy
global_route -verbose didn't report runtime/memory.
detailed_route reported two runtime/memory datapoints before being killed.

OpenROAD v2.0-13573-g9e909da77
Features included (+) or not (-): +Charts +GPU +GUI +Python
...
Design:                   mempool_group
Die area:                 ( 0 0 ) ( 6767040 6762560 )
Number of track patterns: 20
...
[WARNING DRT-0419] No routing tracks pass through the center of Term gen_tiles\[4\].i_tile/gen_banks\[6\].mem_bank/genblk1.sram_instance/rd_out[0]
[WARNING DRT-0418] Term gen_tiles\[4\].i_tile/gen_banks\[6\].mem_bank/genblk1.sram_instance/rd_out[1] has no pins on routing grid
...
#scanned instances     = 3099210
#unique  instances     = 555
#stdCellGenAp          = 16776
#stdCellValidPlanarAp  = 0
#stdCellValidViaAp     = 11479
#stdCellPinNoAp        = 0
#stdCellPinCnt         = 10978298
#instTermValidViaApCnt = 0
#macroGenAp            = 19262
#macroValidPlanarAp    = 19262
#macroValidViaAp       = 0
#macroNoAp             = 0
[INFO DRT-0166] Complete pin access.
[INFO DRT-0267] cpu time = 84:59:19, elapsed time = 03:09:21, memory = 37294.33 (MB), peak = 39299.85 (MB)
....
Number of guides:     28748042

[INFO DRT-0169] Post process guides.
[INFO DRT-0176] GCELLGRID X 0 DO 1611 STEP 4200 ;
[INFO DRT-0177] GCELLGRID Y 0 DO 1610 STEP 4200 ;
...
[INFO DRT-0181] Start track assignment.
[INFO DRT-0184] Done with 8693270 vertical wires in 33 frboxes and 14344011 horizontal wires in 33 frboxes.
[INFO DRT-0186] Done with 5901687 vertical wires in 33 frboxes and 4252651 horizontal wires in 33 frboxes.
[INFO DRT-0182] Complete track assignment.
[INFO DRT-0267] cpu time = 01:01:40, elapsed time = 00:11:57, memory = 54316.77 (MB), peak = 55276.92 (MB)
[INFO DRT-0187] Start routing data preparation.
Killed

@maliberty
Copy link
Member

Not a great example as the command didn't complete so you wouldn't get any reporting at the end. Reporting usage during progress is not what was requested here.

@rovinski
Copy link
Collaborator

rovinski commented May 3, 2024

I think what DRT needs is some automatic adjustment to the number of threads before it starts running in case it detects that the system memory is too low. I know of some proprietary tools which do this based on system load (% CPU) but because DRT memory scales pretty linearly with threads, it should hopefully be easy to do here.

@oharboe
Copy link
Collaborator

oharboe commented May 3, 2024

@rovinski Interesting... I think it would be a good idea to make this improvement to detailed routing, even if it would not affect the maximum memory usage in all cases.

Example from megaboom, here maximum memory usage for global and local routing is about the same. Note that megaboom runs with various things disabled, which will underestimate memory usage, though how much I don't know.

Features are disabled mainly to rein in running times to something somewhat manageable.

Log                            Elapsed seconds Peak Memory/MB
1_1_yosys                                 4043           4528
2_1_floorplan                              137           7208
2_2_floorplan_io                            13           3090
2_4_floorplan_macro                        711           7799
2_5_floorplan_tapcell                      529           2144
2_6_floorplan_pdn                          311          10416
3_1_place_gp_skip_io                        21           3365
3_2_place_iop                               28           4540
3_3_place_gp                              6015          14747
3_4_place_resized                          565           8064
3_5_place_dp                              1222          12899
4_1_cts                                    449          13227
5_1_grt                                   5246          59714
5_2_fillcell                                76          13239
5_3_route                                18100          60493
6_1_merge                                  388          29685
6_report                                  6442          40855
generate_abstract                          874          30833
Total                                    45170          60493

@maliberty
Copy link
Member

It sounds quite fragile. The amount of free memory may change quite dynamically (particularly on a multi-user system). It also depends on how expensive swapping is (and how large the working set is). Its also hard to know exactly how much memory drt will use in advance.

@rovinski
Copy link
Collaborator

rovinski commented May 4, 2024

I was thinking more conservatively of (threads * memory/thread) < total system memory. Not even including free memory. Beyond that, you are guaranteed to dip into swap and risk significant slowdown. In an even more conservative case, at least a warning could be issued.

Is it hard to know how much memory it uses? When I was experimenting a while ago, it seemed pretty deterministic based on the technology, design size, and number of threads.

@maliberty
Copy link
Member

If I overestimate the usage and unnecessarily throttle the run I get a bug report. If I underestimate the memory and the run is slowed down or killed I get a bug report. I have to hit some goldilocks window every time. If you feel you can do that then please submit a PR (and plan to support it).

@oharboe
Copy link
Collaborator

oharboe commented May 4, 2024

The user can control this via NUM_CORES now, right?

What about memory usage for the rest of the flow?

@rovinski
Copy link
Collaborator

rovinski commented May 4, 2024

If I underestimate the memory and the run is slowed down or killed I get a bug report.

But that's the currently the case anyways 😆. If introduced, hopefully it would be less.

@maliberty
Copy link
Member

No such promise is made today and therefore nothing needs "fixing".

@oharboe
Copy link
Collaborator

oharboe commented May 4, 2024

If I underestimate the memory and the run is slowed down or killed I get a bug report.

But that's the currently the case anyways 😆. If introduced, hopefully it would be less.

I think today there is a well defined behavior where the user has to make an annoying choice that has deterministic effects: NUM_CORES.

From a bug/feature request point of view, I think this is the least bad baseline. I can see how any change to this baseline(default behavior) practially(from a maintainers point of view) has to be only better and never worse.

In bazel-orfs, I can easily specify NUM_CORES per stage(floorplan, place, route, etc.), the same is possible in ORFS, but requires writing a script that runs one make comand per stage(not a big deal for a flow that is memory constrainted, which probably also runs for a very long time).

Perhaps the final summary report could include "marginal memory cost/core"?

The marginal memory cost per core refers to the additional memory consumption incurred by adding one more processing core to a system

That would require some extra logging by e.g. detailed routing based upon knowledge of the memory consumption.

The user can then choose to modify NUM_CORES for the problematic stage.

@oharboe
Copy link
Collaborator

oharboe commented May 4, 2024

Come to think of it: if the marginal memory cost per CPU can't be estimated and reported, then it isn't possible to implement a policy to select the "best" number of CPUs as constrained by available memory.

If the marginal speedup per core can be estimated(assuming roughly linear for simplication) and the marginal memory cost can be estimated and reported, then it is possible for the user to decide on a NUM_CORES tradeof manually. Currently a maximum speed policy is in place(use all cores) and a minimum memory usage could be estimated in a column. If the peak memory usage isn't reduced(grt doesn't use parallelism, does it?), then the user would know that nothing is gained as such by reducing parallelism.

@oharboe
Copy link
Collaborator

oharboe commented May 4, 2024

In fact, it would be easy to run an experiment. Run a bunch of runs w NUM_CORES w 1 through number of cores and plot run times and memory usage for each stage... Should tell a lot...

@maliberty
Copy link
Member

This is all predicated on the idea that there is some fixed cost per thread and that it remains a constant across iterations. I'm not clear either is true but am open to see your results.

This issue is getting taken over by an unrelated request. If people want to pursue it then it should be a separate issue.

@annapetrosyan26
Copy link
Contributor

Hello, I would like to work on this issue, however from the conversation its not clear what is the final request for this issue. To summarize what I understood it's that for the following commands runtime and memory usage should be logged from within the commands and the rest of the conversation is considered as a separate request.

global_place
rtl_macro_placer 
repair_design
repair_timing
global_route
detailed_route

Could you please confirm if this is the case?

@maliberty
Copy link
Member

I think that would be good start. People can request more commands if they see long runtimes.

@QuantamHD
Copy link
Collaborator

I'll be pretty forceful here and say this should be an option you need to enable, and not default behavior.

@maliberty
Copy link
Member

Why? drt already does this, how are extra commands different?

@annapetrosyan26
Copy link
Contributor

Hi all, I have started working on this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

6 participants