Skip to content
This repository was archived by the owner on Dec 9, 2024. It is now read-only.

Updated/refactored T5 DNN#304

Merged
VourMa merged 11 commits intoSegmentLinking:masterfrom
jkguiang:t5-dnn
Jul 25, 2023
Merged

Updated/refactored T5 DNN#304
VourMa merged 11 commits intoSegmentLinking:masterfrom
jkguiang:t5-dnn

Conversation

@jkguiang
Copy link
Copy Markdown
Contributor

Summary

Made the following updates to the existing T5 DNN (#291):

  • Trained slightly longer
  • Saved more working points
  • Added ifdef toggles for the chi-squared and DNN cuts:
    • -DUSE_RZCHI2 toggles the r-z chi-squared cut
    • -DUSE_T5_DNN toggles the T5 DNN cuts
    • -DUSE_RPHICHI2 toggles the r-phi chi-squared cuts
    • Currently, -DUSE_RZCHI2 and -DUSE_T5_DNN are toggled (see SDL/Makefile)
  • Refactored the neural network code such that the matrix multiplication code is now done within T5DNN::runInference

Timing

This PR (measured July 11th, 2023 around 9:30am PDT)

Total Timing Summary
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Event      Short           Rate
   avg      1.7      2.0      1.4      3.4      3.7      1.4      1.6      1.0      5.1      21.3      18.3+/-  5.1      22.9   explicit_cache[s=1]
   avg      4.4      3.7      1.9      5.2      5.3      1.5      3.6      2.1      6.9      34.8      28.8+/-  7.7      19.3   explicit_cache[s=2]
   avg      8.0      5.5      3.1      9.6      8.8      1.9      7.3      4.2     11.2      59.6      49.8+/- 11.9      16.9   explicit_cache[s=4]
   avg     13.9      7.4      4.8     14.0     14.2      2.4     12.4      6.5     15.7      91.3      75.0+/- 14.1      15.8   explicit_cache[s=6]
   avg     24.9     10.2      6.4     19.6     20.8      3.2     17.2      9.4     22.3     134.0     105.9+/- 25.8      17.6   explicit_cache[s=8]

Pre-DNN (measured July 11th, 2023 around 9:40am PDT)

Total Timing Summary
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Event      Short           Rate
   avg      1.6      2.0      1.4      3.4      3.4      1.4      1.3      1.0      2.2      17.8      14.8+/-  3.5      19.4   explicit_cache[s=1]
   avg      2.4      3.1      1.8      5.0      4.6      1.4      2.6      1.5      3.5      25.9      22.1+/-  4.1      13.8   explicit_cache[s=2]
   avg      6.1      5.7      3.3     10.1      8.5      1.7      6.3      3.4      6.4      51.4      43.6+/-  7.6      13.3   explicit_cache[s=4]
   avg      9.6      6.7      4.2     12.5     12.5      2.0      9.1      4.8      9.0      70.5      58.8+/- 10.2      12.2   explicit_cache[s=6]
   avg     14.7      8.0      5.1     16.1     16.9      2.4     12.3      6.0     11.6      93.3      76.1+/- 15.9      12.6   explicit_cache[s=8]

@VourMa
Copy link
Copy Markdown
Contributor

VourMa commented Jul 11, 2023

I notice a major slowdown in TCs. Is this understood/expected?

@GNiendorf
Copy link
Copy Markdown
Member

Could you add a toggle from the command line for the DNN like we do with the caching allocator? For example "sdl_make_tracklooper -mcd" where -d is a toggle for the DNN or something like that.

Comment thread SDL/Quintuplet.cu Outdated
@jkguiang
Copy link
Copy Markdown
Contributor Author

Somehow, by changing the SDL Makefile, the TC timing increase is fixed:

This PR (measured July 14th, 2023 around 7:30am)

Total Timing Summary
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Event      Short           Rate
   avg      1.6      2.0      1.5      3.4      4.0      1.4      1.3      1.0      2.1      18.2      15.2+/-  3.5      19.7   explicit_cache[s=1]
   avg      3.7      3.4      1.9      5.4      5.4      1.5      2.8      1.7      3.5      29.3      24.1+/-  4.7      15.2   explicit_cache[s=2]
   avg      5.5      5.3      3.2      9.0      9.0      1.7      6.2      3.1      6.5      49.6      42.4+/-  8.3      12.9   explicit_cache[s=4]
   avg     10.5      6.5      4.3     12.9     12.9      2.2      9.2      4.7      8.5      71.8      59.1+/- 11.7      13.0   explicit_cache[s=6]
   avg     14.6      8.4      6.0     17.0     18.6      2.6     13.4      6.6     11.8      99.1      81.8+/- 12.0      12.8   explicit_cache[s=8]

Pre-DNN (measured July 14th, 2023 around 7:40am)

Total Timing Summary
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Event      Short           Rate
   avg      1.6      2.0      1.5      3.4      3.5      1.4      1.3      1.0      2.2      17.9      14.9+/-  3.6      19.4   explicit_cache[s=1]
   avg      2.3      3.4      1.9      5.1      4.6      1.5      2.8      1.5      3.5      26.5      22.7+/-  4.8      14.0   explicit_cache[s=2]
   avg      5.0      5.1      3.2      9.2      8.2      1.7      6.2      3.2      6.1      47.9      41.2+/-  7.7      12.5   explicit_cache[s=4]
   avg      9.1      6.8      4.4     12.8     12.3      2.1      9.6      4.8      8.9      70.7      59.5+/- 11.0      12.2   explicit_cache[s=6]
   avg     15.2      8.3      5.0     16.3     17.0      2.5     12.4      6.2     11.9      94.7      77.0+/- 13.4      12.2   explicit_cache[s=8]

I have also merged NeuralNetwork.cu with NeuralNetwork.cuh as requested in the last LST meeting. As for adding a toggle for the DNN in the SDL CLI, I have not worked on that yet. Do we need it? I do not know how trivial it is to add.

@GNiendorf
Copy link
Copy Markdown
Member

For the toggle, it should only take a couple more lines of code. See how it's done for the other toggles we have: https://github.com/SegmentLinking/TrackLooper/blob/master/bin/sdl_make_tracklooper

@GNiendorf
Copy link
Copy Markdown
Member

@VourMa or @slava77 Can I get your take on whether you think a toggle would be beneficial here? My concern is that after this gets pushed, the only way that someone would know how to turn the DNN off properly would be to go to this PR and find the two relevant flags that need to be turned on when you toggle off the DNN. I think more likely people will just make the mistake of turning off the DNN flag without turning on the -DUSE_RPHICHI2 flag, although I'm not sure how big of a difference this flag makes. Am I missing something here @jkguiang? A toggle would do this automatically and avoid potential confusion in the future, as well as making it less complicated for new users. And it should take less than 15 lines of code to accomplish this since we already make use of other toggles, right?

Copy link
Copy Markdown
Contributor

@VourMa VourMa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks good and the comments are generally minor. There are two open issues I see, beyond the current code changes:

  1. There has been a request for a command line toggle for the usage of the DNN within the code. Even though it is not mandatory, it could be of use, since dealing with more than one compilation flags for one operation gets quickly complicated and forgotten (hence my comment about the cleanup of the makefile that needs a separate PR to be brought up to date).
    Adding a toggle should not be that hard: First, one creates a separate make target, as here, with the proper extra compilation flag enabled. Then, this make target is turned on using the command line argument like this.
  2. In my opinion, the increase in timing is still there and it wasn't at all clear to me what happened in the meanwhile and 3ms went down to ~0.5ms. Could you please comment on the investigations that you did to understand this? I feel it is ok to take a sub-ms hit to the timing but we should know it and we should know why.

Comment thread SDL/Makefile
Comment thread SDL/NeuralNetwork.cuh Outdated
Comment thread SDL/NeuralNetwork.cuh Outdated
Comment thread SDL/NeuralNetwork.cuh
@slava77
Copy link
Copy Markdown
Contributor

slava77 commented Jul 18, 2023

I think more likely people will just make the mistake of turning off the DNN flag without turning on the -DUSE_RPHICHI2 flag, although I'm not sure how big of a difference this flag makes. Am I missing something here @jkguiang? A toggle would do this automatically and avoid potential confusion in the future, as well as making it less complicated for new users.

is the point to be sure that both are not used? (a "toggle")

@GNiendorf
Copy link
Copy Markdown
Member

GNiendorf commented Jul 18, 2023

is the point to be sure that both are not used? (a "toggle")

I think the correct behavior is that if the DNN is being used, -DUSE_RZCHI2 and -DUSE_T5_DNN should be turned on, and if the DNN is not being used then -DUSE_RZCHI2 and -DUSE_RPHICHI2 should be turned on. At least to keep it consistent with what cuts were being done before the DNN. A toggle would do this automatically by just including a -d (for example) or not when running sdl_make_tracklooper. So like sdl_make_tracklooper -mcd and there's no need to change the makefile yourself.

@jkguiang
Copy link
Copy Markdown
Contributor Author

jkguiang commented Jul 19, 2023

Adding a toggle should not be that hard: First, one creates a separate make target, as here, with the proper extra compilation flag enabled. Then, this make target is turned on using the command line argument like this.

If we add a make target as above, I believe we will just need to be careful about cases where we want more than one make target to be used, as I am guessing we will want to be able to use the DNN-toggling make targets alongside any of the make targets already defined.

In any case, I agree adding a flag to sdl_make_tracklooper would make things easier.

@VourMa
Copy link
Copy Markdown
Contributor

VourMa commented Jul 19, 2023

I am guessing we will want to be able to use the DNN-toggling make targets alongside any of the make targets already defined.

You 're right. Then this is probably a better example?

@jkguiang
Copy link
Copy Markdown
Contributor Author

In my opinion, the increase in timing is still there and it wasn't at all clear to me what happened in the meanwhile and 3ms went down to ~0.5ms. Could you please comment on the investigations that you did to understand this? I feel it is ok to take a sub-ms hit to the timing but we should know it and we should know why.

For this, I made only a brief mention of what I had done:

Somehow, by changing the SDL Makefile, the TC timing increase is fixed

Expanding on this, I had originally put the flags in the Makefile as follows:

CUTVALUEFLAG_FLAGS = -DCUT_VALUE_DEBUG -DUSE_RZCHI2 -DUSE_T5_DNN
%_cuda.o : %.cu %.cuh
	$(LD) -x cu $(PT0P8) ... $(CUTVALUEFLAG) $(DUPLICATES) $< -o $@

With the Makefile configured as above, I got the strangely long runtimes. Then, I moved the flags:

CUTVALUEFLAG_FLAGS = -DCUT_VALUE_DEBUG
CUTFLAGS = -DUSE_RZCHI2 -DUSE_T5_DNN
%_cuda.o : %.cu %.cuh
	$(LD) -x cu $(PT0P8) ... $(CUTVALUEFLAG) $(CUTFLAGS) $(DUPLICATES) $< -o $@

This somehow fixed the issue I was having. I have no idea why this changed the runtime.

@VourMa
Copy link
Copy Markdown
Contributor

VourMa commented Jul 19, 2023

This somehow fixed the issue I was having. I have no idea why this changed the runtime.

In the "slow" configuration:

CUTVALUEFLAG_FLAGS = -DCUT_VALUE_DEBUG -DUSE_RZCHI2 -DUSE_T5_DNN
%_cuda.o : %.cu %.cuh
	$(LD) -x cu $(PT0P8) ... $(CUTVALUEFLAG) $(DUPLICATES) $< -o $@

I am not even sure you included the flags in the compilation. You were changing the CUTVALUEFLAG_FLAGS variable, while the one that is used in the nominal make target is CUTVALUEFLAG. The CUTVALUEFLAG_FLAGS is only compiled in the explicit_cache_cutvalue make target:

explicit_cache_cutvalue: CUTVALUEFLAG = $(CUTVALUEFLAG_FLAGS)

If what I am saying makes sense, I am not even sure what you were running: no DNN, no RZCHI2 cut for the T5, I don't know what that does and how the validations came out fine.

In the "fast" configuration:

CUTVALUEFLAG_FLAGS = -DCUT_VALUE_DEBUG
CUTFLAGS = -DUSE_RZCHI2 -DUSE_T5_DNN
%_cuda.o : %.cu %.cuh
	$(LD) -x cu $(PT0P8) ... $(CUTVALUEFLAG) $(CUTFLAGS) $(DUPLICATES) $< -o $@

you are now correctly including the USE_RZCHI2 and USE_T5_DNN flags, so I expect this configuration to give the right results. However, when the explicit_cache_cutvalue is run, these flags are overwritten:

explicit_cache_cutvalue: CUTVALUEFLAG = $(CUTVALUEFLAG_FLAGS)

Could you please fix the above issue and double check that we are loading the flags in all cases we want them loaded?

@jkguiang
Copy link
Copy Markdown
Contributor Author

I believe I have addressed the comments made so far.

  1. I have added a toggle -N to sdl_make_tracklooper that toggles the T5 DNN
  2. In order to implement (1), I no longer put the DNN and chi-squared flags in CUTVALUEFLAG, so they should still be toggled when the make target is explicit_cache_cutvalue
  3. I verified that the toggles indeed work by putting ifdef statements for each flag in Event.cu in createQuintuplets that printed a message to stdout for each flag
  4. I have made the arguments to T5DNN::runInference const where appropriate
  5. I re-ran the timing, and am now seeing fewer differences with the pre-DNN timing
Total Timing Summary
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Event      Short           Rate
   avg      1.6      2.0      1.4      3.3      3.4      1.4      1.4      1.1      2.2      17.7      14.8+/-  3.4      19.3   explicit_cache[s=1]
   avg      3.8      3.4      1.9      5.2      4.8      1.4      2.9      1.7      3.6      28.7      23.4+/-  5.1      15.0   explicit_cache[s=2]
   avg      7.8      5.0      3.0      8.9      7.8      1.7      5.8      3.3      6.6      49.8      40.3+/-  8.6      13.6   explicit_cache[s=4]
   avg      8.9      6.7      4.3     13.3     12.5      2.1      9.5      4.6      8.7      70.5      59.5+/- 10.8      12.2   explicit_cache[s=6]
   avg     14.2      8.6      5.7     17.7     17.8      2.6     13.3      6.4     11.4      97.6      80.8+/- 15.4      12.6   explicit_cache[s=8]

Finally, per the last comment:

I am not even sure you included the flags in the compilation.

This concerned me as well, however I had verified several times that the flags were indeed being used. I know for a fact now from (3) that the flags are being toggled, and the plots do not change. We also know empirically that the flags were being used before because the fake rate changed significantly with respect to the baseline. Nevertheless, I do not understand how this was possible.

@VourMa
Copy link
Copy Markdown
Contributor

VourMa commented Jul 21, 2023

I think that during today's meeting we agreed that my explanation above makes sense from a technical point of view as to why there was a timing increase, and your solution with the toggle covers the issue I had mentioned.

I noticed two comments not addressed:

  1. The move of constant variables to the appropriate file (comment)
  2. The check of the profiling report that the declaration of the extra variable does not increase the register usage (comment)

Let me know if you plan to address them. Thanks!

@jkguiang
Copy link
Copy Markdown
Contributor Author

I have moved the constants per (1). For (2), I have not used the profiler (as I have been focused on ML development), so I do not know exactly how to make the comparison. If it is a huge worry, I can just remove the extra variable declarations. Otherwise, I will not have time to look into how to use the profiler until later this week at the earliest. I have run the profiler (per the command on the wiki) and put the files here if someone else has time to look:

http://uaf-10.t2.ucsd.edu/~jguiang/dump/PR304.nsys-rep
http://uaf-10.t2.ucsd.edu/~jguiang/dump/PR304.sqlite

@slava77
Copy link
Copy Markdown
Contributor

slava77 commented Jul 24, 2023

I have moved the constants per (1). For (2), I have not used the profiler (as I have been focused on ML development), so I do not know exactly how to make the comparison. If it is a huge worry, I can just remove the extra variable declarations. Otherwise, I will not have time to look into how to use the profiler until later this week at the earliest. I have run the profiler (per the command on the wiki) and put the files here if someone else has time to look:

http://uaf-10.t2.ucsd.edu/~jguiang/dump/PR304.nsys-rep http://uaf-10.t2.ucsd.edu/~jguiang/dump/PR304.sqlite

I'd like to see ncu outputs to see code line details.

@jkguiang
Copy link
Copy Markdown
Contributor Author

I'd like to see ncu outputs to see code line details.

I ran the following command on cgpu-1:

/opt/nvidia/nsight-compute/2022.2.1/ncu --set full -o PR304 -f --import-source on ./bin/sdl -n 1 -v 0 -i PU200

and put the output here:
http://uaf-10.t2.ucsd.edu/~jguiang/dump/PR304.ncu-rep

@GNiendorf
Copy link
Copy Markdown
Member

GNiendorf commented Jul 24, 2023

I don't see any register usage from the is_endcap variables. @slava77

Screenshot 2023-07-24 at 3 06 06 PM

@slava77
Copy link
Copy Markdown
Contributor

slava77 commented Jul 24, 2023

I don't see any register usage from the is_endcap variables

👍

Copy link
Copy Markdown
Contributor

@VourMa VourMa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you all for following up on this. I am merging the PR.

@VourMa VourMa merged commit a8c352a into SegmentLinking:master Jul 25, 2023
Comment thread SDL/NeuralNetwork.cuh
Comment on lines +91 to +95
mdsInGPU.anchorEta[mdIndex3], // outer T3 anchor hit 4 eta (t3_0_eta)
mdsInGPU.anchorPhi[mdIndex3], // outer T3 anchor hit 4 phi (t3_0_phi)
mdsInGPU.anchorZ[mdIndex3], // outer T3 anchor hit 3 eta (t3_0_z)
sqrtf(x3*x3 + y3*y3), // outer T3 anchor hit 3 r (t3_0_r)
float(modulesInGPU.layers[lowerModuleIndex3] + 6*is_endcap3), // outer T3 anchor hit 3 layer (t3_0_layer)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkguiang
is this intentionally identical to L85-90?
we could've saved a bit on the number of weights, matrix operations, and having the network to learn identities.

Please check and perhaps keep a todo somewhere to possibly clean this up

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was not a design decision, but it is something we could indeed clean up.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants