Skip to content

[rANS] Memory optimized sparse SymbolTable#4125

Merged
davidrohr merged 1 commit into
AliceO2Group:devfrom
MichaelLettrich:O2-1203-rans-library
Aug 12, 2020
Merged

[rANS] Memory optimized sparse SymbolTable#4125
davidrohr merged 1 commit into
AliceO2Group:devfrom
MichaelLettrich:O2-1203-rans-library

Conversation

@MichaelLettrich
Copy link
Copy Markdown
Collaborator

Optimized, sparse storage format for SymbolTable. Reduces storage
requirements by up to factor 5x and optimizes cache performance of
encoder/ decoder.

Copy link
Copy Markdown
Collaborator

@shahor02 shahor02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MichaelLettrich thanks, please rebase to dev and see comment below.
@davidrohr this PR will supersede the #4123 but will require dictionaries regenerated, I'll close mine once this one is ready to merge.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here there should be if (idx >= 0 && idx < mIndex.size())
But I think one can have just 1 if by doing

uint64_t idx = static_cast<uin64_t>(index - mMin);
return (idx<mIndex.size()) ? *(mIndex[idx]) : *mEscapeSymbol;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you forgot return. But why not using just one if comparison instead of 2?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the symbol table stores symbols in the range [mMin,mMax], so we have to treat both cases for a rare symbol being > mMax or <mMin, which requires two comparisons.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you compare not mMin<=symbol<=mMax but idx=index - mMin being <0 or >size. If you static_cast negative idx to unsigned, it will be > MAX_INT, so the comparison static_cast<uin64_t>(index - mMin)<mIndex.size()) ... will cover both cases.

Comment thread Utilities/rANS/include/rANS/internal/SymbolTable.h Outdated
@davidrohr
Copy link
Copy Markdown
Collaborator

OK, I'll leave the pregenerated dictionary stuff out for now and wait for this to be merged. In the meantime I am anyway working on the crash mostly.

@MichaelLettrich MichaelLettrich force-pushed the O2-1203-rans-library branch 2 times, most recently from 84497bb to a6739ce Compare August 12, 2020 06:34
@MichaelLettrich
Copy link
Copy Markdown
Collaborator Author

@MichaelLettrich thanks, please rebase to dev and see comment below.
@davidrohr this PR will supersede the #4123 but will require dictionaries regenerated, I'll close mine once this one is ready to merge.

@davidrohr @shahor02 : Dictionaries stay valid. The only difference is how the symbol tables are generated from it.

@davidrohr
Copy link
Copy Markdown
Collaborator

@MichaelLettrich : Are all issues fixed in the PR now, i.e. shall I try it out?

@MichaelLettrich
Copy link
Copy Markdown
Collaborator Author

@davidrohr I would be very happy to see how it behaves in production, yes :)

@davidrohr
Copy link
Copy Markdown
Collaborator

ok, I tried both generating and saving the dictionary as well as running with the pregenerated dictionary.
All looks OK. One should still try to create CTFs and reconstruct data from the CTF, which I didn't yet try in the full system test.

@shahor02
Copy link
Copy Markdown
Collaborator

@davidrohr I've tested reconstruction from CTF (on 1 TF made of 130 pbpb collisions), looks ok.
But comparing CTF created with GPU_proc.ompThreads=1 and GPU_proc.ompThreads=4, I see that while the number of clusters is the same in both case, the tracks are different:

[17013:tpc-tracker]: [13:56:35][INFO] Event has 59423876 TPC Clusters, 0 TRD Tracklets
[17013:tpc-tracker]: [13:59:06][INFO] found 581675 track(s)

vs

[15631:tpc-tracker]: [13:54:08][INFO] Event has 59423876 TPC Clusters, 0 TRD Tracklets
[15631:tpc-tracker]: [13:54:52][INFO] found 581098 track(s)

which obviously leads also to different compressed clusters.

Is this the non-reproducibility you were mentioning? I thought it refers only on the order of tracks.

I also see that writing TPC CTF from reconstruction with GPU_proc.ompThreads=4 and with external dictionary leads to plenty of "literal" symbols written, even though the dictionary is created with the same TF:

[11157:tpc-entropy-encoder]: [13:45:33][INFO] TPC_CTF: Container of 23 blocks, size: 626774480 bytes, unused: 368783648
[11157:tpc-entropy-encoder]: [13:45:33][INFO] Block 0 for 29032074 message words | NDictWords: 0 NDataWords: 5145441 NLiteralWords: 0
[11157:tpc-entropy-encoder]: [13:45:33][INFO] Block 1 for 29032074 message words | NDictWords: 0 NDataWords: 4896028 NLiteralWords: 0
[11157:tpc-entropy-encoder]: [13:45:33][INFO] Block 2 for 29032074 message words | NDictWords: 0 NDataWords: 975850 NLiteralWords: 0
[11157:tpc-entropy-encoder]: [13:45:33][INFO] Block 3 for 28564047 message words | NDictWords: 0 NDataWords: 934752 NLiteralWords: 0
[11157:tpc-entropy-encoder]: [13:45:33][INFO] Block 4 for 28564047 message words | NDictWords: 0 NDataWords: 51342 NLiteralWords: 0
[11157:tpc-entropy-encoder]: [13:45:33][INFO] Block 5 for 28564047 message words | NDictWords: 0 NDataWords: 6370701 NLiteralWords: 6
[11157:tpc-entropy-encoder]: [13:45:33][INFO] Block 6 for 28564047 message words | NDictWords: 0 NDataWords: 6651347 NLiteralWords: 38
[11157:tpc-entropy-encoder]: [13:45:33][INFO] Block 7 for 29032074 message words | NDictWords: 0 NDataWords: 2857513 NLiteralWords: 0
[11157:tpc-entropy-encoder]: [13:45:33][INFO] Block 8 for 29032074 message words | NDictWords: 0 NDataWords: 2553285 NLiteralWords: 0
[11157:tpc-entropy-encoder]: [13:45:33][INFO] Block 9 for 468027 message words | NDictWords: 0 NDataWords: 109811 NLiteralWords: 0
[11157:tpc-entropy-encoder]: [13:45:33][INFO] Block 10 for 468027 message words | NDictWords: 0 NDataWords: 69787 NLiteralWords: 0
[11157:tpc-entropy-encoder]: [13:45:33][INFO] Block 11 for 468027 message words | NDictWords: 0 NDataWords: 75662 NLiteralWords: 0
[11157:tpc-entropy-encoder]: [13:45:33][INFO] Block 12 for 468027 message words | NDictWords: 0 NDataWords: 262282 NLiteralWords: 207
[11157:tpc-entropy-encoder]: [13:45:33][INFO] Block 13 for 468027 message words | NDictWords: 0 NDataWords: 165609 NLiteralWords: 2
[11157:tpc-entropy-encoder]: [13:45:33][INFO] Block 14 for 26080375 message words | NDictWords: 0 NDataWords: 4812140 NLiteralWords: 0
[11157:tpc-entropy-encoder]: [13:45:33][INFO] Block 15 for 26080375 message words | NDictWords: 0 NDataWords: 4488208 NLiteralWords: 0
[11157:tpc-entropy-encoder]: [13:45:33][INFO] Block 16 for 26080375 message words | NDictWords: 0 NDataWords: 1237310 NLiteralWords: 0
[11157:tpc-entropy-encoder]: [13:45:33][INFO] Block 17 for 26080375 message words | NDictWords: 0 NDataWords: 10714339 NLiteralWords: 0
[11157:tpc-entropy-encoder]: [13:45:33][INFO] Block 18 for 26080375 message words | NDictWords: 0 NDataWords: 6875439 NLiteralWords: 4088
[11157:tpc-entropy-encoder]: [13:45:33][INFO] Block 19 for 26080375 message words | NDictWords: 0 NDataWords: 2646969 NLiteralWords: 0
[11157:tpc-entropy-encoder]: [13:45:33][INFO] Block 20 for 26080375 message words | NDictWords: 0 NDataWords: 2494846 NLiteralWords: 0
[11157:tpc-entropy-encoder]: [13:45:33][INFO] Block 21 for 468027 message words | NDictWords: 0 NDataWords: 101568 NLiteralWords: 0
[11157:tpc-entropy-encoder]: [13:45:33][INFO] Block 22 for 5472 message words | NDictWords: 0 NDataWords: 1900 NLiteralWords: 770

@davidrohr
Copy link
Copy Markdown
Collaborator

@shahor02 : Yes, that is what I meant. You can use --configKeyValues "GPU_proc.ompKernels=0" even with "GPU_proc.ompThreads=4", and then it will run with OpenMP only where the results are reproducible.
In that case, you should get the same number of tracks and also no literal symbols.

For reference, I have placed a ctf dictionary generated from 28 time frames here: https://qon.jwdt.org/nmls/tmp/ctf_dictionary.root

Optimized, sparse storage format for SymbolTable. Reduces storage
requirements by up to factor 5x and optimizes cache performance of
encoder/ decoder.
@MichaelLettrich MichaelLettrich requested a review from a team as a code owner August 12, 2020 13:13
@MichaelLettrich
Copy link
Copy Markdown
Collaborator Author

Fixed the comparison @shahor02 mentioned. can be merged after tests pass from my side.

@shahor02
Copy link
Copy Markdown
Collaborator

@davidrohr actually, your file is corrupted:

Attaching file /home/shahoian/Downloads/ctf_dictionary.root as _file0...
Warning in <TFile::Init>: file /home/shahoian/Downloads/ctf_dictionary.root probably not closed, trying to recover
Warning in <TFile::Init>: no keys recovered, file has been made a Zombie
(TFile *) nullptr

Might it be that you've managed to hit ^C exactly at the moment it was being updated at next autosave?

@davidrohr
Copy link
Copy Markdown
Collaborator

@shahor02 : Indeed, the file is corrupted, and basically all files I am creating now are corrupted.
I checked the file I created yesterday without the latest changes, and that file was ok.

What I see is: After the number of iterations defined via --save-dict-after the ctf writer writes the dictionary, but it doesn't close the file. Then I stop the sender, and press ctrl-c in the workflow, after all data has been fully processed and no new data is coming in.
Then, all processes terminate except for the o2-ctf-writer-workflow which is apparently stuck. Probably it still has the dictionary file open, and the file gets corrupted when the process is killed eventually.

@shahor02
Copy link
Copy Markdown
Collaborator

I've just tried with this PR (in raw-reader driven workflow) and for me everything works. The dictionary file explicitly is closed after each autosave. Are you trying with this PR or mixture of different branches?

@davidrohr
Copy link
Copy Markdown
Collaborator

davidrohr commented Aug 12, 2020 via email

@davidrohr
Copy link
Copy Markdown
Collaborator

@shahor02 : The problem apparently depends on the data. Using the small dataset I simulated yesterday, it works.
But with my older dataset (~1 week old) even if I take only a single time frame, it gets stuck while writing the TOF dictionary. I attach the console output of the ctw-writer below. You can see that the last info message is about TOF, and then it gets stuck, and doesn't even go to FT0, which should come afterwards. When I remove TOF from the --only-det option, a correct CTF file is created also with the old dataset.

I am just wondering, shouldn't the writing of the dictionary be independent of the dataset (even if some data was corrupted in there?) Or is this the bug you fixed recently, and I just need to recreate my large dataset.

On another note: I fooled myself trying to create a new dictionary while a ctf_dictionary.root file was present, since the encoders would pick up the dictionary from there, and then they don't send a dictionary at all, and the ctf-writer stores an empty dictionary. Perhaps in that case one should throw a fatal error to make clear something went wrong.

Output from ctf-writer:

[97773:ctf-writer]: [19:01:56][INFO] ITS: Container of 10 blocks, size: 33342256 bytes, unused: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 0 for 718 message words | NDictWords: 85 NDataWords: 57 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 1 for 718 message words | NDictWords: 397 NDataWords: 15 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 2 for 718 message words | NDictWords: 2 NDataWords: 15 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 3 for 718 message words | NDictWords: 121371 NDataWords: 209 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 4 for 3316448 message words | NDictWords: 343 NDataWords: 295194 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 5 for 3316448 message words | NDictWords: 365 NDataWords: 213274 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 6 for 10377521 message words | NDictWords: 512 NDataWords: 2917560 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 7 for 10377521 message words | NDictWords: 65536 NDataWords: 2773541 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 8 for 10377521 message words | NDictWords: 259 NDataWords: 1466580 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 9 for 2579247 message words | NDictWords: 256 NDataWords: 479782 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] MFT: Container of 10 blocks, size: 14665728 bytes, unused: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 0 for 718 message words | NDictWords: 507 NDataWords: 75 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 1 for 718 message words | NDictWords: 397 NDataWords: 15 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 2 for 718 message words | NDictWords: 2 NDataWords: 15 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 3 for 718 message words | NDictWords: 50110 NDataWords: 196 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 4 for 391676 message words | NDictWords: 65536 NDataWords: 34164 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 5 for 391676 message words | NDictWords: 243 NDataWords: 57309 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 6 for 4380287 message words | NDictWords: 512 NDataWords: 1231329 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 7 for 4380287 message words | NDictWords: 65536 NDataWords: 999425 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 8 for 4380287 message words | NDictWords: 2048 NDataWords: 5 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 9 for 13320206 message words | NDictWords: 256 NDataWords: 1158537 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] TPC: Container of 23 blocks, size: 902484016 bytes, unused: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 0 for 80692468 message words | NDictWords: 16385 NDataWords: 14424501 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 1 for 80692468 message words | NDictWords: 961 NDataWords: 13729483 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 2 for 80692468 message words | NDictWords: 8 NDataWords: 2637268 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 3 for 79439947 message words | NDictWords: 152 NDataWords: 2561015 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 4 for 79439947 message words | NDictWords: 36 NDataWords: 149194 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 5 for 79439947 message words | NDictWords: 65536 NDataWords: 17559406 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 6 for 79439947 message words | NDictWords: 16777216 NDataWords: 18418378 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 7 for 80692468 message words | NDictWords: 57 NDataWords: 7917737 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 8 for 80692468 message words | NDictWords: 57 NDataWords: 7079822 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 9 for 1252521 message words | NDictWords: 255 NDataWords: 293348 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 10 for 1252521 message words | NDictWords: 152 NDataWords: 181246 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 11 for 1252521 message words | NDictWords: 36 NDataWords: 202358 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 12 for 1252521 message words | NDictWords: 2289967 NDataWords: 762110 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 13 for 1252521 message words | NDictWords: 8769 NDataWords: 443263 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 14 for 80207328 message words | NDictWords: 16385 NDataWords: 14821231 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 15 for 80207328 message words | NDictWords: 961 NDataWords: 13813616 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 16 for 80207328 message words | NDictWords: 8 NDataWords: 3774110 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 17 for 80207328 message words | NDictWords: 65536 NDataWords: 32981137 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 18 for 80207328 message words | NDictWords: 16777170 NDataWords: 21721466 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 19 for 80207328 message words | NDictWords: 65 NDataWords: 8118842 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 20 for 80207328 message words | NDictWords: 65 NDataWords: 7711403 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 21 for 1252521 message words | NDictWords: 157 NDataWords: 273273 NLiteralWords: 0
[97773:ctf-writer]: [19:01:56][INFO] Block 22 for 5472 message words | NDictWords: 24422 NDataWords: 1976 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] TOF: Container of 10 blocks, size: 2077168 bytes, unused: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 0 for 384 message words | NDictWords: 1189 NDataWords: 15 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 1 for 384 message words | NDictWords: 2 NDataWords: 13 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 2 for 384 message words | NDictWords: 20896 NDataWords: 51 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 3 for 384 message words | NDictWords: 1 NDataWords: 5 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 4 for 690413 message words | NDictWords: 19 NDataWords: 441 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 5 for 690413 message words | NDictWords: 64908 NDataWords: 58190 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 6 for 690413 message words | NDictWords: 1638 NDataWords: 229309 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 7 for 690413 message words | NDictWords: 96 NDataWords: 142061 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 8 for 690413 message words | NDictWords: 247 NDataWords: 5 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 9 for 0 message words | NDictWords: 0 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] FT0: Container of 8 blocks, size: 477360 bytes, unused: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 0 for 40734 message words | NDictWords: 32 NDataWords: 2028 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 1 for 40734 message words | NDictWords: 105 NDataWords: 1145 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 2 for 40734 message words | NDictWords: 2 NDataWords: 30 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 3 for 40734 message words | NDictWords: 209 NDataWords: 3555 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 4 for 168388 message words | NDictWords: 256 NDataWords: 31301 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 5 for 168388 message words | NDictWords: 74 NDataWords: 5281 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 6 for 168388 message words | NDictWords: 1791 NDataWords: 35460 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 7 for 168388 message words | NDictWords: 4096 NDataWords: 33812 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] TF#0 CTF writing is disabled
[97773:ctf-writer]: [19:01:57][INFO] bl 0 sz= 85 min/max 0/84
[97773:ctf-writer]: [19:01:57][INFO] bl 1 sz= 397 min/max 0/396
[97773:ctf-writer]: [19:01:57][INFO] bl 2 sz= 2 min/max 0/1
[97773:ctf-writer]: [19:01:57][INFO] bl 3 sz= 121371 min/max 0/121370
[97773:ctf-writer]: [19:01:57][INFO] bl 4 sz= 343 min/max 0/342
[97773:ctf-writer]: [19:01:57][INFO] bl 5 sz= 365 min/max 0/364
[97773:ctf-writer]: [19:01:57][INFO] bl 6 sz= 512 min/max 0/511
[97773:ctf-writer]: [19:01:57][INFO] bl 7 sz= 65536 min/max 0/65535
[97773:ctf-writer]: [19:01:57][INFO] bl 8 sz= 259 min/max 0/258
[97773:ctf-writer]: [19:01:57][INFO] bl 9 sz= 256 min/max 0/255
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 85 for block 0
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 397 for block 1
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 2 for block 2
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 121371 for block 3
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 343 for block 4
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 365 for block 5
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 512 for block 6
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 65536 for block 7
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 259 for block 8
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 256 for block 9
[97773:ctf-writer]: [19:01:57][INFO] Storing dictionary for ITS: Container of 10 blocks, size: 757344 bytes, unused: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 0 for 0 message words | NDictWords: 85 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 1 for 0 message words | NDictWords: 397 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 2 for 0 message words | NDictWords: 2 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 3 for 0 message words | NDictWords: 121371 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 4 for 0 message words | NDictWords: 343 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 5 for 0 message words | NDictWords: 365 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 6 for 0 message words | NDictWords: 512 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 7 for 0 message words | NDictWords: 65536 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 8 for 0 message words | NDictWords: 259 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 9 for 0 message words | NDictWords: 256 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] bl 0 sz= 507 min/max 0/506
[97773:ctf-writer]: [19:01:57][INFO] bl 1 sz= 397 min/max 0/396
[97773:ctf-writer]: [19:01:57][INFO] bl 2 sz= 2 min/max 0/1
[97773:ctf-writer]: [19:01:57][INFO] bl 3 sz= 50110 min/max 0/50109
[97773:ctf-writer]: [19:01:57][INFO] bl 4 sz= 65536 min/max 0/65535
[97773:ctf-writer]: [19:01:57][INFO] bl 5 sz= 243 min/max 0/242
[97773:ctf-writer]: [19:01:57][INFO] bl 6 sz= 512 min/max 0/511
[97773:ctf-writer]: [19:01:57][INFO] bl 7 sz= 65536 min/max 0/65535
[97773:ctf-writer]: [19:01:57][INFO] bl 8 sz= 2048 min/max 0/2047
[97773:ctf-writer]: [19:01:57][INFO] bl 9 sz= 256 min/max 0/255
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 507 for block 0
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 397 for block 1
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 2 for block 2
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 50110 for block 3
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 65536 for block 4
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 243 for block 5
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 512 for block 6
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 65536 for block 7
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 2048 for block 8
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 256 for block 9
[97773:ctf-writer]: [19:01:57][INFO] Storing dictionary for MFT: Container of 10 blocks, size: 741408 bytes, unused: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 0 for 0 message words | NDictWords: 507 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 1 for 0 message words | NDictWords: 397 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 2 for 0 message words | NDictWords: 2 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 3 for 0 message words | NDictWords: 50110 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 4 for 0 message words | NDictWords: 65536 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 5 for 0 message words | NDictWords: 243 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 6 for 0 message words | NDictWords: 512 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 7 for 0 message words | NDictWords: 65536 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 8 for 0 message words | NDictWords: 2048 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 9 for 0 message words | NDictWords: 256 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] bl 0 sz= 16385 min/max 0/16384
[97773:ctf-writer]: [19:01:57][INFO] bl 1 sz= 961 min/max 0/960
[97773:ctf-writer]: [19:01:57][INFO] bl 2 sz= 8 min/max 0/7
[97773:ctf-writer]: [19:01:57][INFO] bl 3 sz= 152 min/max 0/151
[97773:ctf-writer]: [19:01:57][INFO] bl 4 sz= 36 min/max 0/35
[97773:ctf-writer]: [19:01:57][INFO] bl 5 sz= 65536 min/max 0/65535
[97773:ctf-writer]: [19:01:57][INFO] bl 6 sz= 16777216 min/max 0/16777215
[97773:ctf-writer]: [19:01:57][INFO] bl 7 sz= 57 min/max 0/56
[97773:ctf-writer]: [19:01:57][INFO] bl 8 sz= 57 min/max 0/56
[97773:ctf-writer]: [19:01:57][INFO] bl 9 sz= 255 min/max 0/254
[97773:ctf-writer]: [19:01:57][INFO] bl 10 sz= 152 min/max 0/151
[97773:ctf-writer]: [19:01:57][INFO] bl 11 sz= 36 min/max 0/35
[97773:ctf-writer]: [19:01:57][INFO] bl 12 sz= 2289967 min/max 0/2289966
[97773:ctf-writer]: [19:01:57][INFO] bl 13 sz= 8769 min/max 0/8768
[97773:ctf-writer]: [19:01:57][INFO] bl 14 sz= 16385 min/max 0/16384
[97773:ctf-writer]: [19:01:57][INFO] bl 15 sz= 961 min/max 0/960
[97773:ctf-writer]: [19:01:57][INFO] bl 16 sz= 8 min/max 0/7
[97773:ctf-writer]: [19:01:57][INFO] bl 17 sz= 65536 min/max 0/65535
[97773:ctf-writer]: [19:01:57][INFO] bl 18 sz= 16777170 min/max 0/16777169
[97773:ctf-writer]: [19:01:57][INFO] bl 19 sz= 65 min/max 0/64
[97773:ctf-writer]: [19:01:57][INFO] bl 20 sz= 65 min/max 0/64
[97773:ctf-writer]: [19:01:57][INFO] bl 21 sz= 157 min/max 0/156
[97773:ctf-writer]: [19:01:57][INFO] bl 22 sz= 24422 min/max 0/24421
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 16385 for block 0
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 961 for block 1
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 8 for block 2
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 152 for block 3
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 36 for block 4
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 65536 for block 5
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 16777216 for block 6
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 57 for block 7
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 57 for block 8
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 255 for block 9
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 152 for block 10
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 36 for block 11
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 2289967 for block 12
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 8769 for block 13
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 16385 for block 14
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 961 for block 15
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 8 for block 16
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 65536 for block 17
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 16777170 for block 18
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 65 for block 19
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 65 for block 20
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 157 for block 21
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 24422 for block 22
[97773:ctf-writer]: [19:01:57][INFO] Storing dictionary for TPC: Container of 23 blocks, size: 144179288 bytes, unused: 18446744073709551608
[97773:ctf-writer]: [19:01:57][INFO] Block 0 for 0 message words | NDictWords: 16385 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 1 for 0 message words | NDictWords: 961 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 2 for 0 message words | NDictWords: 8 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 3 for 0 message words | NDictWords: 152 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 4 for 0 message words | NDictWords: 36 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 5 for 0 message words | NDictWords: 65536 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 6 for 0 message words | NDictWords: 16777216 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 7 for 0 message words | NDictWords: 57 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 8 for 0 message words | NDictWords: 57 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 9 for 0 message words | NDictWords: 255 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 10 for 0 message words | NDictWords: 152 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 11 for 0 message words | NDictWords: 36 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 12 for 0 message words | NDictWords: 2289967 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 13 for 0 message words | NDictWords: 8769 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 14 for 0 message words | NDictWords: 16385 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 15 for 0 message words | NDictWords: 961 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 16 for 0 message words | NDictWords: 8 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 17 for 0 message words | NDictWords: 65536 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 18 for 0 message words | NDictWords: 16777170 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 19 for 0 message words | NDictWords: 65 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 20 for 0 message words | NDictWords: 65 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 21 for 0 message words | NDictWords: 157 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 22 for 0 message words | NDictWords: 24422 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] bl 0 sz= 1189 min/max 0/1188
[97773:ctf-writer]: [19:01:57][INFO] bl 1 sz= 2 min/max 0/1
[97773:ctf-writer]: [19:01:57][INFO] bl 2 sz= 20896 min/max 0/20895
[97773:ctf-writer]: [19:01:57][INFO] bl 3 sz= 1 min/max 0/0
[97773:ctf-writer]: [19:01:57][INFO] bl 4 sz= 19 min/max 0/18
[97773:ctf-writer]: [19:01:57][INFO] bl 5 sz= 64908 min/max 0/64907
[97773:ctf-writer]: [19:01:57][INFO] bl 6 sz= 1638 min/max 0/1637
[97773:ctf-writer]: [19:01:57][INFO] bl 7 sz= 96 min/max 0/95
[97773:ctf-writer]: [19:01:57][INFO] bl 8 sz= 247 min/max 0/246
[97773:ctf-writer]: [19:01:57][INFO] bl 9 sz= 0 min/max 0/0
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 1189 for block 0
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 2 for block 1
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 20896 for block 2
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 1 for block 3
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 19 for block 4
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 64908 for block 5
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 1638 for block 6
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 96 for block 7
[97773:ctf-writer]: [19:01:57][INFO] adding dict of size 247 for block 8
[97773:ctf-writer]: [19:01:57][INFO] Storing dictionary for TOF: Container of 10 blocks, size: 356808 bytes, unused: 18446744073709551608
[97773:ctf-writer]: [19:01:57][INFO] Block 0 for 0 message words | NDictWords: 1189 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 1 for 0 message words | NDictWords: 2 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 2 for 0 message words | NDictWords: 20896 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 3 for 0 message words | NDictWords: 1 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 4 for 0 message words | NDictWords: 19 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 5 for 0 message words | NDictWords: 64908 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 6 for 0 message words | NDictWords: 1638 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 7 for 0 message words | NDictWords: 96 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 8 for 0 message words | NDictWords: 247 NDataWords: 0 NLiteralWords: 0
[97773:ctf-writer]: [19:01:57][INFO] Block 9 for 0 message words | NDictWords: 0 NDataWords: 0 NLiteralWords: 0

@shahor02
Copy link
Copy Markdown
Collaborator

@davidrohr The dictionary depends on dataset, so if some dataset triggers a bug ...
Looks like there is a memory overflow (should be alignment related): after finalization the CTF should report "unused: 0", while in your log I see "negative" 18446744073709551608, that should be the reason.

[97773:ctf-writer]: [19:01:57][INFO] Storing dictionary for TPC: Container of 23 blocks, size: 144179288 bytes, unused: 18446744073709551608
[97773:ctf-writer]: [19:01:57][INFO] Block 0 for 0 message words | NDictWords: 16385 NDataWords: 0 NLiteralWords: 0

I know more or less where to look, how can I rerun the job which created this log?

Concerning the protection against using existing ctf_dictionary when its creation is requested: I will throw exception from the ctf-writer init method if the dictionary creation is requested and it is found in working dir.

@davidrohr
Copy link
Copy Markdown
Collaborator

@shahor02 : Rerunning the exact data is a bit complicated since it needs DataDistribution, but I believe the sim2 dataset (I think you used it before) triggers the error as well, at least I am getting a segfault in the ctf writer.

Could you try test.sh in /data/drohr/sim2 on the Frankfurt server? (You'll have to replace HIP by CPU as device type in the script if you didn't compile O2 with HIP support).

@davidrohr
Copy link
Copy Markdown
Collaborator

@shahor02 : Will you debug this now, then I'll stop testing on the server until you are done.

@shahor02
Copy link
Copy Markdown
Collaborator

@davidrohr at the moment I am trying to debug it locally, I'll let you know if/when I have something to test on the server.

@shahor02
Copy link
Copy Markdown
Collaborator

@davidrohr I've reproduced the problem locally and fixed it, was stupid bug... Will not need to run on the server.

@davidrohr
Copy link
Copy Markdown
Collaborator

ok, thx, will you open a PR?

@shahor02
Copy link
Copy Markdown
Collaborator

@davidrohr #4139 should fix the problem.

Copy link
Copy Markdown
Collaborator

@davidrohr davidrohr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failures seem to be unrelated, some issues during QC compilation related to RDH usage in EMCAL.

@shahor02
Copy link
Copy Markdown
Collaborator

@davidrohr yes, EMCal has suppressed some classes their QC depends on: #4128 (comment), you decide if yo want to wait for gpu CI.

@mfasDa
Copy link
Copy Markdown
Collaborator

mfasDa commented Aug 12, 2020

The fix adapting the QC for EMCAL to changes in O2 was merged in QC several hours ago. In principle a new CI build should detect it.

@davidrohr davidrohr merged commit f49d9a3 into AliceO2Group:dev Aug 12, 2020
@davidrohr
Copy link
Copy Markdown
Collaborator

@mfasDa : EMCAL is still failing in QC (see eg #4140) with the same problem. Do we perhaps need a new tag or so?

@mfasDa
Copy link
Copy Markdown
Collaborator

mfasDa commented Aug 13, 2020

@davidrohr Probably, the tag of QC was just merged this morning.

@ktf
Copy link
Copy Markdown
Member

ktf commented Aug 13, 2020

@knopers8 can you have a look? The tests in alidist were passing, though.

@davidrohr
Copy link
Copy Markdown
Collaborator

I think @mfasDa was correct, the new tag was only bumped in alidist this morning, so I'd ignore failing tests from tonight, and see if it still fails with new PRs.

@knopers8
Copy link
Copy Markdown
Collaborator

I had a look around other PRs in O2, the failed builds were started before the bump of QC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

6 participants