[rANS] Memory optimized sparse SymbolTable#4125
Conversation
shahor02
left a comment
There was a problem hiding this comment.
@MichaelLettrich thanks, please rebase to dev and see comment below.
@davidrohr this PR will supersede the #4123 but will require dictionaries regenerated, I'll close mine once this one is ready to merge.
There was a problem hiding this comment.
here there should be if (idx >= 0 && idx < mIndex.size())
But I think one can have just 1 if by doing
uint64_t idx = static_cast<uin64_t>(index - mMin);
return (idx<mIndex.size()) ? *(mIndex[idx]) : *mEscapeSymbol;
There was a problem hiding this comment.
you forgot return. But why not using just one if comparison instead of 2?
There was a problem hiding this comment.
Because the symbol table stores symbols in the range [mMin,mMax], so we have to treat both cases for a rare symbol being > mMax or <mMin, which requires two comparisons.
There was a problem hiding this comment.
you compare not mMin<=symbol<=mMax but idx=index - mMin being <0 or >size. If you static_cast negative idx to unsigned, it will be > MAX_INT, so the comparison static_cast<uin64_t>(index - mMin)<mIndex.size()) ... will cover both cases.
|
OK, I'll leave the pregenerated dictionary stuff out for now and wait for this to be merged. In the meantime I am anyway working on the crash mostly. |
56e1f5a to
d1b5dfe
Compare
84497bb to
a6739ce
Compare
@davidrohr @shahor02 : Dictionaries stay valid. The only difference is how the symbol tables are generated from it. |
|
@MichaelLettrich : Are all issues fixed in the PR now, i.e. shall I try it out? |
|
@davidrohr I would be very happy to see how it behaves in production, yes :) |
|
ok, I tried both generating and saving the dictionary as well as running with the pregenerated dictionary. |
|
@davidrohr I've tested reconstruction from CTF (on 1 TF made of 130 pbpb collisions), looks ok. vs which obviously leads also to different compressed clusters. Is this the non-reproducibility you were mentioning? I thought it refers only on the order of tracks. I also see that writing TPC CTF from reconstruction with |
|
@shahor02 : Yes, that is what I meant. You can use --configKeyValues "GPU_proc.ompKernels=0" even with "GPU_proc.ompThreads=4", and then it will run with OpenMP only where the results are reproducible. For reference, I have placed a ctf dictionary generated from 28 time frames here: https://qon.jwdt.org/nmls/tmp/ctf_dictionary.root |
Optimized, sparse storage format for SymbolTable. Reduces storage requirements by up to factor 5x and optimizes cache performance of encoder/ decoder.
a6739ce to
9de6786
Compare
|
Fixed the comparison @shahor02 mentioned. can be merged after tests pass from my side. |
|
@davidrohr actually, your file is corrupted: Might it be that you've managed to hit |
|
@shahor02 : Indeed, the file is corrupted, and basically all files I am creating now are corrupted. What I see is: After the number of iterations defined via |
|
I've just tried with this PR (in raw-reader driven workflow) and for me everything works. The dictionary file explicitly is closed after each autosave. Are you trying with this PR or mixture of different branches? |
|
A mixture... I'll try again later. Schould be easy to add some debug output and understand where it gets stuck.
|
|
@shahor02 : The problem apparently depends on the data. Using the small dataset I simulated yesterday, it works. I am just wondering, shouldn't the writing of the dictionary be independent of the dataset (even if some data was corrupted in there?) Or is this the bug you fixed recently, and I just need to recreate my large dataset. On another note: I fooled myself trying to create a new dictionary while a Output from ctf-writer: |
|
@davidrohr The dictionary depends on dataset, so if some dataset triggers a bug ... I know more or less where to look, how can I rerun the job which created this log? Concerning the protection against using existing |
|
@shahor02 : Rerunning the exact data is a bit complicated since it needs DataDistribution, but I believe the sim2 dataset (I think you used it before) triggers the error as well, at least I am getting a segfault in the ctf writer. Could you try test.sh in /data/drohr/sim2 on the Frankfurt server? (You'll have to replace HIP by CPU as device type in the script if you didn't compile O2 with HIP support). |
|
@shahor02 : Will you debug this now, then I'll stop testing on the server until you are done. |
|
@davidrohr at the moment I am trying to debug it locally, I'll let you know if/when I have something to test on the server. |
|
@davidrohr I've reproduced the problem locally and fixed it, was stupid bug... Will not need to run on the server. |
|
ok, thx, will you open a PR? |
|
@davidrohr #4139 should fix the problem. |
davidrohr
left a comment
There was a problem hiding this comment.
Failures seem to be unrelated, some issues during QC compilation related to RDH usage in EMCAL.
|
@davidrohr yes, EMCal has suppressed some classes their QC depends on: #4128 (comment), you decide if yo want to wait for gpu CI. |
|
The fix adapting the QC for EMCAL to changes in O2 was merged in QC several hours ago. In principle a new CI build should detect it. |
|
@davidrohr Probably, the tag of QC was just merged this morning. |
|
@knopers8 can you have a look? The tests in alidist were passing, though. |
|
I think @mfasDa was correct, the new tag was only bumped in alidist this morning, so I'd ignore failing tests from tonight, and see if it still fails with new PRs. |
|
I had a look around other PRs in O2, the failed builds were started before the bump of QC |
Optimized, sparse storage format for SymbolTable. Reduces storage
requirements by up to factor 5x and optimizes cache performance of
encoder/ decoder.