Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception instead of LogError in DeepTauId #40733

Open
VinInn opened this issue Feb 9, 2023 · 23 comments
Open

Exception instead of LogError in DeepTauId #40733

VinInn opened this issue Feb 9, 2023 · 23 comments

Comments

@VinInn
Copy link
Contributor

VinInn commented Feb 9, 2023

here
https://cmssdt.cern.ch/dxr/CMSSW/source/RecoTauTag/RecoTau/plugins/DeepTauId.cc#1296
a LogError should be issued not what is an effective job abort

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 9, 2023

A new Issue was created by @VinInn Vincenzo Innocente.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

makortel commented Feb 9, 2023

assign reconstruction

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 9, 2023

New categories assigned: reconstruction

@mandrenguyen,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

makortel commented Feb 9, 2023

Duplicate of #40437 and #28358

I leave it to @cms-sw/reconstruction-l2 to decide if continue the discussion in #40437 or here

@mandrenguyen
Copy link
Contributor

@mbluj @azotz Can you please have a look?

@mandrenguyen
Copy link
Contributor

type tau

@cmsbuild cmsbuild added the tau label Feb 9, 2023
@mbluj
Copy link
Contributor

mbluj commented Feb 9, 2023

Value of pred variable should lie in [0,1] range by its construction. A value outside of this range is a sign of some processing issue, highly probably some memory corruption as repored in #40437 and #28358. Unfortunately the issue is non-reproducible.
Anyway, a simple replacement of an exception by a LogError seems not be the best solution as it can cause that a nonsense value (possibly NaN) is passed to further processing. If it is mandatory to issue a LogError one should also set pred to some safe default (-1, 0?) which will effectively mean that a tau candidate with this value of pred is rejected at analysis level. @kandrosov what do you think?

@kandrosov
Copy link
Contributor

Yes, I have commented in #28358, if the condition do not satisfied it means that there has been some memory corruption, so any computations after that point shouldn't be trusted. Therefore, simply reporting in LogError is not sufficient. It was confirmed in some private productions that after first event with NaN, if exception is not thrown, all consecutive events will have NaNs in deepTau scores.

If you consider aborting the whole CMSSW for that is an overkill, possible solution is to disable modules that can be potentially affected by this memory corruption. E.g. after the condition above is occurred all Tau-related modules are disabled for all consecutive events and instead of tau collection some flag that indicates that tau sequences are disabled is stored in the output root file. However, I don't know if such mechanisms is implemented within CMSSW.

@VinInn
Copy link
Contributor Author

VinInn commented Feb 9, 2023

it would be useful to have full details of the Hardware were the error occurred.
If there is a real hardware issue and nothing specifically related to the code used by DeepTau it could affect many other parts of any kind of job running there.

We may have to develop and deploy some sort of ad-hoc test or just ask site to regularly run memory-checks.

@VinInn
Copy link
Contributor Author

VinInn commented Feb 11, 2023

My understanding is that TensorFlow is used in DeepTauId.
TnsorFlow is a fat library: uses different code on different hardware: I wonder if there has been any test to compare its result on a sse only machine, avx, avx2 and axv512.
If yes, which SSE-only machine has been used?

@kandrosov
Copy link
Contributor

@VinInn hm.. I just go one of these errors in some of my private nano production jobs. It seems that NaN conditions is now occurring more frequently wrt to the past. Here is an example log file. Do you think we can extract something useful from it about the hardware?

One interesting observation: It was a crab job with a custom executable that consecutively runs CMSSW instances for each input file. This particular job had two input files and for both cmssw crashed on the first event due to the NaN problem. So the behaviour seems to be persistent on that worker node.

@VinInn
Copy link
Contributor Author

VinInn commented Feb 15, 2023

HERE WE GO.

== CMSSW: 2023-02-15 13:57:17.162144: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2

I bet we never tested on sse-only machines....
My suspicion is wrong unaligned load...

@makortel
Copy link
Contributor

The log happens to contain also the exact processor model

Current processor: model name	: Intel(R) Xeon(R) CPU           E5645  @ 2.40GHz

in case that would be useful.

@VinInn
Copy link
Contributor Author

VinInn commented Feb 15, 2023

I failed to find a sse-only machine at cern.
If anybody can get her hands on one of those would be nice to run valgrind and/or other "memory" tools

@kandrosov
Copy link
Contributor

I found one node with Intel(R) Xeon(R) CPU X5650 @ 2.67GHz. It should also be SSE-only.
I run few times nano sequence on the same input file as in the log above, but it always runs successfully without any sign of NaNs...
Here is a valgrind output.

@VinInn
Copy link
Contributor Author

VinInn commented Feb 16, 2023

@kandrosov : would it be easy for you to check that the results of deepTauId is identical on this machine and on a AVX2/AVX512 one?

@VinInn
Copy link
Contributor Author

VinInn commented Feb 16, 2023

it's plenty of

  <what>Conditional jump or move depends on uninitialised value(s)</what>
  <stack>
    <frame>
      <ip>0x5174599</ip>
      <obj>/cvmfs/cms.cern.ch/slc7_amd64_gcc10/lcg/root/6.24.07-bf41b0420bc269850b74e23486e2953a/lib/libHist.so</obj>
      <fn>TStorage::UpdateIsOnHeap(unsigned int const volatile&amp;, unsigned int volatile&amp;)</fn>

is it a false positive?

the rest is sometimes scary but clearly in python or cling...

@Dr15Jones
Copy link
Contributor

It’s a false positive. In the ROOT release area they have a filter file one can use with valgrind to ignore those messages.

@VinInn
Copy link
Contributor Author

VinInn commented Feb 16, 2023

I'm running

valgrind --tool=memcheck --leak-check=full --suppressions=${ROOTSYS}/etc/valgrind-root.supp

and btw I also see many

==78542== Conditional jump or move depends on uninitialised value(s)
==78542==    at 0x52ED3C34: mkfit::helixAtRFromIterativeCCS(Matriplex::Matriplex<float, 6, 1, 4> const&, Matriplex::Matriplex<int, 1, 1, 4> const&, Matriplex::Matriplex<float, 1, 1, 4> const&, Matriplex::Matriplex<float, 6, 1, 4>&, Matriplex::Matriplex<float, 6, 6, 4>&, Matriplex::Matriplex<int, 1, 1, 4>&, int, mkfit::PropagationFlags) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_0/lib/slc7_amd64_gcc10/libRecoTrackerMkFitCore.so)
==78542== Use of uninitialised value of size 8
==78542==    at 0x6DB543F: sinf (in /usr/lib64/libm-2.17.so)
==78542==    by 0x52ED2ECA: mkfit::helixAtRFromIterativeCCS(Matriplex::Matriplex<float, 6, 1, 4> const&, Matriplex::Matriplex<int, 1, 1, 4> const&, Matriplex::Matriplex<float, 1, 1, 4> const&, Matriplex::Matriplex<float, 6, 1, 4>&, Matriplex::Matriplex<float, 6, 6, 4>&, Matriplex::Matriplex<int, 1, 1, 4>&, int, mkfit::PropagationFlags) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_0/lib/slc7_amd64_gcc10/libRecoTrackerMkFitCore.so)
==78542== Conditional jump or move depends on uninitialised value(s)
==78542==    at 0x64F7A91: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/external/xz/5.2.5-d6fed2038c4e8d6e04531d1adba59f37/lib/liblzma.so.5.2.5)

some improper delete

==78542== Invalid free() / delete / delete[] / realloc()
==78542==    at 0x4030F5C: free (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/external/valgrind/3.17.0-12ba960b086b024cce391ad228855798/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==78542==    by 0x52EC1311: mkfit::ExecutionContext::~ExecutionContext() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_0/lib/slc7_amd64_gcc10/libRecoTrackerMkFitCore.so)

and a large number of memory leaks in tensorflow....

@slava77
Copy link
Contributor

slava77 commented Feb 16, 2023

and btw I also see many

==78542== Conditional jump or move depends on uninitialised value(s)
==78542==    at 0x52ED3C34: mkfit::helixAtRFromIterativeCCS(Matriplex::Matriplex<float, 6, 1, 4> const&, Matriplex::Matriplex<int, 1, 1, 4> const&, Matriplex::Matriplex<float, 1, 1, 4> const&, Matriplex::Matriplex<float, 6, 1, 4>&, Matriplex::Matriplex<float, 6, 6, 4>&, Matriplex::Matriplex<int, 1, 1, 4>&, int, mkfit::PropagationFlags) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_0/lib/slc7_amd64_gcc10/libRecoTrackerMkFitCore.so)

IIUC, this was fixed in #40105 in 12_6_X (merged in pre5, apparently)
At least the fixed cases were harmless and were happening in parts of the fixed size loop that were discarded downstream.

@VinInn
Copy link
Contributor Author

VinInn commented Feb 16, 2023

Thanks @slava77 , I was suspecting something like that (and sorry if I'm still runnning and old benchmark (hepix benchmark))

@slava77
Copy link
Contributor

slava77 commented Feb 21, 2023

the nan in deepTau could be blocking the rereco of 2022
https://cms-unified.web.cern.ch/cms-unified/showlog/?search=ReReco-Run2022B-DoubleMuonLowMass-10Dec2022-00001

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants