Exception instead of LogError in DeepTauId #40733

VinInn · 2023-02-09T08:20:30Z

here
https://cmssdt.cern.ch/dxr/CMSSW/source/RecoTauTag/RecoTau/plugins/DeepTauId.cc#1296
a LogError should be issued not what is an effective job abort

cmsbuild · 2023-02-09T08:20:53Z

A new Issue was created by @VinInn Vincenzo Innocente.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel · 2023-02-09T14:17:06Z

assign reconstruction

cmsbuild · 2023-02-09T14:17:25Z

New categories assigned: reconstruction

@mandrenguyen,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel · 2023-02-09T14:17:48Z

Duplicate of #40437 and #28358

I leave it to @cms-sw/reconstruction-l2 to decide if continue the discussion in #40437 or here

mandrenguyen · 2023-02-09T14:47:28Z

@mbluj @azotz Can you please have a look?

mandrenguyen · 2023-02-09T14:47:39Z

type tau

mbluj · 2023-02-09T15:37:02Z

Value of pred variable should lie in [0,1] range by its construction. A value outside of this range is a sign of some processing issue, highly probably some memory corruption as repored in #40437 and #28358. Unfortunately the issue is non-reproducible.
Anyway, a simple replacement of an exception by a LogError seems not be the best solution as it can cause that a nonsense value (possibly NaN) is passed to further processing. If it is mandatory to issue a LogError one should also set pred to some safe default (-1, 0?) which will effectively mean that a tau candidate with this value of pred is rejected at analysis level. @kandrosov what do you think?

kandrosov · 2023-02-09T16:35:59Z

Yes, I have commented in #28358, if the condition do not satisfied it means that there has been some memory corruption, so any computations after that point shouldn't be trusted. Therefore, simply reporting in LogError is not sufficient. It was confirmed in some private productions that after first event with NaN, if exception is not thrown, all consecutive events will have NaNs in deepTau scores.

If you consider aborting the whole CMSSW for that is an overkill, possible solution is to disable modules that can be potentially affected by this memory corruption. E.g. after the condition above is occurred all Tau-related modules are disabled for all consecutive events and instead of tau collection some flag that indicates that tau sequences are disabled is stored in the output root file. However, I don't know if such mechanisms is implemented within CMSSW.

VinInn · 2023-02-09T18:18:39Z

it would be useful to have full details of the Hardware were the error occurred.
If there is a real hardware issue and nothing specifically related to the code used by DeepTau it could affect many other parts of any kind of job running there.

We may have to develop and deploy some sort of ad-hoc test or just ask site to regularly run memory-checks.

VinInn · 2023-02-11T14:34:47Z

My understanding is that TensorFlow is used in DeepTauId.
TnsorFlow is a fat library: uses different code on different hardware: I wonder if there has been any test to compare its result on a sse only machine, avx, avx2 and axv512.
If yes, which SSE-only machine has been used?

kandrosov · 2023-02-15T15:33:25Z

@VinInn hm.. I just go one of these errors in some of my private nano production jobs. It seems that NaN conditions is now occurring more frequently wrt to the past. Here is an example log file. Do you think we can extract something useful from it about the hardware?

One interesting observation: It was a crab job with a custom executable that consecutively runs CMSSW instances for each input file. This particular job had two input files and for both cmssw crashed on the first event due to the NaN problem. So the behaviour seems to be persistent on that worker node.

VinInn · 2023-02-15T15:37:47Z

HERE WE GO.

== CMSSW: 2023-02-15 13:57:17.162144: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2

I bet we never tested on sse-only machines....
My suspicion is wrong unaligned load...

makortel · 2023-02-15T15:39:09Z

The log happens to contain also the exact processor model

Current processor: model name	: Intel(R) Xeon(R) CPU           E5645  @ 2.40GHz

in case that would be useful.

VinInn · 2023-02-15T15:44:56Z

https://ark.intel.com/content/www/fr/fr/ark/products/48768/intel-xeon-processor-e5645-12m-cache-2-40-ghz-5-86-gts-intel-qpi.html

VinInn · 2023-02-15T15:55:20Z

I failed to find a sse-only machine at cern.
If anybody can get her hands on one of those would be nice to run valgrind and/or other "memory" tools

kandrosov · 2023-02-15T19:10:10Z

I found one node with Intel(R) Xeon(R) CPU X5650 @ 2.67GHz. It should also be SSE-only.
I run few times nano sequence on the same input file as in the log above, but it always runs successfully without any sign of NaNs...
Here is a valgrind output.

VinInn · 2023-02-16T08:17:04Z

@kandrosov : would it be easy for you to check that the results of deepTauId is identical on this machine and on a AVX2/AVX512 one?

VinInn · 2023-02-16T08:32:39Z

it's plenty of

  <what>Conditional jump or move depends on uninitialised value(s)</what>
  <stack>
    <frame>
      <ip>0x5174599</ip>
      <obj>/cvmfs/cms.cern.ch/slc7_amd64_gcc10/lcg/root/6.24.07-bf41b0420bc269850b74e23486e2953a/lib/libHist.so</obj>
      <fn>TStorage::UpdateIsOnHeap(unsigned int const volatile&amp;, unsigned int volatile&amp;)</fn>

is it a false positive?

the rest is sometimes scary but clearly in python or cling...

Dr15Jones · 2023-02-16T13:20:27Z

It’s a false positive. In the ROOT release area they have a filter file one can use with valgrind to ignore those messages.

VinInn · 2023-02-16T13:26:23Z

I'm running

valgrind --tool=memcheck --leak-check=full --suppressions=${ROOTSYS}/etc/valgrind-root.supp

and btw I also see many

==78542== Conditional jump or move depends on uninitialised value(s)
==78542==    at 0x52ED3C34: mkfit::helixAtRFromIterativeCCS(Matriplex::Matriplex<float, 6, 1, 4> const&, Matriplex::Matriplex<int, 1, 1, 4> const&, Matriplex::Matriplex<float, 1, 1, 4> const&, Matriplex::Matriplex<float, 6, 1, 4>&, Matriplex::Matriplex<float, 6, 6, 4>&, Matriplex::Matriplex<int, 1, 1, 4>&, int, mkfit::PropagationFlags) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_0/lib/slc7_amd64_gcc10/libRecoTrackerMkFitCore.so)

==78542== Use of uninitialised value of size 8
==78542==    at 0x6DB543F: sinf (in /usr/lib64/libm-2.17.so)
==78542==    by 0x52ED2ECA: mkfit::helixAtRFromIterativeCCS(Matriplex::Matriplex<float, 6, 1, 4> const&, Matriplex::Matriplex<int, 1, 1, 4> const&, Matriplex::Matriplex<float, 1, 1, 4> const&, Matriplex::Matriplex<float, 6, 1, 4>&, Matriplex::Matriplex<float, 6, 6, 4>&, Matriplex::Matriplex<int, 1, 1, 4>&, int, mkfit::PropagationFlags) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_0/lib/slc7_amd64_gcc10/libRecoTrackerMkFitCore.so)

==78542== Conditional jump or move depends on uninitialised value(s)
==78542==    at 0x64F7A91: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/external/xz/5.2.5-d6fed2038c4e8d6e04531d1adba59f37/lib/liblzma.so.5.2.5)

some improper delete

==78542== Invalid free() / delete / delete[] / realloc()
==78542==    at 0x4030F5C: free (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/external/valgrind/3.17.0-12ba960b086b024cce391ad228855798/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==78542==    by 0x52EC1311: mkfit::ExecutionContext::~ExecutionContext() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_0/lib/slc7_amd64_gcc10/libRecoTrackerMkFitCore.so)

and a large number of memory leaks in tensorflow....

slava77 · 2023-02-16T13:57:35Z

and btw I also see many

==78542== Conditional jump or move depends on uninitialised value(s)
==78542==    at 0x52ED3C34: mkfit::helixAtRFromIterativeCCS(Matriplex::Matriplex<float, 6, 1, 4> const&, Matriplex::Matriplex<int, 1, 1, 4> const&, Matriplex::Matriplex<float, 1, 1, 4> const&, Matriplex::Matriplex<float, 6, 1, 4>&, Matriplex::Matriplex<float, 6, 6, 4>&, Matriplex::Matriplex<int, 1, 1, 4>&, int, mkfit::PropagationFlags) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_0/lib/slc7_amd64_gcc10/libRecoTrackerMkFitCore.so)

IIUC, this was fixed in #40105 in 12_6_X (merged in pre5, apparently)
At least the fixed cases were harmless and were happening in parts of the fixed size loop that were discarded downstream.

VinInn · 2023-02-16T14:02:38Z

Thanks @slava77 , I was suspecting something like that (and sorry if I'm still runnning and old benchmark (hepix benchmark))

slava77 · 2023-02-21T17:42:04Z

the nan in deepTau could be blocking the rereco of 2022
https://cms-unified.web.cern.ch/cms-unified/showlog/?search=ReReco-Run2022B-DoubleMuonLowMass-10Dec2022-00001

cmsbuild added the pending-assignment label Feb 9, 2023

cmsbuild added reconstruction-pending pending-signatures and removed pending-assignment labels Feb 9, 2023

cmsbuild added the tau label Feb 9, 2023

VinInn mentioned this issue Feb 15, 2023

CMSSW not tested on SSE-only platforms #40777

Open

kskovpen mentioned this issue Apr 14, 2023

Failures in Run 3 data reprocessing #40437

Open

makortel mentioned this issue Sep 1, 2023

RecoTracker/MkFitCore: Valgrind Massif/Memcheck "Invalid free() / delete / delete[] / realloc()" at end of job. #42700

Open

srimanob mentioned this issue Sep 25, 2023

Exception on HLT_DoubleMediumDeepTauPFTauHPS35_eta2p1 Phase-2 workflow #42862

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exception instead of LogError in DeepTauId #40733

Exception instead of LogError in DeepTauId #40733

VinInn commented Feb 9, 2023 •

edited

cmsbuild commented Feb 9, 2023

makortel commented Feb 9, 2023

cmsbuild commented Feb 9, 2023

makortel commented Feb 9, 2023

mandrenguyen commented Feb 9, 2023

mandrenguyen commented Feb 9, 2023

mbluj commented Feb 9, 2023

kandrosov commented Feb 9, 2023

VinInn commented Feb 9, 2023

VinInn commented Feb 11, 2023

kandrosov commented Feb 15, 2023

VinInn commented Feb 15, 2023

makortel commented Feb 15, 2023

VinInn commented Feb 15, 2023

VinInn commented Feb 15, 2023

kandrosov commented Feb 15, 2023

VinInn commented Feb 16, 2023 •

edited

VinInn commented Feb 16, 2023

Dr15Jones commented Feb 16, 2023

VinInn commented Feb 16, 2023 •

edited

slava77 commented Feb 16, 2023

VinInn commented Feb 16, 2023

slava77 commented Feb 21, 2023

Exception instead of LogError in DeepTauId #40733

Exception instead of LogError in DeepTauId #40733

Comments

VinInn commented Feb 9, 2023 • edited

cmsbuild commented Feb 9, 2023

makortel commented Feb 9, 2023

cmsbuild commented Feb 9, 2023

makortel commented Feb 9, 2023

mandrenguyen commented Feb 9, 2023

mandrenguyen commented Feb 9, 2023

mbluj commented Feb 9, 2023

kandrosov commented Feb 9, 2023

VinInn commented Feb 9, 2023

VinInn commented Feb 11, 2023

kandrosov commented Feb 15, 2023

VinInn commented Feb 15, 2023

makortel commented Feb 15, 2023

VinInn commented Feb 15, 2023

VinInn commented Feb 15, 2023

kandrosov commented Feb 15, 2023

VinInn commented Feb 16, 2023 • edited

VinInn commented Feb 16, 2023

Dr15Jones commented Feb 16, 2023

VinInn commented Feb 16, 2023 • edited

slava77 commented Feb 16, 2023

VinInn commented Feb 16, 2023

slava77 commented Feb 21, 2023

VinInn commented Feb 9, 2023 •

edited

VinInn commented Feb 16, 2023 •

edited

VinInn commented Feb 16, 2023 •

edited