Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DQMOnlineClient-beam_dqm_sourceclient error #31896

Closed
silviodonato opened this issue Oct 22, 2020 · 24 comments
Closed

DQMOnlineClient-beam_dqm_sourceclient error #31896

silviodonato opened this issue Oct 22, 2020 · 24 comments

Comments

@silviodonato
Copy link
Contributor

silviodonato commented Oct 22, 2020

In CMSSW_11_2_X_2020-10-21-2300, a unit test from DQM/Integration is failing.
This was already reported in some PR test (see #31654 (comment)) :

#32135 (comment)
#32036 (comment)
#31765 (comment)
#31871 (comment)
#31699 (comment)
#31689 (comment)
#31654 (comment)
#31206 (comment)
#27983 (comment)

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/cc8_amd64_gcc8/CMSSW_11_2_X_2020-10-21-2300/unitTestLogs/DQM/Integration#/

===== Test "TestDQMOnlineClient-beam_dqm_sourceclient" ====
+ [[ 1 -eq 0 ]]
+ [[ -z '' ]]
+ LOCAL_TEST_DIR=.
+ [[ -z '' ]]
+ CLIENTS_DIR=./src/DQM/Integration/python/clients
+ mkdir -p ./upload
+ cmsRun ./src/DQM/Integration/python/clients/beam_dqm_sourceclient-live_cfg.py unitTest=True
Querying DAS for files...
the query is file run=334393 dataset=/ExpressCosmics/Commissioning2019-Express-v1/FEVT lumi=1
DAS succeeded after 1 attempts 0
found files:  ['/store/express/Commissioning2019/ExpressCosmics/FEVT/Express-v1/000/334/393/00000/D0F052ED-9CA5-F547-BA73-2AA370D51AE8.root']
edmFileUtil --catalog file:/cvmfs/cms-ib.cern.ch/SITECONF/local/PhEDEx/storage.xml?protocol=xrootd --events /store/express/Commissioning2019/ExpressCosmics/FEVT/Express-v1/000/334/393/00000/D0F052ED-9CA5-F547-BA73-2AA370D51AE8.root | tail -n +9 | head -n -5 | awk '{ print $3 }'
the query is file run=334393 dataset=/ExpressCosmics/Commissioning2019-Express-v1/FEVT lumi=2
DAS succeeded after 1 attempts 0
found files:  ['/store/express/Commissioning2019/ExpressCosmics/FEVT/Express-v1/000/334/393/00000/FE21789A-F777-0B43-A1F5-E43F1FD52D19.root']
edmFileUtil --catalog file:/cvmfs/cms-ib.cern.ch/SITECONF/local/PhEDEx/storage.xml?protocol=xrootd --events /store/express/Commissioning2019/ExpressCosmics/FEVT/Express-v1/000/334/393/00000/FE21789A-F777-0B43-A1F5-E43F1FD52D19.root | tail -n +9 | head -n -5 | awk '{ print $3 }'
Got 2 files.
Loaded configuration file from: []
dqmRunConfig: cms.PSet(
    collectorHost = cms.untracked.string('127.0.0.1'),
    collectorPort = cms.untracked.int32(9190),
    type = cms.untracked.string('userarea')
)
Monitoring file not found, disabling.
22-Oct-2020 03:38:39 CEST  Initiating request to open file root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/express/Commissioning2019/ExpressCosmics/FEVT/Express-v1/000/334/393/00000/D0F052ED-9CA5-F547-BA73-2AA370D51AE8.root
%MSG-w XrdAdaptor:  file_open 22-Oct-2020 03:38:42 CEST pre-events
Data is served from cern.ch instead of original site eoscms
%MSG
22-Oct-2020 03:38:43 CEST  Successfully opened file root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/express/Commissioning2019/ExpressCosmics/FEVT/Express-v1/000/334/393/00000/D0F052ED-9CA5-F547-BA73-2AA370D51AE8.root
%MSG-w SiStripRawToDigi:  SiStripRawToDigiModule:siStripDigis  22-Oct-2020 03:38:57 CEST Run: 334393 Event: 16285
NULL pointer to FEDRawData for FED: id 114
Note: further warnings of this type will be suppressed (this can be changed by enabling debugging printout)
%MSG
22-Oct-2020 03:39:03 CEST  Closed file root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/express/Commissioning2019/ExpressCosmics/FEVT/Express-v1/000/334/393/00000/D0F052ED-9CA5-F547-BA73-2AA370D51AE8.root
22-Oct-2020 03:39:03 CEST  Initiating request to open file root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/express/Commissioning2019/ExpressCosmics/FEVT/Express-v1/000/334/393/00000/FE21789A-F777-0B43-A1F5-E43F1FD52D19.root
%MSG-w XrdAdaptor:  file_open 22-Oct-2020 03:39:05 CEST PostProcessEvent
Data is served from cern.ch instead of original site eoscms
%MSG
22-Oct-2020 03:39:05 CEST  Successfully opened file root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/express/Commissioning2019/ExpressCosmics/FEVT/Express-v1/000/334/393/00000/FE21789A-F777-0B43-A1F5-E43F1FD52D19.root

@cmsbuild
Copy link
Contributor

A new Issue was created by @silviodonato Silvio Donato.

@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@silviodonato
Copy link
Contributor Author

assign DQM

@silviodonato
Copy link
Contributor Author

assign dqm

@cmsbuild
Copy link
Contributor

New categories assigned: dqm

@jfernan2,@andrius-k,@fioriNTU,@kmaeshima,@ErnestaP you have been requested to review this Pull request/Issue and eventually sign? Thanks

@silviodonato
Copy link
Contributor Author

we don't see the error in slc7 amd64 gcc900 and other architectures, so perhaps it is just a temporary problem

@jfernan2
Copy link
Contributor

Hi @silviodonato

this is the second time I see this error being revived in a PR Jenkins test:
#32036 (comment)

26-Nov-2020 11:04:32 CET Successfully opened file root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/express/Commissioning2019/ExpressCosmics/FEVT/Express-v1/000/334/393/00000/FE21789A-F777-0B43-A1F5-E43F1FD52D19.root
26-Nov-2020 11:04:34 CET Writing DQM Root file: ./upload/DQM_V0001_BeamMonitor_R000334393.root
DQMFileSaver::globalEndRun()
26-Nov-2020 11:04:34 CET Closed file root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/express/Commissioning2019/ExpressCosmics/FEVT/Express-v1/000/334/393/00000/FE21789A-F777-0B43-A1F5-E43F1FD52D19.root
%MSG-w SiStripRawToDigi: SiStripRawToDigiModule:siStripDigis@endStream 26-Nov-2020 11:04:34 CET post-events
[sistrip::RawToDigiUnpacker::createDigis] warnings:
NULL pointer to FEDRawData for FED (1800)
%MSG

Fatal system signal has occurred during exit
/data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_11_2_X_2020-11-25-2300/src/DQM/Integration/test/runtest.sh: line 19: 14915 Aborted cmsRun $CLIENTS_DIR/$1 unitTest=True

---> test TestDQMOnlineClient-beam_dqm_sourceclient had ERRORS

@jfernan2
Copy link
Contributor

jfernan2 commented Nov 26, 2020

What I am not sure if this is related to the SiStripRawToDigiModule:siStripDigis Warning before (it seems not since that warning is present in several other unit Tests) or to the beam_dqm_sourceclient itself

I have just tried to reproduce it using exactly same settings as in #32036 but failed, however perhaps it makes sense to reopen this issue since it seems to be there

FYI: @gennai @francescobrivio

@mmusich
Copy link
Contributor

mmusich commented Nov 26, 2020

@jfernan2
just a clarification on this

What I am not sure if this is related to the SiStripRawToDigiModule:siStripDigis Warning

the warning comes from the Strip unpacker and just signals there is one or more FED(-s) out of DAQ [1] (1800 is the number of times the same warning is emitted) .
Given there are many SiStrip FEDs out of DAQ in the run chosen for the unit test 334393 [2] it is not surprising to see them.
Moreover we are running the Strip unpacker in several other online clients (including the Strip online one) without problem, so I really tend to exclude a problem with that.

[1] https://github.com/cms-sw/cmssw/blob/master/EventFilter/SiStripRawToDigi/plugins/SiStripRawToDigiUnpacker.cc#L158
[2] https://cmswbm.cern.ch/cmsdb/servlet/RunSummary?RUN=334393

@jfernan2
Copy link
Contributor

jfernan2 commented Nov 26, 2020

Thanks @mmusich
Yeah, indeed it is not related to SiStripRawToDigiModule:siStripDigis and not even reproducible (I tried offline 10 times so far... :-P )

@jfernan2
Copy link
Contributor

Another one example: #27983 (comment)
May it be correlated to the switch to CMSSW_11_3_X in the milestone?

@silviodonato silviodonato reopened this Nov 26, 2020
@silviodonato
Copy link
Contributor Author

Another one example: #27983 (comment)
May it be correlated to the switch to CMSSW_11_3_X in the milestone?

I would exclude that 11_3_X can cause this problem.

Googling "Fatal system signal has occurred during exit" I've found a @makortel issue #32045

@silviodonato
Copy link
Contributor Author

The same error was seen here #31190 (comment)

@silviodonato
Copy link
Contributor Author

The error Fatal system signal has occurred during exit is always reproducible on the ASAN release.
This is the minimal code reproducing the error

import FWCore.ParameterSet.Config as cms

process = cms.Process("BeamMonitor")

process.source = cms.Source("PoolSource",
    eventsToProcess = cms.untracked.VEventRange("334393:1:16049-334393:1:16407", "334393:2:32684-334393:2:33028"),
    fileNames = cms.untracked.vstring(
        '/store/express/Commissioning2019/ExpressCosmics/FEVT/Express-v1/000/334/393/00000/D0F052ED-9CA5-F547-BA73-2AA370D51AE8.root',
    ),
)

process.scalersRawToDigi = cms.EDProducer("ScalersRawToDigi",
    mightGet = cms.optional.untracked.vstring,
    scalersInputTag = cms.InputTag("rawDataCollector")
)

process.OnlineDBOutputService = cms.Service("OnlineDBOutputService",
    DBParameters = cms.PSet(
        authenticationPath = cms.untracked.string('.'),
        messageLevel = cms.untracked.int32(0)
    ),
    autoCommit = cms.untracked.bool(True),
    connect = cms.string('sqlite_file:BeamSpotOnlineLegacy.db'),
    jobName = cms.untracked.string('BeamSpotOnlineLegacyTest'),
    lastLumiFile = cms.untracked.string(''),
    latency = cms.untracked.uint32(2),
    preLoadConnectionString = cms.untracked.string('sqlite_file:BeamSpotOnlineLegacy.db'),
    runNumber = cms.untracked.uint64(334393),
    saveLogsOnDB = cms.untracked.bool(False),
    toPut = cms.VPSet(cms.PSet(
        onlyAppendUpdatePolicy = cms.untracked.bool(True),
        record = cms.string('BeamSpotOnlineLegacyObjectsRcd'),
        tag = cms.string('BSOnlineLegacy_tag'),
        timetype = cms.untracked.string('Lumi')
    )),
    writeTransactionDelay = cms.untracked.uint32(0)
)


process.p = cms.Path(process.scalersRawToDigi)

It runs without errors in CMSSW_11_2_X_2020-11-25-2300.

It crashes in CMSSW_11_2_ASAN_X_2020-11-25-2300 with

Fatal system signal has occurred during exit

@silviodonato
Copy link
Contributor Author

assign db

@cmsbuild
Copy link
Contributor

New categories assigned: db

@ggovi you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

@silviodonato Does the crash in ASAN have any information in addition to the "Fatal system signal during exit"? (if it does, could you paste it here?)

@silviodonato
Copy link
Contributor Author

@makortel
Copy link
Contributor

Running Silvio's example configuration #31896 (comment) (thanks!) in gdb in CMSSW_11_2_ASAN_X_2020-11-25-2300 shows a segfault with the following stack trace

Thread 1 "cmsRun" received signal SIGSEGV, Segmentation fault.
0x00007fffea676ba9 in coral::MessageStream::doOutput() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_ASAN_X_2020-11-25-2300/external/slc7_amd64_gcc820/lib/liblcg_CoralBase.so
(gdb) where
#0  0x00007fffea676ba9 in coral::MessageStream::doOutput() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_ASAN_X_2020-11-25-2300/external/slc7_amd64_gcc820/lib/liblcg_CoralBase.so
#1  0x00007fffd68fcb9b in coral::ConnectionService::ConnectionPool::~ConnectionPool() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_ASAN_X_2020-11-25-2300/external/slc7_amd64_gcc820/lib/liblcg_ConnectionService.so
#2  0x00007fffd68fcd29 in coral::ConnectionService::ConnectionPool::~ConnectionPool() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_ASAN_X_2020-11-25-2300/external/slc7_amd64_gcc820/lib/liblcg_ConnectionService.so
#3  0x00007fffd6903360 in coral::ConnectionService::ConnectionService::~ConnectionService() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_ASAN_X_2020-11-25-2300/external/slc7_amd64_gcc820/lib/liblcg_ConnectionService.so
#4  0x00007fffd69033f9 in coral::ConnectionService::ConnectionService::~ConnectionService() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_ASAN_X_2020-11-25-2300/external/slc7_amd64_gcc820/lib/liblcg_ConnectionService.so
#5  0x00007fffea6a2fdc in coral::Context::~Context() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_ASAN_X_2020-11-25-2300/external/slc7_amd64_gcc820/lib/liblcg_CoralKernel.so
#6  0x00007ffff33e0ce9 in __run_exit_handlers () from /lib64/libc.so.6
#7  0x00007ffff33e0d37 in exit () from /lib64/libc.so.6
#8  0x00007ffff33c955c in __libc_start_main () from /lib64/libc.so.6
#9  0x00000000004115f9 in _start ()

@silviodonato
Copy link
Contributor Author

@jfernan2
Copy link
Contributor

jfernan2 commented Dec 8, 2020

The following PR should be fixing this somehow:
#32408
FYI @francescobrivio

@francescobrivio
Copy link
Contributor

The following PR should be fixing this somehow:
#32408
FYI @francescobrivio

I don't think #32408 will fix this issue. It just contains some minor updates of the clients after last MWGR tests.

@ggovi
Copy link
Contributor

ggovi commented Dec 16, 2020

Addressed by #32503

@jfernan2
Copy link
Contributor

+1

@silviodonato
Copy link
Contributor Author

The crash has been solved by #32503 (I tested #31896 (comment) in CMSSW_11_3_ASAN_X_2021-01-06-2300)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants