Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DQM/Integration unit tests are failing in all releases but 12_6_X #39669

Closed
perrotta opened this issue Oct 7, 2022 · 24 comments
Closed

DQM/Integration unit tests are failing in all releases but 12_6_X #39669

perrotta opened this issue Oct 7, 2022 · 24 comments

Comments

@perrotta
Copy link
Contributor

perrotta commented Oct 7, 2022

DQM/Integration unit tests are failing in large number in all releases but 12_6_X, in all cases apparently independently from the PR merged in the meanwhile.

I observed it starting in:
CMSSW_12_5_X_2022-10-04-1100
CMSSW_12_4_X_2022-10-03-2300
CMSSW_12_3_X_2022-09-30-1100
CMSSW_12_2_X_2022-10-03-2300

No such issue (yet?) in the master release.
In all cases there were no PR merged for th IB when it appeared first, in particular we are not merging anything in 12_2_X and 12_3_X since a while.

A typical log:

edmFileUtil --catalog file:/cvmfs/cms-ib.cern.ch/SITECONF/local/PhEDEx/storage.xml?protocol=xrootd --events /store/express/Commissioning2021/ExpressCosmics/FEVT/Express-v1/000/344/518/00000/8ae6d6f6-7859-4089-84dd-4a5d89deb5df.root | tail -n +9 | head -n -5 | awk '{ print $3 }'
Error in <TNetXNGFile::Open>: [ERROR] Server responded with an error: [3011] No servers are available to read the file.

----- Begin Fatal Exception 30-Sep-2022 12:04:01 CEST-----------------------
An exception of category 'ConfigFileReadError' occurred while
   [0] Processing the python configuration file named ./src/DQM/Integration/python/clients/beam_dqm_sourceclient-live_cfg.py
Exception Message:
 unknown python problem occurred.
IndexError: list index out of range

At:
  /cvmfs/cms-ib.cern.ch/nweek-02752/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_X_2022-09-30-1100/python/DQM/Integration/config/unittestinputsource_cfi.py(107): <module>
  <frozen importlib._bootstrap>(228): _call_with_frames_removed
  <frozen importlib._bootstrap_external>(850): exec_module
  <frozen importlib._bootstrap>(695): _load_unlocked
  <frozen importlib._bootstrap>(986): _find_and_load_unlocked
  <frozen importlib._bootstrap>(1007): _find_and_load
  /cvmfs/cms-ib.cern.ch/nweek-02752/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_X_2022-09-30-1100/python/FWCore/ParameterSet/Config.py(722): load
  ./src/DQM/Integration/python/clients/beam_dqm_sourceclient-live_cfg.py(36): <module>

----- End Fatal Exception -------------------------------------------------
@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 7, 2022

A new Issue was created by @perrotta Andrea Perrotta.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@perrotta
Copy link
Contributor Author

perrotta commented Oct 7, 2022

assign dqm,externals

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 7, 2022

New categories assigned: dqm,externals

@jfernan2,@ahmad3213,@micsucmed,@iarspider,@rvenditti,@smuzaffar,@emanueleusai,@syuvivida,@aandvalenzuela,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

@iarspider
Copy link
Contributor

iarspider commented Oct 7, 2022

I have reproduced the issue, also with CMSSW_12_6_X_2022-10-04-1100 (no idea why it didn't fail in the IBs). However I don't know how to fix it, we need to wait until @smuzaffar is back.

@rvenditti
Copy link
Contributor

For the time being, I just reproduced the error in CMSSW_12_3_X_2022-09-30-1100 (after changing the input dataset in https://github.com/cms-sw/cmssw/blob/master/DQM/Integration/python/config/unittestinputsource_cfi.py#L41 to avoid the xrootd error), but we don't have any ideas of the reason why. I tried to run a couple of DQM clients without unit test, and they work properly.

@smuzaffar
Copy link
Contributor

smuzaffar commented Oct 10, 2022

Could it be that dataset /ExpressCosmics/Commissioning2021-Express-v1/FEVT was recently deleted and now xrootd can not find such file any more? Note that we have cached this files in ibeos area but one need to use protocol=ibeos to access it e.g. the following works

edmFileUtil  --catalog file:/cvmfs/cms-ib.cern.ch/SITECONF/local/PhEDEx/storage.xml?protocol=ibeos --events /store/express/Run2022B/ExpressPhysics/FEVT/Express-v1/000/355/380/00000/b8a57fc4-5656-42b4-9b7b-2e647baf65e8.root

Or other solution is to backport the SITECONFIG_PATH changes #37278 (comment) to production releases e.g. 12.x/11.x release cycles.

@smuzaffar
Copy link
Contributor

@makortel , @nhduongvn , @stlammel during Core SW meeting we decided to backport #37278 changes to older release cycles too. Do you see any issues doing this ? I am not sure if all sites are ready and already have new data catalogs from rucio

@makortel
Copy link
Contributor

during Core SW meeting we decided to backport #37278 changes to older release cycles too. Do you see any issues doing this ?

Yes, that is the plan (see #37278 (comment)).

Do you see any issues doing this ?

We need to be sure that the backports won't cause troubles in the old release cycles. I had earlier collected the list of fixes that need to be included in the backport in #37278 (comment), and this week a new issue on the subsite treatment in the site-local-config.xml was reported in
https://cms-talk.web.cern.ch/t/crab-test-cmssw-12-6-x-invalid-site-local-config/15423/17. I've understood @nhduongvn would open a PR for the fix soon.

I am not sure if all sites are ready and already have new data catalogs from rucio

That was actually my precondition for signing #37278 that @stlammel confirmed in #37278 (comment) (although with 12_6_0_pre2 reality turned out to be more complicated).

@stlammel
Copy link

So, there was a campaign earlier this year to get storage.json files in place for all sites. Two sites had held out and they were put in place when this was discovered several week ago, as Matti wrote.
During the sub-site issue last week i found obsolete entries at two sites and they were corrected.
The SAM test to check SITECONF is ready and will go into production with the next token update. This should detect inconsistencies before users. (I didn't regard this high priority as we don't have this for the current SITECONF files either but them being active reveals issues promptly.)
I would release CMSSW_12_6, make sure everything is fine before the backport of other releases.

  • Stephan

@rappoccio
Copy link
Contributor

Hi, All,

This still needs attention, is it still the case that @nhduongvn is preparing a fix here?

@nhduongvn
Copy link
Contributor

Hi Sal, all,
The fix was provided and merged:
#39727

@rappoccio
Copy link
Contributor

Thanks @nhduongvn, but we still need back ports to 12_5 and 12_4. @makortel is there some update there?

Otherwise, can we just move to a more recent file for the DQM checks and bypass this entirely to just use a more recent run that's still available? @cms-sw/dqm-l2 ?

@rappoccio
Copy link
Contributor

I would release CMSSW_12_6, make sure everything is fine before the backport of other releases.

@stlammel we won't release 12_6 until December, we can't really leave the IBs broken for 2 months.

@stlammel
Copy link

Hallo Sal, @rappoccio
i am a bit confused: The old versions, including 12_4, 12_5, should work fine
without the backport. Only the 12_6 pre-releases are broken and the next
pre-release will fix this.
Thanks,

  • Stephan

@makortel
Copy link
Contributor

Given the trouble we've had with #37278 I'm not comfortable in backporting it (and all the necessary fixes) to 12_4_X or 12_5_X until the data taking is over (to avoid any risk for Tier0).

Said that, I think the unit tests would get fixed by just dropping the --catalog option to edmFileUtil, i.e. backporting just #39266 . @smuzaffar The test machinery still sets CMS_PATH=/cvmfs/cms-ib.cern.ch, right? If that is the case, edmFileUtil will find the right storage.xml. I just tested

CMS_PATH=/cvmfs/cms-ib.cern.ch edmFileUtil  --events /store/express/Run2022B/ExpressPhysics/FEVT/Express-v1/000/355/380/00000/b8a57fc4-5656-42b4-9b7b-2e647baf65e8.root

succeeds in CMSSW_12_5_X_2022-10-21-1100.

@mmusich
Copy link
Contributor

mmusich commented Oct 21, 2022

Said that, I think the unit tests would get fixed by just dropping the --catalog option to edmFileUtil, i.e. backporting just #39266 .

based on my private test1, this won't be sufficient to fix the unit tests.

Footnotes

  1. cmsrel CMSSW_12_5_X_2022-10-21-1100
    cd CMSSW_12_5_X_2022-10-21-1100/src/
    cmsenv
    git cms-addpkg DQM/Integration
    git cherry-pick 9a056d4
    scramv1 b -j 20
    cd DQM/Integration/python/clients/
    voms-proxy-init -voms cms
    cmsRun sistrip_dqm_sourceclient-live_cfg.py unitTest=True

@smuzaffar
Copy link
Contributor

Right, dropping the --catelog option does not work for 12.5 and earliler releases. One simple fix is to either use a file known to das ( acessiable via xrootd redirectors ) or use ibeos protocol i.e. use --catalog file:/cvmfs/cms-ib.cern.ch/SITECONF/local/PhEDEx/storage.xml?protocol=ibeos

@mmusich
Copy link
Contributor

mmusich commented Oct 23, 2022

or use ibeos protocol i.e. use --catalog file:/cvmfs/cms-ib.cern.ch/SITECONF/local/PhEDEx/storage.xml?protocol=ibeos

this indeed works.
I have opened the following PRs:

Let me know if some other cycles could use an update.

@makortel
Copy link
Contributor

I still don't understand why just dropping the --catalog would not work. In CMSSW_12_5_X_2022-10-21-1100 I get

# this is what the test used before
$ edmFileUtil -d --catalog file:/cvmfs/cms-ib.cern.ch/SITECONF/local/PhEDEx/storage.xml?protocol=xrootd /store/express/Run2022B/ExpressPhysics/FEVT/Express-v1/000/355/380/00000/b8a57fc4-5656-42b4-9b7b-2e647baf65e8.root
root://cms-xrd-global.cern.ch//store/express/Run2022B/ExpressPhysics/FEVT/Express-v1/000/355/380/00000/b8a57fc4-5656-42b4-9b7b-2e647baf65e8.root

# with explicit ibeos
$ edmFileUtil -d --catalog file:/cvmfs/cms-ib.cern.ch/SITECONF/local/PhEDEx/storage.xml?protocol=ibeos /store/express/Run2022B/ExpressPhysics/FEVT/Express-v1/000/355/380/00000/b8a57fc4-5656-42b4-9b7b-2e647baf65e8.root
root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/express/Run2022B/ExpressPhysics/FEVT/Express-v1/000/355/380/00000/b8a57fc4-5656-42b4-9b7b-2e647baf65e8.root

# dropping --catalog, setting CMS_PATH
$ CMS_PATH=/cvmfs/cms-ib.cern.ch edmFileUtil -d /store/express/Run2022B/ExpressPhysics/FEVT/Express-v1/000/355/380/00000/b8a57fc4-5656-42b4-9b7b-2e647baf65e8.root
root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/express/Run2022B/ExpressPhysics/FEVT/Express-v1/000/355/380/00000/b8a57fc4-5656-42b4-9b7b-2e647baf65e8.root

The last two cases resolve to exactly the same PFN.

Also running scram b use-ibeos runtests on DQM/Integration seems to work with dropping --catalog.

Anyway, given that #39829 and #39830 are already merged, there probably isn't practical need to continue the discussion (except maybe why the merge of #39829 did not cause this issue to close).

@mmusich
Copy link
Contributor

mmusich commented Oct 24, 2022

Also running scram b use-ibeos runtests on DQM/Integration seems to work with dropping --catalog.

This didn't work for me, see #39669 (comment)

@makortel
Copy link
Contributor

Also running scram b use-ibeos runtests on DQM/Integration seems to work with dropping --catalog.

This didn't work for me, see #39669 (comment)

I guess because the recipe in #39669 (comment) did not include overriding the CMS_PATH (that I expect scram b use-ibeos runtests to do, among other things).

@smuzaffar
Copy link
Contributor

humm, yes dropping --catalog with correct CMS_PATH also worked for me .... no idea why I had the impression that this was not working.

@mmusich
Copy link
Contributor

mmusich commented Oct 25, 2022

no idea why I had the impression that this was not working.

that's interesting, because when I first tried to drop --catalog (, i.e. backporting just #39266) also I have the distinct impression that also scram b use-ibeos runtests wasn't working, then I passed to use single client tests (as in the recipe of #39669 (comment)) in order to make tests run faster.
I am wondering if some other thing was changed in the meanwhile, such that scram b use-ibeos runtests now also runs OK.
At any rate I think that #39829 is a superior fix, because other than letting the unit test run, also allows the single client to be tested in unit test mode directly, which is what generally developers use.

@rappoccio
Copy link
Contributor

Thanks a lot for the efforts here! I think we can now close the issue as the IBs are now correctly completing. Thanks everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants