Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

try to solve mg5 /tmp problem from ExternalLHEProducer #23004

Merged
merged 1 commit into from Apr 20, 2018

Conversation

perrozzi
Copy link
Contributor

In production (and gridpack creation) when using condor infrastructure
some failures have been observed which are related to the use of /tmp
by MG5_aMC when the $TMPDIR environment variable is not set.
A solution has been provided by @khurtado (cms-sw/genproductions#1534 (comment))
and is used for new gridpacks.
However, so far, existing gridpacks had been patched one by one.
Different MG5_aMC versions require different patches, making the operation cumbersome.
Discussing with @fabiocos and @vlimant it comes out that CMSSW doesn't touch this environment variable,
so one can effectively set the $TMPDIR upstream in the ExternalLHEProducer before the gridpack runs.
It has been cross-checked that the call of cmsenv doesn't overwrite the $TMPDIR variable

@kdlong, @agrohsje, @efeyazgan, @perrozzi

[1]

-bash-4.1$ cmsenv
-bash-4.1$ echo $TMPDIR
/tmp/pippo
-bash-4.1$ export TMPDIR=/ciao
-bash-4.1$ echo $TMPDIR
/ciao
-bash-4.1$ cmsenv
-bash-4.1$ echo $TMPDIR
/ciao

@cmsbuild
Copy link
Contributor

The code-checks are being triggered in jenkins.

@perrozzi
Copy link
Contributor Author

example of failure in production
https://hypernews.cern.ch/HyperNews/CMS/get/generators/3745

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @perrozzi for master.

It involves the following packages:

GeneratorInterface/LHEInterface

@cmsbuild, @efeyazgan, @perrozzi can you please review it and eventually sign? Thanks.
@alberto-sanchez, @agrohsje, @mkirsano this is something you requested to watch as well.
@davidlange6, @slava77, @fabiocos you are the release manager for this.

cms-bot commands are listed here

@perrozzi
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 19, 2018

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/27569/console Started: 2018/04/19 14:49

@vlimant
Copy link
Contributor

vlimant commented Apr 19, 2018

My two cents, since the issue of tmp is not only in the gridpack business but could be anywhere, is to know whether TMPDIR being set is an actual requirement of cmssw environment, in which case it has to be set properly upstream of a specific module of cmssw.
Either in cmsenv or in wmcore job wrapper or in condor pilot (in increasing order of relevance).

With this PR, we kill one bird with one stone, while there are other birds out there.

@kdlong
Copy link
Contributor

kdlong commented Apr 19, 2018

I have always thought that setting it in the condor pilot made the most sense. Is there any opposition to this? If we set it in the CMSSW, doesn't it depend which CMSSW version is used by the gridpack script?

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

Comparison job queued.

@cmsbuild
Copy link
Contributor

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-23004/27569/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 29
  • DQMHistoTests: Total histograms compared: 2492830
  • DQMHistoTests: Total failures: 1
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2492653
  • DQMHistoTests: Total skipped: 176
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 23 files compared)
  • Checked 119 log files, 9 edm output root files, 29 DQM output files

@perrozzi
Copy link
Contributor Author

+1
Signing off from the GEN side. Please let us know what is the best solution computing-wise.

To answer Kenneth in a pedantic way, if:

  • the gridpack has been produced with an old CMSSW version (say version A) not setting TMPDIR,
  • it will inherit the environment variable as set in the job CMSSW (say version B)
  • and the cmsenv inside the gridpack with version A will not overwrite it (check my dummy test in the first comment).
  • if in the future we want to change the CMSSW version used for the gridpacks to version A*, and this will explicitly set TMPDIR, we can always use a different CMSSW (version C) passing it through the ExternalLHEProducer. Maximum flexibility.

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @davidlange6, @slava77, @smuzaffar, @fabiocos (and backports should be raised in the release meeting by the corresponding L2)

@kdlong
Copy link
Contributor

kdlong commented Apr 19, 2018

@perrozzi aaah, ok, I see. So run_generic_tarball_cvmfs.sh is still taken from the CMSSW of the job, not from the one specified in the gridpack. This looks like quite a simple and robust solution to me then.

@khurtado
Copy link

To complement on this, @vlimant created an elog thread regarding solving this issue at the pilot level for a general solution.

This was discussed today at the Submission Infrastructure meeting and there is a plan to test settingTMPDIR in the CMS pilot validation scripts (see notes), just like other variables like _CONDOR_SCRATCH_DIR or HOME are set.

@perrozzi
Copy link
Contributor Author

@khurtado thanks for the update, please keep us posted. from our side is indeed not mandatory to have it in cmssw.
may I kindly ask what you expect to be a ETA for the test?

@ddavila0
Copy link

I've merged Kenyi's patch on qa, will start testing it today and if everything goes well, we could push this to production at the beginning of the next week. Just keep in mind that the renewal of pilots will take few days (2-3) making the change likely to be available in the full pool by the end of next week.

@fabiocos
Copy link
Contributor

@perrozzi @vlimant this PR is anuwau not in contrast with possible modifications in the condor pileot, right? So merging it will not harm anyway

@perrozzi
Copy link
Contributor Author

as the comment says, if TMPDIR is unset, set it to the condor scratch area if present, otherwise fallback to /tmp
so no, in principle is not in contrast with the proposed other general modifications

@fabiocos
Copy link
Contributor

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants