Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rivet 2.6.0 compiles in 10X but crash at running time #3679

Closed
xjanssen opened this issue Jan 22, 2018 · 21 comments
Closed

Rivet 2.6.0 compiles in 10X but crash at running time #3679

xjanssen opened this issue Jan 22, 2018 · 21 comments

Comments

@xjanssen
Copy link
Contributor

Hi,

I am testing the integration of Rivet 2.6.0 (and YODA 1.7.0 on which this release is based) in CMSSW 10X. I managed to build the two package on a cmsdev machine with the usual cmsBuild command without problem. However when I link these new version in a CMSSW 10X release with 'scram tool' command and try to run my test job I get the following error:

---- Begin Fatal Exception 11-Jan-2018 15:17:01 CET-----------------------
An exception of category 'PluginLibraryLoadError' occurred while
[0] Constructing the EventProcessor
Exception Message:
unable to load /afs/cern.ch/work/x/xjanssen/cms/Rivet/10X_gcc630_Rivet260/CMSSW_10_0_0_pre3/lib/slc6_amd64_gcc630/pluginGeneratorInterfaceRivetInterface_plugins.so because dlopen: cannot load any more object with static TLS
----- End Fatal Exception -------------------------------------------------

As far as I understand I am hitting some limitation of our running time environment but I am unsure how to fix/debug as I am not an expert of these kind of problems. The main change wrt the previous release of Rivet id the adding of several rivet plugins which might be the underlying reason of hitting this limit. The standalone install of Rivet 2.6.0 (outside CMSSW) is however working and hence, this seems a feature linked to CMSSW environment.

@cmsbuild
Copy link
Contributor

A new Issue was created by @xjanssen Janssen Xavier.

@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@davidlt
Copy link
Contributor

davidlt commented Jan 22, 2018

It's linked to system dynamic loader. Probably Rivet was compiled with static TLS, but there are a few spaces left in operating system for system specific packages in a vector. CMSSW loads hundreds of shared libraries via dlopen(), but you cannot fit more shared libraries with static TLS then there are slots in the vector.

This has been resolved for CentOS 7.2 and above.

See: cms-externals/glibc@2c052e0

One should investigate how TLS is being used in Rivet. Maybe using "-ftls-model=global-dynamic" would help. This is default if Rivert was built for shared linking (i.e. -fPIC).

@davidlt
Copy link
Contributor

davidlt commented Jan 22, 2018

Ah, it's not Rivet, it's "pluginGeneratorInterfaceRivetInterface_plugins.so". I wonder why...

@smuzaffar
Copy link
Contributor

smuzaffar commented Jan 22, 2018

@xjanssen, can you please share your spec files?

@xjanssen
Copy link
Contributor Author

xjanssen commented Jan 22, 2018 via email

@smuzaffar
Copy link
Contributor

did you build and scram setup both yoda-toolfile and rivet-toolfile?
I built these versions on cmsdev02 and can not reproduce the error. On which cmsdev machine did you build and can you point me to the build logs dir?

@xjanssen
Copy link
Contributor Author

xjanssen commented Jan 23, 2018 via email

@xjanssen
Copy link
Contributor Author

Ah ofc I used everywhere CMSSW_10_0_0_pre3 and gcc630, I forgot to update in the next lines when pasting from my note on how to do it

@mseidel42
Copy link
Contributor

@davidlt @davidlange6 It seems there was a similar problem with fireworks before, and glibc has been patched as follows: https://github.com/cms-externals/glibc/commits/cms/2.12-1.166.el6_7.3

Is it possible to increase DTV_SURPLUS further?

@davidlange6
Copy link
Contributor

hi @intrepid42 - i suspect this would not be easily done (as this is the cause of problems moving towards centos7 smoothly) - but you should bring this up to a core software meeting so people can better understand whats going on (yoda/rivet does not seem like the sort of software that should run into this problem after all)

do things work on centos7?

@Dr15Jones
@smuzaffar

@mseidel42
Copy link
Contributor

Hi, I just tried this:

  • Rivet 2.6.1 + Yoda 1.7.1
  • CMSSW_10_3_0_pre6
  • slc7_amd64_gcc700
  • cmsRun GeneratorInterface/RivetInterface/test/runRivetAnalyzer_cfg.py

And 4400 events get processed fine with the MC_GENERIC analysis. Xavier's rivet_CUEP8S1_CT6_Soft_cfg.py config works fine, too :)

Should we still bother about slc6, or is it likely to be phased out soon?

@mseidel42
Copy link
Contributor

Just to confirm that same steps fail in slc6:

----- Begin Fatal Exception 11-Oct-2018 13:44:01 CEST-----------------------
An exception of category 'PluginLibraryLoadError' occurred while
   [0] Constructing the EventProcessor
Exception Message:
unable to load /afs/cern.ch/work/m/mseidel/CMSSW_dev/Rivet/slc6/CMSSW_10_3_0_pre6/lib/slc6_amd64_gcc700/pluginGeneratorInterfaceRivetInterface_plugins.so because dlopen: cannot load any more object with static TLS
----- End Fatal Exception -------------------------------------------------

@mseidel42
Copy link
Contributor

mseidel42 commented Oct 11, 2018

Very weird also: The ParticleLevelProducer (which calls Rivet code and is part of GeneratorInterface/RivetInterface) runs fine.

@fabiocos
Copy link
Contributor

@intrepid42 in 10_4_X we should push for slc7 as new production version IMO

@mseidel42
Copy link
Contributor

Ok, I created a pull request!

@mseidel42
Copy link
Contributor

NB: this Rivet release is probably not super-important yet. But we definitely want to use Rivet 3.0 once it is available to get new features like processing of multiple weights

@mseidel42
Copy link
Contributor

So, unfortunately, the nanoAod workflow fails on slc6 (relies on Rivet-based ParticleLevelProducer) in #4427.

Is it technically possible use "legacy" version 2.5.4 for slc6, using a statement like this in the spec file?

case "%{cmsplatf}" in
  slc6*)
    %{realversion}=2.5.4
    ;;
esac

This would ensure the basic functionality on slc6 (= old plugins and ParticleLevelProducer).

I think it would be acceptable to move Rivet plugin development and advanced usage (latest plugins and new features) to slc7.

I fear that we would also need to have different RivetAnalyzer code for different Rivet versions once they diverge too much... Can that be done with some #ifdef or BuildFile.xml statements?

@smuzaffar
Copy link
Contributor

@intrepid42 , Currently the version if fixed in spec file and can not be changed. I will see if cmsBuild can dynamically assign a version.
Currently there are couple of hacks to build different versions for different archs.

@mseidel42
Copy link
Contributor

Thank you, that looks promising! Do you know if it works with %ifarch slc6/7?

Are there flags that we can use during compile time for the code in the RivetInterface package? We expect to integrate some new functionalities with the Rivet 3.0 upgrade (heavy-ion support, multi-weight handling), so we would need to make those parts invisible to the compiler on slc6 (that would know only the headers of Rivet 2.5.4).

@smuzaffar
Copy link
Contributor

you can use %{cmsplatf} see example here

case "%{cmsplatf}" in

%define isslc6 %(case %{cmsplatf} in (slc6*) echo 1 ;; (*) echo 0 ;; esac)

@smuzaffar
Copy link
Contributor

Closign this, we have disabled OPENMP for slc6 to avoid this crash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants