Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate Calibration Code to New Random Service Interface #2523

Merged
merged 2 commits into from Mar 4, 2014

Conversation

wddgit
Copy link
Contributor

@wddgit wddgit commented Feb 18, 2014

Migrate Calibration code to use the new interface
of the random number generator service designed to work
with the multithreaded Framework. The main interface
change is to require a StreamID or LuminosityBlockIndex
argument to the getEngine function. These objects are only
available during the event or beginLuminosityBlock method.

There is only one calibration module using the random
number interface. I do not know how to test this module
as nothing else in CMSSW references it.

Migrate Calibration code to use the new interface
of the random number generator service designed to work
with the multithreaded Framework. The main interface
change is to require a StreamID or LuminosityBlockIndex
argument to the getEngine function. These objects are only
available during the event or beginLuminosityBlock method.

There is only one calibration module using the random
number interface. I do not know how to test this module
as nothing else in CMSSW references it.
@cmsbuild
Copy link
Contributor

A new Pull Request was created by @wddgit (W. David Dagenhart) for CMSSW_7_1_X.

Migrate Calibration Code to New Random Service Interface

It involves the following packages:

CalibMuon/DTCalibration

@cmsbuild, @Degano, @diguida, @rcastello, @nclopezo can you please review it and eventually sign? Thanks.
You can sign-off by replying to this message having '+1' in the first line of your reply.
You can reject by replying to this message having '-1' in the first line of your reply.
@nclopezo, @ktf you are the release manager for this.
You can merge this pull request by typing 'merge' in the first line of your comment.

@cmsbuild
Copy link
Contributor

-1
When I ran the RelVals I found an error in the following worklfows:
401.0 step1

runTheMatrix-results/401.0_TTbarNewMix+TTbarFSPU2+HARVESTFS/step1_TTbarNewMix+TTbarFSPU2+HARVESTFS.log
----- Begin Fatal Exception 19-Feb-2014 10:08:58 CET-----------------------
An exception of category 'FallbackFileOpenError' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing module: class=MixingModule label='mixGenPU'
   [2] Calling RootInputFileSequence::initFile()
   [3] Calling StorageFactory::open()
   [4] Calling XrdFile::open()
Exception Message:
Failed to open the file 'root://xrootd.ba.infn.it//store/relval/CMSSW_5_3_6-START53_V14/RelValProdMinBias/GEN-SIM-RAW/v2/00000/4677049F-042A-E211-8525-0026189438E8.root'
   Additional Info:
      [a] Input file root://eoscms//eos/cms/store/relval/CMSSW_5_3_6-START53_V14/RelValProdMinBias/GEN-SIM-RAW/v2/00000/4677049F-042A-E211-8525-0026189438E8.root?svcClass=default could not be opened.
Fallback Input file root://xrootd.ba.infn.it//store/relval/CMSSW_5_3_6-START53_V14/RelValProdMinBias/GEN-SIM-RAW/v2/00000/4677049F-042A-E211-8525-0026189438E8.root also could not be opened.
      [b] XrdClient::Open(name='root://xrootd.ba.infn.it//store/relval/CMSSW_5_3_6-START53_V14/RelValProdMinBias/GEN-SIM-RAW/v2/00000/4677049F-042A-E211-8525-0026189438E8.root', flags=0x10, permissions=0666) => error 'cannot obtain credentials for protocol: Secgsi: ErrParseBuffer: unknown CA: cannot verify server certificate: kXGS_init: unable to get protocol object.' (errno=3010)
      [c] Current server connection: root://xrootd.ba.infn.it:1094//store/relval/CMSSW_5_3_6-START53_V14/RelValProdMinBias/GEN-SIM-RAW/v2/00000/4677049F-042A-E211-8525-0026189438E8.root
----- End Fatal Exception -------------------------------------------------

you can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2523/221/summary.html

@ktf
Copy link
Contributor

ktf commented Feb 19, 2014

xrootd is on holidays today...

On 19 Feb 2014, at 10:33, cmsbuild wrote:

-1
When I ran the RelVals I found an error in the following worklfows:
401.0 step1

runTheMatrix-results/401.0_TTbarNewMix+TTbarFSPU2+HARVESTFS/step1_TTbarNewMix+TTbarFSPU2+HARVESTFS.log
----- Begin Fatal Exception 19-Feb-2014 10:08:58 
CET-----------------------
An exception of category 'FallbackFileOpenError' occurred while
[0] Constructing the EventProcessor
[1] Constructing module: class=MixingModule label='mixGenPU'
[2] Calling RootInputFileSequence::initFile()
[3] Calling StorageFactory::open()
[4] Calling XrdFile::open()
Exception Message:
Failed to open the file 
'root://xrootd.ba.infn.it//store/relval/CMSSW_5_3_6-START53_V14/RelValProdMinBias/GEN-SIM-RAW/v2/00000/4677049F-042A-E211-8525-0026189438E8.root'
Additional Info:
   [a] Input file 
root://eoscms//eos/cms/store/relval/CMSSW_5_3_6-START53_V14/RelValProdMinBias/GEN-SIM-RAW/v2/00000/4677049F-042A-E211-8525-0026189438E8.root?svcClass=default 
could not be opened.
Fallback Input file 
root://xrootd.ba.infn.it//store/relval/CMSSW_5_3_6-START53_V14/RelValProdMinBias/GEN-SIM-RAW/v2/00000/4677049F-042A-E211-8525-0026189438E8.root 
also could not be opened.
   [b] 
XrdClient::Open(name='root://xrootd.ba.infn.it//store/relval/CMSSW_5_3_6-START53_V14/RelValProdMinBias/GEN-SIM-RAW/v2/00000/4677049F-042A-E211-8525-0026189438E8.root', 
flags=0x10, permissions=0666) => error 'cannot obtain credentials for 
protocol: Secgsi: ErrParseBuffer: unknown CA: cannot verify server 
certificate: kXGS_init: unable to get protocol object.' (errno=3010)
   [c] Current server connection: 
root://xrootd.ba.infn.it:1094//store/relval/CMSSW_5_3_6-START53_V14/RelValProdMinBias/GEN-SIM-RAW/v2/00000/4677049F-042A-E211-8525-0026189438E8.root
----- End Fatal Exception 
-------------------------------------------------

you can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2523/221/summary.html


Reply to this email directly or view it on GitHub:
#2523 (comment)

@nclopezo
Copy link
Contributor

@Dr15Jones
Copy link
Contributor

@ktf but we only go to xrootd if we first fail to find a file in eos.

@ktf
Copy link
Contributor

ktf commented Feb 19, 2014

I think it should have been "EOS is on holidays". xrootd == EOS in my mind… Anyways, problems seems to be solved:

https://hypernews.cern.ch/HyperNews/CMS/get/cernCompAnnounce/911/1.html

david can you restart the tests?

@nclopezo
Copy link
Contributor

I restarted them, the results are in my previous comment.

#2523 (comment)

@diguida
Copy link
Contributor

diguida commented Feb 19, 2014

+1
From the technical side it looks ok, I am asking for confirmation to DT AlCa contacts.

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next CMSSW_7_1_X IBs unless changes (tests are also fine). @nclopezo, @ktf can you please take care of it?

@diguida
Copy link
Contributor

diguida commented Feb 19, 2014

Hi @wddgit @Dr15Jones
I informed our AlCaDB contacts, you are in cc.

@ktf
Copy link
Contributor

ktf commented Feb 20, 2014

Can you please mention the contacts here as well and put a -1 until you are ok?

@diguida
Copy link
Contributor

diguida commented Feb 20, 2014

Hi @ktf
Finding them on github...

@diguida
Copy link
Contributor

diguida commented Feb 20, 2014

Hi @olschewski @passaseo @ronchese
[feel free to add the missing DT contacts, I could not find them on github]
The framework experts reported an issue with the DT alignment code that could break the multithreaded Framework.
The code is in this module:
CalibMuon/DTCalibration/test/DBTools/FakeTTrig.cc
David did yesterday this pull request with a fix. May I ask you to review it?
Briefly, it is moving most of the work from endJob to beginLuminosityBlock. The tests are passed, so technically is ok, and our guess is that it makes no difference. In this case, please check if the DB writing at the beginLumi is ok: e.g. PopCon (this test is inspired to it, I'd say) does this at endJob. You can check the possible solutions proposed by the FWK in [1].
On the other hand, the code is very old and maybe no more maintained, as this module is only referenced by one configuration file in CMSSW and that configuration is obviously obsolete and will not work anymore. If this is the case, can you consider removing the module completely? If not, can you please make the configuration work again and maintain this legacy code?

Thanks in advance for contributing to keep CMSSW well maintained

[1]

  1. Move the work done during endJob to
    beginLuminosityBlock and add a flag so that work
    is only done once in a process. If there are no other
    suggestions or objections, I will do this and just hope
    it works. (This module writes something to some
    database. I do not understand CMSSW databases well).
    I have no idea how to test this.
  2. If this is a standalone test utility that will never be run
    with other things and where replay is not an issue, we
    could simply have it instantiate its own random engine
    and seed the engine with a parameter of the module.
    Then it would not use the service at all. In general
    we try to avoid random number generation not using
    the service, but for a standalone test ....
  3. It's a hack, but I could get an engine during the
    beginLuminosityBlock method and save a pointer to
    it until endJob. But in multithreaded mode, there will
    be more than one so it will be random and
    nonreproducible as to which one of the multiple engines
    you will get and each is seeded differently.
  4. We could just remove the randomness from the module.
    I do not understand the purpose it serves.

@passaseo
Copy link

Ciao Salvatore,
Avendo preso i mano questo software da poco, non conosco bene
la funzione di questo modulo. Lo guardo con l'aiuto di Paolo, e ti faccio sapere.
Marina

On 20 febbraio 2014 19:14:04 CET, Salvatore Di Guida notifications@github.com wrote:

Hi @olschewski @passaseo @ronchese
[feel free to add the missing DT contacts, I could not find them on
github]
The framework experts reported an issue with the DT alignment code that
could break the multithreaded Framework.
The code is in this module:
CalibMuon/DTCalibration/test/DBTools/FakeTTrig.cc
David did yesterday this pull request with a fix. May I ask you to
review it?
Briefly, it is moving most of the work from endJob to
beginLuminosityBlock. The tests are passed, so technically is ok, and
our guess is that it makes no difference. In this case, please check if
the DB writing at the beginLumi is ok: e.g. PopCon (this test is
inspired to it, I'd say) does this at endJob. You can check the
possible solutions proposed by the FWK in [1].
On the other hand, the code is very old and maybe no more maintained,
as this module is only referenced by one configuration file in CMSSW
and that configuration is obviously obsolete and will not work anymore.
If this is the case, can you consider removing the module completely?
If not, can you please make the configuration work again and maintain
this legacy code?

Thanks in advance for contributing to keep CMSSW well maintained

[1]

  1. Move the work done during endJob to
    beginLuminosityBlock and add a flag so that work
    is only done once in a process. If there are no other
    suggestions or objections, I will do this and just hope
    it works. (This module writes something to some
    database. I do not understand CMSSW databases well).
    I have no idea how to test this.
  2. If this is a standalone test utility that will never be run
    with other things and where replay is not an issue, we
    could simply have it instantiate its own random engine
    and seed the engine with a parameter of the module.
    Then it would not use the service at all. In general
    we try to avoid random number generation not using
    the service, but for a standalone test ....
  3. It's a hack, but I could get an engine during the
    beginLuminosityBlock method and save a pointer to
    it until endJob. But in multithreaded mode, there will
    be more than one so it will be random and
    nonreproducible as to which one of the multiple engines
    you will get and each is seeded differently.
  4. We could just remove the randomness from the module.
    I do not understand the purpose it serves.

Reply to this email directly or view it on GitHub:
#2523 (comment)

-- Inviato dal mio tablet con K-9 Mail.


Marina Passaseo
Istituto nazionale di fisica nucleare
Via Marzolo 8
35131 Padova
Tel 0499677099


@passaseo
Copy link

Dear All,
having had a discussion with Salvatore and Paolo to clarify the problem
with the random generator, I think that there is a basic info we miss.
The BeginLuminosityBlock and the JobEnd block are supposed to
be executed in different threads? Otherwise I do not see why I should loose
the pointer to the right random generator engine.
thanks Marina

On 20/02/2014 19:14, Salvatore Di Guida wrote:

Hi @olschewski https://github.com/olschewski @passaseo
https://github.com/passaseo @ronchese https://github.com/ronchese
[feel free to add the missing DT contacts, I could not find them on
github]
The framework experts reported an issue with the DT alignment code
that could break the multithreaded Framework.
The code is in this module:
CalibMuon/DTCalibration/test/DBTools/FakeTTrig.cc
David did yesterday this pull request with a fix. May I ask you to
review it?
Briefly, it is moving most of the work from endJob to
beginLuminosityBlock. The tests are passed, so technically is ok, and
our guess is that it makes no difference. In this case, please check
if the DB writing at the beginLumi is ok: e.g. PopCon (this test is
inspired to it, I'd say) does this at endJob. You can check the
possible solutions proposed by the FWK in [1].
On the other hand, the code is very old and maybe no more maintained,
as this module is only referenced by one configuration file in CMSSW
and that configuration is obviously obsolete and will not work
anymore. If this is the case, can you consider removing the module
completely? If not, can you please make the configuration work again
and maintain this legacy code?

Thanks in advance for contributing to keep CMSSW well maintained

[1]

    1.

        Move the work done during endJob to
        beginLuminosityBlock and add a flag so that work
        is only done once in a process. If there are no other
        suggestions or objections, I will do this and just hope
        it works. (This module writes something to some
        database. I do not understand CMSSW databases well).
        I have no idea how to test this.

    2.

        If this is a standalone test utility that will never be run
        with other things and where replay is not an issue, we
        could simply have it instantiate its own random engine
        and seed the engine with a parameter of the module.
        Then it would not use the service at all. In general
        we try to avoid random number generation not using
        the service, but for a standalone test ....

    3.

        It's a hack, but I could get an engine during the
        beginLuminosityBlock method and save a pointer to
        it until endJob. But in multithreaded mode, there will
        be more than one so it will be random and
        nonreproducible as to which one of the multiple engines
        you will get and each is seeded differently.

    4.

        We could just remove the randomness from the module.
        I do not understand the purpose it serves.


Reply to this email directly or view it on GitHub
#2523 (comment).


Marina Passaseo

/Istituto Nazionale di Fisica Nucleare/

/Sezione di Padova/

/Via Marzolo, 8/

/35131 Padova/

//

/Tel: +390499677099/

/Fax: +390499677110/

@wddgit
Copy link
Contributor Author

wddgit commented Feb 21, 2014

Yes, beginLuminosityBlock and JobEnd could be on different threads. In fact there could be multiple beginLuminosityBlock's running concurrently, each on its own thread and each would have its own random engine instance. JobEnd would not run concurrently with them, but might or might not be on the same thread as one of them.

Independently of that, for replay to work properly, random numbers can only be generated in the event and beginLuminosityBlock methods and the new interface makes it much more difficult to violate this rule. This replay requirement predates this multithreading effort, although there have been a number of violations of it. This is one of the reasons replay often fails to work.

@wddgit
Copy link
Contributor Author

wddgit commented Feb 21, 2014

I am not sure if you were included in the previous discussions so I will repeat this. With the interface change no argument getEngine function will be deleted. The current version of this code will fail to compile when that function is deleted. In the future, a StreamID or LuminosityBlockIndex will be required as an argument. It is only possible to get those in the beginLuminosityBlock method and the event method of a module.

@ktf
Copy link
Contributor

ktf commented Feb 22, 2014

Not approving for now. Alca please put a -1 until responsible are happy (or get them to agree).

@diguida
Copy link
Contributor

diguida commented Feb 22, 2014

-1
As suggested by @ktf
Thanks @wddgit for the explanation: the missing piece was exactly where endJob could run. From your reply, I try to summarise here all the relevant points.
Concerning the framework behaviour:

  1. You have many beginLuminosityBlock transitions, one per Stream (the handle for processing events; it is basically a thread, but IIRC, there could be exceptions, @Dr15Jones am I right?), so they run concurrently during a job execution;
  2. endJob runs after all the beginLuminosityBlock (you should process all your events before closing the job!), but it is not guaranteed on which thread can run.

Regarding the old code:

  1. In our case, with the old implementation, we instantiate the random engine in the constructor, and we use it in the endJob for filling the payloads and write them into the DB.
//in the header
CLHEP::RandGaussQ* theGaussianDistribution;
//in the source
//constructor snippet
...
theGaussianDistribution = new CLHEP::RandGaussQ(rng->getEngine());
...
//endJob snippet
...
double gaussianSmearing = theGaussianDistribution->fire(0.,smearing);
...

As far as the RandomNumberGenerator service is concerned:

  1. In the new implementation of the RandomNumberGenerator service, there is an interface change: the no argument getEngine function will be deleted. The old version of this code will fail to compile when that function is deleted. In the future, a StreamID or LuminosityBlockIndex will be required as an argument. It is only possible to get those in the beginLuminosityBlock method and the event method of a module.
  2. In order to guarantee replays to work, random numbers can only be generated in the event and beginLuminosityBlock methods. The new interface of the RandomNumberGenerator service makes it much more difficult to violate this rule, so ulfils the requirement of guaranteeing replays to works.

Finally concerning the changes proposed here:

  1. If we simply move the RandomNumberGenerator service in the beginLuminosityBlock method, keeping the payload creation in the endJob, we will have as many engines instantiated as streams. So, according to which thread executes the endJob after all the beginLuminosityBlock transitions, that one will pick the engine corresponding to that thread. This yields not reproducible results.
  2. In the new implementation of our code, the work done during endJob is moved to beginLuminosityBlock and a flag is added so that work is only done once in a process.

Based on the last two points, I have a question to @wddgit and @Dr15Jones : is the

bool dataBaseWriteWasDone

flag thread-safe? If two threads start at the same time, they both could see it false and triggering the creation of the payload and the writing into the DB. Am I correct?

This also reminds me that this will be a "Classic" module (one instance shared across all Streams), and the transitions guaranteed to be serialised in the framework are the begin*Summary and end*Summary ones.

Thanks

@diguida
Copy link
Contributor

diguida commented Feb 22, 2014

@passaseo @ronchese
As specified by @wddgit

this module is only referenced by one configuration file in CMSSW and that configuration is obviously obsolete and will not work anymore.

Can you please make the configuration work again?
This will eventually allow us to make the test run again, so to check the correctness of the proposed changes.

Thanks

@nclopezo nclopezo modified the milestones: CMSSW_7_1_0_pre4, CMSSW_7_1_0_pre3 Feb 24, 2014
@passaseo
Copy link

Since this software is not used o written by us,
I shall forward your request to the calibration group.
ciao Marina
On 22/02/2014 15:34, Salvatore Di Guida wrote:

@passaseo https://github.com/passaseo @ronchese
https://github.com/ronchese
As specified by @wddgit https://github.com/wddgit

this module is only referenced by one configuration file in CMSSW
and that configuration is obviously obsolete and will not work
anymore.
Can you please make the configuration work again?
This will eventually allow us to make the test run again, so to
check the correctness of the proposed changes.

Thanks


Reply to this email directly or view it on GitHub
#2523 (comment).


Marina Passaseo

/Istituto Nazionale di Fisica Nucleare/

/Sezione di Padova/

/Via Marzolo, 8/

/35131 Padova/

//

/Tel: +390499677099/

/Fax: +390499677110/

@wddgit
Copy link
Contributor Author

wddgit commented Feb 24, 2014

The flag in the proposed new code is OK in a "legacy (aka classic)" type module. It would also be OK in a "one" type module that watches luminosity blocks. If we were to convert the module to be a "stream" or "global" type module, the flag would be a problem with thread safety and we would have to do something else, possibly make it an atomic. Based on what the code does, I would think we would probably make it a "one" type module. I will make that additional change if you would like. I didn't only because I was trying to do the minimum to migrate to the new random number service interface. (I've migrated more than a hundred modules already and have been doing the minimum so I could finish the migration faster, this is one of 3 modules left to be migrated)

In your good summary, I noticed only error. The number of concurrent beginLuminosityBlock methods is independent of the number of streams. Actually, when there is a legacy module in the job, it will be limited to one.

Fix test configuration. The random number service
was being configured with a syntax deprecated in
2006 and for which support was dropped in March 2013.
If run without this fix and a configuration exception
would be thrown.
@wddgit
Copy link
Contributor Author

wddgit commented Feb 25, 2014

I added to this pull request a fix for the test configuration that I said was broken. It had been broken since March 2013. With the fix, it no longer dies with a configuration error. Now on my local machine it dies with the following error (presumably because I am not at cern)

Connection on "sqlite_file:/afs/cern.ch/cms/CAF/CMSALCA/ALCA_MUONCALIB/DTCALIB/COMM09/ttrig/ttrig_ResidCorr_112281.db" cannot be established ( CORAL : "ConnectionPool::getSessionFromNewConnection" from "CORAL/Services/ConnectionService" )

I'm hoping you can approve this soon. As of this morning, this became the very last thing to complete this large migration. Even if you decide that you do not like my fix, you can rewrite in a subsequent pull request however you like or delete it.

@cmsbuild
Copy link
Contributor

Pull request #2523 was updated. @cmsbuild, @Degano, @diguida, @rcastello, @nclopezo can you please check and sign again.

@cmsbuild
Copy link
Contributor

@wddgit
Copy link
Contributor Author

wddgit commented Mar 3, 2014

I understand you are busy with other genuinely important issues, but can you please approve this pull request soon? First of all, I think the modified code will work as well as the previous version. Second, if after having more time consider this, you want to revise this file again in some different way, please feel free to do so. If you want me to modify the pull request, please let me know how and I will gladly do it. This is the last small step of a migration that involved hundreds of files and several pull requests and is important to move the multithreaded development effort forward.

@diguida
Copy link
Contributor

diguida commented Mar 3, 2014

+1

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 3, 2014

This pull request is fully signed and it will be integrated in one of the next CMSSW_7_1_X IBs unless changes (tests are also fine). @nclopezo, @ktf can you please take care of it?

@diguida
Copy link
Contributor

diguida commented Mar 3, 2014

Hi @wddgit
Thanks for taking care of this. From our side it is ok now.

ktf added a commit that referenced this pull request Mar 4, 2014
New Random Number Generator -- Migrate Calibration Code to New Random Service Interface
@ktf ktf merged commit ed3706a into cms-sw:CMSSW_7_1_X Mar 4, 2014
@nclopezo nclopezo modified the milestones: CMSSW_7_1_0_pre5, CMSSW_7_1_0_pre4 Mar 10, 2014
@wddgit wddgit deleted the migrateToNewRandomService4 branch April 25, 2014 20:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants