Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CTagger retraining in Scikit-Learn for 80X release #13054

Merged
merged 3 commits into from Feb 1, 2016

Conversation

smoortga
Copy link
Contributor

This PR changes training_settings.py in RecoBTag/CTagging to read in the variables of the new training.
The new training files can be found in the following PR:
cms-data/RecoBTag-CTagging#2

@mverzett @pvmulder @acaudron @imarches

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @smoortga for CMSSW_8_0_X.

It involves the following packages:

RecoBTag/CTagging

@cmsbuild, @cvuosalo, @davidlange6, @slava77 can you please review it and eventually sign? Thanks.
@ferencek, @acaudron, @pvmulder, @imarches this is something you requested to watch as well.
@slava77, @Degano, @smuzaffar you are the release manager for this.

cms-bot commands are list here #13028

@slava77
Copy link
Contributor

slava77 commented Jan 26, 2016

@smoortga please clarify when we can expect an update to the cms-data
It is better operationally to update file names in the cms-data external package and then switch to a new file once this file is distributed in the IB releases.

@smoortga
Copy link
Contributor Author

@slava77 I am not completely sure what you mean, but allow me to explain the situation a bit more clearly:
This PR is intended to replace an existing CTagger MVA training by a new more optimal one. Only two things have to change in this case:

  1. the new training .xml files have to be replaced in cms-data, for which I made the following PR: Replaced training .xml files with new trainings for 80X release cms-data/RecoBTag-CTagging#2
  2. the file RecoBTag/CTagging/python/training_settings.py needs to be adapted since the new training uses a different set of variables. Therefore I made this PR I am currently replying to

First of all one can not go without the other. Since the number of variables in training_settings.py and in the .xml files need to match.
Secondly, appearantly someone merged the cms-data PR, which caused conflicts for the cmssw PR and so it was reverted. Here are links to the relevant 'revert' PR and issue in cmsdist:
cms-data/RecoBTag-CTagging#3
cms-sw/cmsdist#2088

I am a bit lost in the current status of the PRs, but I think it is up to you to merge them simultaneously, as I said: none can succeed without the other. Let me know if action needs to be taken from my side.

Kind Regards,

Seth Moortgat

@slava77
Copy link
Contributor

slava77 commented Jan 26, 2016

Hi Seth,

My proposal is:

  • you make a cms-data PR which add a new input file name with a new name
  • you update this PR to use the new name
  • we wait for the new input file name to appear in the release distribution area
  • once the file shows up I test this PR and once it works, it gets merged into the release
  • some time later I would expect you to remove the old training file from the cms-data

@smoortga
Copy link
Contributor Author

Ok indeed a solid plan. I will do it straight away, hoping the new xml file names can be merged by tomorrow.

@cmsbuild
Copy link
Contributor

Pull request #13054 was updated. @cmsbuild, @cvuosalo, @davidlange6, @slava77 can you please check and sign again.

@mverzett
Copy link
Contributor

@slava77 @Degano could you please updates us on the merging status of both this PR and the cmsdata one? I personally got lost with the merge-revert-reopen cycles.

Thanks!

Since the training only uses vertexMass and not correctedSVMass, we put the parameter to false
@cmsbuild
Copy link
Contributor

Pull request #13054 was updated. @cmsbuild, @cvuosalo, @davidlange6, @slava77 can you please check and sign again.

@mverzett
Copy link
Contributor

@slava77 @Degano
We really would like to get this in 8.0.X. I don't know if it can be considered just an update and therefore enter in the next pre or it has to be merged before Monday. In the first case just let us know so we can relax a bit too :).

Cheers

@slava77
Copy link
Contributor

slava77 commented Jan 29, 2016

@mverzett
I think you are still on track for pre6.

@slava77
Copy link
Contributor

slava77 commented Jan 31, 2016

@cmsbuild please test

@cmsbuild
Copy link
Contributor

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/10861/console

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

@slava77
Copy link
Contributor

slava77 commented Feb 1, 2016

+1

for #13054 16a258e

  • code updates are in line with the description: changed training input files and variable definitions used in it.
  • jenkins tests pass and comparisons with baseline show differences in c-tagger variables only as expected: typical value pattern changed in both CvsB and CvsL (the pattern is similar in the available workflows)
    • CvsB is now more negative at the lower values and more positive on the higher values, which looks like just the output scale change (IMO, it could have been normalized to something like 0 to 1 instead)
      plots from 134.911 (singlePhoton 2015 data, mostly light jets, supposedly)
      all_oldvsnew_runsingleph2015dwf134p911c_recojetedmreftobaseprodtofloatsassociationvector_pfcombinedcvsbjettags__rereco_obj_data_
      all_oldvsnew_runsingleph2015dwf134p911c_recojetedmreftobaseprodtofloatsassociationvector_pfcombinedcvsbjettags__rereco_obj_data_171
      compared to ttbar events (wf 25.0)
      all_oldvsnew_ttbarwf25p0c_recojetedmreftobaseprodtofloatsassociationvector_pfcombinedcvsbjettags__reco_obj_data_
    • CvsL peak (no-discrimination value?) moved up to the more positive and the broader distribution of values is more negative, again, as if the scale has changed
      134.911 workflow as above
      all_oldvsnew_runsingleph2015dwf134p911c_recojetedmreftobaseprodtofloatsassociationvector_pfcombinedcvsljettags__rereco_obj_data_
      compare to ttbar events (25.0)
      all_oldvsnew_ttbarwf25p0c_recojetedmreftobaseprodtofloatsassociationvector_pfcombinedcvsljettags__reco_obj_data_

There are still no DQM plots, apparently (no changes appear in the DQM output comparisons).

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 1, 2016

This pull request is fully signed and it will be integrated in one of the next CMSSW_8_0_X IBs (tests are also fine). This pull request requires discussion in the ORP meeting before it's merged. @slava77, @davidlange6, @Degano, @smuzaffar

@davidlange6
Copy link
Contributor

+1

cmsbuild added a commit that referenced this pull request Feb 1, 2016
CTagger retraining in Scikit-Learn for 80X release
@cmsbuild cmsbuild merged commit b0cabfa into cms-sw:CMSSW_8_0_X Feb 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants