Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPCRecHits nan protection #18955

Merged
merged 2 commits into from May 31, 2017
Merged

RPCRecHits nan protection #18955

merged 2 commits into from May 31, 2017

Conversation

jhgoh
Copy link
Contributor

@jhgoh jhgoh commented May 26, 2017

(modified) RMS of timing in the RPC local reconstruction had bug in the fomula and gave nan values.

This PR fixes fomula, but also forces the variance to have non-negative value, protect from possible crash or undefined behaviour.
We found unphysical peak at time=0 in the phase-II RPCRecHit validation and this must be linked with this problem.
(https://cmsweb.cern.ch/dqm/relval/plotfairy/archive/1/RelValSingleMuPt100Extended/CMSSW_9_1_1-91X_upgrade2023_realistic_v1_D17-v1/DQMIO/RPC/RPCRecHitV/SimVsReco/HitProperty/RecHitTimeBarrel?session=0vLmrp;v=1495479251889465704;w=924;h=715)

With the limited number of events, the peak is disappeared after the fix (red=before, blue=after).
image

The current data taking is not affected. The timing information is filled only for the phase-II upgrade.

@bapavlov @mileva

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @jhgoh (Junghwan John Goh) for master.

It involves the following packages:

RecoLocalMuon/RPCRecHit

@perrotta, @cmsbuild, @slava77, @davidlange6 can you please review it and eventually sign? Thanks.
@bellan this is something you requested to watch as well.
@davidlange6 you are the release manager for this.

cms-bot commands are listed here

@slava77
Copy link
Contributor

slava77 commented May 26, 2017

@cmsbuild please test

@cmsbuild
Copy link
Contributor

cmsbuild commented May 26, 2017

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/20127/console Started: 2017/05/26 05:58

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

Comparison job queued.

@cmsbuild
Copy link
Contributor

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-18955/20127/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 95 differences found in the comparisons
  • DQMHistoTests: Total files compared: 24
  • DQMHistoTests: Total histograms compared: 1833592
  • DQMHistoTests: Total failures: 17881
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 1815531
  • DQMHistoTests: Total skipped: 180
  • DQMHistoTests: Total Missing objects: 0
  • Checked 98 log files, 14 edm output root files, 24 DQM output files

@jhgoh
Copy link
Contributor Author

jhgoh commented May 26, 2017

including @calabria @jshlee to inform that we have a bugfix PR.

@@ -26,11 +26,11 @@ int RPCCluster::bx() const { return bunchx; }

bool RPCCluster::hasTime() const { return nTime > 0; }
float RPCCluster::time() const { return hasTime() ? sumTime/nTime : 0; }
float RPCCluster::timeRMS() const { return hasTime() ? sqrt((sumTime2*nTime - sumTime*sumTime)/nTime) : -1; }
float RPCCluster::timeRMS() const { return hasTime() ? sqrt(std::max(0.F, sumTime2*nTime - sumTime*sumTime)/nTime) : -1; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the problem appearing only for n=1?
It seems more appropriate to handle n=1 case explicitly.

Furthermore, applicable to yRMS as well as timeRMS: these are supposed to correspond to estimated rec hit y and t uncertainties.

  • I expect that the formula should be sqrt( 1/(n - 1) * ( sumTime2 - sumTime*sumTime/n ) ) from the unbiased estimator of the variance. There is 1/(n-1) term missing.
    • more hits should give a better cluster estimation, while the current formula e.g. for time mean = 0 gives a plain sqrt(sumTime2) which keeps growing.

Then, in addition, I do not see where the uncertainty of a single time or y measurement is added. Current version gives timeRMS = 0 in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was typo and the formula simply got wrong. Fixing it now.

@cmsbuild
Copy link
Contributor

Pull request #18955 was updated. @perrotta, @cmsbuild, @slava77, @davidlange6 can you please check and sign again.

@jhgoh
Copy link
Contributor Author

jhgoh commented May 27, 2017

@slava77 Thank you for comment. It turned out the the parenthesis were place at wrong place in the previous version.

Regarding on adding 1/(n-1) factor, I would like to keep it unchanged for now.
The return value of timeRMS() have not been used in the serious error calculation at this moment. Later it will be revised to be computed more realistic way.

@slava77
Copy link
Contributor

slava77 commented May 27, 2017

@cmsbuild please test

@slava77
Copy link
Contributor

slava77 commented May 27, 2017

@jhgoh please confirm that max(0, is needed only in cases of numerical round-off (a few printouts would help).
If there are other cases, there may be bugs in upstream code.

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

Comparison job queued.

@cmsbuild
Copy link
Contributor

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-18955/20153/summary.html

Comparison Summary:

  • You potentially added 139 lines to the logs
  • Reco comparison results: 3403 differences found in the comparisons
  • DQMHistoTests: Total files compared: 24
  • DQMHistoTests: Total histograms compared: 1833592
  • DQMHistoTests: Total failures: 48760
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 1784652
  • DQMHistoTests: Total skipped: 180
  • DQMHistoTests: Total Missing objects: 0
  • Checked 98 log files, 14 edm output root files, 24 DQM output files

@jhgoh
Copy link
Contributor Author

jhgoh commented May 28, 2017

@slava77
I confirm that the nan issue comes from the numerical truncation - there are some entries with small negative value. Below is printout, see the 3rd column.

CHECKTIME -1.165753 0.000000
CHECKTIME 3.279032 -0.000005
CHECKTIME 1.504020 0.000000
CHECKTIME 1.045554 0.000000
CHECKTIME -1.314781 0.000000
CHECKTIME -3.226450 0.000003
CHECKTIME 0.904386 0.000000
CHECKTIME 0.461982 0.000000
CHECKTIME 1.686430 0.000000
CHECKTIME -1.354682 0.000000
CHECKTIME 0.047267 -0.000000

@slava77
Copy link
Contributor

slava77 commented May 28, 2017 via email

@jhgoh
Copy link
Contributor Author

jhgoh commented May 28, 2017

For the print out,

  1. removed sqrt() in the timeRMS() function in RecoLocalMuon/RPCRecHit/src/RPCCluster.cc
  • Before the fix float RPCCluster::timeRMS() const { return hasTime() ? sqrt((sumTime2*nTime - sumTime*sumTime)/nTime) : -1; }
  • Removed sqrt to check sign float RPCCluster::timeRMS() const { return hasTime() ? ((sumTime2*nTime - sumTime*sumTime)/nTime) : -1; }

And the same for the bugfixed one

  • After the fix float RPCCluster::timeRMS() const { return hasTime() ? sqrt(std::max(0.F, sumTime2*nTime - sumTime*sumTime))/nTime) : -1; }
  • Removed sqrt() and max() to check sign float RPCCluster::timeRMS() const { return hasTime() ? (sumTime2*nTime - sumTime*sumTime)/nTime/nTime : -1; }
  1. add printf in RecoLocalMuon/RPCRecHit/src/RPCRecHitStandardAlgo.cc
diff --git a/RecoLocalMuon/RPCRecHit/src/RPCRecHitStandardAlgo.cc b/RecoLocalMuon/RPCRecHit/src/RPCRecHitStandardAlgo.cc
index e29113f..a950b84 100644
--- a/RecoLocalMuon/RPCRecHit/src/RPCRecHitStandardAlgo.cc
+++ b/RecoLocalMuon/RPCRecHit/src/RPCRecHitStandardAlgo.cc
@@ -57,6 +57,7 @@ bool RPCRecHitStandardAlgo::compute(const RPCRoll& roll,
   if ( cluster.hasTime() ) {
     time = cluster.time();
     timeErr = cluster.timeRMS();
+printf("Time Avg=%f Var=%f RMS=%f\n", time, timeErr, sqrt(timeErr)); // for "before fix" case
+//printf("Time Avg=%f Var=%f RMS=%f\n", time, timeErr, sqrt(std::max(0.F,timeErr))); // for "after fix" case
   }
   else {
     time = 0;

Time average and rms are printed out, diff before the fix(left) and after the fix(right)

Time Avg=-1.323429 Var=0.000000 RMS=0.000000			Time Avg=-1.323429 Var=0.000000 RMS=0.000000
Time Avg=-2.387054 Var=0.000000 RMS=0.000000			Time Avg=-2.387054 Var=0.000000 RMS=0.000000
Time Avg=-2.971581 Var=-0.000003 RMS=-nan		      |	Time Avg=-2.971581 Var=-0.000001 RMS=0.000000
Time Avg=3.228576 Var=8.602002 RMS=2.932917		      |	Time Avg=3.228576 Var=1.720400 RMS=1.311640
Time Avg=6.458827 Var=0.000000 RMS=0.000000			Time Avg=6.458827 Var=0.000000 RMS=0.000000
Time Avg=3.873698 Var=0.000000 RMS=0.000000			Time Avg=3.873698 Var=0.000000 RMS=0.000000

Is this clear?

@slava77
Copy link
Contributor

slava77 commented May 28, 2017 via email

@perrotta
Copy link
Contributor

With some debug I verified that the fix is needed because those RPCClusters, made up from nearby RPC digis, could:

  • be formed by one single digi, and in that case obviously rms is zero modulo some possible rund-off
  • be formed by a few digis, but all with the same time or y, and even in that case rms is zero modulo some possible rund-off

The two situations above represent the vast majority of RPCClusters, as seen in my debug session.

Those RPCCluster's are made inside RPCClusterizer.cc. Now, I have two questions for the experts:

  • is it correct that several digis with identically the same time or y concur in making the cluster?
  • is it correct that in several cases there are digis with time (hasTime()=true) but not y (hasY()=false), i.e. nTime different from nY?

@jhgoh
Copy link
Contributor Author

jhgoh commented May 29, 2017

@perrotta for the 2nd question - yes, there are digis with time but without y.
The link board upgrade scenario for the present RPC system makes timing measurement but no 2D information. Therefore nTime and nY can be different.

@bapavlov
Copy link
Contributor

@perrotta

is it correct that in several cases there are digis with time (hasTime()=true)
but not y (hasY()=false), i.e. nTime different from nY?

It is possible to have digis with hasTime()=true, but with (hasY()=false).
Let me give some more details.
The existing RPC system has only one dimensional readout plane - so for the existing system we have only
one coordinate (let say X coordinate). There is a plan to upgrade the existing system with a new Link Boards,
enabling in this way the precise timing. So, hopefully after the Link Board upgrade (foreseen for 2023) the existing chambers would provide precise time (i.e. hasTime()=true) but there is no way to add second coordinate (thus hasY()=false).

An additional upgrade is foreseen for 2023, by adding a completely new chambers in stations RE3/1 and RE4/1. These chambers (RE3/1 and RE4/1) would provide precise timing (i.e. hasTime()=true) and Y coordinate (hasY()=true).

So, both the systems would co-exist (one with Y coordinate and one without Y coordinate), both with precise time.

@slava77
Copy link
Contributor

slava77 commented May 29, 2017 via email

@perrotta
Copy link
Contributor

Thank you all for the explanations. Then, if I understand it correctly, also having clusters made of several digis with identical time/y is also an expected feature of the system: correct?

@jhgoh , just to know about your plans: do you plan to add in quadrature some fixed uncertainty, which is basically the suggestion made in #18955 (comment) in order to have something more physical for the recHit uncertainties? If so, we'll wait for it before approving; otherwise, well keep the fix as such and sign this one, even if the meaning of these uncertainties remain rather limited, as also Slava said. Please clarify.

@jhgoh
Copy link
Contributor Author

jhgoh commented May 29, 2017

I would like to keep this codes even though the quantity has limited meaning. If we introduce fixed numbers here, maybe we have to tune them sometime. But it is a free parameter of simulation until we finalize actual hardware performance.

@perrotta
Copy link
Contributor

+1
Bug fix that prevents sqrt of negative numbers in case of rund-offs.
Given #18955 (comment) , #18955 (comment) (and following answers in this thread)

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request requires discussion in the ORP meeting before it's merged. @davidlange6, @smuzaffar

@davidlange6
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit c55feb9 into cms-sw:master May 31, 2017
@jhgoh jhgoh deleted the RPCTimeFix branch June 1, 2017 08:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants