Update OpenBLAS build options #5091

hqucms · 2019-07-12T22:05:02Z

roll back to DYNAMIC_ARCH=0
set TARGET=CORE2 to be consistent with what's generally used in CMSSW

cmsbuild · 2019-07-12T22:05:29Z

A new Pull Request was created by @hqucms (Huilin Qu) for branch IB/CMSSW_11_0_X/gcc700.

@cmsbuild, @smuzaffar, @gudrutis, @mrodozov can you please review it and eventually sign? Thanks.
cms-bot commands are listed here

slava77 · 2019-07-12T22:21:33Z

OpenBLAS.spec

@@ -10,12 +10,12 @@ Source: https://github.com/xianyi/OpenBLAS/archive/v%{realversion}.tar.gz

 # PRESCOTT is a generic x86-64 target https://github.com/xianyi/OpenBLAS/issues/685 
 %ifarch x86_64
-make FC=gfortran BINARY=64 TARGET=PENRYN NUM_THREADS=256 DYNAMIC_ARCH=1
+make FC=gfortran BINARY=64 TARGET=CORE2 NUM_THREADS=256 DYNAMIC_ARCH=0


what does the NUM_THREADS actually correspond to in reality at run time?
Was the speedup seen in tests of deepAK8 real or a simple parallelization on a quiet node?

@slava77 What affects the number of threads used at runtime is the OPENBLAS_NUM_THREADS environment variable which is set to 1 in https://github.com/cms-sw/cmsdist/blob/IB/CMSSW_11_0_X/gcc700/OpenBLAS-toolfile.spec#L19. NUM_THREADS is the maximum allowed number of threads and is used for resource allocation etc. I explicitly checked the CPU usage when running the tests and it was not going beyond 100%:)

And gslblas is known to be not a very high-performance BLAS routine, so the speed-up we saw after switching to OpenBLAS should not be surprising at all.

thanks for clarifications

slava77 · 2019-07-13T00:43:31Z

@cmsbuild please test

cmsbuild · 2019-07-13T00:43:56Z

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-run-pr-tests/1459/console Started: 2019/07/13 02:45

cmsbuild · 2019-07-13T14:13:04Z

-1

Tested at: 2b925b4

You can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-bd8d21/1459/summary.html

I found follow errors while testing this PR

Failed tests: UnitTests

Unit Tests:

I found errors in the following unit tests:

---> test ExpressionEvaluatorUnitTest had ERRORS

cmsbuild · 2019-07-13T14:13:06Z

Comparison job queued.

cmsbuild · 2019-07-13T16:34:49Z

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-bd8d21/1459/summary.html

Comparison Summary:

No significant changes to the logs found
Reco comparison results: 4 differences found in the comparisons
DQMHistoTests: Total files compared: 32
DQMHistoTests: Total histograms compared: 3081858
DQMHistoTests: Total failures: 2
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3081534
DQMHistoTests: Total skipped: 322
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 31 files compared)
Checked 133 log files, 14 edm output root files, 32 DQM output files

slava77 · 2019-07-15T17:58:50Z

It looks like the tests passed OK.

The unit test failure seems unrelated.

There is a difference in 136.788 step3 logs, which also appears to be unrelated:
this PR test has

%MSG-e ExcessiveTime:  MuonShowerInformationProducer:muonShowerInformation  13-Jul-2019 14:12:51 CEST Run: 297557 Event: 190047277
ExcessiveTime: Module used 1147.06 seconds of time which exceeds the error threshold configured in the Timing Service of 600 seconds.

We've seen something like this non-reproducible showing up in some other external tests.

mrodozov · 2019-07-16T09:55:56Z

please test
lets try again to be sure its consistent

cmsbuild · 2019-07-16T09:56:19Z

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-run-pr-tests/1483/console Started: 2019/07/16 11:58

fabiocos · 2019-07-16T12:45:11Z

The timing report above does not seem consistent with the baseline IB result

<PerformanceReport>
<PerformanceSummary Metric="Timing">
<Metric Name="AvgEventTime" Value="22.9378"/>
<Metric Name="EventSetup Get" Value="149.031"/>
<Metric Name="EventSetup Lock" Value="0.000856161"/>
<Metric Name="EventThroughput" Value="0.041906"/>
<Metric Name="MaxEventTime" Value="83.1715"/>
<Metric Name="MinEventTime" Value="0.739687"/>
<Metric Name="NumberOfStreams" Value="1"/>
<Metric Name="NumberOfThreads" Value="1"/>
<Metric Name="TotalInitCPU" Value="229.427"/>
<Metric Name="TotalInitTime" Value="352.472"/>
<Metric Name="TotalJobCPU" Value="2043.26"/>
<Metric Name="TotalJobTime" Value="2759.96"/>
<Metric Name="TotalLoopCPU" Value="1798.22"/>
<Metric Name="TotalLoopTime" Value="2386.29"/>
</PerformanceSummary>
</PerformanceReport>

although the conditions are not identically the same (baseline is multithreaded, the PR test is not)

slava77 · 2019-07-16T23:01:29Z

Reco comparison results: 3 differences found in the comparisons

it looks like a different workflow now has a similar warning. In 136.85

%MSG-e ExcessiveTime:   CandSecondaryVertexProducer:pfInclusiveSecondaryVertexFinderTagInfosAK8PFPuppiSoftDropSubjets  16-Jul-2019 19:14:14 CEST Run: 315489 Event: 47618378
ExcessiveTime: Module used 654.235 seconds of time which exceeds the error threshold configured in the Timing Service of 600 seconds.

still, unrelated to the openblas use case.

fabiocos · 2019-07-17T07:56:04Z

We are not observing these excesses in the baseline tests as far as I can see in recent IBs, but is it clear that if the happens randomly they are difficult to track, In any case the only TimeOut failure we have recently had was almost systematic, understood and solved. @mrodozov do you have a way to check for instance the average time from recent IBs? This is one of the use cases where monitoring plots turn to be useful...

mrodozov · 2019-07-17T10:09:48Z

Hi Fabio, sry for the delay but I had to check what could be used and what we have,
we can select given workflow and step field (avg time max time etc) and plot it against time (i.e. IBs)
Thats the table of the data

https://es-cmssdt.cern.ch/kibana_rw/app/kibana#/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now%2Fw,mode:quick,to:now%2Fw))&_a=(columns:!(package,release,step,workflow,TotalJobTime,MaxEventTime,MinEventTime,AvgEventTime),index:AWvhEvG8LPbZ6C4ywF-2,interval:auto,query:(match_all:()),sort:!('@timestamp',desc))
which can be ordered by field so one can see "for that workflow that step (say 136.85 ) show me the values ordered by time"

https://es-cmssdt.cern.ch/kibana_rw/app/kibana#/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now%2Fw,mode:quick,to:now%2Fw))&_a=(columns:!(package,release,step,workflow,TotalJobTime,MaxEventTime,MinEventTime,AvgEventTime),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:AWvhEvG8LPbZ6C4ywF-2,key:workflow,negate:!f,type:phrase,value:'136.85'),query:(match:(workflow:(query:'136.85',type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:AWvhEvG8LPbZ6C4ywF-2,key:step.keyword,negate:!f,type:phrase,value:step2),query:(match:(step.keyword:(query:step2,type:phrase))))),index:AWvhEvG8LPbZ6C4ywF-2,interval:auto,query:(match_all:()),sort:!(MaxEventTime,desc))

@gudrutis is pushing it to have things ready to be used on Grafana (has better plotting options than Kibana)

fabiocos · 2019-07-17T14:35:47Z

+1

as agreed with @slava77 , I will open an issue to keep track of the long time issue
@mrodozov I would like to merge it tonight

mrodozov · 2019-07-17T14:55:56Z

+externals

cmsbuild · 2019-07-17T14:56:25Z

This pull request is fully signed and it will be integrated in one of the next IB/CMSSW_11_0_X/gcc700 IBs (tests are also fine). This pull request will be automatically merged.

mrodozov · 2019-07-18T09:51:52Z

please test
lets try as agreed for consistency
I also want to check the bot (it was hanging)

cmsbuild · 2019-07-18T09:52:37Z

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-run-pr-tests/1518/console Started: 2019/07/18 11:54

cmsbuild · 2019-07-18T15:20:43Z

+1
Tested at: 2b925b4
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-bd8d21/1518/summary.html

cmsbuild · 2019-07-18T15:20:47Z

Comparison job queued.

cmsbuild · 2019-07-18T17:42:29Z

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-bd8d21/1518/summary.html

Comparison Summary:

No significant changes to the logs found
Reco comparison results: 0 differences found in the comparisons
DQMHistoTests: Total files compared: 32
DQMHistoTests: Total histograms compared: 2622600
DQMHistoTests: Total failures: 0
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 2622298
DQMHistoTests: Total skipped: 302
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 31 files compared)
Checked 133 log files, 14 edm output root files, 32 DQM output files

Roll back to DYNAMIC_ARCH=0 and set TARGET=CORE2

2b925b4

cmsbuild added comparison-pending externals-pending orp-pending pending-signatures tests-pending labels Jul 12, 2019

hqucms mentioned this pull request Jul 12, 2019

deepAK8 reproducibility problems after switching to openblas cms-sw/cmssw#27504

Closed

slava77 reviewed Jul 12, 2019

View reviewed changes

cmsbuild added tests-started and removed tests-pending labels Jul 13, 2019

cmsbuild added tests-rejected and removed tests-started labels Jul 13, 2019

cmsbuild added comparison-available and removed comparison-pending labels Jul 13, 2019

cmsbuild added comparison-pending tests-pending and removed comparison-available tests-rejected labels Jul 16, 2019

cmsbuild added tests-started and removed tests-pending labels Jul 16, 2019

cmsbuild added the comparison-available label Jul 16, 2019

cmsbuild added orp-approved and removed orp-pending labels Jul 17, 2019

fabiocos mentioned this pull request Jul 17, 2019

Modules taking long time during test of cmsdist with big rebuilds cms-sw/cms-bot#1169

Closed

cmsbuild added externals-approved and removed externals-pending pending-signatures labels Jul 17, 2019

cmsbuild added the fully-signed label Jul 17, 2019

cmsbuild merged commit 1c0e4c8 into cms-sw:IB/CMSSW_11_0_X/gcc700 Jul 17, 2019

hqucms deleted the patch-2 branch July 18, 2019 03:22

cmsbuild removed comparison-available tests-approved labels Jul 18, 2019

cmsbuild removed the tests-pending label Jul 18, 2019

cmsbuild added tests-approved and removed tests-started labels Jul 18, 2019

cmsbuild added comparison-available and removed comparison-pending labels Jul 18, 2019

This was referenced Oct 21, 2019

Backport OpenBLAS fixes [10_2_X] #5289

Merged

Backport OpenBLAS fixes [10_6_X] #5291

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update OpenBLAS build options #5091

Update OpenBLAS build options #5091

hqucms commented Jul 12, 2019

cmsbuild commented Jul 12, 2019

slava77 Jul 12, 2019

hqucms Jul 12, 2019

hqucms Jul 12, 2019

slava77 Jul 13, 2019

slava77 commented Jul 13, 2019

cmsbuild commented Jul 13, 2019 •

edited

cmsbuild commented Jul 13, 2019

cmsbuild commented Jul 13, 2019

cmsbuild commented Jul 13, 2019

slava77 commented Jul 15, 2019

mrodozov commented Jul 16, 2019

cmsbuild commented Jul 16, 2019 •

edited

fabiocos commented Jul 16, 2019 •

edited

slava77 commented Jul 16, 2019

fabiocos commented Jul 17, 2019

mrodozov commented Jul 17, 2019 •

edited

fabiocos commented Jul 17, 2019

mrodozov commented Jul 17, 2019

cmsbuild commented Jul 17, 2019

mrodozov commented Jul 18, 2019

cmsbuild commented Jul 18, 2019 •

edited

cmsbuild commented Jul 18, 2019

cmsbuild commented Jul 18, 2019

cmsbuild commented Jul 18, 2019

Update OpenBLAS build options #5091

Update OpenBLAS build options #5091

Conversation

hqucms commented Jul 12, 2019

cmsbuild commented Jul 12, 2019

slava77 Jul 12, 2019

Choose a reason for hiding this comment

hqucms Jul 12, 2019

Choose a reason for hiding this comment

hqucms Jul 12, 2019

Choose a reason for hiding this comment

slava77 Jul 13, 2019

Choose a reason for hiding this comment

slava77 commented Jul 13, 2019

cmsbuild commented Jul 13, 2019 • edited

cmsbuild commented Jul 13, 2019

cmsbuild commented Jul 13, 2019

cmsbuild commented Jul 13, 2019

slava77 commented Jul 15, 2019

mrodozov commented Jul 16, 2019

cmsbuild commented Jul 16, 2019 • edited

fabiocos commented Jul 16, 2019 • edited

slava77 commented Jul 16, 2019

fabiocos commented Jul 17, 2019

mrodozov commented Jul 17, 2019 • edited

fabiocos commented Jul 17, 2019

mrodozov commented Jul 17, 2019

cmsbuild commented Jul 17, 2019

mrodozov commented Jul 18, 2019

cmsbuild commented Jul 18, 2019 • edited

cmsbuild commented Jul 18, 2019

cmsbuild commented Jul 18, 2019

cmsbuild commented Jul 18, 2019

cmsbuild commented Jul 13, 2019 •

edited

cmsbuild commented Jul 16, 2019 •

edited

fabiocos commented Jul 16, 2019 •

edited

mrodozov commented Jul 17, 2019 •

edited

cmsbuild commented Jul 18, 2019 •

edited