Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update OpenBLAS build options #5091

Merged
merged 1 commit into from Jul 17, 2019
Merged

Conversation

hqucms
Copy link
Contributor

@hqucms hqucms commented Jul 12, 2019

To address cms-sw/cmssw#27504.

  • roll back to DYNAMIC_ARCH=0
  • set TARGET=CORE2 to be consistent with what's generally used in CMSSW

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @hqucms (Huilin Qu) for branch IB/CMSSW_11_0_X/gcc700.

@cmsbuild, @smuzaffar, @gudrutis, @mrodozov can you please review it and eventually sign? Thanks.
cms-bot commands are listed here

@@ -10,12 +10,12 @@ Source: https://github.com/xianyi/OpenBLAS/archive/v%{realversion}.tar.gz

# PRESCOTT is a generic x86-64 target https://github.com/xianyi/OpenBLAS/issues/685
%ifarch x86_64
make FC=gfortran BINARY=64 TARGET=PENRYN NUM_THREADS=256 DYNAMIC_ARCH=1
make FC=gfortran BINARY=64 TARGET=CORE2 NUM_THREADS=256 DYNAMIC_ARCH=0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does the NUM_THREADS actually correspond to in reality at run time?
Was the speedup seen in tests of deepAK8 real or a simple parallelization on a quiet node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slava77 What affects the number of threads used at runtime is the OPENBLAS_NUM_THREADS environment variable which is set to 1 in https://github.com/cms-sw/cmsdist/blob/IB/CMSSW_11_0_X/gcc700/OpenBLAS-toolfile.spec#L19. NUM_THREADS is the maximum allowed number of threads and is used for resource allocation etc. I explicitly checked the CPU usage when running the tests and it was not going beyond 100%:)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And gslblas is known to be not a very high-performance BLAS routine, so the speed-up we saw after switching to OpenBLAS should not be surprising at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for clarifications

@slava77
Copy link
Contributor

slava77 commented Jul 13, 2019

@cmsbuild please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Jul 13, 2019

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-run-pr-tests/1459/console Started: 2019/07/13 02:45

@cmsbuild
Copy link
Contributor

-1

Tested at: 2b925b4

You can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-bd8d21/1459/summary.html

I found follow errors while testing this PR

Failed tests: UnitTests

  • Unit Tests:

I found errors in the following unit tests:

---> test ExpressionEvaluatorUnitTest had ERRORS

@cmsbuild
Copy link
Contributor

Comparison job queued.

@cmsbuild
Copy link
Contributor

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-bd8d21/1459/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 32
  • DQMHistoTests: Total histograms compared: 3081858
  • DQMHistoTests: Total failures: 2
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3081534
  • DQMHistoTests: Total skipped: 322
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 31 files compared)
  • Checked 133 log files, 14 edm output root files, 32 DQM output files

@slava77
Copy link
Contributor

slava77 commented Jul 15, 2019

It looks like the tests passed OK.

The unit test failure seems unrelated.

There is a difference in 136.788 step3 logs, which also appears to be unrelated:
this PR test has

%MSG-e ExcessiveTime:  MuonShowerInformationProducer:muonShowerInformation  13-Jul-2019 14:12:51 CEST Run: 297557 Event: 190047277
ExcessiveTime: Module used 1147.06 seconds of time which exceeds the error threshold configured in the Timing Service of 600 seconds.

We've seen something like this non-reproducible showing up in some other external tests.

@mrodozov
Copy link
Contributor

please test
lets try again to be sure its consistent

@cmsbuild
Copy link
Contributor

cmsbuild commented Jul 16, 2019

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-run-pr-tests/1483/console Started: 2019/07/16 11:58

@fabiocos
Copy link
Contributor

fabiocos commented Jul 16, 2019

The timing report above does not seem consistent with the baseline IB result

<PerformanceReport>
<PerformanceSummary Metric="Timing">
<Metric Name="AvgEventTime" Value="22.9378"/>
<Metric Name="EventSetup Get" Value="149.031"/>
<Metric Name="EventSetup Lock" Value="0.000856161"/>
<Metric Name="EventThroughput" Value="0.041906"/>
<Metric Name="MaxEventTime" Value="83.1715"/>
<Metric Name="MinEventTime" Value="0.739687"/>
<Metric Name="NumberOfStreams" Value="1"/>
<Metric Name="NumberOfThreads" Value="1"/>
<Metric Name="TotalInitCPU" Value="229.427"/>
<Metric Name="TotalInitTime" Value="352.472"/>
<Metric Name="TotalJobCPU" Value="2043.26"/>
<Metric Name="TotalJobTime" Value="2759.96"/>
<Metric Name="TotalLoopCPU" Value="1798.22"/>
<Metric Name="TotalLoopTime" Value="2386.29"/>
</PerformanceSummary>
</PerformanceReport>

although the conditions are not identically the same (baseline is multithreaded, the PR test is not)

@slava77
Copy link
Contributor

slava77 commented Jul 16, 2019

Reco comparison results: 3 differences found in the comparisons

it looks like a different workflow now has a similar warning. In 136.85

%MSG-e ExcessiveTime:   CandSecondaryVertexProducer:pfInclusiveSecondaryVertexFinderTagInfosAK8PFPuppiSoftDropSubjets  16-Jul-2019 19:14:14 CEST Run: 315489 Event: 47618378
ExcessiveTime: Module used 654.235 seconds of time which exceeds the error threshold configured in the Timing Service of 600 seconds.

still, unrelated to the openblas use case.

@fabiocos
Copy link
Contributor

We are not observing these excesses in the baseline tests as far as I can see in recent IBs, but is it clear that if the happens randomly they are difficult to track, In any case the only TimeOut failure we have recently had was almost systematic, understood and solved. @mrodozov do you have a way to check for instance the average time from recent IBs? This is one of the use cases where monitoring plots turn to be useful...

@mrodozov
Copy link
Contributor

mrodozov commented Jul 17, 2019

Hi Fabio, sry for the delay but I had to check what could be used and what we have,
we can select given workflow and step field (avg time max time etc) and plot it against time (i.e. IBs)
Thats the table of the data

https://es-cmssdt.cern.ch/kibana_rw/app/kibana#/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now%2Fw,mode:quick,to:now%2Fw))&_a=(columns:!(package,release,step,workflow,TotalJobTime,MaxEventTime,MinEventTime,AvgEventTime),index:AWvhEvG8LPbZ6C4ywF-2,interval:auto,query:(match_all:()),sort:!('@timestamp',desc))
which can be ordered by field so one can see "for that workflow that step (say 136.85 ) show me the values ordered by time"

https://es-cmssdt.cern.ch/kibana_rw/app/kibana#/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now%2Fw,mode:quick,to:now%2Fw))&_a=(columns:!(package,release,step,workflow,TotalJobTime,MaxEventTime,MinEventTime,AvgEventTime),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:AWvhEvG8LPbZ6C4ywF-2,key:workflow,negate:!f,type:phrase,value:'136.85'),query:(match:(workflow:(query:'136.85',type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:AWvhEvG8LPbZ6C4ywF-2,key:step.keyword,negate:!f,type:phrase,value:step2),query:(match:(step.keyword:(query:step2,type:phrase))))),index:AWvhEvG8LPbZ6C4ywF-2,interval:auto,query:(match_all:()),sort:!(MaxEventTime,desc))

@gudrutis is pushing it to have things ready to be used on Grafana (has better plotting options than Kibana)

@fabiocos
Copy link
Contributor

+1

as agreed with @slava77 , I will open an issue to keep track of the long time issue
@mrodozov I would like to merge it tonight

@mrodozov
Copy link
Contributor

+externals

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next IB/CMSSW_11_0_X/gcc700 IBs (tests are also fine). This pull request will be automatically merged.

@cmsbuild cmsbuild merged commit 1c0e4c8 into cms-sw:IB/CMSSW_11_0_X/gcc700 Jul 17, 2019
@hqucms hqucms deleted the patch-2 branch July 18, 2019 03:22
@mrodozov
Copy link
Contributor

please test
lets try as agreed for consistency
I also want to check the bot (it was hanging)

@cmsbuild
Copy link
Contributor

cmsbuild commented Jul 18, 2019

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-run-pr-tests/1518/console Started: 2019/07/18 11:54

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

Comparison job queued.

@cmsbuild
Copy link
Contributor

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-bd8d21/1518/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 32
  • DQMHistoTests: Total histograms compared: 2622600
  • DQMHistoTests: Total failures: 0
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2622298
  • DQMHistoTests: Total skipped: 302
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 31 files compared)
  • Checked 133 log files, 14 edm output root files, 32 DQM output files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants