Skip to content

QC-1178 window length should be infinite if moving windows are not used#2302

Merged
Barthelemy merged 1 commit into
AliceO2Group:masterfrom
knopers8:no-publish-async
May 27, 2024
Merged

QC-1178 window length should be infinite if moving windows are not used#2302
Barthelemy merged 1 commit into
AliceO2Group:masterfrom
knopers8:no-publish-async

Conversation

@knopers8
Copy link
Copy Markdown
Collaborator

This is an optimization to avoid having QC tasks on grid publish objects often than needed, i.e. when they don't use moving windows, it's enough to publish them at end of stream.

This is an optimization to avoid having QC tasks on grid publish objects often than needed, i.e. when they don't use moving windows, it's enough to publish them at end of stream.
@Barthelemy Barthelemy merged commit 0d9ceee into AliceO2Group:master May 27, 2024

// setup timekeeping
mDeploymentMode = DefaultsHelpers::deploymentMode();
mTimekeeper = TimekeeperFactory::create(mDeploymentMode, mTaskConfig.cycleDurations.back().first * 1000);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you guaranteed that mTaskConfig.cycleDurations is not empty?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, something would have to go very wrong. The configuration goes through TaskRunnerFactory which does the input sanitization.

@ktf
Copy link
Copy Markdown
Member

ktf commented May 27, 2024

Can we have a release with this ASAP? I think it will drastically reduce the memory usage when running on the Grid, especially on slow nodes which lag behind in merging histograms.

@ktf
Copy link
Copy Markdown
Member

ktf commented May 27, 2024

@chiarazampolli alternatively, can we cherry pick this in some async release?

@knopers8
Copy link
Copy Markdown
Collaborator Author

let me know your preference. a patch release will be quick to do, but maybe you want the commit to be in some async tag, then @chiarazampolli should know what to do.

@chiarazampolli chiarazampolli added async-2022-pp-apass7 Request porting to async-2022-pp-apass7 async-2024-pp-apass1 Request porting to async-2024-pp-apass1 labels May 28, 2024
@chiarazampolli
Copy link
Copy Markdown
Contributor

Hello,

I added the labels to have this ported to the correct tags. We can also do a test with the daily tomorrow, to see how this PR improves the situation.

Chiara

@ktf
Copy link
Copy Markdown
Member

ktf commented May 28, 2024

I would start with the patch release.

knopers8 added a commit that referenced this pull request May 28, 2024
…ed (#2302)

This is an optimization to avoid having QC tasks on grid publish objects often than needed, i.e. when they don't use moving windows, it's enough to publish them at end of stream.
@knopers8
Copy link
Copy Markdown
Collaborator Author

alisw/alidist#5484

@knopers8 knopers8 deleted the no-publish-async branch May 28, 2024 07:58
@ktf
Copy link
Copy Markdown
Member

ktf commented May 28, 2024

For the record, I also verified that having a large value for cycleDurationSeconds (see below) has the same effect and sends the histograms only at the end.

--- json_cache/20240408-183950-350871-31254--MCH_DIGITS-MCH_RECO-MCH_ERRORS-MCH_TRACKS-matchTOF-ITS-MFT-TPC-TOF-FT0-MID-EMC-PHS-ZDC-FDD-HMP-FV0-TRD-GLO_ITSTPC-GLO_MFTMCH-GLO_PRIMVTX-pidFT0-TOF.json	2024-04-11 16:01:37
+++ qc.json	2024-05-28 09:42:49
@@ -37,7 +37,7 @@
         "className": "o2::quality_control_modules::muonchambers::DigitsTask",
         "moduleName": "QcMuonChambers",
         "detectorName": "MCH",
-        "cycleDurationSeconds": "300",
+        "cycleDurationSeconds": "20000",
         "maxNumberCycles": "-1",
         "dataSource": {
           "type": "direct",
@@ -62,7 +62,7 @@
         "className": "o2::quality_control_modules::muonchambers::RofsTask",
         "moduleName": "QcMuonChambers",
         "detectorName": "MCH",
-        "cycleDurationSeconds": "300",
+        "cycleDurationSeconds": "20000",
         "maxNumberCycles": "-1",
         "dataSource": {
           "type": "direct",
@@ -78,7 +78,7 @@
         "className": "o2::quality_control_modules::muonchambers::PreclustersTask",
         "moduleName": "QcMuonChambers",
         "detectorName": "MCH",
-        "cycleDurationSeconds": "300",
+        "cycleDurationSeconds": "20000",
         "maxNumberCycles": "-1",
         "dataSource": {
           "type": "direct",
@@ -90,7 +90,7 @@
         "className": "o2::quality_control_modules::muonchambers::RofsTask",
         "moduleName": "QcMuonChambers",
         "detectorName": "MCH",
-        "cycleDurationSeconds": "300",
+        "cycleDurationSeconds": "20000",
         "maxNumberCycles": "-1",
         "dataSource": {
           "type": "direct",
@@ -106,7 +106,7 @@
         "className": "o2::quality_control_modules::muonchambers::ErrorTask",
         "moduleName": "QcMuonChambers",
         "detectorName": "MCH",
-        "cycleDurationSeconds": "600",
+        "cycleDurationSeconds": "20000",
         "maxNumberCycles": "-1",
         "dataSource": {
           "type": "direct",
@@ -119,7 +119,7 @@
         "className": "o2::quality_control_modules::muon::TracksTask",
         "moduleName": "QcMUONCommon",
         "detectorName": "MCH",
-        "cycleDurationSeconds": "180",
+        "cycleDurationSeconds": "20000",
         "maxNumberCycles": "-1",
         "dataSource": {
           "type": "direct",
@@ -146,7 +146,7 @@
         "className": "o2::quality_control_modules::tof::TOFMatchedTracks",
         "moduleName": "QcTOF",
         "detectorName": "TOF",
-        "cycleDurationSeconds": "60",
+        "cycleDurationSeconds": "20000",
         "maxNumberCycles": "-1",
         "dataSource": {
           "type": "direct",
@@ -181,7 +181,7 @@
         "className": "o2::quality_control_modules::its::ITSClusterTask",
         "moduleName": "QcITS",
         "detectorName": "ITS",
-        "cycleDurationSeconds": "180",
+        "cycleDurationSeconds": "20000",
         "maxNumberCycles": "-1",
         "dataSource_comment": "The other type of dataSource is \"direct\", see basic-no-sampling.json.",
         "dataSource": {
@@ -204,7 +204,7 @@
         "className": "o2::quality_control_modules::its::ITSTrackTask",
         "moduleName": "QcITS",
         "detectorName": "ITS",
-        "cycleDurationSeconds": "30",
+        "cycleDurationSeconds": "20000",
         "maxNumberCycles": "-1",
         "dataSource_comment": "The other type of dataSource is \"direct\", see basic-no-sampling.json.",
         "dataSource": {
@@ -292,7 +292,7 @@
         "className": "o2::quality_control_modules::tpc::Clusters",
         "moduleName": "QcTPC",
         "detectorName": "TPC",
-        "cycleDurationSeconds": "300",
+        "cycleDurationSeconds": "20000",
         "dataSource": {
           "type": "dataSamplingPolicy",
           "name": "tpc-clusters",
@@ -352,7 +352,7 @@
         "className": "o2::quality_control_modules::tpc::TrackClusters",
         "moduleName": "QcTPC",
         "detectorName": "TPC",
-        "cycleDurationSeconds": "300",
+        "cycleDurationSeconds": "20000",
         "dataSource": {
           "type": "direct",
           "query": "inputTracks:TPC/TRACKS/0;inputClusters:TPC/CLUSTERNATIVE;inputClusRefs:TPC/CLUSREFS/0"
@@ -405,7 +405,7 @@
         "className": "o2::quality_control_modules::tof::TaskDigits",
         "moduleName": "QcTOF",
         "detectorName": "TOF",
-        "cycleDurationSeconds": "60",
+        "cycleDurationSeconds": "20000",
         "maxNumberCycles": "-1",
         "dataSource": {
           "type": "direct",
@@ -437,7 +437,7 @@
         "className": "o2::quality_control_modules::ft0::RecPointsQcTask",
         "moduleName": "QcFT0",
         "detectorName": "FT0",
-        "cycleDurationSeconds": "600",
+        "cycleDurationSeconds": "20000",
         "maxNumberCycles": "-1",
         "dataSource": {
           "type": "direct",
@@ -451,7 +451,7 @@
         "className": "o2::quality_control_modules::mid::DigitsQcTask",
         "moduleName": "QcMID",
         "detectorName": "MID",
-        "cycleDurationSeconds": "60",
+        "cycleDurationSeconds": "20000",
         "dataSource": {
           "type": "direct",
           "query": "digits:MID/DATA;digits_rof:MID/DATAROF"
@@ -463,7 +463,7 @@
         "className": "o2::quality_control_modules::mid::ClustQcTask",
         "moduleName": "QcMID",
         "detectorName": "MID",
-        "cycleDurationSeconds": "60",
+        "cycleDurationSeconds": "20000",
         "dataSource": {
           "type": "direct",
           "query": "clusters:MID/TRACKCLUSTERS;clusterrofs:MID/TRCLUSROFS"
@@ -475,7 +475,7 @@
         "className": "o2::quality_control_modules::mid::TracksQcTask",
         "moduleName": "QcMID",
         "detectorName": "MID",
-        "cycleDurationSeconds": "60",
+        "cycleDurationSeconds": "20000",
         "dataSource": {
           "type": "direct",
           "query": "tracks:MID/TRACKS;trackrofs:MID/TRACKROFS"
@@ -487,7 +487,7 @@
         "className": "o2::quality_control_modules::emcal::CellTask",
         "moduleName": "QcEMCAL",
         "detectorName": "EMC",
-        "cycleDurationSeconds": "60",
+        "cycleDurationSeconds": "20000",
         "maxNumberCycles": "-1",
         "dataSource": {
           "type": "direct",
@@ -500,7 +500,7 @@
         "className": "o2::quality_control_modules::emcal::ClusterTask",
         "moduleName": "QcEMCAL",
         "detectorName": "EMC",
-        "cycleDurationSeconds": "60",
+        "cycleDurationSeconds": "20000",
         "maxNumberCycles": "-1",
         "dataSource": {
           "type": "direct",
@@ -537,7 +537,7 @@
         "className": "o2::quality_control_modules::emcal::BCTask",
         "moduleName": "QcEMCAL",
         "detectorName": "EMC",
-        "cycleDurationSeconds": "60",
+        "cycleDurationSeconds": "20000",
         "maxNumberCycles": "-1",
         "dataSource": {
           "type": "direct",
@@ -551,7 +551,7 @@
         "className": "o2::quality_control_modules::phos::ClusterQcTask",
         "moduleName": "QcPHOS",
         "detectorName": "PHS",
-        "cycleDurationSeconds": "100",
+        "cycleDurationSeconds": "20000",
         "dataSource": {
           "type": "direct",
           "query": "clusters:PHS/CLUSTERS/0;clustertr:PHS/CLUSTERTRIGREC/0"
@@ -564,7 +564,7 @@
         "className": "o2::quality_control_modules::zdc::ZDCRecDataTask",
         "moduleName": "QcZDC",
         "detectorName": "ZDC",
-        "cycleDurationSeconds": "60",
+        "cycleDurationSeconds": "20000",
         "maxNumberCycles": "-1",
         "dataSource": {
           "type": "direct",
@@ -593,7 +593,7 @@
         "className": "o2::quality_control_modules::fdd::RecPointsQcTask",
         "moduleName": "QcFDD",
         "detectorName": "FDD",
-        "cycleDurationSeconds": "600",
+        "cycleDurationSeconds": "20000",
         "maxNumberCycles": "-1",
         "dataSource": {
           "type": "direct",
@@ -607,7 +607,7 @@
         "className": "o2::quality_control_modules::hmpid::HmpidTaskClusters",
         "moduleName": "QcHMPID",
         "detectorName": "HMP",
-        "cycleDurationSeconds": "60",
+        "cycleDurationSeconds": "20000",
         "maxNumberCycles": "-1",
         "dataSource": {
           "type": "direct",
@@ -620,7 +620,7 @@
         "className": "o2::quality_control_modules::hmpid::HmpidTaskMatches",
         "moduleName": "QcHMPID",
         "detectorName": "HMP",
-        "cycleDurationSeconds": "60",
+        "cycleDurationSeconds": "20000",
         "maxNumberCycles": "-1",
         "dataSource": {
           "type": "direct",
@@ -633,7 +633,7 @@
         "className": "o2::quality_control_modules::fv0::DigitQcTask",
         "moduleName": "QcFV0",
         "detectorName": "FV0",
-        "cycleDurationSeconds": "600",
+        "cycleDurationSeconds": "20000",
         "maxNumberCycles": "-1",
         "dataSource": {
           "type": "direct",
@@ -653,7 +653,7 @@
         "className": "o2::quality_control_modules::trd::DigitsTask",
         "moduleName": "QcTRD",
         "detectorName": "TRD",
-        "cycleDurationSeconds": "60",
+        "cycleDurationSeconds": "20000",
         "dataSource": {
           "type": "direct",
           "query": "digits:TRD/DIGITS;triggers:TRD/TRKTRGRD;noiseMap:TRD/NOISEMAP/0?lifetime=condition&ccdb-path=TRD/Calib/NoiseMapMCM;chamberStatus:TRD/CHSTATUS/0?lifetime=condition&ccdb-path=TRD/Calib/HalfChamberStatusQC;fedChamberStatus:TRD/FCHSTATUS/0?lifetime=condition&ccdb-path=TRD/Calib/DCSDPsFedChamberStatus"
@@ -664,7 +664,7 @@
         "className": "o2::quality_control_modules::trd::TrackletsTask",
         "moduleName": "QcTRD",
         "detectorName": "TRD",
-        "cycleDurationSeconds": "60",
+        "cycleDurationSeconds": "20000",
         "dataSource": {
           "type": "direct",
           "query": "tracklets:TRD/TRACKLETS;triggers:TRD/TRKTRGRD;noiseMap:TRD/NOISEMAP/0?lifetime=condition&ccdb-path=TRD/Calib/NoiseMapMCM;chamberStatus:TRD/CHSTATUS/0?lifetime=condition&ccdb-path=TRD/Calib/HalfChamberStatusQC;fedChamberStatus:TRD/FCHSTATUS/0?lifetime=condition&ccdb-path=TRD/Calib/DCSDPsFedChamberStatus"
@@ -675,7 +675,7 @@
         "className": "o2::quality_control_modules::trd::PulseHeightTrackMatch",
         "moduleName": "QcTRD",
         "detectorName": "TRD",
-        "cycleDurationSeconds": "60",
+        "cycleDurationSeconds": "20000",
         "dataSource": {
           "type": "direct",
           "query": "phValues:TRD/PULSEHEIGHT"
@@ -686,7 +686,7 @@
         "className": "o2::quality_control_modules::trd::TrackingTask",
         "moduleName": "QcTRD",
         "detectorName": "TRD",
-        "cycleDurationSeconds": "60",
+        "cycleDurationSeconds": "20000",
         "dataSource": {
           "type": "direct",
           "query": "trackITSTPCTRD:TRD/MATCH_ITSTPC;trigITSTPCTRD:TRD/TRGREC_ITSTPC;trackTPCTRD:TRD/MATCH_TPC;trigTPCTRD:TRD/TRGREC_TPC"
@@ -702,7 +702,7 @@
         "className": "o2::quality_control_modules::glo::ITSTPCMatchingTask",
         "moduleName": "QcGLO",
         "detectorName": "GLO",
-        "cycleDurationSeconds": "3600",
+        "cycleDurationSeconds": "20000",
         "maxNumberCycles": "-1",
         "dataSource": {
           "type": "direct",
@@ -744,7 +744,7 @@
         "moduleName": "QcMUONCommon",
         "detectorName": "GLO",
         "taskName": "MUONTracks",
-        "cycleDurationSeconds": "300",
+        "cycleDurationSeconds": "20000",
         "maxNumberCycles": "-1",
         "dataSource": {
           "type": "direct",
@@ -771,7 +771,7 @@
         "className": "o2::quality_control_modules::glo::VertexingQcTask",
         "moduleName": "QcGLO",
         "detectorName": "GLO",
-        "cycleDurationSeconds": "60",
+        "cycleDurationSeconds": "20000",
         "maxNumberCycles": "-1",
         "dataSource": {
           "type": "direct",
@@ -787,7 +787,7 @@
         "className": "o2::quality_control_modules::pid::TaskFT0TOF",
         "moduleName": "QcTOF",
         "detectorName": "TOF",
-        "cycleDurationSeconds": "10",
+        "cycleDurationSeconds": "20000",
         "maxNumberCycles": "-1",
         "dataSource": {
           "type": "direct",

@ktf
Copy link
Copy Markdown
Member

ktf commented May 28, 2024

Thinking about it, there is an obvious race condition when merging histograms in async mode. The more histograms are in fly, the more time it takes to merge them and therefore they accumulate in shared memory if they are published at a fixed rate and never dropped.

@knopers8
Copy link
Copy Markdown
Collaborator Author

Why "never dropped"?

@ktf
Copy link
Copy Markdown
Member

ktf commented May 29, 2024

Never dropped in the sense that the merger tries to merge every single histogram which is thrown at it. So if histograms are produced at a rate which is higher than the merger ability to merge, memory usage diverges (and eventually shared memory becomes full).

@ktf
Copy link
Copy Markdown
Member

ktf commented May 29, 2024

In the particular case we were looking at 800MB of histograms were produced and had to be merged for every timeframe.

benedikt-voelkel pushed a commit that referenced this pull request May 29, 2024
…ed (#2302)

This is an optimization to avoid having QC tasks on grid publish objects often than needed, i.e. when they don't use moving windows, it's enough to publish them at end of stream.

(cherry picked from commit 0d9ceee)
@benedikt-voelkel benedikt-voelkel removed async-2022-pp-apass7 Request porting to async-2022-pp-apass7 async-2024-pp-apass1 Request porting to async-2024-pp-apass1 labels Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

5 participants