QC-1178 window length should be infinite if moving windows are not used#2302
Conversation
This is an optimization to avoid having QC tasks on grid publish objects often than needed, i.e. when they don't use moving windows, it's enough to publish them at end of stream.
|
|
||
| // setup timekeeping | ||
| mDeploymentMode = DefaultsHelpers::deploymentMode(); | ||
| mTimekeeper = TimekeeperFactory::create(mDeploymentMode, mTaskConfig.cycleDurations.back().first * 1000); |
There was a problem hiding this comment.
are you guaranteed that mTaskConfig.cycleDurations is not empty?
There was a problem hiding this comment.
Thanks, something would have to go very wrong. The configuration goes through TaskRunnerFactory which does the input sanitization.
|
Can we have a release with this ASAP? I think it will drastically reduce the memory usage when running on the Grid, especially on slow nodes which lag behind in merging histograms. |
|
@chiarazampolli alternatively, can we cherry pick this in some async release? |
|
let me know your preference. a patch release will be quick to do, but maybe you want the commit to be in some async tag, then @chiarazampolli should know what to do. |
|
Hello, I added the labels to have this ported to the correct tags. We can also do a test with the daily tomorrow, to see how this PR improves the situation. Chiara |
|
I would start with the patch release. |
…ed (#2302) This is an optimization to avoid having QC tasks on grid publish objects often than needed, i.e. when they don't use moving windows, it's enough to publish them at end of stream.
|
For the record, I also verified that having a large value for cycleDurationSeconds (see below) has the same effect and sends the histograms only at the end. --- json_cache/20240408-183950-350871-31254--MCH_DIGITS-MCH_RECO-MCH_ERRORS-MCH_TRACKS-matchTOF-ITS-MFT-TPC-TOF-FT0-MID-EMC-PHS-ZDC-FDD-HMP-FV0-TRD-GLO_ITSTPC-GLO_MFTMCH-GLO_PRIMVTX-pidFT0-TOF.json 2024-04-11 16:01:37
+++ qc.json 2024-05-28 09:42:49
@@ -37,7 +37,7 @@
"className": "o2::quality_control_modules::muonchambers::DigitsTask",
"moduleName": "QcMuonChambers",
"detectorName": "MCH",
- "cycleDurationSeconds": "300",
+ "cycleDurationSeconds": "20000",
"maxNumberCycles": "-1",
"dataSource": {
"type": "direct",
@@ -62,7 +62,7 @@
"className": "o2::quality_control_modules::muonchambers::RofsTask",
"moduleName": "QcMuonChambers",
"detectorName": "MCH",
- "cycleDurationSeconds": "300",
+ "cycleDurationSeconds": "20000",
"maxNumberCycles": "-1",
"dataSource": {
"type": "direct",
@@ -78,7 +78,7 @@
"className": "o2::quality_control_modules::muonchambers::PreclustersTask",
"moduleName": "QcMuonChambers",
"detectorName": "MCH",
- "cycleDurationSeconds": "300",
+ "cycleDurationSeconds": "20000",
"maxNumberCycles": "-1",
"dataSource": {
"type": "direct",
@@ -90,7 +90,7 @@
"className": "o2::quality_control_modules::muonchambers::RofsTask",
"moduleName": "QcMuonChambers",
"detectorName": "MCH",
- "cycleDurationSeconds": "300",
+ "cycleDurationSeconds": "20000",
"maxNumberCycles": "-1",
"dataSource": {
"type": "direct",
@@ -106,7 +106,7 @@
"className": "o2::quality_control_modules::muonchambers::ErrorTask",
"moduleName": "QcMuonChambers",
"detectorName": "MCH",
- "cycleDurationSeconds": "600",
+ "cycleDurationSeconds": "20000",
"maxNumberCycles": "-1",
"dataSource": {
"type": "direct",
@@ -119,7 +119,7 @@
"className": "o2::quality_control_modules::muon::TracksTask",
"moduleName": "QcMUONCommon",
"detectorName": "MCH",
- "cycleDurationSeconds": "180",
+ "cycleDurationSeconds": "20000",
"maxNumberCycles": "-1",
"dataSource": {
"type": "direct",
@@ -146,7 +146,7 @@
"className": "o2::quality_control_modules::tof::TOFMatchedTracks",
"moduleName": "QcTOF",
"detectorName": "TOF",
- "cycleDurationSeconds": "60",
+ "cycleDurationSeconds": "20000",
"maxNumberCycles": "-1",
"dataSource": {
"type": "direct",
@@ -181,7 +181,7 @@
"className": "o2::quality_control_modules::its::ITSClusterTask",
"moduleName": "QcITS",
"detectorName": "ITS",
- "cycleDurationSeconds": "180",
+ "cycleDurationSeconds": "20000",
"maxNumberCycles": "-1",
"dataSource_comment": "The other type of dataSource is \"direct\", see basic-no-sampling.json.",
"dataSource": {
@@ -204,7 +204,7 @@
"className": "o2::quality_control_modules::its::ITSTrackTask",
"moduleName": "QcITS",
"detectorName": "ITS",
- "cycleDurationSeconds": "30",
+ "cycleDurationSeconds": "20000",
"maxNumberCycles": "-1",
"dataSource_comment": "The other type of dataSource is \"direct\", see basic-no-sampling.json.",
"dataSource": {
@@ -292,7 +292,7 @@
"className": "o2::quality_control_modules::tpc::Clusters",
"moduleName": "QcTPC",
"detectorName": "TPC",
- "cycleDurationSeconds": "300",
+ "cycleDurationSeconds": "20000",
"dataSource": {
"type": "dataSamplingPolicy",
"name": "tpc-clusters",
@@ -352,7 +352,7 @@
"className": "o2::quality_control_modules::tpc::TrackClusters",
"moduleName": "QcTPC",
"detectorName": "TPC",
- "cycleDurationSeconds": "300",
+ "cycleDurationSeconds": "20000",
"dataSource": {
"type": "direct",
"query": "inputTracks:TPC/TRACKS/0;inputClusters:TPC/CLUSTERNATIVE;inputClusRefs:TPC/CLUSREFS/0"
@@ -405,7 +405,7 @@
"className": "o2::quality_control_modules::tof::TaskDigits",
"moduleName": "QcTOF",
"detectorName": "TOF",
- "cycleDurationSeconds": "60",
+ "cycleDurationSeconds": "20000",
"maxNumberCycles": "-1",
"dataSource": {
"type": "direct",
@@ -437,7 +437,7 @@
"className": "o2::quality_control_modules::ft0::RecPointsQcTask",
"moduleName": "QcFT0",
"detectorName": "FT0",
- "cycleDurationSeconds": "600",
+ "cycleDurationSeconds": "20000",
"maxNumberCycles": "-1",
"dataSource": {
"type": "direct",
@@ -451,7 +451,7 @@
"className": "o2::quality_control_modules::mid::DigitsQcTask",
"moduleName": "QcMID",
"detectorName": "MID",
- "cycleDurationSeconds": "60",
+ "cycleDurationSeconds": "20000",
"dataSource": {
"type": "direct",
"query": "digits:MID/DATA;digits_rof:MID/DATAROF"
@@ -463,7 +463,7 @@
"className": "o2::quality_control_modules::mid::ClustQcTask",
"moduleName": "QcMID",
"detectorName": "MID",
- "cycleDurationSeconds": "60",
+ "cycleDurationSeconds": "20000",
"dataSource": {
"type": "direct",
"query": "clusters:MID/TRACKCLUSTERS;clusterrofs:MID/TRCLUSROFS"
@@ -475,7 +475,7 @@
"className": "o2::quality_control_modules::mid::TracksQcTask",
"moduleName": "QcMID",
"detectorName": "MID",
- "cycleDurationSeconds": "60",
+ "cycleDurationSeconds": "20000",
"dataSource": {
"type": "direct",
"query": "tracks:MID/TRACKS;trackrofs:MID/TRACKROFS"
@@ -487,7 +487,7 @@
"className": "o2::quality_control_modules::emcal::CellTask",
"moduleName": "QcEMCAL",
"detectorName": "EMC",
- "cycleDurationSeconds": "60",
+ "cycleDurationSeconds": "20000",
"maxNumberCycles": "-1",
"dataSource": {
"type": "direct",
@@ -500,7 +500,7 @@
"className": "o2::quality_control_modules::emcal::ClusterTask",
"moduleName": "QcEMCAL",
"detectorName": "EMC",
- "cycleDurationSeconds": "60",
+ "cycleDurationSeconds": "20000",
"maxNumberCycles": "-1",
"dataSource": {
"type": "direct",
@@ -537,7 +537,7 @@
"className": "o2::quality_control_modules::emcal::BCTask",
"moduleName": "QcEMCAL",
"detectorName": "EMC",
- "cycleDurationSeconds": "60",
+ "cycleDurationSeconds": "20000",
"maxNumberCycles": "-1",
"dataSource": {
"type": "direct",
@@ -551,7 +551,7 @@
"className": "o2::quality_control_modules::phos::ClusterQcTask",
"moduleName": "QcPHOS",
"detectorName": "PHS",
- "cycleDurationSeconds": "100",
+ "cycleDurationSeconds": "20000",
"dataSource": {
"type": "direct",
"query": "clusters:PHS/CLUSTERS/0;clustertr:PHS/CLUSTERTRIGREC/0"
@@ -564,7 +564,7 @@
"className": "o2::quality_control_modules::zdc::ZDCRecDataTask",
"moduleName": "QcZDC",
"detectorName": "ZDC",
- "cycleDurationSeconds": "60",
+ "cycleDurationSeconds": "20000",
"maxNumberCycles": "-1",
"dataSource": {
"type": "direct",
@@ -593,7 +593,7 @@
"className": "o2::quality_control_modules::fdd::RecPointsQcTask",
"moduleName": "QcFDD",
"detectorName": "FDD",
- "cycleDurationSeconds": "600",
+ "cycleDurationSeconds": "20000",
"maxNumberCycles": "-1",
"dataSource": {
"type": "direct",
@@ -607,7 +607,7 @@
"className": "o2::quality_control_modules::hmpid::HmpidTaskClusters",
"moduleName": "QcHMPID",
"detectorName": "HMP",
- "cycleDurationSeconds": "60",
+ "cycleDurationSeconds": "20000",
"maxNumberCycles": "-1",
"dataSource": {
"type": "direct",
@@ -620,7 +620,7 @@
"className": "o2::quality_control_modules::hmpid::HmpidTaskMatches",
"moduleName": "QcHMPID",
"detectorName": "HMP",
- "cycleDurationSeconds": "60",
+ "cycleDurationSeconds": "20000",
"maxNumberCycles": "-1",
"dataSource": {
"type": "direct",
@@ -633,7 +633,7 @@
"className": "o2::quality_control_modules::fv0::DigitQcTask",
"moduleName": "QcFV0",
"detectorName": "FV0",
- "cycleDurationSeconds": "600",
+ "cycleDurationSeconds": "20000",
"maxNumberCycles": "-1",
"dataSource": {
"type": "direct",
@@ -653,7 +653,7 @@
"className": "o2::quality_control_modules::trd::DigitsTask",
"moduleName": "QcTRD",
"detectorName": "TRD",
- "cycleDurationSeconds": "60",
+ "cycleDurationSeconds": "20000",
"dataSource": {
"type": "direct",
"query": "digits:TRD/DIGITS;triggers:TRD/TRKTRGRD;noiseMap:TRD/NOISEMAP/0?lifetime=condition&ccdb-path=TRD/Calib/NoiseMapMCM;chamberStatus:TRD/CHSTATUS/0?lifetime=condition&ccdb-path=TRD/Calib/HalfChamberStatusQC;fedChamberStatus:TRD/FCHSTATUS/0?lifetime=condition&ccdb-path=TRD/Calib/DCSDPsFedChamberStatus"
@@ -664,7 +664,7 @@
"className": "o2::quality_control_modules::trd::TrackletsTask",
"moduleName": "QcTRD",
"detectorName": "TRD",
- "cycleDurationSeconds": "60",
+ "cycleDurationSeconds": "20000",
"dataSource": {
"type": "direct",
"query": "tracklets:TRD/TRACKLETS;triggers:TRD/TRKTRGRD;noiseMap:TRD/NOISEMAP/0?lifetime=condition&ccdb-path=TRD/Calib/NoiseMapMCM;chamberStatus:TRD/CHSTATUS/0?lifetime=condition&ccdb-path=TRD/Calib/HalfChamberStatusQC;fedChamberStatus:TRD/FCHSTATUS/0?lifetime=condition&ccdb-path=TRD/Calib/DCSDPsFedChamberStatus"
@@ -675,7 +675,7 @@
"className": "o2::quality_control_modules::trd::PulseHeightTrackMatch",
"moduleName": "QcTRD",
"detectorName": "TRD",
- "cycleDurationSeconds": "60",
+ "cycleDurationSeconds": "20000",
"dataSource": {
"type": "direct",
"query": "phValues:TRD/PULSEHEIGHT"
@@ -686,7 +686,7 @@
"className": "o2::quality_control_modules::trd::TrackingTask",
"moduleName": "QcTRD",
"detectorName": "TRD",
- "cycleDurationSeconds": "60",
+ "cycleDurationSeconds": "20000",
"dataSource": {
"type": "direct",
"query": "trackITSTPCTRD:TRD/MATCH_ITSTPC;trigITSTPCTRD:TRD/TRGREC_ITSTPC;trackTPCTRD:TRD/MATCH_TPC;trigTPCTRD:TRD/TRGREC_TPC"
@@ -702,7 +702,7 @@
"className": "o2::quality_control_modules::glo::ITSTPCMatchingTask",
"moduleName": "QcGLO",
"detectorName": "GLO",
- "cycleDurationSeconds": "3600",
+ "cycleDurationSeconds": "20000",
"maxNumberCycles": "-1",
"dataSource": {
"type": "direct",
@@ -744,7 +744,7 @@
"moduleName": "QcMUONCommon",
"detectorName": "GLO",
"taskName": "MUONTracks",
- "cycleDurationSeconds": "300",
+ "cycleDurationSeconds": "20000",
"maxNumberCycles": "-1",
"dataSource": {
"type": "direct",
@@ -771,7 +771,7 @@
"className": "o2::quality_control_modules::glo::VertexingQcTask",
"moduleName": "QcGLO",
"detectorName": "GLO",
- "cycleDurationSeconds": "60",
+ "cycleDurationSeconds": "20000",
"maxNumberCycles": "-1",
"dataSource": {
"type": "direct",
@@ -787,7 +787,7 @@
"className": "o2::quality_control_modules::pid::TaskFT0TOF",
"moduleName": "QcTOF",
"detectorName": "TOF",
- "cycleDurationSeconds": "10",
+ "cycleDurationSeconds": "20000",
"maxNumberCycles": "-1",
"dataSource": {
"type": "direct", |
|
Thinking about it, there is an obvious race condition when merging histograms in async mode. The more histograms are in fly, the more time it takes to merge them and therefore they accumulate in shared memory if they are published at a fixed rate and never dropped. |
|
Why "never dropped"? |
|
Never dropped in the sense that the merger tries to merge every single histogram which is thrown at it. So if histograms are produced at a rate which is higher than the merger ability to merge, memory usage diverges (and eventually shared memory becomes full). |
|
In the particular case we were looking at 800MB of histograms were produced and had to be merged for every timeframe. |
This is an optimization to avoid having QC tasks on grid publish objects often than needed, i.e. when they don't use moving windows, it's enough to publish them at end of stream.