Skip to content

Debug mode for topology generation#1460

Merged
martenole merged 2 commits intoAliceO2Group:masterfrom
martenole:topo
Feb 15, 2024
Merged

Debug mode for topology generation#1460
martenole merged 2 commits intoAliceO2Group:masterfrom
martenole:topo

Conversation

@martenole
Copy link
Copy Markdown
Contributor

This would allow to run the epn-topo-merger with the option --force-exact-node-numbers --nodes-mi50 1 --nmin-mi50 0 --nodes-mi100 1 --nmin-mi100 0 in case the user specifies DEBUG_TOPOLOGY_GENERATION=1. In that case also the temporary xml file would be kept for inspection.
Ping @davidrohr to see if its fine with you?

@davidrohr
Copy link
Copy Markdown
Collaborator

This makes sense to me.
Perhaps we also want to enable some other debug options, or set some options that we usually need to add, like:

  • EPN2EOS_METAFILES_DIR
  • GEN_TOPO_OVERRIDE_TEMPDIR : this will use a specific folder not /tmp for some tempfiles, which will then also be kept, like QC jsons, etc...

@martenole
Copy link
Copy Markdown
Contributor Author

These points I wanted to add to the logFetcher tool. I have changed it such that it prepends the GEN_TOPO_WORKDIR=$PWD and uses the gen_topo.sh without logging. For printing the workflow one would need to add the WORKFLOWMODE=print in any case and only then would be the EPN2EOS_METAFILES_DIR needed I think.

So I have at the moment in the logFetcher

if [[ $TOPO == 1 ]]; then
  TOPO_LOG=$(ssh epnlog@$INFRANODE grep $PARTITION /var/log/topology/gen-topo.log)
  if [[ -z $TOPO_LOG ]]; then
    echo "No topology logs found for: partition=$PARTITION and role=$ROLENAME"
    echo "Typo in partition ID or looking in staging instead of production or vice versa?"
    exit 1
  fi
  echo "$TOPO_LOG"
  TOPO_COMMAND=$(grep -o 'GEN_TOPO_HASH.*\/opt\/alisw\/el8\/GenTopo\/bin' <<< "$TOPO_LOG")
  TOPO_COMMAND+="/gen_topo.sh"
  echo -e "\n\033[0;31mIn order to debug the topology generation, run the following:\033[0m\n"
  echo "DEBUG_TOPO_GENERATION=1 GEN_TOPO_WORKDIR=\$PWD $TOPO_COMMAND"
  echo -e "\n\033[0;31mIn case you want to print the workflow, prepend the following:\033[0m\n"
  echo "WORKFLOWMODE=print EPN2EOS_METAFILES_DIR=FOO"
  echo
  exit 0
fi

Which would print for example

[...]
20240212-124200 2kcLhNjKEhr :     topology generation failed

In order to debug the topology generation, run the following:

DEBUG_TOPO_GENERATION=1 GEN_TOPO_WORKDIR=$PWD GEN_TOPO_HASH=1 GEN_TOPO_SOURCE='epn-20240211' DDMODE='processing' GEN_TOPO_LIBRARY_FILE='production/production.desc' GEN_TOPO_WORKFLOW_NAME='synchronous-workflow-calib' WORKFLOW_DETECTORS='TOF,MCH,MFT,TRD,ZDC,CPV,TPC,FDD,MID,EMC,ITS,FV0,PHS,FT0' WORKFLOW_DETECTORS_EXCLUDE_QC='' WORKFLOW_DETECTORS_EXCLUDE_CALIB='' WORKFLOW_PARAMETERS='QC,GPU,CALIB,EVENT_DISPLAY' RECO_NUM_NODES_OVERRIDE=2 RECO_MAX_FAIL_NODES_OVERRIDE=1 MULTIPLICITY_FACTOR_RAWDECODERS=1 MULTIPLICITY_FACTOR_CTFENCODERS=1 MULTIPLICITY_FACTOR_REST=1 BEAMTYPE='pp' NHBPERTF=32 GEN_TOPO_ONTHEFLY=1 OVERRIDE_PDPSUITE_VERSION='O2PDPSuite/epn-20240211-DDv1.6.4-QCv1.133.0-flp-suite-v1.20.0-1' SET_QCJSON_VERSION='Y6K68Z8puXy8GOX2pyzhk+VLuMY8WK/nN5Gs0T01re0=' DD_DISK_FRACTION='100' SHM_MANAGER_SHMID='1' RUNTYPE=SYNTHETIC FLP_IDS='S01,S05,S02,S06,S09,S03,S13,S10,S12,S07,S11,S04,S08,S14' GEN_TOPO_DEPLOYMENT_TYPE=ALICE_STAGING ED_VERTEX_MODE=1 ARGS_EXTRA_PROCESS_o2_eve_export_workflow='--number-of_files 20' IS_SIMULATED_DATA=1   CONFIG_EXTRA_PROCESS_o2_itsmft_stf_decoder_workflow="ITSClustererParam.maxBCDiffToMaskBias=-1;MFTClustererParam.maxBCDiffToMaskBias=-1;" WORKFLOW_EXTRA_PROCESSING_STEPS="TPC_DEDX,MFT_RECO,MID_RECO,MCH_RECO,MATCH_MFTMCH,MATCH_MCHMID,MUON_SYNC_RECO,ZDC_RECO,FV0_RECO,FDD_RECO" ARGS_EXTRA_PROCESS_o2_calibration_residual_aggregator='--output-type unbinnedResid,trackParams' CALIB_TPC_SCDCALIB_SENDTRKDATA=1 GEN_TOPO_AUTOSCALE_PROCESSES=0 WORKFLOW_DETECTORS_FLP_PROCESSING=NONE RECOSHMSIZE=120259084288 DDSHMSIZE=114688 /opt/alisw/el8/GenTopo/bin/gen_topo.sh

In case you want to print the workflow, prepend the following:

WORKFLOWMODE=print EPN2EOS_METAFILES_DIR=FOO


@martenole
Copy link
Copy Markdown
Contributor Author

Hi @davidrohr could you check again if it makes sense for you like this? I have also added a separate directory for the topology cache on staging

@davidrohr
Copy link
Copy Markdown
Collaborator

Hi @davidrohr could you check again if it makes sense for you like this? I have also added a separate directory for the topology cache on staging

Only that there is this long-pending JIRA ticket https://its.cern.ch/jira/browse/EPN-436, to move the gen_topo temp folder of the scratch NFS, as requested by EPN. But up to them to do this, so long I am fine with your change. Afterwards one needs to think again.

@davidrohr
Copy link
Copy Markdown
Collaborator

Also, in principle it is no problem if they use the same temp-folder, since I use a file-lock to serialize the access. It just means the topology cache uses the same folder, but since the cache is verified against all set env variables, this cannot be confused. Anyhow, a bit cleaner to separate it, so just go ahead.

@martenole
Copy link
Copy Markdown
Contributor Author

Thanks! Then pinging @lkrcal here as well for the temp folder on the infra nodes for the cache. And for now I merge this as is

@martenole martenole merged commit a4f859c into AliceO2Group:master Feb 15, 2024
@martenole martenole deleted the topo branch February 15, 2024 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants