New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increasing Number of workflows with Duplicates Lumis #11956
Comments
@hassan11196 Hassan, I am not saying I am going to debug it :) But just to help WM on debugging it, would you have further details for one of these workflows? Which dataset? Which run/lumi and file have dups? |
@hassan11196 Hassan, I was looking for a workflow with very few stats, and I found this cmsunified_task_EXO-Run3Summer22MiniAODv4-00662__v1_T_240213_134751_3377 which has been sitting in I implemented lumi check at 3 levels: dataset, block and file; and indeed they differ as can be seen in the following table: where we can see 4 duplicate lumis in the input dataset. In case you want to check on your side, here is the duplicate report for the input dataset (which can potentially cause duplicate lumis in the output, and it did!):
lastly, this dataset above was (potentially) produced by 3 workflows, according to request reqmgr2 api. Just for your information, the distinction of those 3 different ways to calculate lumis in the dataset are:
If you want, I can polish my python notebook and share it with you tomorrow, such that you can check the other workflows. PS.: the workflow that produced the dataset above did not have any input dataset. So duplication originated on the direct parent workflow. |
Hi @amaltaro, Thank you for this investigation. I did not update the ticket but I was able to find the files that that had duplicated lumis using our scripts but was not able to find the lumis yet, Please do share your notebook (even in unpolished state) this will help me a lot. We now have this tool in our dashboard which we use to find the files to invalidate for duplicate lumis. I will provide you an updated list of wfs that had duplicated lumis, along with list of files and lumi run numbers.
You mentioned the same query for block and dataset. So inside each the filesummaries for a dataset, their are filesummaries for each block? Another question I have is, if multiple workflows are writing to a dataset, what are the parameters that control the workflow outputs what lumi no's. I assume Lastly, I want to minimize your your time with trivial things given wmcores priorities this quarter, you can just give me pointers to find the stuff, I will gather all logs and relevant info for you in a single place and make it easy for you to narrow down the issue. Thanks a lot Alan. |
@hassan11196 Hassan, I think we covered most of this over zoom today, but please let me know if anything needs follow up.
If you planned on having multiple workflows writing to the same output dataset, then yes, you need to use However, most - if not all - of these duplicate lumis happen unintentionally, and you can either invalidate given files in DBS or recreate the output dataset in a v++ setup. We briefly discussed the notebook today, and here it is: I am moving this issue to waiting, once we confirm with another workflow or two that there is no problem on the WM side, I would suggest to get it closed. Please let us know how it goes. Thanks! |
Hello @amaltaro https://docs.google.com/document/d/1bH6etTBucsw5F_wUKiqHBHuN1fRAlLUuGEbAaGAwd2g/edit?usp=sharing Thank you. |
Hi @amaltaro For background context We reproduced this dataset as V2, but there is a discrepancy between MiniAOD and NanoAOD events and lumis.
our initial suspicion was that it might be caused by duplicated lumis. However, our duplication check in Unified did not detect any issues. I tried the Here is a complete list of duplicated Lumis, /Muon1/Run2023C-22Sep2023_v4-v2/MINIAOD
/Muon1/Run2023C-22Sep2023_v4-v2/NANOAOD
Total Duplicated Lumis MinAOD : 372 unique lumis that are duplicated -> 865 sum of duplicated lumis @amaltaro can you tell me how to verify the duplicates from root? I have downloaded one file. |
@hassan11196 Ahmed, I have not yet checked the list you provided above. |
Hi @amaltaro, I was reviewing one of the files[1] mentioned above using the dbs API and noticed that they had duplicate lumis but different run numbers. It seems that the Before:
After:
So can you confirm that this was a false alarm? [1] |
@hassan11196 Ahmed, if those same lumis belong to different run numbers, then it is definitely NOT a duplication. Thank you very much for spotting that. I will take this opportunity and update the python notebook in my repository (but if you prefer, feel free to share your current code and I can push that in as well). Maybe we should re-do such tests with the previous 22 (?) workflows that you reported with dup lumis? |
Hi @amaltaro, Thank you |
Impact of the bug
Duplicate Lumis in output files affect the Output Datasets, The workflows with duplicates in their lumis are not announced automatically and need Manual Operations from P&R to remove the files with duplicate lumis.
Describe the bug
There has been an increase in Workflows with duplicates in their Outputs over the past weeks.
Monitoring Link: https://monit-grafana.cern.ch/goto/W0feXBxSR?orgId=11
It has also affected RelVal workflows, as described in this ticket
https://its.cern.ch/jira/browse/CMSPROD-165
A recent example of a workflow with duplicates
For this workflow I have invalidated the files in DBS with duplicate Lumis
How to reproduce it
I can try submitting one of the above workflows as a backfill and see if its output also has duplicates.
Expected behavior
Output Datasets to not have files with duplicate Lumis.
The text was updated successfully, but these errors were encountered: