Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update data files for DropBoxMetadata tag #40522

Merged
merged 1 commit into from Jan 29, 2023

Conversation

malbouis
Copy link
Contributor

PR description:

This is to update the json files that are used to produce the DropBoxMetada tag. Some of the tags (SiStripBadChannel_PCL_v1_prompt, BeamSpotObjects_PCL_byRun_v1_prompt, BeamSpotObjects_PCL_byLumi_v1_prompt) had pcl synchronisation in Prep DB and this was preventing upload from Tier0 replays. This PR replaces the faulty tags and also updates a few other ones to have a different name between Prod and Prep tags to try avoid such a case of wrong synch in Prep DB in the future.

The difference between the new DMD tag produced with the changes introduced here and the previous one can be seen in https://cern.ch/go/LC9S

The new DMD tag has been included in the 126X_dataRun3_Express_Queue and will be picked up when a new Express GT is created.

PR validation:

  • created a Candidate GT by queueing the new tag to 126X_dataRun3_Express_Queue
  • ran RTM the workflow 1001.3 with the candidate GT from above

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

Not a backport.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-40522/33727

  • This PR adds an extra 12KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @malbouis for master.

It involves the following packages:

  • CondFormats/Common (db, alca)

@malbouis, @yuanchao, @cmsbuild, @saumyaphor4252, @ggovi, @tvami, @ChrisMisan, @francescobrivio can you please review it and eventually sign? Thanks.
@mmusich, @tocheng, @seemasharmafnal this is something you requested to watch as well.
@perrotta, @dpiparo, @rappoccio you are the release manager for this.

cms-bot commands are listed here

@malbouis
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b59c0c/29986/summary.html
COMMIT: 226b19c
CMSSW: CMSSW_13_0_X_2023-01-13-1100/el8_amd64_gcc11
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/40522/29986/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 7 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3555538
  • DQMHistoTests: Total failures: 6
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3555510
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 211 log files, 162 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

@mmusich
Copy link
Contributor

mmusich commented Jan 13, 2023

@malbouis
I am puzzled by this PR:

(SiStripBadChannel_PCL_v1_prompt, ...) had pcl synchronisation in Prep DB and this was preventing upload from Tier0 replays.

the corresponding tag for prod for the strip bad components has pcl syncronization as well:

"SiStripBadChannel_FromOfflineCalibration_GR10_v1_prompt": {}

https://cms-conddb.cern.ch/cmsDbBrowser/list/Prod/tags/SiStripBadChannel_FromOfflineCalibration_GR10_v1_prompt

and it has been set like this on purpose.
See the "manual":

Screenshot from 2023-01-13 22-20-42

prompt synchronization would imply that if the payload produced by the PCL has an IoV smaller than the FCSR it is uploaded nonetheless on the FCSR+1, but this is not OK for a bad component condition that changes dynamically run-by-run as you don't want to write the bad components computed on run n on the run n+1 since the dynamic part driven by readout noise might be temporary.
Can you elaborate why

had pcl synchronisation in Prep DB and this was preventing upload from Tier0 replays.

?
Also shouldn't this change be taken in concert with the subsystem expert?

@malbouis
Copy link
Contributor Author

malbouis commented Jan 13, 2023

Hi Marco,

as far as I understand, the Prep DB tag is only used for the replays in this case. All replays do the upload to the Prep DB. If the synchronisation of the tag being uploaded by a replay is pcl, it will not be possible to do the upload as the run from the replay is in the past, therefore the change in this PR.

I don't see anything wrong with it as we are not touching the Prod DB.

Is the Prep DB tag from the strips bad component used for anything other than the replays? If so, I would recommend moving it to Prod DB.

Thanks.

@mmusich
Copy link
Contributor

mmusich commented Jan 13, 2023

it will not be possible to do the upload as the run from the replay is in the past, therefore the change in this PR.

I think the intention of the original author was to have the same setup as used in production. Is it relevant that replays re-upload the same conditions on the prep tag?

Is the Prep DB tag from the strips bad component used for anything other than the replays?

I don't think so

@malbouis
Copy link
Contributor Author

I think the intention of the original author was to have the same setup as used in production. Is it relevant that replays re-upload the same conditions on the prep tag?

We would like to always make sure the upload from Tier0 is successful. At the moment, we are getting an error from the Tier0 uploader when doing the replays and this is sub-optimal.

Is the Prep DB tag from the strips bad component used for anything other than the replays?

I don't think so

Then we are good, right? Since we would like to have a successful upload from the Tier0 uploader during the replays, we could use this new tag created in this PR.

@mmusich
Copy link
Contributor

mmusich commented Jan 13, 2023

Then we are good, right? Since we would like to have a successful upload from the Tier0 uploader during the replays, we could use this new tag created in this PR.

well, I am still unsure why you want to have a successful upload. The uploads were rejected without any problems in the past 10+ years of operations.

@malbouis
Copy link
Contributor Author

well, I am still unsure why you want to have a successful upload. The uploads were rejected without any problems in the past 10+ years of operations.

I think it is an extra layer of checks that all is as expected in a replay. If the tag is not used for anything else, I also don't see why it cannot be used for the original intent (replays).

@mmusich
Copy link
Contributor

mmusich commented Jan 13, 2023

I think it is an extra layer of checks that all is as expected in a replay. If the tag is not used for anything else, I also don't see why it cannot be used for the original intent (replays).

the fact that the upload is rejected and that follows the rules of the pcl synchronization is exactly what it is expected to do. Unless there is desire to check at every replay the content of the conditions (I am not sure who would be supposed to that), I don't see the need of the upload. Also seems dangerous to have the replay and production diverge as in the future someone might be tempted to change the production environment because to even out (and that would be wrong).

@malbouis
Copy link
Contributor Author

I think it is an extra layer of checks that all is as expected in a replay. If the tag is not used for anything else, I also don't see why it cannot be used for the original intent (replays).

the fact that the upload is rejected and that follows the rules of the pcl synchronization is exactly what it is expected to do. Unless there is desire to check at every replay the content of the conditions (I am not sure who would be supposed to that), I don't see the need of the upload. Also seems dangerous to have the replay and production diverge as in the future someone might be tempted to change the production environment because to even out (and that would be wrong).

But if the Prep DB is for the replays, I don't see how someone would like to even Prep and Prod in the future, knowing that Prep DB is used only for the replays. Even if they do, the synchronisation of the prod tags should be kept in any case as we cannot have a tag with 'any' synch in prod. I honestly don't see how someone would try to even out the tags between prep and prod. And I still think it is more productive to have a successful upload (to eventually check the uploaded conditions to prep, it does't have to be at every replay, but it might be useful ar some point) than just having a failure and getting used to ignore upload failures.

@mmusich
Copy link
Contributor

mmusich commented Jan 13, 2023

Even if they do, the synchronisation of the prod tags should be kept in any case as we cannot have a tag with 'any' synch in prod.

So - you plan to leave the prep tag unsyncronized? Mmh, not a very realistic setup.

@tvami
Copy link
Contributor

tvami commented Jan 15, 2023

hi @mmusich

without any problems in the past 10+ years of operations

As you know, this year we had a few replays where the replays were "successful", but when we plugged it in to real operations we had issues. Those issues could have been discovered by reading the logs. This situation triggered the T0 team to take warnings/errors in the logs more seriously.

We'd prefer not to have this error come up in all the cases of the replays...
If you are worried that the tag in Prep is used for something more than technical checks, we could rename them and add forReplays or something in the tag name.
About synchronization... we could set them to offline, we barely make replays in runs that are before the last replayed run, so adding an IOV before the last IOV boundary will not come up, while adding an IOV to the same run number will come up a few times.

What do you think?

@mmusich
Copy link
Contributor

mmusich commented Jan 16, 2023

This situation triggered the T0 team to take warnings/errors in the logs more seriously.

OK, that's good to know.

About synchronization... we could set them to offline, we barely make replays in runs that are before the last replayed run, so adding an IOV before the last IOV boundary will not come up, while adding an IOV to the same run number will come up a few times.

some times, one does "preplays" though right? especially when you are not really sure if Prompt will come out right or not. I saw that happening at least a couple of times in the past.

@malbouis
Copy link
Contributor Author

I think that in this specific case of the replays the any synchronisation is actually appropriate. We are not interested in the overall history of the tags but rather only each replay independently.

To sumarize:

  • We would like to have clean logs from the Tier0 uploader for the replays (as a good practice to not ignore errors related to replays and by request from Tier0 experts)

  • For this, we think the best would be to keep all the PrepDB tags with any synchronisation

I would like, if possible to move forward with this PR. Marco, please let us know if there is any objection from your side.

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@mmusich
Copy link
Contributor

mmusich commented Jan 25, 2023

seems we converged,

I wouldn't say we have converged, rather we have established that we have different perceptions of what actually matters to test in a replay. Having said that, this PR is rather unconsequential, as the actual change in the Tier0 replay behavior hinges on a change of the Global Tag used there and not on the content of the release.

@rappoccio
Copy link
Contributor

hold

  • We probably should discuss this strategy, it would be good to have everyone on board with the plan.

@cmsbuild
Copy link
Contributor

Pull request has been put on hold by @rappoccio
They need to issue an unhold command to remove the hold state or L1 can unhold it for all

@francescobrivio
Copy link
Contributor

Ciao @mmusich,

in that way the synchronization of the condition uploader deamon is not tested.

but the syncronization checks are not a feature of the Tier0 daemon right?
AFAIU the sync is a feature of the tags themselves and the checks are performed by the cmsDbUpload service (whether the upload request comes from the uploadConditions.py or from the Tier0 daemon).
So I believe that a replay, which is by definition out of sync (at least in my mind) given it is always deployed on old data, is not really the best way to test the synchronization upload policies.

that we have different perceptions of what actually matters to test in a replay.

Yes I think that's the case 😄 As Helena mentioned already earlier in the thread, with this PR we want to make sure the replays are able to upload conditions at each replay, i.e. to check that there is no issue with the upload it self (e.g. possible authentication failures, old passwords in Tier0, etc..., all cases that happened in 2021 after the LS2 break).

@tvami
Copy link
Contributor

tvami commented Jan 25, 2023

Hi @rappoccio

Having said that, this PR is rather unconsequential,

I see this comment from Marco as no objection to get this PR merged

@mmusich
Copy link
Contributor

mmusich commented Jan 25, 2023

but the syncronization checks are not a feature of the Tier0 daemon right?

well, no.
The "synchronization" you are referring to is a field in the database, but that is what is checked (together with the information of the FCSR) by the uploader setup to decide whether upload or not. So it is a feature of the uploading system - in the case of a Tier-0 replay of the condtion uploader daemon.

given it is always deployed on old data, is not really the best way to test the synchronization upload policies.

why not? Part of the PCL concept hinges precisely on the "too late to be accepted" (see detailed explanation at #40522 (comment))

(e.g. possible authentication failures, old passwords in Tier0, etc..., all cases that happened in 2021 after the LS2 break).

all of this can be checked as well with an upload failure - just read the logs :)

@tvami
Copy link
Contributor

tvami commented Jan 25, 2023

all of this can be checked as well with an upload failure - just read the logs :)

but the point is that if the T0 team will make errors popping up based on the logs then all replays will be unsuccessful for a know feature

@mmusich
Copy link
Contributor

mmusich commented Jan 25, 2023

then all replays will be unsuccessful for a know feature

Certainly the log parsing can be adapted to distinguish between expected failure and real failure.

@malbouis
Copy link
Contributor Author

@mmusich , AlCaDB would really like to have this PR merged, as we have been dealing with many replays and this would ease the evaluation of the replays for us.
Do you have strong objections for merging this PR?
We can of course re-discuss this matter in the future and re-evaluate the decision taken here.
But for now, we would really appreciate it to be able to merge it as is.
Please let us know.

@mmusich
Copy link
Contributor

mmusich commented Jan 27, 2023

Do you have strong objections for merging this PR?

@malbouis, as far as I can tell nothing is holding this PR.
I am still a bit dubious of the implied consequences. I hope that the change in the replays behaviour will be properly announced to the subsystem experts that might not be following the gitHub thread, and that once the policy change will be enforced the now "successful" replay uploads will be actively scrutinized, not only relying on the return code of the uploader.

@tvami
Copy link
Contributor

tvami commented Jan 27, 2023

@rappoccio
given #40522 (comment) can you please unhold?

@perrotta
Copy link
Contributor

unhold

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@perrotta
Copy link
Contributor

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants