Exclude NANOAOD from Unified output dataplacement for standard workfl… by todor-ivanov · Pull Request #532 · CMSCompOps/WmAgentScripts

todor-ivanov · 2020-04-16T15:54:05Z

Fixes #530
Status
Description

As part of the Rucio migration there have been a plan put into action to initially adapt the NANOAOD data tier. Which means that Unified should not subscribe data transfers when it comes to those datasets.

After our previous investigation of the code we concluded that Unified is doing output data placement through DDM for standard Workflows and through Phedex for RelVal. It seems we can benefit from the already existing parameter 'tier_to_DDM' here. And we can hard code the exclusion of NANOD for Relval in a similar way as it is done for the rest of the data tiers here: [1]

[1] https://github.com/CMSCompOps/WmAgentScripts/blob/master/Unified/closor.py#L502-L518

Is it backward compatible (if not, which system it affects?)
Related PRs

NO
External dependencies / deployment changes

We may want to make it configurable for RelVal workflows too.
Mention people to look at PRs

@sharad1126 @amaltaro @vlimant

vlimant · 2020-04-17T08:27:51Z

the relval nanoaod is not the same as central production aod, they have to be available where people expect to run validation ; CERN for example

amaltaro · 2020-04-17T09:05:12Z

@nsmith- Nick, give that NANO data will be soon managed by Rucio/DM team, you might want to have a special rule/subscription for RelVal NANO.

nsmith- · 2020-04-17T16:36:40Z

Will WMAgent be able to set some DID metadata that replicates wfi.isRelval() flag? Or is that determinable via existing metadata (or the dataset name itself?) Based on

WmAgentScripts/utils.py

Line 6506 in d11ab40

def isRelval(self):

looks like it might need request details

todor-ivanov · 2020-04-21T13:30:47Z

@nsmith- your question is quite relevant. I do not think we discussed how to fetch this exact piece of information when we were discussing the output data placement in Micro Services. I was kind of always thinking of this info like a given and being easily accessible through ReqMgr.

todor-ivanov · 2020-04-21T13:33:38Z

@sharad1126 ,What is the status of this PR? Did you manage to test it somehow? And are you planning to merge it some time this week? I believe we will need it soon. Thanks in advance.

nsmith- · 2020-04-21T15:03:07Z

@todor-ivanov For the marking of RelVal datasets, I believe after discussing with @amaltaro the plan here is to set the PhEDEx group name as metadata on the container DID, and all RelVal have that group.

sharad1126 · 2020-04-21T15:05:22Z

@todor-ivanov I am not sure how to test this. you need to provide me some specific workflows for this case where we can try running the script on.

todor-ivanov · 2020-04-21T16:24:02Z

Thanks Nick @nsmith-, This was yet another aspect of the point you were making and I did not fully get it before. Yes, marking the the dataset itself as a Relval makes quite a sense, and IIUC as it cannot be done through the dataset metadata itself, you suggest that we do it with including the needed markers in the transfer submission metadata, right? I think in the current implementation, what unified is doing is more or less similar - it marks those datasets with setting the PhEDEx group to group='RelVal'.

And just to clarify, when you are mentioning container DID, you refer to a Rucio Container, right?

nsmith- · 2020-04-21T16:38:12Z

including the needed markers in the transfer submission metadata

Actually I think it needs to be in the DID creation metadata (on injection by WMAgent), as opposed to the rule metadata (what I take to be "transfer submission"--really there is no transfer submission concept in rucio, just rule creation, which may or may not require an actual transfer)
And yes, I mean the rucio container DID (=CMS dataset)

todor-ivanov · 2020-04-21T17:44:15Z

Thanks @nsmith- I think I got the picture now. Most of this discussion provides useful information for the Output Data Placement in Microservices issue though. But it was really good that you brought that up here. Thanks again.

todor-ivanov · 2020-04-21T17:55:31Z

@sharad1126 I do not think we can test it explicitly with some dedicated workflows. At least not until the testing environment for Unified is completely setup so one can do such a test with sending such workflows to testbed. The best we can do here is just you do the normal test in the so far semi testing machine (vocms0** - I do not remember which one was that) as you normally do with any other change in the code, and just check if there are no any typos or something that totally breaks the 'closor' module. What I remember you telling me about that proccess is, that you were cherrypicking commits in that machine order to do the test. Or did I understand you wrong?

amaltaro · 2020-04-21T19:35:10Z

Actually I think it needs to be in the DID creation metadata (on injection by WMAgent),

@nsmith- Nick, I did try it - I guess for block creation - and the did creation failed. Can you please check whether it's supported or not at block/container creation and let us know?

nsmith- · 2020-04-21T20:15:40Z

@amaltaro it seems, unlike for rule metadata, that for DID metadata the keys are enforced to be from a predefined set. One has to first add the metadata key via commands like https://rucio.readthedocs.io/en/latest/api/meta-data.html
Since this only happens once, should we apply this to the database you are testing against?

amaltaro · 2020-04-21T20:24:36Z

Yes, now I recall you had already mentioned it in the past week.
I think we could then add this group key (string with not more than...50 chars(?)) to the cmsrucio-int instance. I also think it makes sense to enable such metadata/key for both block and container data injection, but I leave it up to you decide at which level it makes more sense.

nsmith- · 2020-04-21T22:05:48Z

We'll carry on this metadata discussion at dmwm/WMCore#9655

sharad1126 · 2020-04-27T12:46:56Z

as it looks like from the discussion above, this is still an on going PR. @amaltaro @todor-ivanov should I convert this to draft?

todor-ivanov · 2020-04-27T13:11:09Z

@sharad1126 As Far as it concerns the Unified code, It is the whole thing. I do not think you should turn it into draft. Again, the data tiers to fall under the change here (and to be excluded from data placement) are configurable through Unified Config. You can always change them. And the question is more like - should we switch off data placement for NANOAOD now or we are about to wait?

About the discussion triggered above - it is already moved in Wmcore, as It concerns how we will distinguish the data in the Future implementation.

sharad1126 · 2020-04-27T13:15:30Z

@todor-ivanov does that mean we can test and then just merge this PR?

todor-ivanov · 2020-04-27T13:20:42Z

@sharad1126 Yes. Even if we decide to wait with disabling the NANOAD immediately, the code needs to be tested and proved to work when the time comes.

sharad1126 · 2020-04-27T13:37:00Z

@todor-ivanov from my side it looks like it is ok to merge. I don't have any specific things to test it on but for the python related checks and all, it is all ok to merge.

amaltaro

@todor-ivanov I have made a couple of comments in the code that will make this change cleaner.
@sharad1126 no, it cannot be merged and put in production because the default configuration is already skipping NANO data.

I still have a question though. If we do not place those NANOAD datasets here, won't it be a blocker to get workflows announced?

sharad1126 · 2020-04-27T13:55:13Z

@amaltaro thanks for confirming. I think that sending the data to ddm and announcing workflows are still two different things and if needed, workflows can be announced without sending data to ddm. the whole point being, one needs to take care that the workflow status is changed accordingly.

amaltaro · 2020-04-27T15:54:39Z

the whole point being, one needs to take care that the workflow status is changed accordingly.

That's exactly my point. I think unified checks on the output data placement status, and I'm afraid that not placing NANOAOD might cause workflows to get stuck in closed-out.
Of course, we can also keep an eye on things once this gets merged and follow up on any other changes needed.

vlimant · 2020-04-27T16:08:21Z

actual location of the output is only checked for DQMIO location and possibility of running harvesting.
the location of anything else is looked at but ignored for any decision on state transition.

todor-ivanov · 2020-04-29T18:00:10Z

@amaltaro @sharad1126 @vlimant Here is the code with the early interruption of the cycle. Which I agree, looks much prettier .... but in this case I think the beauty will just introduce some issues. And the way I did it the previous time was exactly to avoid the behavior Alan and Sharad were discussing above - to avoid Having 'NANOAOD' workflows that due to this check and interruption of the logic start accumulating in closed-out status.

todor-ivanov · 2020-05-05T15:24:53Z

Upon a double check I noticed which was the line I was overlooking and now I convinced myself it will work both ways. So the current version of the PR is definitely achieving the same goal as the previous one. @sharad1126 could you please test and merge it?

sharad1126 · 2020-05-06T07:51:46Z

@todor-ivanov @amaltaro I see no issues in the code. There is nothing I can test this code on and won't affect anything in unified's side. It won't stop doing anything which it was doing previously as you just introduced 2 new lists which checks and sends logs only. The other lists which has NANOAOD will still work as it is as the NANOAOD in those lists still exist.

amaltaro · 2020-05-06T08:01:52Z

Thanks @todor-ivanov , it looks good to me. Todor, can you please update the description of this PR with what exactly is being proposed here? That will clarify things to Sharad as well.

@nsmith- Nick, do we need any handshake before it can get merged?

sharad1126 · 2020-05-06T09:35:25Z

@amaltaro that would be great indeed. thanks.

todor-ivanov · 2020-05-06T17:40:01Z

Done

nsmith- · 2020-05-06T21:43:31Z

As I understand it, merging this changes production behavior. I think then we need to wait for all WMAgents to be running in Rucio-Nano mode before removing this?

…ows. Exclude NANOAOD from Unified output dataplacement for RelVal workflows. Exclude NANOAOD from Unified - add tier_to_skip in UnifiedConfig. Early DataTier check and cycle interruption && put back NANOOD in the default list tiers_to_DDM Revert privious bad membership check Renaming tier configuration lists Switching off Rucio for nonrelval tiers

todor-ivanov · 2020-05-15T16:58:52Z

@sharad1126 this one must be merged before [1]. Thanks!

[1]
#561

amaltaro · 2020-05-15T18:28:52Z

@nsmith- Nick, did you manage to create the NANO rucio subscriptions? I guess that's the only thing that is holding us on merging this PR, isn't it?

nsmith- · 2020-05-15T18:50:54Z

Ok so its not clear to me yet: if this is merged then unified will stop sending NANO replication requests to Dynamo?
If so, we should wait to merge it until at least the first WMAgent that outputs to rucio is running, and preferably until all are running.

amaltaro · 2020-05-18T09:59:08Z

@sharad1126 let me try to clarify this PR.

It provides to configuration parameters:

tiers_to_rucio_relval: will contain a list of datatiers that are injected into Rucio - by the RelVal WMAgent - thus we do not want to make any (phedex) data subscription for them.
tiers_to_rucio_nonrelval: will contain a list of datatiers that are injected into Rucio - by production WMAgents - thus we do not want to make any (phedex) data subscription for them as well.

In case they are set to an empty list, we deal with data subscription as it's done nowadays. Otherwise, datatiers listed won't get a DDM/PhEDEx subscription.

sharad1126 · 2020-05-18T10:12:39Z

@amaltaro I understand what you mean. My point was to introduce the second one later but ok, I don't mind this as well. From my side, I see no issues in this PR and we are ready to merge this once you are ready to handle the relval NANOAOD output.

nsmith- reviewed Apr 17, 2020

View reviewed changes

Comment thread Unified/closor.py Outdated

todor-ivanov force-pushed the feature_disable_NANOAOD branch from d11ab40 to dec68ab Compare April 17, 2020 18:16

todor-ivanov mentioned this pull request Apr 21, 2020

Disable NANOAOD/NANOAODSIM output data placement in Unified dmwm/WMCore#9634

Closed

amaltaro suggested changes Apr 27, 2020

View reviewed changes

Comment thread Unified/closor.py Outdated

Comment thread Unified/closor.py Outdated

todor-ivanov force-pushed the feature_disable_NANOAOD branch 2 times, most recently from de5ef60 to 1573cb1 Compare April 29, 2020 17:53

amaltaro approved these changes May 6, 2020

View reviewed changes

todor-ivanov force-pushed the feature_disable_NANOAOD branch from 1573cb1 to ba6634c Compare May 15, 2020 16:49

sharad1126 reviewed May 18, 2020

View reviewed changes

Comment thread unifiedConfiguration.json

sharad1126 mentioned this pull request May 18, 2020

Adding Rucio to DBS-Phedex consistency check #561

Merged

sharad1126 reviewed May 18, 2020

View reviewed changes

Comment thread Unified/closor.py

sharad1126 added this to the phedex to rucio transition milestone May 19, 2020

z4027163 merged commit 32b9e92 into CMSCompOps:master Jun 2, 2020

Conversation

todor-ivanov commented Apr 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vlimant commented Apr 17, 2020

Uh oh!

amaltaro commented Apr 17, 2020

Uh oh!

Uh oh!

nsmith- commented Apr 17, 2020

Uh oh!

todor-ivanov commented Apr 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

todor-ivanov commented Apr 21, 2020

Uh oh!

nsmith- commented Apr 21, 2020

Uh oh!

sharad1126 commented Apr 21, 2020

Uh oh!

todor-ivanov commented Apr 21, 2020

Uh oh!

nsmith- commented Apr 21, 2020

Uh oh!

todor-ivanov commented Apr 21, 2020

Uh oh!

todor-ivanov commented Apr 21, 2020

Uh oh!

amaltaro commented Apr 21, 2020

Uh oh!

nsmith- commented Apr 21, 2020

Uh oh!

amaltaro commented Apr 21, 2020

Uh oh!

nsmith- commented Apr 21, 2020

Uh oh!

sharad1126 commented Apr 27, 2020

Uh oh!

todor-ivanov commented Apr 27, 2020

Uh oh!

sharad1126 commented Apr 27, 2020

Uh oh!

todor-ivanov commented Apr 27, 2020

Uh oh!

sharad1126 commented Apr 27, 2020

Uh oh!

amaltaro left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sharad1126 commented Apr 27, 2020

Uh oh!

amaltaro commented Apr 27, 2020

Uh oh!

vlimant commented Apr 27, 2020

Uh oh!

todor-ivanov commented Apr 29, 2020

Uh oh!

todor-ivanov commented May 5, 2020

Uh oh!

sharad1126 commented May 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amaltaro commented May 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sharad1126 commented May 6, 2020

Uh oh!

todor-ivanov commented May 6, 2020

Uh oh!

nsmith- commented May 6, 2020

Uh oh!

todor-ivanov commented May 15, 2020

Uh oh!

amaltaro commented May 15, 2020

Uh oh!

nsmith- commented May 15, 2020

Uh oh!

Uh oh!

todor-ivanov commented Apr 16, 2020 •

edited

Loading

todor-ivanov commented Apr 21, 2020 •

edited

Loading

sharad1126 commented May 6, 2020 •

edited

Loading

amaltaro commented May 6, 2020 •

edited

Loading