Skip to content
This repository was archived by the owner on Sep 12, 2024. It is now read-only.

Exclude NANOAOD from Unified output dataplacement for standard workfl…#532

Merged
z4027163 merged 1 commit intoCMSCompOps:masterfrom
todor-ivanov:feature_disable_NANOAOD
Jun 2, 2020
Merged

Exclude NANOAOD from Unified output dataplacement for standard workfl…#532
z4027163 merged 1 commit intoCMSCompOps:masterfrom
todor-ivanov:feature_disable_NANOAOD

Conversation

@todor-ivanov
Copy link
Copy Markdown
Contributor

@todor-ivanov todor-ivanov commented Apr 16, 2020

Fixes #530
Status
Description

As part of the Rucio migration there have been a plan put into action to initially adapt the NANOAOD data tier. Which means that Unified should not subscribe data transfers when it comes to those datasets.

After our previous investigation of the code we concluded that Unified is doing output data placement through DDM for standard Workflows and through Phedex for RelVal. It seems we can benefit from the already existing parameter 'tier_to_DDM' here. And we can hard code the exclusion of NANOD for Relval in a similar way as it is done for the rest of the data tiers here: [1]

[1] https://github.com/CMSCompOps/WmAgentScripts/blob/master/Unified/closor.py#L502-L518

Is it backward compatible (if not, which system it affects?)
Related PRs

NO
External dependencies / deployment changes

We may want to make it configurable for RelVal workflows too.
Mention people to look at PRs

@sharad1126 @amaltaro @vlimant

@vlimant
Copy link
Copy Markdown
Contributor

vlimant commented Apr 17, 2020

the relval nanoaod is not the same as central production aod, they have to be available where people expect to run validation ; CERN for example

@amaltaro
Copy link
Copy Markdown
Contributor

@nsmith- Nick, give that NANO data will be soon managed by Rucio/DM team, you might want to have a special rule/subscription for RelVal NANO.

Comment thread Unified/closor.py Outdated
@nsmith-
Copy link
Copy Markdown

nsmith- commented Apr 17, 2020

Will WMAgent be able to set some DID metadata that replicates wfi.isRelval() flag? Or is that determinable via existing metadata (or the dataset name itself?) Based on

def isRelval(self):
looks like it might need request details

@todor-ivanov todor-ivanov force-pushed the feature_disable_NANOAOD branch from d11ab40 to dec68ab Compare April 17, 2020 18:16
@todor-ivanov
Copy link
Copy Markdown
Contributor Author

todor-ivanov commented Apr 21, 2020

@nsmith- your question is quite relevant. I do not think we discussed how to fetch this exact piece of information when we were discussing the output data placement in Micro Services. I was kind of always thinking of this info like a given and being easily accessible through ReqMgr.

@todor-ivanov
Copy link
Copy Markdown
Contributor Author

@sharad1126 ,What is the status of this PR? Did you manage to test it somehow? And are you planning to merge it some time this week? I believe we will need it soon. Thanks in advance.

@nsmith-
Copy link
Copy Markdown

nsmith- commented Apr 21, 2020

@todor-ivanov For the marking of RelVal datasets, I believe after discussing with @amaltaro the plan here is to set the PhEDEx group name as metadata on the container DID, and all RelVal have that group.

@sharad1126
Copy link
Copy Markdown
Contributor

@todor-ivanov I am not sure how to test this. you need to provide me some specific workflows for this case where we can try running the script on.

@todor-ivanov
Copy link
Copy Markdown
Contributor Author

Thanks Nick @nsmith-, This was yet another aspect of the point you were making and I did not fully get it before. Yes, marking the the dataset itself as a Relval makes quite a sense, and IIUC as it cannot be done through the dataset metadata itself, you suggest that we do it with including the needed markers in the transfer submission metadata, right? I think in the current implementation, what unified is doing is more or less similar - it marks those datasets with setting the PhEDEx group to group='RelVal'.

And just to clarify, when you are mentioning container DID, you refer to a Rucio Container, right?

@nsmith-
Copy link
Copy Markdown

nsmith- commented Apr 21, 2020

including the needed markers in the transfer submission metadata

Actually I think it needs to be in the DID creation metadata (on injection by WMAgent), as opposed to the rule metadata (what I take to be "transfer submission"--really there is no transfer submission concept in rucio, just rule creation, which may or may not require an actual transfer)
And yes, I mean the rucio container DID (=CMS dataset)

@todor-ivanov
Copy link
Copy Markdown
Contributor Author

Thanks @nsmith- I think I got the picture now. Most of this discussion provides useful information for the Output Data Placement in Microservices issue though. But it was really good that you brought that up here. Thanks again.

@todor-ivanov
Copy link
Copy Markdown
Contributor Author

@sharad1126 I do not think we can test it explicitly with some dedicated workflows. At least not until the testing environment for Unified is completely setup so one can do such a test with sending such workflows to testbed. The best we can do here is just you do the normal test in the so far semi testing machine (vocms0** - I do not remember which one was that) as you normally do with any other change in the code, and just check if there are no any typos or something that totally breaks the 'closor' module. What I remember you telling me about that proccess is, that you were cherrypicking commits in that machine order to do the test. Or did I understand you wrong?

@amaltaro
Copy link
Copy Markdown
Contributor

Actually I think it needs to be in the DID creation metadata (on injection by WMAgent),

@nsmith- Nick, I did try it - I guess for block creation - and the did creation failed. Can you please check whether it's supported or not at block/container creation and let us know?

@nsmith-
Copy link
Copy Markdown

nsmith- commented Apr 21, 2020

@amaltaro it seems, unlike for rule metadata, that for DID metadata the keys are enforced to be from a predefined set. One has to first add the metadata key via commands like https://rucio.readthedocs.io/en/latest/api/meta-data.html
Since this only happens once, should we apply this to the database you are testing against?

@amaltaro
Copy link
Copy Markdown
Contributor

Yes, now I recall you had already mentioned it in the past week.
I think we could then add this group key (string with not more than...50 chars(?)) to the cmsrucio-int instance. I also think it makes sense to enable such metadata/key for both block and container data injection, but I leave it up to you decide at which level it makes more sense.

@nsmith-
Copy link
Copy Markdown

nsmith- commented Apr 21, 2020

We'll carry on this metadata discussion at dmwm/WMCore#9655

@sharad1126
Copy link
Copy Markdown
Contributor

as it looks like from the discussion above, this is still an on going PR. @amaltaro @todor-ivanov should I convert this to draft?

@todor-ivanov
Copy link
Copy Markdown
Contributor Author

@sharad1126 As Far as it concerns the Unified code, It is the whole thing. I do not think you should turn it into draft. Again, the data tiers to fall under the change here (and to be excluded from data placement) are configurable through Unified Config. You can always change them. And the question is more like - should we switch off data placement for NANOAOD now or we are about to wait?

About the discussion triggered above - it is already moved in Wmcore, as It concerns how we will distinguish the data in the Future implementation.

@sharad1126
Copy link
Copy Markdown
Contributor

@todor-ivanov does that mean we can test and then just merge this PR?

@todor-ivanov
Copy link
Copy Markdown
Contributor Author

@sharad1126 Yes. Even if we decide to wait with disabling the NANOAD immediately, the code needs to be tested and proved to work when the time comes.

@sharad1126
Copy link
Copy Markdown
Contributor

@todor-ivanov from my side it looks like it is ok to merge. I don't have any specific things to test it on but for the python related checks and all, it is all ok to merge.

Copy link
Copy Markdown
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@todor-ivanov I have made a couple of comments in the code that will make this change cleaner.
@sharad1126 no, it cannot be merged and put in production because the default configuration is already skipping NANO data.

I still have a question though. If we do not place those NANOAD datasets here, won't it be a blocker to get workflows announced?

Comment thread Unified/closor.py Outdated
Comment thread Unified/closor.py Outdated
@sharad1126
Copy link
Copy Markdown
Contributor

@amaltaro thanks for confirming. I think that sending the data to ddm and announcing workflows are still two different things and if needed, workflows can be announced without sending data to ddm. the whole point being, one needs to take care that the workflow status is changed accordingly.

@amaltaro
Copy link
Copy Markdown
Contributor

the whole point being, one needs to take care that the workflow status is changed accordingly.

That's exactly my point. I think unified checks on the output data placement status, and I'm afraid that not placing NANOAOD might cause workflows to get stuck in closed-out.
Of course, we can also keep an eye on things once this gets merged and follow up on any other changes needed.

@vlimant
Copy link
Copy Markdown
Contributor

vlimant commented Apr 27, 2020

actual location of the output is only checked for DQMIO location and possibility of running harvesting.
the location of anything else is looked at but ignored for any decision on state transition.

@todor-ivanov todor-ivanov force-pushed the feature_disable_NANOAOD branch 2 times, most recently from de5ef60 to 1573cb1 Compare April 29, 2020 17:53
@todor-ivanov
Copy link
Copy Markdown
Contributor Author

@amaltaro @sharad1126 @vlimant Here is the code with the early interruption of the cycle. Which I agree, looks much prettier .... but in this case I think the beauty will just introduce some issues. And the way I did it the previous time was exactly to avoid the behavior Alan and Sharad were discussing above - to avoid Having 'NANOAOD' workflows that due to this check and interruption of the logic start accumulating in closed-out status.

@todor-ivanov
Copy link
Copy Markdown
Contributor Author

Upon a double check I noticed which was the line I was overlooking and now I convinced myself it will work both ways. So the current version of the PR is definitely achieving the same goal as the previous one. @sharad1126 could you please test and merge it?

@sharad1126
Copy link
Copy Markdown
Contributor

sharad1126 commented May 6, 2020

@todor-ivanov @amaltaro I see no issues in the code. There is nothing I can test this code on and won't affect anything in unified's side. It won't stop doing anything which it was doing previously as you just introduced 2 new lists which checks and sends logs only. The other lists which has NANOAOD will still work as it is as the NANOAOD in those lists still exist.

@amaltaro
Copy link
Copy Markdown
Contributor

amaltaro commented May 6, 2020

Thanks @todor-ivanov , it looks good to me. Todor, can you please update the description of this PR with what exactly is being proposed here? That will clarify things to Sharad as well.

@nsmith- Nick, do we need any handshake before it can get merged?

@sharad1126
Copy link
Copy Markdown
Contributor

@amaltaro that would be great indeed. thanks.

@todor-ivanov
Copy link
Copy Markdown
Contributor Author

Done

@nsmith-
Copy link
Copy Markdown

nsmith- commented May 6, 2020

As I understand it, merging this changes production behavior. I think then we need to wait for all WMAgents to be running in Rucio-Nano mode before removing this?

…ows.

Exclude NANOAOD from Unified output dataplacement for RelVal workflows.

Exclude NANOAOD from Unified - add tier_to_skip in UnifiedConfig.

Early DataTier check and cycle interruption && put back NANOOD in the default list tiers_to_DDM

Revert privious bad membership check

Renaming tier configuration lists

Switching off Rucio for nonrelval tiers
@todor-ivanov todor-ivanov force-pushed the feature_disable_NANOAOD branch from 1573cb1 to ba6634c Compare May 15, 2020 16:49
@todor-ivanov
Copy link
Copy Markdown
Contributor Author

@sharad1126 this one must be merged before [1]. Thanks!

[1]
#561

@amaltaro
Copy link
Copy Markdown
Contributor

@nsmith- Nick, did you manage to create the NANO rucio subscriptions? I guess that's the only thing that is holding us on merging this PR, isn't it?

@nsmith-
Copy link
Copy Markdown

nsmith- commented May 15, 2020

Ok so its not clear to me yet: if this is merged then unified will stop sending NANO replication requests to Dynamo?
If so, we should wait to merge it until at least the first WMAgent that outputs to rucio is running, and preferably until all are running.

Comment thread unifiedConfiguration.json
Comment thread Unified/closor.py
@amaltaro
Copy link
Copy Markdown
Contributor

@sharad1126 let me try to clarify this PR.

It provides to configuration parameters:

  • tiers_to_rucio_relval: will contain a list of datatiers that are injected into Rucio - by the RelVal WMAgent - thus we do not want to make any (phedex) data subscription for them.
  • tiers_to_rucio_nonrelval: will contain a list of datatiers that are injected into Rucio - by production WMAgents - thus we do not want to make any (phedex) data subscription for them as well.

In case they are set to an empty list, we deal with data subscription as it's done nowadays. Otherwise, datatiers listed won't get a DDM/PhEDEx subscription.

@sharad1126
Copy link
Copy Markdown
Contributor

@amaltaro I understand what you mean. My point was to introduce the second one later but ok, I don't mind this as well. From my side, I see no issues in this PR and we are ready to merge this once you are ready to handle the relval NANOAOD output.

@sharad1126 sharad1126 added this to the phedex to rucio transition milestone May 19, 2020
@z4027163 z4027163 merged commit 32b9e92 into CMSCompOps:master Jun 2, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Exclude NANOD from Unified Output data placement

6 participants