Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TaskPublish tries to insert illegal dataset #7359

Closed
belforte opened this issue Aug 4, 2022 · 21 comments
Closed

TaskPublish tries to insert illegal dataset #7359

belforte opened this issue Aug 4, 2022 · 21 comments

Comments

@belforte
Copy link
Member

belforte commented Aug 4, 2022

in this task
https://cmsweb.cern.ch/crabserver/ui/task/220801_190609%3Aapresyan_crab_VLLPair_VLLToTauS_MVLL1000_MS10_ctau1000_batch2_v1

CRAB eventually construct this output dataset name

/VLLPair_VLLToTauS_MVLL1000_MS10_ctau1000/lpcdihiggsboost-NanoTuples-V2p0_lpclonglived-crab_PrivateProduction_Fall18_DR_step3_VLLPair_VLLToTauS_MVLL1000_MS10_ctau1000_batch3_v1-3ee3afd6b5a1410aea6d0b4d52723d06-cd471944433cef30a1e69a7cb38aa7e8/USER

which is illegal (second piece between / is 200-char long instead of a max of 199)
so when TaskPublish tries to use it, it fails immediately at

existingDBSFiles = destReadApi.listFiles(dataset=dataset, detail=True)

when trying to check what was published already (if any) becasue DBS server rejects the dataset name

checking the dataset name above with Lexicon.dataset returns

AssertionError: '/VLLPair_VLLToTauS_MVLL1000_MS10_ctau1000/lpcdihiggsboost-NanoTuples-V2p0_lpclonglived-crab_PrivateProduction_Fall18_DR_step3_VLLPair_VLLToTauS_MVLL1000_MS10_ctau1000_batch3_v1-3ee3afd6b5a1410aea6d0b4d52723d06-cd471944433cef30a1e69a7cb38aa7e8/USER' does not match regular expression ^/[a-zA-Z0-9-]{1,99}/[a-zA-Z0-9.-]{1,199}/[A-Z-]{1,50}$

@belforte
Copy link
Member Author

belforte commented Aug 4, 2022

so the question is how

tm_publish_name = NanoTuples-V2p0_lpclonglived-crab_PrivateProduction_Fall18_DR_step3_VLLPair_VLLToTauS_MVLL1000_MS10_ctau1000_batch3_v1-3ee3afd6b5a1410aea6d0b4d52723d06-00000000000000000000000000000000

tm_output_lfn = /store/group/lpcdihiggsboost/NanoTuples/V2p0/MC_Autumn18/v1/

passed through the checks in

def _checkPublishDataName(self, kwargs, outlfn, requestname, username):

given that

i.e. groupname=lpcdihiggsboost = 15 charsandpublishname= 184 chars` so this line

outputDatasetTagToCheck = "%s-%s" % (group_user_prefix, outputDatasetTagToCheck)

should produce a string of length 15+184+1=200
and Lexicon.procdataset has a limit at 199

@belforte
Copy link
Member Author

belforte commented Aug 4, 2022

FWIW I verified that removing one char from the publishname, the listDatasets DBS query returns the expected empty list [] w/o any exception.

@belforte
Copy link
Member Author

belforte commented Aug 4, 2022

first finding. In that task submission, when RESTUserWorkflow is called, the argument
kwargs['publishgroupname'] is set to 0 (False), so the check is done using username in place of groupname and since groupname=lpcdihiggsboost is 15 char, while username=apresyan is 8 char, the check is OK.

This actually OK and to be expected since user did not set publishWithGroupName in the config. So output dataset name must indeed contain the username, not the groupname.

So why do we end up with groupname in the outputdataset name ?

outdataset is set in PostJob and stored in FMD

I checked FMD for this task at
https://cmsweb.cern.ch/crabserver/prod/filemetadata?taskname=220801_190609:apresyan_crab_VLLPair_VLLToTauS_MVLL1000_MS10_ctau1000_batch2_v1&filetype=EDM
and indeed it contains

\"outdataset\": \"/VLLPair_VLLToTauS_MVLL1000_MS10_ctau1000/lpcdihiggsboost-NanoTuples-V2p0_lpclonglived-crab_PrivateProduction_Fall18_DR_step3_VLLPair_VLLToTauS_MVLL1000_MS10_ctau1000_batch3_v1-3ee3afd6b5a1410aea6d0b4d52723d06-cd471944433cef30a1e69a7cb38aa7e8/USER\"

@belforte
Copy link
Member Author

belforte commented Aug 4, 2022

Problem is likely in PostJob and likely there since a long time, if not ever :-(

@belforte
Copy link
Member Author

belforte commented Aug 4, 2022

problem started with this commit 86cc6a5

So it is 5 months that we always put groupname in outputdataset name even if the user did not ask for it. It looks like this new naming is not bothering anybody and maybe at this point is better to live with this and simply remove the flexibility. One parameter less to worry about.

actions:

  • make sure that I undertand what exactly the previous code was doing. Do not want to miss things again.
  • if all OK, change REST to sort of enforce publishWithGroupName=True
  • remove this paramater from Client and from documentation

@belforte
Copy link
Member Author

belforte commented Aug 4, 2022

looks like in that commit lines like these

if 'CRAB_PublishGroupName' in self.job_ad and self.job_ad['CRAB_PublishGroupName']:
if file_info['outlfn'].startswith('/store/group/') and file_info['outlfn'].split('/')[3]:
group_user_prefix = file_info['outlfn'].split('/')[3]
outdataset = os.path.join('/' + primary_dataset, group_user_prefix + '-' + publishname, 'USER')
output_datasets.add(outdataset)

became like
# group name overrides username when present:
username = self.job_ad['CRAB_UserHN']
if self.dest_dir.startswith('/store/group/') and self.dest_dir.split('/')[3]:
username = self.dest_dir.split('/')[3]
module_label = file_info.get('module_label') if multiple_edm else None
outdataset = compute_outputdataset_name(primaryDS=self.job_ad['CRAB_PrimaryDataset'],
username=username,
publish_name=self.publish_name,
pset_hash=file_info['pset_hash'],
module_label=module_label)

i.e. I did not port over the check on self.job_ad['CRAB_PublishGroupName']

@belforte
Copy link
Member Author

belforte commented Aug 4, 2022

so the question is wheter to do 1. or 2.

  1. restore behaviour like until March ?
  2. keep current one make check at submit time consistent so that it prevents illegal file names to seep through

@mapellidario @novicecpp @dciangot @KatyEllis any advice ? (I like to share blame for difficult decisions !)

I am leaning for 2. as indicated above in #7359 (comment)

maybe we can talk about this next Tuesday, hopefully the failure which made me look is a rare one.

@dmwm dmwm deleted a comment from mapellidario Aug 4, 2022
@KatyEllis
Copy link

I'd say prevent the illegal names seeping through, but I may not have understood the full consequences.

@belforte
Copy link
Member Author

belforte commented Aug 5, 2022

let's recap. In the following you can assume PrimaryDataSet to be some string defined by CRAB specific ot this task.

Behavior 1: no matter where user sends the output, data is published in a dataset named like
/<PrimaryDataSet>/username-<userdefinetag>-<some-hash-added-by-CRAB>/USER
unless user does both these things:
1 send output files to /store/group/<groupname>/.....
2 sets in crabConfig. file the flag publishWithGroupName=True
in which case dataset name becomes
/<PrimaryDataSet>/groupname-<userdefinetag>-<some-hash-added-by-CRAB>/USER

Behavior 2: the configuration flag publishWithGroupName becomes irrelevant and whenever output files are sent to /store/group/<groupname>/....., output data is published in a dataset named
/<PrimaryDataSet>/groupname-<userdefinetag>-<some-hash-added-by-CRAB>/USER

WHere problem arise:
Behavior 1 was working until version 3.220315 . At that point the code which creates the dataset name for publication was changed to be the same as in 2, while name checking was left as in 1. Users did not notice/complain that their datsets started being named after the group rather than themselves. But the dataset nema check at submission time is not coherent anymore with the actual naming. Hence an illegal dataset name can sneak through.

There are two possible solutions:
S1: properly restore behavior 1, needs a small change to PostJob code. Will suddenly change dataset naming for users (again)
S2: properly code behavior 2: needs a small change to CRAB Rest, will be transparent to users. We can then deprecate the publishWithGroupName parameter, cleanup all code and documentation which refers to it and overall simplify things.

I guess one could check in Task table of CRAB Oracel DB to find out how much that parameter was used in the past. I suspect one of those features added on "one user whim" and which people can live without.

Hope I made things more clear. I agree that it is not easy to pick, but I do not want to have a "referendum" among users either.

@novicecpp
Copy link
Contributor

novicecpp commented Aug 9, 2022

I guess one could check in Task table of CRAB Oracel DB to find out how much that parameter was used in the past. I suspect one of those features added on "one user whim" and which people can live without.

SELECT COUNT(*) FROM TASKS t WHERE TM_PUBLISH_GROUPNAME != 'F' AND TM_START_TIME > TO_TIMESTAMP('2021-01-01', 'YYYY-MM-DD')
Only 47 tasks from 2021-01-01 to now.
7 tasks from 2022-01-01 to now.
0 tasks from 2022-03-25 to now.

@belforte
Copy link
Member Author

belforte commented Aug 9, 2022

so conclusion is that currently we publish as /primaryds/groupname-... even if users did not ask for it (since so few did) and nobody noticed/complained of this change. Interesting data point !

@belforte
Copy link
Member Author

belforte commented Aug 13, 2022

we reached consensus on sticking with Behavior 2. I.e. keep current dataset names and

  • remove publishWithGroupName from Client : deprecate publishWithGroupName CRABClient#5173
  • fix REST to do the proper check and remove publishWithGroupName from it at the same time need to
    • remove tm_publish_groupname from DB methods in Task.py which write it (would get ORA-01008: not all variables bound)
  • remove publishWithGroupName from documentation
  • remove tm_publish_groupname from DB methods Create.py and methods in Task.py which may try to read it
  • remove tm_publish_groupname column from database Task table

belforte added a commit to belforte/CRABServer that referenced this issue Sep 30, 2022
belforte added a commit to belforte/CRABServer that referenced this issue Sep 30, 2022
@belforte
Copy link
Member Author

hmmm.. need to remove extra option from client first, or server will complain about extra keyword.
In similar cases in the past we tried to make server compatible with both ways etc. etc., but it is a pain.
Since it is not urgent I will explore changing client first and rearrange order of boxes above.

belforte added a commit to belforte/CRABClient that referenced this issue Oct 3, 2022
belforte added a commit to dmwm/CRABClient that referenced this issue Oct 3, 2022
* deprecate publishWithGroupName as per dmwm/CRABServer#7359

* pylint

* pylint
@belforte
Copy link
Member Author

belforte commented Oct 3, 2022

CRABClient updated (tag v3.221004 ) and on its way to IB crab-dev cms-sw/cmsdist#8118

@belforte
Copy link
Member Author

belforte commented Oct 4, 2022

REST and TW changes are now in my branch and I checked that things works using new client from GH.
Will wait until new client is in production as crab-prod before making a PR.

@belforte
Copy link
Member Author

one bit was still missing in REST side. I updated #7359 (comment)
and testing now with Jenkins on test2 with crab-dev client using code from https://github.com/belforte/CRABServer/tree/deprecate-publishgroupname on both cmsweb-test2 and crab-dev-tw01

@belforte
Copy link
Member Author

belforte commented Oct 11, 2022

all good my branch is validated and ready to be merged as soon as CRAB Client v3.221004 is in production: #7411
on HOLD until then

@belforte
Copy link
Member Author

crab client update requested in cms-sw/cmsdist#8159

@belforte
Copy link
Member Author

belforte commented Nov 14, 2022

CRABClient is now updated:

belforte@lxplus765/TC3> which crab
/cvmfs/cms.cern.ch/common/crab
belforte@lxplus765/TC3> crab --version
CRAB client v3.221018

time to finalize

belforte added a commit that referenced this issue Nov 14, 2022
* stop using publishgroupname in REST. For #7359

* do not write tm_publish_groupname in DB when inserting
@belforte
Copy link
Member Author

tagged in v3.221114

@belforte
Copy link
Member Author

v3.221114 deployed in preprod and validated https://github.com/dmwm/CRABServer/releases/tag/v3.221114
Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants