Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement file dataset=/a/b/c site=XXX run=123 query using Rucio APIs #30

Closed
vkuznet opened this issue Jan 18, 2021 · 5 comments
Closed

Comments

@vkuznet
Copy link
Collaborator

vkuznet commented Jan 18, 2021

Originally the support for

file dataset=/a/b/c site=XXX run=123

query was done through DBS and Phedex APIs. First, we resolved list of blocks for a given dataset. Then, we find files for a given set of blocks and run number, and finally filter files using Phedex fileReplicas API to select files on a given site.

Now, we need to implement the same logic using DBS and Rucio APIs. The question is do we have similar to fileReplicas Rucio API to select files only for a given site or should we find another route in Rucio to accommodate this workflow.

@ericvaandering could you please comment on this?

@ericvaandering
Copy link
Member

Almost. I think what you want to do is what this does: https://github.com/rucio/rucio/blob/0246888ceeb8cc12387c6aaffd398921b31da10e/lib/rucio/client/replicaclient.py#L117

You can pass either a container or a block and get all the file replicas, or if you pass an RSE it will give just data at that RSE.

Then you probably need to filter out what Rucio gives you for the files which matched the run in your example. Of course, you could query file by file or provide a list of files, but that may be less efficient or involve transferring more data.

The code shows you how to build the REST query.

@vkuznet
Copy link
Collaborator Author

vkuznet commented Jan 20, 2021

Eric, I still need your assistance with this as I'm getting different errors from Rucio server. So if I correctly depict replicaclient.py codebase you pointed out I came up with the following plain curl call:

#!/bin/bash
opt="-s -L -k --key $HOME/.globus/userkey.pem --cert $HOME/.globus/usercert.pem"
token=`curl $opt -v https://cms-rucio-auth.cern.ch/auth/x509 2>&1 | grep "X-Rucio-Auth-Token:" | sed -e "s,< X-Rucio-Auth-Token: ,,g"`
echo "$token"
dataset=/JetHT/Run2018A-TkAlMinBias-12Nov2019_UL2018-v2/ALCARECO
curl $opt -H "X-Rucio-Auth-Token: $token" -X POST -d '{"dids": ["scope":"cms", "name":"$dataset"], "domain": "all"}' "http://cms-rucio.cern.ch/replicas/cms/list"

Here I tried two URLs: http://cms-rucio.cern.ch/replicas/cms/list which returns internal server error, but I'm not sure if /cms should be part of URL since it does not like the case from replicaclient.py code. So I tried w/o it, i.e. http://cms-rucio.cern.ch/replicas/list which gives me a different error {"ExceptionMessage": "Cannot decode json parameter list", "ExceptionClass": "ValueError"}.

So, as you know I really need plain URL example in order to proceed with this request. Please guide me as necessary.

@ericvaandering
Copy link
Member

ericvaandering commented Jan 20, 2021 via email

@vkuznet
Copy link
Collaborator Author

vkuznet commented Jan 20, 2021

Eric, thanks for spotting json problem. I managed to get the output with the following sequence of steps:

#!/bin/bash
opt="-s -L -k --key $HOME/.globus/userkey.pem --cert $HOME/.globus/usercert.pem"
token=`curl $opt -v https://cms-rucio-auth.cern.ch/auth/x509 2>&1 | grep "X-Rucio-Auth-Token:" | sed -e "s,< X-Rucio-Auth-Token: ,,g"`
echo "$token"
dataset=/JetHT/Run2018A-TkAlMinBias-12Nov2019_UL2018-v2/ALCARECO
curl $opt -H "X-Rucio-Auth-Token: $token" -X POST -d '{"dids": [{"scope":"cms", "name":"/JetHT/Run2018A-TkAlMinBias-12Nov2019_UL2018-v2/ALCARECO"}], "domain": "all", "rse_expression": "T2_DE_DESY"}' "http://cms-rucio.cern.ch/replicas/list"

The output looks like this now:

{"adler32": "df6675e0", "name": "/store/data/Run2018A/JetHT/ALCARECO/TkAlMinBias-12Nov2019_UL2018-v2/270001/BF52B44F-51A0-3248-B13A-9052DF7B03CA.root", "rses": {"T2_DE_DESY": []}, "bytes": 3736739040, "states": {"T2_DE_DESY": "AVAILABLE"}, "pfns": {}, "scope": "cms", "md5": null}
{"adler32": "07531d4b", "name": "/store/data/Run2018A/JetHT/ALCARECO/TkAlMinBias-12Nov2019_UL2018-v2/270001/BFBAA739-795D-AF49-ACFB-1B53033E7121.root", "rses": {"T2_DE_DESY": []}, "bytes": 3755039930, "states": {"T2_DE_DESY": "AVAILABLE"}, "pfns": {}, "scope": "cms", "md5": null}
...

which I hope would be sufficient for this use-case. I'll proceed with implementing necessary bits in DAS codebase.

@vkuznet
Copy link
Collaborator Author

vkuznet commented Jan 21, 2021

Done. The new release on cmsweb is upgraded and new dasgoclient PR is here cms-sw/cmsdist#6584

If you need a binary version of dasgoclient before it will be updated on cvmfs please take it from here:
/afs/cern.ch/user/v/valya/public/dasgoclient/dasgoclient

The new version is

Build: git=v02.04.23 go=go1.15.6 date=2021-01-21 21:15:20.46625747 +0100 CET m=+0.006210747

and your query looks like this:

./dasgoclient -query="file dataset=/JetHT/Run2018A-TkAlMinBias-12Nov2019_UL2018-v2/ALCARECO site=T2_DE_DESY run=316723"
/store/data/Run2018A/JetHT/ALCARECO/TkAlMinBias-12Nov2019_UL2018-v2/280000/25B4C3D5-03C1-F24E-9D35-E08860CBC145.root
/store/data/Run2018A/JetHT/ALCARECO/TkAlMinBias-12Nov2019_UL2018-v2/280000/4E29E31D-AA0E-8744-B558-98B35D8320E3.root
/store/data/Run2018A/JetHT/ALCARECO/TkAlMinBias-12Nov2019_UL2018-v2/280000/BAE93AF7-30F2-FC49-95FF-E584E4BE6773.root
/store/data/Run2018A/JetHT/ALCARECO/TkAlMinBias-12Nov2019_UL2018-v2/280000/FAF43300-D19D-E24E-9175-B800DBD5083C.root
/store/data/Run2018A/JetHT/ALCARECO/TkAlMinBias-12Nov2019_UL2018-v2/70001/67CF1160-478F-5E4E-9F7D-57E8E09C1E25.root

Closing the issue.

@vkuznet vkuznet closed this as completed Jan 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants