Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

file replicas of 2GeV protodune-sp files aren't actually gone #513

Closed
StevenCTimm opened this issue Mar 3, 2024 · 9 comments
Closed

file replicas of 2GeV protodune-sp files aren't actually gone #513

StevenCTimm opened this issue Mar 3, 2024 · 9 comments
Labels
bug Something isn't working ops rucio

Comments

@StevenCTimm
Copy link
Collaborator

Even though a rucio erase was done individually on every single file in this list and the datasets above them, there's no expired_at field in the individual file metadata and they are still out there and still have replicas.
Total size of the protodune-sp directory under /dune/persistent/staging/ hasn't really changed, it is still 33TB so we did not free up any space at all.

@StevenCTimm
Copy link
Collaborator Author

have rucio erased one of the files, the expired_at field shows now.
Not sure what happened on Friday.

Will see if said file goes away tomorrow as sscheduled

[dunepro@dunesl7gpvm01 protodune-sp]$ rucio erase protodune-sp:PDSPProd4a_protoDUNE_sp_reco_stage1_p2GeV_35ms_sce_datadriven_66503427_0_20231231T235952Z.root
2024-03-03 15:58:05,782 INFO CAUTION! erase operation is irreversible after 24 hours. To cancel this operation you can run the following command:
rucio erase --undo protodune-sp:PDSPProd4a_protoDUNE_sp_reco_stage1_p2GeV_35ms_sce_datadriven_66503427_0_20231231T235952Z.root
[dunepro@dunesl7gpvm01 protodune-sp]$ rucio get-metadata protodune-sp:PDSPProd4a_protoDUNE_sp_reco_stage1_p2GeV_35ms_sce_datadriven_66503427_0_20231231T235952Z.root
access_cnt: None
accessed_at: None
account: dunepro
adler32: dee2ac0e
availability: AVAILABLE
bytes: 2059985893
campaign: None
closed_at: None
complete: None
constituent: None
created_at: 2024-01-02 19:04:12
datatype: None
deleted_at: None
did_type: FILE
eol_at: None
events: None
expired_at: 2024-03-04 21:58:05
guid: None
hidden: False
is_archive: None
is_new: False
is_open: None
length: None
lumiblocknr: None
md5: None
monotonic: False
name: PDSPProd4a_protoDUNE_sp_reco_stage1_p2GeV_35ms_sce_datadriven_66503427_0_20231231T235952Z.root
obsolete: False
panda_id: None
phys_group: None
prod_step: None
project: None
provenance: None
purge_replicas: True
run_number: None
scope: protodune-sp
stream_name: None
suppressed: False
task_id: None
transient: False
updated_at: 2024-03-03 21:58:05
version: None

@StevenCTimm
Copy link
Collaborator Author

have done another rucio erase of one of the files.. file shows expired_at, we'll see if it kicks in tomorrow as scheduled.

@StevenCTimm StevenCTimm added the bug Something isn't working label Mar 5, 2024
@StevenCTimm
Copy link
Collaborator Author

Martin Barisits confirms this

Yes, I think erase doesn't work right now on files in all cases. Historically this was meant for datasets, never for files. Although the usecase to have it work on files came up as well, but it does not seem to work properly. There is an issue about it
rucio/rucio#5154

Issue was filed by Brandon White 2 years ago on behalf of Icarus.

@StevenCTimm
Copy link
Collaborator Author

At the expired_at: time what we observe is that the expired_at: flag goes away but the file doesn't

@StevenCTimm
Copy link
Collaborator Author

Dimitrios says that the undertaker won't erase replicas, the reaper does that. looking at the reaper we see that it is able to delete large numbers of files off of DUNE_US_FNAL_DISK_STAGE, mostly the justiin-logs. so nothing wrong with the reaper. Possibly there are locks on these files, have asked to see if there is a way we can tell short of accessing the DB.

@bjwhite-fnal
Copy link

bjwhite-fnal commented Mar 6, 2024

If the undertaker can't delete the DID, there is no reason any of the rules or replica locks would be modified. We need the undertaker to handle the removal of these DIDs properly as per rucio/rucio#5154.

@StevenCTimm
Copy link
Collaborator Author

So we worked around this by using the list of the retired files that we had kept, organizing them into datasets again, making a short-lived rule and then the reaper got the files when the rule expired.
Still have about 4 datasets worth of stuff out there under the protodune-sp hierarch on DUNE_US_FNAL_DISK_STAGE should try to clean that up too because we need the space.

@StevenCTimm
Copy link
Collaborator Author

It is now apparent that the others haven't yet been deleted because they were never part of any rule and there was no tombstone set. So we are first setting tombstones for all the files that we know are part of datasets and then we will scan the tree to get the rest.. I believe these are replicas that wouldn't be findable even via doing rucio list-datasets-rse (which shows any dataset that's ever been on the RSE) and getting the replicas from all those datasets. .but we'll try that too.

@StevenCTimm
Copy link
Collaborator Author

These are cleaned now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ops rucio
Projects
None yet
Development

No branches or pull requests

2 participants