Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track how often files are accessed, and by whom #231

Open
aekiss opened this issue Feb 25, 2021 · 19 comments
Open

Track how often files are accessed, and by whom #231

aekiss opened this issue Feb 25, 2021 · 19 comments
Labels
🥞 database Related to the structure of the database itself 🧜🏽‍♀️ enhancement

Comments

@aekiss
Copy link
Collaborator

aekiss commented Feb 25, 2021

Just a thought - for the big 0.1deg runs we save a lot of data on request but it's sometimes unclear how much of it actually gets used or by who, so it's hard to tell whether some of it could be deleted to save space, or whether some diagnostics could be dropped for future runs.

To assist with managing our storage it could be useful to make querying.getvar log which files are actually getting accessed, e.g. by having the database store the total number of requests for each file, and the date of the most recent request. If the username is accessible, the total count and most recent access could be recorded per-user. The DB could then be queried to find big files that nobody needs anymore.

This data could also be useful for documenting the research impact of the cosima datasets, e.g. for grant applications.

@angus-g
Copy link
Collaborator

angus-g commented Feb 25, 2021

I like this idea! I think maybe it would be better to have the logged accesses stored in a separate database, so we wouldn't have to worry about accidentally nuking the shared (big) database of all experiments. From a technical standpoint, I think sqlite should handle multiple transactional accesses to the same database pretty smoothly, given that the request volumes are pretty small in the grand scheme of things.

@aidanheerdegen
Copy link
Collaborator

Permissions is the major issue.

@angus-g
Copy link
Collaborator

angus-g commented Mar 1, 2021

The stats database could be group-writable, but I guess the worry is people fiddling with it?

@aekiss
Copy link
Collaborator Author

aekiss commented Mar 1, 2021

if we want to track this per-user we could create a separate stats db for each user as needed, and just give them write permissions for their own one?

@aidanheerdegen
Copy link
Collaborator

If it can be done I agree it would be great to have.

It does assume that everyone used the cookbook to access the data. I'd like to think that was the case, but sometimes old habits die hard and some people prefer to do it the way they're used to. Be good to know that wasn't the case.

Permissions issues can be around people fiddling, or corrupting the database in some way. Paola has tried something similar with a basic log file for CLeF queries and it has worked, but also can create issues when a new log file is created. That is probably not an issue here if the DB persists at the same path.

@aekiss
Copy link
Collaborator Author

aekiss commented Mar 1, 2021

yes there's always the possibility that users will access data by some other method, but at least it will show us what data should definitely be retained...

@angus-g
Copy link
Collaborator

angus-g commented Mar 1, 2021

I'll note that since we'd be logging to a sqlite DB, it would be quite robust -- write-ahead transactions and such make it atomically handle concurrent writes without corruption, etc. But it is true that it's only useful if people are using the cookbook in the first place!

if we want to track this per-user we could create a separate stats db for each user as needed, and just give them write permissions for their own one?

If we need to go this route, maybe /tmp-like permissions (u+w g+r) would be enough to stop people accidentally breaking things.

@aidanheerdegen
Copy link
Collaborator

If you think it will work with sqlite @angus-g then give it a burl and see. Definitely wrap any DB access in a try/except block though, so failure to log doesn't prevent users from accessing the Cookbook.

@access-hive-bot
Copy link

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/priorities-for-large-msu-experiments/123/18

@access-hive-bot
Copy link

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/track-how-recently-files-were-accessed-via-cookbook/391/1

@aekiss
Copy link
Collaborator Author

aekiss commented Feb 3, 2023

This came up again at the COSIMA meeting yesterday. It would be a good capability to have. Maybe it's something ACCESS-NRI could help with?

@aekiss
Copy link
Collaborator Author

aekiss commented May 16, 2024

Having this capability is becoming more ever-more pressing, so we can better manage our growing pile of data. Any thoughts on how we can have it implemented?

Ideally we'd have something that logs all data access via the Cookbook or Intake, storing something like

- file path 1
   - variable 1
      - username 1
         - total number of requests for this variable in this file by this user
         - date of most recent access of this variable in this file by this user
      - username 2
         - total number of requests for this variable in this file by this user
         - date of most recent access of this variable in this file by this user
      ...
   - variable 2
      - username 1
         - total number of requests for this variable in this file by this user
         - date of most recent access of this variable in this file by this user
      - username 2
         - total number of requests for this variable in this file by this user
         - date of most recent access of this variable in this file by this user
      ...
- file path 2
   - variable 1
      - username 1
         - total number of requests for this variable in this file by this user
         - date of most recent access of this variable in this file by this user
      - username 2
         - total number of requests for this variable in this file by this user
         - date of most recent access of this variable in this file by this user
      ...
   - variable 2
      - username 1
         - total number of requests for this variable in this file by this user
         - date of most recent access of this variable in this file by this user
      - username 2
         - total number of requests for this variable in this file by this user
         - date of most recent access of this variable in this file by this user
      ...
...

@aidanheerdegen
Copy link
Collaborator

I'd say you need less structure, otherwise you have to query the DB to find existing information, and then update the record.

So something much more like an activity log, and pull out the structured information through queries.

The the question is what actions do you want to log, and what information from that action?

@aidanheerdegen
Copy link
Collaborator

Yes this is something ACCESS-NRI wants to do, but you know, time and resources ....

@aekiss
Copy link
Collaborator Author

aekiss commented Jun 5, 2024

It would also be helpful if it was possible for cookbook users to query the DB to find out who is using a given dataset. For example

  • to find out who else is working on related problems (to facilitate collaboration and reduce the chances of project duplication)
  • so a contributor of shared data can find out who is using their contribution (this could encourage data sharing by being motivating in itself and also relieving fears of having their contribution exploited without their knowledge or suitable credit)

@aekiss
Copy link
Collaborator Author

aekiss commented Jun 5, 2024

The the question is what actions do you want to log, and what information from that action?

Log each variable requested in each .nc file accessed via calls to cc.querying.getvar, recording

  • username
  • date
  • .nc file path
  • variable

This would make enormous log files, which is why I suggested condensing it by storing only the total number of accesses by each user and the date of most recent access.

It would also be nice to be able to look up username to get real name and email address. Not sure if that's possible.

@aekiss aekiss changed the title Track how often files are accessed? Track how often files are accessed, and by whom Jun 5, 2024
@aekiss aekiss added 🧜🏽‍♀️ enhancement 🥞 database Related to the structure of the database itself labels Jun 5, 2024
@rmholmes
Copy link

rmholmes commented Jun 5, 2024

Just noting that most of the time I don't use cc.querying.getvar to access data from runs on ik11 etc. I usually just use xr.open_mfdataset on the raw files. I'm not sure how many others are in the same boat.

@aekiss
Copy link
Collaborator Author

aekiss commented Jun 5, 2024

Thanks @rmholmes, good point. I suspect most usage is via the cookbook, but we don't actually know.

We can't hope to cover every access method, but any info on usage is better than none - e.g. we can be sure we shouldn't delete data if it's actively used via the cookbook, but if there are no cookbook users we'd need to ask around whether anyone is accessing data via some other method before deleting it.

@aekiss
Copy link
Collaborator Author

aekiss commented Jun 13, 2024

For the purposes of cleanup, we can identify files that haven't been recently accessed using the -atime option of find, e.g. https://forum.access-hive.org.au/t/g-data-ik11-cleanup/2153/13, but it would be much more useful to know who accessed them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🥞 database Related to the structure of the database itself 🧜🏽‍♀️ enhancement
Projects
None yet
Development

No branches or pull requests

5 participants