Track how often files are accessed, and by whom #231

aekiss · 2021-02-25T03:12:58Z

Just a thought - for the big 0.1deg runs we save a lot of data on request but it's sometimes unclear how much of it actually gets used or by who, so it's hard to tell whether some of it could be deleted to save space, or whether some diagnostics could be dropped for future runs.

To assist with managing our storage it could be useful to make querying.getvar log which files are actually getting accessed, e.g. by having the database store the total number of requests for each file, and the date of the most recent request. If the username is accessible, the total count and most recent access could be recorded per-user. The DB could then be queried to find big files that nobody needs anymore.

This data could also be useful for documenting the research impact of the cosima datasets, e.g. for grant applications.

The text was updated successfully, but these errors were encountered:

angus-g · 2021-02-25T22:28:24Z

I like this idea! I think maybe it would be better to have the logged accesses stored in a separate database, so we wouldn't have to worry about accidentally nuking the shared (big) database of all experiments. From a technical standpoint, I think sqlite should handle multiple transactional accesses to the same database pretty smoothly, given that the request volumes are pretty small in the grand scheme of things.

aidanheerdegen · 2021-02-25T22:38:54Z

Permissions is the major issue.

angus-g · 2021-03-01T01:42:51Z

The stats database could be group-writable, but I guess the worry is people fiddling with it?

aekiss · 2021-03-01T04:57:35Z

if we want to track this per-user we could create a separate stats db for each user as needed, and just give them write permissions for their own one?

aidanheerdegen · 2021-03-01T06:12:53Z

If it can be done I agree it would be great to have.

It does assume that everyone used the cookbook to access the data. I'd like to think that was the case, but sometimes old habits die hard and some people prefer to do it the way they're used to. Be good to know that wasn't the case.

Permissions issues can be around people fiddling, or corrupting the database in some way. Paola has tried something similar with a basic log file for CLeF queries and it has worked, but also can create issues when a new log file is created. That is probably not an issue here if the DB persists at the same path.

aekiss · 2021-03-01T06:21:26Z

yes there's always the possibility that users will access data by some other method, but at least it will show us what data should definitely be retained...

angus-g · 2021-03-01T23:30:15Z

I'll note that since we'd be logging to a sqlite DB, it would be quite robust -- write-ahead transactions and such make it atomically handle concurrent writes without corruption, etc. But it is true that it's only useful if people are using the cookbook in the first place!

if we want to track this per-user we could create a separate stats db for each user as needed, and just give them write permissions for their own one?

If we need to go this route, maybe /tmp-like permissions (u+w g+r) would be enough to stop people accidentally breaking things.

aidanheerdegen · 2021-03-01T23:59:54Z

If you think it will work with sqlite @angus-g then give it a burl and see. Definitely wrap any DB access in a try/except block though, so failure to log doesn't prevent users from accessing the Cookbook.

access-hive-bot · 2022-11-16T22:31:41Z

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/priorities-for-large-msu-experiments/123/18

access-hive-bot · 2023-02-03T01:29:32Z

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/track-how-recently-files-were-accessed-via-cookbook/391/1

aekiss · 2023-02-03T01:29:45Z

This came up again at the COSIMA meeting yesterday. It would be a good capability to have. Maybe it's something ACCESS-NRI could help with?

aekiss · 2024-05-16T05:19:21Z

Having this capability is becoming more ever-more pressing, so we can better manage our growing pile of data. Any thoughts on how we can have it implemented?

Ideally we'd have something that logs all data access via the Cookbook or Intake, storing something like

- file path 1
   - variable 1
      - username 1
         - total number of requests for this variable in this file by this user
         - date of most recent access of this variable in this file by this user
      - username 2
         - total number of requests for this variable in this file by this user
         - date of most recent access of this variable in this file by this user
      ...
   - variable 2
      - username 1
         - total number of requests for this variable in this file by this user
         - date of most recent access of this variable in this file by this user
      - username 2
         - total number of requests for this variable in this file by this user
         - date of most recent access of this variable in this file by this user
      ...
- file path 2
   - variable 1
      - username 1
         - total number of requests for this variable in this file by this user
         - date of most recent access of this variable in this file by this user
      - username 2
         - total number of requests for this variable in this file by this user
         - date of most recent access of this variable in this file by this user
      ...
   - variable 2
      - username 1
         - total number of requests for this variable in this file by this user
         - date of most recent access of this variable in this file by this user
      - username 2
         - total number of requests for this variable in this file by this user
         - date of most recent access of this variable in this file by this user
      ...
...

aidanheerdegen · 2024-05-16T06:28:40Z

I'd say you need less structure, otherwise you have to query the DB to find existing information, and then update the record.

So something much more like an activity log, and pull out the structured information through queries.

The the question is what actions do you want to log, and what information from that action?

aidanheerdegen · 2024-05-16T06:29:13Z

Yes this is something ACCESS-NRI wants to do, but you know, time and resources ....

aekiss · 2024-06-05T22:16:09Z

It would also be helpful if it was possible for cookbook users to query the DB to find out who is using a given dataset. For example

to find out who else is working on related problems (to facilitate collaboration and reduce the chances of project duplication)
so a contributor of shared data can find out who is using their contribution (this could encourage data sharing by being motivating in itself and also relieving fears of having their contribution exploited without their knowledge or suitable credit)

aekiss · 2024-06-05T23:01:41Z

The the question is what actions do you want to log, and what information from that action?

Log each variable requested in each .nc file accessed via calls to cc.querying.getvar, recording

username
date
.nc file path
variable

This would make enormous log files, which is why I suggested condensing it by storing only the total number of accesses by each user and the date of most recent access.

It would also be nice to be able to look up username to get real name and email address. Not sure if that's possible.

rmholmes · 2024-06-05T23:20:23Z

Just noting that most of the time I don't use cc.querying.getvar to access data from runs on ik11 etc. I usually just use xr.open_mfdataset on the raw files. I'm not sure how many others are in the same boat.

aekiss · 2024-06-05T23:40:22Z

Thanks @rmholmes, good point. I suspect most usage is via the cookbook, but we don't actually know.

We can't hope to cover every access method, but any info on usage is better than none - e.g. we can be sure we shouldn't delete data if it's actively used via the cookbook, but if there are no cookbook users we'd need to ask around whether anyone is accessing data via some other method before deleting it.

aekiss · 2024-06-13T05:17:53Z

For the purposes of cleanup, we can identify files that haven't been recently accessed using the -atime option of find, e.g. https://forum.access-hive.org.au/t/g-data-ik11-cleanup/2153/13, but it would be much more useful to know who accessed them.

aekiss changed the title ~~Track how often files are accessed?~~ Track how often files are accessed, and by whom Jun 5, 2024

aekiss added 🧜🏽‍♀️ enhancement 🥞 database Related to the structure of the database itself labels Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track how often files are accessed, and by whom #231

Track how often files are accessed, and by whom #231

aekiss commented Feb 25, 2021

angus-g commented Feb 25, 2021

aidanheerdegen commented Feb 25, 2021

angus-g commented Mar 1, 2021

aekiss commented Mar 1, 2021

aidanheerdegen commented Mar 1, 2021

aekiss commented Mar 1, 2021

angus-g commented Mar 1, 2021

aidanheerdegen commented Mar 1, 2021

access-hive-bot commented Nov 16, 2022

access-hive-bot commented Feb 3, 2023

aekiss commented Feb 3, 2023

aekiss commented May 16, 2024

aidanheerdegen commented May 16, 2024

aidanheerdegen commented May 16, 2024

aekiss commented Jun 5, 2024

aekiss commented Jun 5, 2024

rmholmes commented Jun 5, 2024

aekiss commented Jun 5, 2024 •

edited

Loading

aekiss commented Jun 13, 2024

Track how often files are accessed, and by whom #231

Track how often files are accessed, and by whom #231

Comments

aekiss commented Feb 25, 2021

angus-g commented Feb 25, 2021

aidanheerdegen commented Feb 25, 2021

angus-g commented Mar 1, 2021

aekiss commented Mar 1, 2021

aidanheerdegen commented Mar 1, 2021

aekiss commented Mar 1, 2021

angus-g commented Mar 1, 2021

aidanheerdegen commented Mar 1, 2021

access-hive-bot commented Nov 16, 2022

access-hive-bot commented Feb 3, 2023

aekiss commented Feb 3, 2023

aekiss commented May 16, 2024

aidanheerdegen commented May 16, 2024

aidanheerdegen commented May 16, 2024

aekiss commented Jun 5, 2024

aekiss commented Jun 5, 2024

rmholmes commented Jun 5, 2024

aekiss commented Jun 5, 2024 • edited Loading

aekiss commented Jun 13, 2024

aekiss commented Jun 5, 2024 •

edited

Loading