-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track how often files are accessed, and by whom #231
Comments
I like this idea! I think maybe it would be better to have the logged accesses stored in a separate database, so we wouldn't have to worry about accidentally nuking the shared (big) database of all experiments. From a technical standpoint, I think sqlite should handle multiple transactional accesses to the same database pretty smoothly, given that the request volumes are pretty small in the grand scheme of things. |
Permissions is the major issue. |
The stats database could be group-writable, but I guess the worry is people fiddling with it? |
if we want to track this per-user we could create a separate stats db for each user as needed, and just give them write permissions for their own one? |
If it can be done I agree it would be great to have. It does assume that everyone used the cookbook to access the data. I'd like to think that was the case, but sometimes old habits die hard and some people prefer to do it the way they're used to. Be good to know that wasn't the case. Permissions issues can be around people fiddling, or corrupting the database in some way. Paola has tried something similar with a basic log file for CLeF queries and it has worked, but also can create issues when a new log file is created. That is probably not an issue here if the DB persists at the same path. |
yes there's always the possibility that users will access data by some other method, but at least it will show us what data should definitely be retained... |
I'll note that since we'd be logging to a sqlite DB, it would be quite robust -- write-ahead transactions and such make it atomically handle concurrent writes without corruption, etc. But it is true that it's only useful if people are using the cookbook in the first place!
If we need to go this route, maybe |
If you think it will work with sqlite @angus-g then give it a burl and see. Definitely wrap any DB access in a |
This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there: https://forum.access-hive.org.au/t/priorities-for-large-msu-experiments/123/18 |
This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there: https://forum.access-hive.org.au/t/track-how-recently-files-were-accessed-via-cookbook/391/1 |
This came up again at the COSIMA meeting yesterday. It would be a good capability to have. Maybe it's something ACCESS-NRI could help with? |
Having this capability is becoming more ever-more pressing, so we can better manage our growing pile of data. Any thoughts on how we can have it implemented? Ideally we'd have something that logs all data access via the Cookbook or Intake, storing something like
|
I'd say you need less structure, otherwise you have to query the DB to find existing information, and then update the record. So something much more like an activity log, and pull out the structured information through queries. The the question is what actions do you want to log, and what information from that action? |
Yes this is something ACCESS-NRI wants to do, but you know, time and resources .... |
It would also be helpful if it was possible for cookbook users to query the DB to find out who is using a given dataset. For example
|
Log each variable requested in each .nc file accessed via calls to
This would make enormous log files, which is why I suggested condensing it by storing only the total number of accesses by each user and the date of most recent access. It would also be nice to be able to look up username to get real name and email address. Not sure if that's possible. |
Just noting that most of the time I don't use |
Thanks @rmholmes, good point. I suspect most usage is via the cookbook, but we don't actually know. We can't hope to cover every access method, but any info on usage is better than none - e.g. we can be sure we shouldn't delete data if it's actively used via the cookbook, but if there are no cookbook users we'd need to ask around whether anyone is accessing data via some other method before deleting it. |
For the purposes of cleanup, we can identify files that haven't been recently accessed using the |
Just a thought - for the big 0.1deg runs we save a lot of data on request but it's sometimes unclear how much of it actually gets used or by who, so it's hard to tell whether some of it could be deleted to save space, or whether some diagnostics could be dropped for future runs.
To assist with managing our storage it could be useful to make
querying.getvar
log which files are actually getting accessed, e.g. by having the database store the total number of requests for each file, and the date of the most recent request. If the username is accessible, the total count and most recent access could be recorded per-user. The DB could then be queried to find big files that nobody needs anymore.This data could also be useful for documenting the research impact of the cosima datasets, e.g. for grant applications.
The text was updated successfully, but these errors were encountered: