Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New "original filenames" methods #448

Closed
davidhassell opened this issue Sep 9, 2022 · 8 comments · Fixed by #449
Closed

New "original filenames" methods #448

davidhassell opened this issue Sep 9, 2022 · 8 comments · Fixed by #449
Labels
enhancement New feature or request netCDF read Relating to reading netCDF datasets

Comments

@davidhassell
Copy link
Collaborator

davidhassell commented Sep 9, 2022

The conversation at #365 discussed how to record the names of files that contained some or all of the data at instantiation time - typically (but not necessarily) from a read call.

I propose a new suite of original_filenames methods that allow us to define, report, and edit these "original" files - those that contained some or all of the data at instantiation time - and are independent of the get_filenames methods. get_filenames reports on any files that are used by Data objects at the time of asking, which might reported different (typically fewer) files than were being used at the time of data instantiation.

It is also the case that the daskification of the cf-python library confuses functionality of the get_filenames methods - due to the new lazy execution paradigm, the notion of files in use at the time of asking is a little fuzzy. For this reason, the get_filenames methods are likely to be deprecated. Edit: These have now been deprecated in the PR

The implementation of this will be in cfdm and entirely inherited into cf-python. See NCAS-CMS/cfdm#215 for the implementation details, which were initially proposed as (pending review):

>>> f = cfdm.read("parent_file.nc")[0]
>>> f.original_filenames()
{"parent_file.nc"}

>>> f = cfdm.read("parent_file.nc", external="external_file.nc")[0]
>>> f.original_filenames()
{"parent_file.nc", "external_file.nc"}
>>> f.original_filenames(update="file.nc")
>>> f.original_filenames()
{"parent_file.nc", "external_file.nc", "file.nc"}
>>> f.original_filenames(clear=True)
{"parent_file.nc", "external_file.nc", "file.nc"}
>>> f.original_filenames()
set()

You might want to edit the original file names if you are creating constructs ab initio (without cf.read, which will set them for you automatically), or if you are creating a Field (or Data or any other construct) that is some combination of others - in this case you might want to store the superset of original files.

Many thanks to @ThibHlln for first raising this the issue in #365.

@davidhassell davidhassell added the enhancement New feature or request label Sep 9, 2022
@sadielbartholomew
Copy link
Member

Regarding my joint reviews for #449 and the upstream cfdm implementation NCAS-CMS/cfdm#216, I had a few questions about the approach, and I thought best put them here instead of one either PR. Namely (and note I may be missing something in my understanding in asking either of these, but best check!):

1. Difference in original filename method keywords for data and for fields

The code block you provide here as an example doesn't work for the implementations, seemingly because the method on a Field has only defined a clear parameter. However, the define and update mechanism works as illustrated above but instead for any Data, i.e. a Field's Data, as summarised by this passing test snippet.

Is this correct, that we only allow update and define for Data, but clear for both, or is that an issue with the cfdm PR? If it is intended, what is the reason for this, just so we can be explicit on this thread to trace the enhancement (I can't see anything in the opening comment here that suggests why this might be)?

2. Possibility of specifying files not currently present (e.g. on the filesystem)

I was also wondering if it is the intention that the user can specify files to define and update that don't exist on their filesystem at the present? I imagine it is in fact deliberate and considered OK in light of your comment:

due to the new lazy execution paradigm, the notion of files in use at the time of asking is a little fuzzy

but just want to check! Even so, it might be worth including a warning via the logging system if a file is detected by os or pathlib or similar to be non-existent (or unreadable, even)?

@sadielbartholomew
Copy link
Member

(Sorry, should really have posted my comment above on the cfdm Issue, NCAS-CMS/cfdm#215 which is where the implementation is mostly based.)

@davidhassell
Copy link
Collaborator Author

Hi @sadielbartholomew, Thanks for #448 (comment).

  1. Difference in original filename method keywords for data and for fields

I'm thinking about this ...

Even so, it might be worth including a warning via the logging system if a file is detected by os or pathlib or similar to be non-existent (or unreadable, even)?

Perhaps a loglevel == -1 warning?

@sadielbartholomew
Copy link
Member

Perhaps a loglevel == -1 warning?

I was thinking more a 3/'DETAIL' log level as users won't be using debug (generally), is that too for exposed for your liking?

@davidhassell
Copy link
Collaborator Author

Hi Sadie,

I was thinking more a 3/'DETAIL' log level as users won't be using debug (generally), is that too for exposed for your liking?

Been thinking about this a bit more, and am now not so keen on issuing such a warning when "update" or "define" is used. If a warning is not issued then this implies that you can be sure that the files will exist and be readable at a future time when you need them, which is not guaranteed. Original filenames will typically be set by read, so we know those files exist. If the use case is setting them manually then they are likely already in a situation where they want to record the names of files which have not been actually been read and might not be available - however they know that the data relates to these files and their names are what is important, not their content.

@davidhassell
Copy link
Collaborator Author

Hi Sadie,

  1. Difference in original filename method keywords for data and for fields

My original thinking was that updating/defining filenames on construct (as opposed to Data) objects was ill-defined, because you probably wouldn't want to broadcast those names to constituent constructs (e.g. coordinates) and components (e.g. bounds).

However - this situation also occurs for the Data object, since when you set names they are not broadcast to ancillary components (e.g. index/count/list/tie_point variables).

So I'm now thinking like you and "update" and "define" should be available on all original_filenames methods; and clearly documented to make it clear that setting them does not trickle down to constituent parts.

@sadielbartholomew
Copy link
Member

Hi David, thanks for the updates on your thinking regarding the questions raised.

Regarding #448 (comment), that all makes sense to me now you justify it as such, particularly you pointing out:

If a warning is not issued then this implies that you can be sure that the files will exist and be readable at a future time when you need them, which is not guaranteed

and I am happy to stick with no warning and keep this aspect as-is! So I guess we are fine with the current implementation, but I think it is a good idea to note within the documentation that additions to original_filenames can be made regardless of whether the files under those paths exist or are readable, and files resulting from update or define are not guaranteed to exist at a given time so that should not be relied upon. Something more concise than that probably but we can work that out later!?

Otherwise we can consider my question (2) in #448 (comment) resolved.

(Will reply to your second comment separately after lunch.)

@sadielbartholomew
Copy link
Member

My original thinking was that ...

That seems sensible. Thanks for clarifying.

However - this situation also occurs for the Data object, since ...

...So I'm now thinking like you and "update" and "define" should be available on all original_filenames methods; and clearly documented to make it clear that setting them does not trickle down to constituent parts.

OK, I follow and agree with your in-hindsight logic here. I agree with that justification and it would also be more consistent and therefore user-friendly for them to have matching signatures. So shall we go with that plan?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request netCDF read Relating to reading netCDF datasets
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants