Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add API for GDAL users to be able to extract information / transform subdataset names #7261

Closed
rouault opened this issue Feb 16, 2023 · 17 comments · Fixed by #8155
Closed

Add API for GDAL users to be able to extract information / transform subdataset names #7261

rouault opened this issue Feb 16, 2023 · 17 comments · Fixed by #8155
Assignees

Comments

@rouault
Copy link
Member

rouault commented Feb 16, 2023

Cf https://github.com/qgis/QGIS/pull/51901/files#r1109095423 for the context

Perhaps GDALGetFilenameFromSubdatasetName(), GDALModifyFilenameInSubdatasetName(), etc.

@mdsumner
Copy link
Contributor

would this include getting the sds names in the beginning? (because I'd meant to pursue that, would like to be able to name them in "vrt://...?sds=var1" for example and it's a bit laborious atm)👌

@rouault
Copy link
Member Author

rouault commented Feb 17, 2023

would this include getting the sds names in the beginning?

What do you mean exactly?

@mdsumner
Copy link
Contributor

to generate the set of "NETCDF:file.nc:var1" etc from "file.nc" last I looked it's quite involved

@mdsumner
Copy link
Contributor

involved as in iterating over poDataset->GetMetadata("SUBDATASETS") and getting every second entry etc

@rouault
Copy link
Member Author

rouault commented Feb 17, 2023

involved as in iterating over poDataset->GetMetadata("SUBDATASETS") and getting every second entry etc

ah, well that aspects is in the "works" category for me. The Python bindings have typically a GetSubDatasets() helper that re-arranges things as an array of (name, description) pairs:


    def GetSubDatasets(self):
        sd_list = []

        sd = self.GetMetadata('SUBDATASETS')
        if sd is None:
            return sd_list

        i = 1
        while 'SUBDATASET_'+str(i)+'_NAME' in sd:
            sd_list.append((sd['SUBDATASET_'+str(i)+'_NAME'],
                            sd['SUBDATASET_'+str(i)+'_DESC']))
            i = i + 1
        return sd_list

@elpaso elpaso self-assigned this Feb 23, 2023
@rouault
Copy link
Member Author

rouault commented Jul 19, 2023

We'd also probably need a GDALIsSubdatasetSyntax() to check if a string is a subdataset syntax.
That could be used in conjunction with GDALGetFilenameFromSubdatasetName() to avoid VRTSimpleSource::GetFileList() to do a VSIStatExL() call that can slow on network file systems. See issue reported in https://lists.osgeo.org/pipermail/gdal-dev/2023-July/057472.html

I presume those functions should actually iterate over drivers and call function pointers at the driver level (similarly to Identify(), Open() etc)

@elpaso
Copy link
Collaborator

elpaso commented Jul 21, 2023

@rouault I started sketching the API, if I get this right the goal is to determine if the file name is possibly a subdataset and for the second function to strip the dataset information and return the file path without dataset information.

Is it correct that for performance reasons both functions should not open the dataset but only examine the file name string?

I have a few questions about the input and the implementation of the methods, you mentioned that we should loop through all the registered drivers that support subdatasets so I guess that we must handle the case when we don't know the driver in advance (because the file name is not in the form DRIVER_NAME:....... but in the form /path/to/the/file or protocol:://path/to/the/file).

The question is if it does make sense to loop through all drivers (when the driver is not known in advance) and ask the driver to determine if the file name is a subdataset without actually opening the dataset, how can the driver know if it can handle that file without actually trying to open it? I guess in a few cases it could examine the file extension but I am worried that this won't work in all situations (for example in case of API URLs).

Is there a logic I can borrow? Maybe from OpenEx?

This is the top level function I am working on:

bool CPL_STDCALL GDALIsSubdatasetSyntax(const char *pszFileName)
{
    // Iterate all drivers
    GDALDriverManager *poDM = GetGDALDriverManager();
    const int nDriverCount = poDM->GetDriverCount();
    for (int iDriver = 0; iDriver < nDriverCount; ++iDriver)
    {
        GDALDriver *poDriver = poDM->GetDriver(iDriver);
        char **papszMD = GDALGetMetadata(poDriver, nullptr);
        if (! CPLFetchBool(papszMD, GDAL_DMD_SUBDATASETS, false) || ! poDriver->pfnIsDatasetSyntax)
        {
            continue;
        }
        
        // Ask the driver if this is a subdataset descriptor
        if( poDriver->pfnIsDatasetSyntax( pszFileName ))
        {
            return TRUE;
        }
    }
    return FALSE;
}

@rouault
Copy link
Member Author

rouault commented Jul 21, 2023

Is it correct that for performance reasons both functions should not open the dataset but only examine the file name string?

yes ... ideally ... I don't have in mind situations where we'd need to open it, but perhaps I'm missing something

Is there a logic I can borrow? Maybe from OpenEx?

your looping logic looks good to me.
The key is then to implement pfnIsDatasetSyntax (should probably be name pfnIsSubdatasetSyntax) in the relevant drivers, based on the implementation of their pfnIdentify method
In most cases this should be something as simple as return STARTS_WITH_CI(pszFilename, "MY_PREFIX:")

As we might need 3 functonality (is this is a subdaset name, get the filename if there's one, get a new subsdataset URI with this filename instead), I'm wondering if we shouldn't have a single pfnGetSubdatasetInfo function pointer that would return an object that would have methods GetFilename, ModifyFilename. Potentially it could also return driver specific info as key/value pairs, similarly to QGIS decodeUri()/encodeUri()

@elpaso
Copy link
Collaborator

elpaso commented Jul 21, 2023

@rouault so I can assume that we always have a file name that starts with "MY_PREFIX:" and we don't handle unprefixed file names?

@rouault
Copy link
Member Author

rouault commented Jul 21, 2023

There might be several prefixes handled by the same driver. Eg the SENTINEL2 one. So the general logic should not make any assumption regarding this.
Subdataset names should not look like regular/ unprefixed filenmes but i don't think the general logic should be aware of the content of the passed string

@elpaso
Copy link
Collaborator

elpaso commented Jul 24, 2023

@rouault I tried the function pointer approach but I've got lost in casting fp to void* (which is forbidden).

Before I loose more time on that road, I was thinking that we should define the API a little better: as far as I understand there are two different required client use cases:

  1. given a generic string which defines a data source and a subdataset, call a method which will loop through all registered drivers and possibly return a SubdatasetInfo object (or nullptr if no driver accepts the string), the SubdatasetInfo object will have methods that return information about the subdataset and manipulate the string (strip the subdataset, replace the file path for renaming, etc.). These methods could be static and do not need a DS or driver instance.
  2. given a driver which supports subdatasets, get the SubdatasetInfo object from the driver instance (or nullptr if the driver does not support it)

Now, I understand that many methods in GDAL are implemented with function pointers (I'm not very familiar with that pattern because we don't use it much in QGIS or in other C++ projects I've been working with, like underpass), I tried anyway to define an opaque handle for the function pointer but I've got stuck with the above mentioned error.

So, I was thinking at a different approach: make SubdatasetInfo and abstract interface (with pure virtual methods) which must be implemented by the drivers that offer this functionality. When GetSubdatasetInfo() is called on a driver it can either return nullptr or a pointer to the (owned by the driver) instance of a concrete SubdatasetInfo.

Do you see any problem with this approach especially for the C interface and the bindings?
I'm not very familiar with swig but I guess it can bind to object it doesn't own.

If we stick to the use case 1 and with the separate methods API I've sketched in #7261 (comment) I think I know how to proceed but that's different from what you asked in #7261 (comment)

@rouault
Copy link
Member Author

rouault commented Jul 24, 2023

I tried the function pointer approach but I've got lost in casting fp to void* (which is forbidden).

why did you need to cast fp to void* ? And why do you actually need a file pointer in the API ? It might help if you showed your draft and the actual error you got

as far as I understand there are two different required client use cases:

I would hope we can have a single API for both use cases. I would say that even if we know the driver, it is probably not that much an issue to loop over driver to find it again. I don't anticipate those subdataset related methods to be in particular hot performance code paths.

Now, I understand that many methods in GDAL are implemented with function pointers

I agree this is a bit of a odd pattern. My understand is this was done to easily test if a driver implements an interface or not, without actually calling the method. So for example if pfnCreateCopy is defined then you can adverize a GDAL_DCAP_CREATECOPY=YES driver metadata item. And as a consequence most drivers don't need to actually subclass GDALDriver. They just instanciate a new object and set function pointers.
For that new API, I'm not sure if we would need to advertize a capability for it. But I'm not sure why we would have more issues with the function pointer approach vs a virtual C++ method. It would be slightly better to stick with the function pointer approach for new methods of GDALDriver, for the sake of uniformity

So, I was thinking at a different approach: make SubdatasetInfo and abstract interface (with pure virtual methods) which must be implemented by the drivers that offer this functionality. When GetSubdatasetInfo() is called on a driver it can either return nullptr or a pointer to the (owned by the driver) instance of a concrete SubdatasetInfo.
Do you see any problem with this approach especially for the C interface and the bindings?

Having SubdatasetInfo a class with pure virtual methods sounds OK to me. It shouldn't cause any issue for the C interface and SWIG bindings. Pretty similar to GDALDataset, GDALRasterBand, OGRLayer and so on.

@elpaso
Copy link
Collaborator

elpaso commented Jul 24, 2023

I tried the function pointer approach but I've got lost in casting fp to void* (which is forbidden).

why did you need to cast fp to void* ? And why do you actually need a file pointer in the API ? It might help if you showed

I meant function pointer.

@elpaso
Copy link
Collaborator

elpaso commented Jul 24, 2023

@rouault Here is my broken and unfinished initial attempt https://github.com/elpaso/gdal/tree/subdatasetinfo-api-func-pointers

@rouault
Copy link
Member Author

rouault commented Jul 24, 2023

I meant function pointer.

ah ok
You could likely solve the issue with forward defining struct GDALGetSubdatasetInfo that way, and making the C handle a pointer to it:

typedef struct GDALGetSubdatasetInfo *GDALGetSubdatasetInfoH;

and making the C++ class actually a struct GDALGetSubdatasetInfo {} as some compilers might not liking mixing struct & class with the same identifier.

@sgillies
Copy link
Contributor

@rouault would it be possible as part of this work to deprecate and stop promoting the use of subdataset names with extra quotes in them? For example, netcdf:"example.h5":data. It would be good for GDAL to align itself more with internet standards for addressing files and protocols and the extra quotes are weird and oddball. They also put a requirement on GDAL users downstream like this: https://github.com/qgis/QGIS/pull/51901/files#diff-b86e100d2544e883c6c84116f2712118e74b2f993dff776d1b6ec5aa9c1dc00dR2495.

@rouault
Copy link
Member Author

rouault commented Jul 24, 2023

would it be possible as part of this work to deprecate and stop promoting the use of subdataset names with extra quotes in them?

I wouldn't want to put too much on @elpaso shoulders as the amount of changes might quickly go out of control. We also have backward compatibility concerns with people potentially forging subdataset names based on their knowledge of the current implementation.
I'm not totally sure if introducing "clean" / standard friently subdataset name in existing drivers, in addition to supporting legacy syntax, would be a win from the point of view of GDAL's code base...

They also put a requirement on GDAL users downstream like this: https://github.com/qgis/QGIS/pull/51901/files#diff-b86e100d2544e883c6c84116f2712118e74b2f993dff776d1b6ec5aa9c1dc00dR2495.

This is the very aim of this ticket to make subdataset name structure opaque to QGIS and give it functions to manipulate them easily without having to know anything about netCDF, HDF5 and the like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants