Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new features request for get_geodataframe method #97

Open
xldeltares opened this issue Feb 11, 2022 · 7 comments
Open

new features request for get_geodataframe method #97

xldeltares opened this issue Feb 11, 2022 · 7 comments
Labels
DataCatalog & DataAdapters issues related to the DataCatalog and DataAdapters Enhancement New feature or request GeoDataFrame issue related to GeoDataFrame adapter or processing Needs refinement issue still needs refinement
Projects

Comments

@xldeltares
Copy link
Contributor

In hydrolib plugin, a few additional functions were needed, such as clip, retype, slice, set new columns and set id_col for geodataframe.
How these new features can be generalised and useful for other plugins could be discussed.

@xldeltares xldeltares added the Enhancement New feature or request label Feb 11, 2022
@DirkEilander
Copy link
Contributor

@xldeltares Could you elaborate (with a short sentence) for each functionality what it should do, just to make it clear to everyone and make sure we're on the same page. With the "clip" functionality I guess you mean a spatial clip, but could you clarify how this is different from the current implementation. And with "slice", do you mean a filter/query on data properties?

@xldeltares
Copy link
Contributor Author

@DirkEilander Sure! see below. Please let me know if anything needs further clarification, cheers!

Elaborated descriptions see the following:

  1. clip_buffer: float and clip_predicate: str
    Arguments used for spatial clipping of geodataframe as per data layer.
    The current implementation allows buffer and clip_predicate in modelbuilder.ini as per component. In dflowfm setups, we sometimes have different geometry types for the same model component, i.e. a cross-section profile can be one point or a series of points; they hence require different spatial clipping methods.
  2. retype: dict
    Dictionary used for force data types for geodataframe.
    The current implementation uses a default geo pandas parser. In dflowfm setups, we noticed sometimes the geopandas parser was not enough.
  3. funs: dict
    Dictionary used for evaluating a Python expression based on data layer attributes.
    Currently, part of this request is implemented by using unit_add and unit_mult methods in data adaptor. The requested funcs includes assign attribute based on another. For example, in dflowfm, pump storage, required as a numerical solution for model stability, could be estimated based on pump capacity, e.g. PUMP_STORAE = gpf.eval( “PUMP_CAPACITY * 100”). It also includes creating a new attribute based on a given value, e.g. specifying feature type for the entire data layer. e.g. IS_PIPE = gpf.eval( “TYPE in [‘PIPE’, ‘pipe’, ‘p’, ‘pipes’]”).
    (Is it more relevant to model defaults? Model defaults never overwrite, only append -> a combination of queries and fill in no data)
  4. required_columns: list, required_query: str
    Arguments used to slice the geodataframe. required_columns is used for column-wise slicing, required_query is used for row-wise slicing.
    Currently, column-wise slicing is implemented as variables argument in modelbuilder. There is no implementation of row-wise slicing.
  5. id_col: str
    Argument required to set the index column of the geodataframe. (whether to allow multi-index?)
    Currently, there are no implemented methods for this. In dflowfm, the index must be assigned. NaN in the index will be dropped.

@DirkEilander
Copy link
Contributor

Thanks for the detailed explanation @xldeltares!

In general it would be good to carefully review what we expect to be done in a preprocessing, what in the DataAdapter classes and what by setup_* methods and workflows. Let's discuss this in a follow up meeting. We need to be careful that putting too many required switches in the yml file might complicate the setup.

  • clipping In other models we often clip based the model region and an optional buffer, which is the same for any type of geometry you are reading. I'd like to learn more about your case for adding this to the yml.
  • retype & funcs In other models these types of actions are often taken care of by workflows. If data is missing or in the wrong dtype it is estimated based on existing columns or other (meta) data. I can understand that if the input data is really diverse your solution to add functions to the yml is useful. Note that adding these to the yml file requires a thorough understanding of the user on exactly what is required and thus really good documentation.
  • I like the query idea, but it's good to check if this is always best practise. There is currently no method in geopandas/fiona to do any filtering while reading the data as far as I know, meaning that the entire dataset will have to be read before one can do any filtering. If the data is big it therefore makes much more sense to do some preprocessing and group different rows in geopackage layers (which can be read independently) or even different files. ogr2ogr is still the best performing tool for these types of preprocessing of vector data. Also we need to think about the syntax for how to implement this - do you have an example?
  • The id_col is an easy one to add. In practise this could already be implemented in the workflows by using a naming convention for the id_col and do renaming in the yml.

@DirkEilander
Copy link
Contributor

this issue is also linked to #51

@DirkEilander DirkEilander added this to To do in data via automation Jul 26, 2022
@hboisgon hboisgon added the DataCatalog & DataAdapters issues related to the DataCatalog and DataAdapters label May 23, 2023
@hboisgon hboisgon added Needs refinement issue still needs refinement GeoDataFrame issue related to GeoDataFrame adapter or processing labels Jun 1, 2023
@DirkEilander
Copy link
Contributor

Sorry for this being dormant for very long. @hboisgon @xldeltares could you indicate if this is still relevant? And if so which of the above items specifically. This time we will put it on our planning to not forget.

@hboisgon
Copy link
Contributor

My thoughts but @xldeltares please add your own:

  1. clipping: In get_geodataframe and get_geodataset, you can pass assert_gtype = Point in the io.open_vector function for example. This will throw an error if not all geometries are point. For geodataframe all files are read with this function but for geodataset not for the netcdf and zarr files so we could add it there. Maybe make it also a more explicit argument of these catalogs rather than a little more hidden kwargs?
    It won't solve your clipping problem but make sure about the geometry type of the file used. I think if you want to allow your functions to support multiple geometry types, then using a not too restrictive clipping when reading and in your function split your input data to different gdf per geometry type before processing further would be a way to go? Or you think we should do more in core?
  2. retype: I actually think it might be nice to add to all data adapters. You can with pandas provide a dtype but you have to provide it for all columns or one type for every column so this can be a bit tedious to fill in if you want to only use a couple of columns of the dataset? @DirkEilander what do you think?
  3. funcs: I think this should be part of the plugins workflows. In your docs, you can specify to users what are then mandatory and optional variables and do the computation in the workflow functions.
  4. required_columns: I think this depends per model rather than per dataset (ie wflow may need different columns than delft3d for the same input dataset). Though this function might be useful to add somewhere in core so that each plugin/workflow can call that function with then the list of required columns.
    query: query is quite useful indeed to select based on some column/lines attributes (eg I use this to query road_type values highway or other). If it won't change the reading, maybe again another core function to perform the query? Not that it's a tough one I think.
  5. id_col would be really useful to set via data catalog. In a way you can do this in pandas with index_col, so maybe good if we could implement the same for geodataframe and geodataset. Not sure if then this is something to fix only in the data catalog or as Dirk mentions, in your workflow function as for the index column and then pass it to the get_data method.

So to recap on proposed changes:

  1. use assert_gtype for all readers for geodataset (geodataframe already is ok)
  2. add a new dtype property in the catalogs like nodata, unit_mult etc for all adapters
  3. no
  4. maybe functions in core (vector and?) rather than in data adapters
  5. support id_col for geodataframe and geodataset

@xldeltares
Copy link
Contributor Author

xldeltares commented Oct 19, 2023

Hi @hboisgon and @DirkEilander,

Nice that this is being discussed again! and thank you both for thinking along. It also gives some new perspectives. :)

First of all, just to express my overall feeling that the objective of data_adaptor is not very clear. Sometimes I feel that the data_adaptor should just take care of the reading, sometimes I feel it should be a mix of preprocessing.

That said, I have a preference of having more advanced functions in data_adaptor (both reading and preprocessing) motivated by use local data :). Even in the case of global data catalog, having more advanced functions can still be beneficial, as it will support refined data catalog by using a "wrapper" to the current global data catalogue. With the current trend of moving from global to local scale studies, it seems nice to have. Especially now hydromt is having a variety of plugins (also less hydro ones) that comes with different data requirements. Finally, it would help to keep model builder yml/ini more focus towards model building options only - now the data preprocessing options are either implemented as hardcoded in the workflows, or we have to mixin preprocessing options to model building options.

See following regarding the requested functions:
1. clipping: No! assert_gtype : Yes! I agree not to implement additional clipping argument as it suit more to specify in the function arguments. assert_gtype has proven to be quite handy when working with geodataset so I agree to implement it for all. Note that sometimes, one would expect LineString as gtype, but get MultiLineString. What would be the best practice to deal with this? Warning or Error?
2. retype: Yes! Prefered in data_adaptor. But need to be very carefully implemented (like everything of course), for example, when combined with nodata, how to make it clear to the user which dtype should they use for nodata?
3. funcs: No! I agree with not to implement for now. Because hydromt so far facilitates more user who work with developed model toolkits. I think it might be nice for the future if the hydromt would like to be more orientated to exploratory users, e.g. some one who is still exploring different methodologies.
4. required columns/query: Yes! Prefered in data_adaptor. From a modeller's perspective, I see this as how the data is read, thus belongs to data_catalogue. Use case: filtering either for the benefit of lower memory or quality control (filter for relevant columns or rows that have less reliable data) could be very handy.
6. id_col: Yes! Prefered in data_adaptor.

Opinions open to discussion :). Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DataCatalog & DataAdapters issues related to the DataCatalog and DataAdapters Enhancement New feature or request GeoDataFrame issue related to GeoDataFrame adapter or processing Needs refinement issue still needs refinement
Projects
No open projects
data
To do
Development

No branches or pull requests

3 participants