Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GNIP 92: non-spatial structured data as (pre)viewable FAIR datasets #8714

Open
2 of 5 tasks
gannebamm opened this issue Jan 31, 2022 · 32 comments
Open
2 of 5 tasks

GNIP 92: non-spatial structured data as (pre)viewable FAIR datasets #8714

gannebamm opened this issue Jan 31, 2022 · 32 comments
Labels
API v2 gnip A GeoNodeImprovementProcess Issue master

Comments

@gannebamm
Copy link
Contributor

gannebamm commented Jan 31, 2022

GNIP 92 - non-spatial structured data as (pre)viewable FAIR datasets

Overview

We need to store structured non-spatial datasets besides geodata as GeoNode resources. The non-spatial datasets shall provide a simple viewer as a preview and should be able to get used as part of dashboards. The datasets should be findable, accessible and provided in an interoperable way, therefore complying with the FAIR principles.

Proposed By

Florian Hoedt, Thünen-Institute Centre for Information Management

We intend to fund this development as part of an upcoming tender process. This GNIP shall start a discussion about how the developed feature could get upstreamed to the main project.

Assigned to Release

This proposal is for GeoNode 4.0

State

  • Under Discussion
  • In Progress
  • Completed
  • Rejected
  • Deferred

Motivation

Status Quo: Non-spatial but structured datasets like csv/ Excel files can be uploaded as documents. As document objects, these datasets do inherit the resource base metadata models but can not be viewed in a meaningful way.
As a research institute, our scientists often use PostgreSQL databases and tables to store and structure their research data. Currently, those datasets can not be published in any way in GeoNode. As a research institute, we need to store/register structured non-spatial datasets besides geodata as GeoNode datasets (in the meaning of a v4.0 dataset).

Objective: Implement a new category of RessourceBase for structured non-spatial datasets. Instead of using the GeoServer importer to ingest, e.g. shapefiles, into the PostGIS enabled backend, you should be able to define a connection string to the table to use [?]. The non-spatial datasets shall provide a simple viewer as a preview and should be able to get used as part of dashboards.

Proposal

How to realize the above-mentioned feature is still to be discussed.

As part of an internal discussion, we thought about using PostgREST as an accessible and interoperable tabular data provider. One major aspect is to synchronise authorization mechanisms with the new service. Currently, Django and GeoServer do synchronise their roles via GeoFence. Something similar should be implemented by the tabular service provider. There seem to be options to use JWT as part of the django-rest-framework to grant such authorization as explained here: https://gitter.im/begriffs/postgrest?at=61f06b40742c3d4b21b63843

Apart from using PostgREST as a tabular data provider we also considered the new OGC APIs. These may provide enough functionality for this GNIP. For example the EDR (https://ogcapi.ogc.org/edr/).

Backwards Compatibility

It is not intended to backport this GNIP to 3.x

Future evolution

Explain which could be future evolutions.

Feedback

See discussion below...

Voting

Project Steering Committee:

  • Alessio Fabiani: 👍
  • Francesco Bartoli:
  • Giovanni Allegri:
  • Toni Schoenbuchner: 👍
  • Florian Hoedt: 👍

Links

Remove unused links below.

@gannebamm gannebamm added gnip A GeoNodeImprovementProcess Issue API v2 master labels Jan 31, 2022
@afabiani
Copy link
Member

afabiani commented Feb 2, 2022

@gannebamm my +1 here, the proposal is actually very good. Of course we need to carefully choose how to convert the structured data into a "standard" format that GeoNode can use later on.
It would be nice if you could prepare/provide few samples of possible datasets or provide an idea of the complexity of the structures. You are speaking about Excel documents, but those ones might be very complex. We might need to envisage some hooks/harvesters able to parse and store specific formats.
We had a similar use case for the Afghanistan Risk Data portal. In that case we had to create some brand new data structures and parsers able to ingest very specific Excel files for each hazard type.

@afabiani
Copy link
Member

afabiani commented Feb 2, 2022

Added the GNIP to the wiki page

@gannebamm
Copy link
Contributor Author

Thanks @afabiani for adding it to the wiki page!

It would be nice if you could prepare/provide few samples of possible datasets or provide an idea of the complexity of the structures. You are speaking about Excel documents, but those ones might be very complex.

here you go: soil example dataset

You can switch the metadata language to English and use the BZE_LW English Version. The site.xlsx is the spatial dataset we currently upload as a point layer. The other two xlsx files (LABORATORY_DATA, HORIZON_DATA) are examples of non-spatial datasets. I know, in the end, everything is spatial somehow since the lab and horizon datasets do explicitly or implicitly reference a sample site. Nonetheless, we would like to publish those as non-spatial datasets and enable custom applications to fetch those in an accessible and interoperable way through an API. An example of this kind of custom application can be seen at soilgrids. If you click a coordinate you will derive loads of additional data like this:
grafik

However, most of our data is already stored in PostgreSQL databases. I know other research institutes also have working databases that could maybe just get integrated. If we use some ORM like sqlalchemy we could even open this up for a more diverse set of SQL-like data providers, as explained here. But maybe that one is out of scope and we should stay true and close to our current stack, which does use PostgreSQL.

I will ask my colleagues to provide some more examples.

@t-book
Copy link
Contributor

t-book commented Feb 18, 2022

My +1. thanks Florian

@giohappy
Copy link
Contributor

@gannebamm this proposal is the natural prosecution of the conceptual change we did from "layers" to "datasets" in GeoNode.
One of the reasons for the change of the name of these entities is exactly to make room for non-spatial datasets, which are not well represented as "layers".

Before starting the discussion about their presentation (web client, API, standards, whatever) I wonder where we imagine storing these datasets. The first option that comes to my mind is geonode_data DB, which is the one employed by Geoserver At the moment GeoNode has no direct connection with that DB, but I was thinking of this since a while. If we made GeoNode "aware" of the geonode_data DB and models we could:

  • build more advanced and custom analysis and visualization tools on vector spatial datasets
  • expose vector spatial datasets also to the functionality that will be built for non-spatial datasets
  • we have a single data store for both non-spatial and vector datasets, even though the latter are directly managed by Geoserver

I know that this goes against the general advice to keep the services models separate, and in theory, we should only rely on OGC standard interfaces to query spatial datasets, but in the case of GeoNode and Geoserver:

  • they're services composing a single product, so we have total control over the two and their models
  • keeping GeoNode itself tied to the standard interfaces with Geoserver limits the functionality (or make it far more complex and less performant) that could be built

@matthesrieke
Copy link
Contributor

As I had an offline discussion with @gannebamm on the topic, I am sneaking in on the discussion.

However, most of our data is already stored in PostgreSQL databases. I know other research institutes also have working databases that could maybe just get integrated.

I think @gannebamm is having a slightly different workflow in mind (correct me if I am wrong). The data would not be imported into a central data store, but managed as reference to an existing database. I guess this is the most flexible and scalable approach as otherwise you would need to make sure to preserve the structure in the geonode_data, without conflicts across datasets.

On the other hand, this would bring in the requirement of some sort of default structure anyways, if you do not want to implement special visualizers for each dataset. Maybe it could also be a mappable structure, filled out by the user during import.

Maybe also both scenarios (1. existing DB; 2. import into geonode_data) could be covered. The same questions would still require answers.

@gannebamm gannebamm changed the title GNIP 92: non-spatial structured data as (pre)viewable FAI(R) datasets GNIP 92: non-spatial structured data as (pre)viewable FAIR datasets Feb 28, 2022
@gannebamm
Copy link
Contributor Author

Maybe the https://github.com/rvinzent/django-dynamic-models technology evaluated and used (?) for the SOS integration (contrib module) can help for this feature request. See: https://github.com/GeoNode/geonode-contribs/issues/172

@matthesrieke
Copy link
Contributor

Dear @giohappy, together with @gannebamm @mwallschlaeger we have started to iterate the requirements and the concept behind a non-spatial dataset feature for GeoNode. We have started a small prototype by setting up a Django App / Contrib module. At the moment, uploading data is achieved by providing a CSV file with a sidecar JSON Tabular Data Resource that describes the schema and types of fields.

@gannebamm pointed us to the new geonode-importer module and we were wondering if this would be a good fit for ingesting the data. It looks to be designed in a way that it would allow the addition of custom/new handlers. Do you think it would fit our purpose?

@giohappy
Copy link
Contributor

giohappy commented Dec 7, 2022

Dear @matthesrieke sorry for the late reply.

First of all, a GNIP for the new importer is on its way, we want to make it a community module asap. At the moment it's hosted under GeoSolution's own repo.

As you note, the new importer lets you implement specific handlers, and it can assume complete control of the lifecycle of a resource. For example, the handler is in charge of doing any housekeeping when a resource backed by specific data and tables is deleted.
@mattiagiupponi can tell you much more about it since he's the module's author.

So, the primary use case here is to map a GeoNode resource to an external DB. If we generalize this I'd say that this case isn't strictly related to non-spatial datasets. In our vision, a non-spatial dataset could still be served by Geoserver, that way we can benefit from all the services and WFS-based client tools that we already have. They can work for non-spatial data too.
So, follow me, we have two "dimensions" here:

  • implement support for non-spatial datasets on top of the existing tools (here we don't care where the data is located)
  • implement support for alternative DBs to geonode_data (this can work both for spatial and non-spatial datasets)

IMHO we should agree on the first point first, which is the subject of this GNIP.
I'm a bit concerned with creating new data models and services. We can improve the current ones (always with back compatibility in mind!), but I'd try our best to avoid adding complexity.

@gannebamm
Copy link
Contributor Author

@giohappy

I am not sure if I understand the two dimensions stated.

on top of the existing tools (here we don't care where the data is located)

We care where the data is located and would like to ingest it into the PostgreSQL backend for later use.

In our vision, a non-spatial dataset could still be served by Geoserver, that way we can benefit from all the services and WFS-based client tools that we already have.

I think because this is not the intended way to use WFS tools like QGIS are likely to fail to understand the non-spatial data served in a WFS. Did anyone test this successfully?

@afabiani @matthesrieke @t-book @francbartoli ->

Maybe we should schedule a talk to discuss this? It is getting quite complex, and I think it would help to dig deep into the pros and cons of the possible approaches and define our needs. Maybe @mattiagiupponi can provide a short intro into the importer and the non-spatial data serving capabilities of GeoServer, and @matthesrieke can describe the prototype he developed to test the approach. In the end, less complexity is always welcome.

If other community developers are interested in coming by, I can host a public meeting. Scheduling this will be rough, though.
What do you think?

@giohappy
Copy link
Contributor

@gannebamm

We care where the data is located and would like to ingest it into the PostgreSQL backend for later use.

I'm not saying that this isn't relevant, My point is to distinguish the two requirements:

  1. Using an external DB, which isn't strictly related to spatial/non-spatial tables. Notice that you can already publish resources from an external DB (we do it frequently) although there isn't a tool for end-users to configure it. At the moment you have to configure a new Geoserver store and then publish the layer on GeoNode, e.g. with the updatelayers command. However, it should be quite trivial to do it with the new importer
  2. Publishing of non-spatial tables. From my experience, QGIS plays nicely with non-spatial WFS featuretypes. I did a quick test to confirm this. The images you see below are the JSON output from WFS and the same table loaded in QGIS. I'm pushing forward this solution because it comes for free (almost).

We're happy to discuss this in a call.

WFS non-spatial table JSON output
image

WFS non-aptial table loaded in QGIS
image

@gannebamm
Copy link
Contributor Author

@giohappy @afabiani @matthesrieke @t-book @mattiagiupponi (and everyone else interested!)
I would like to schedule a web session to talk about the further development of this GNIP with you. I can provide a WebEx room.

There are some open slots next week for me. Please fill out this poll: https://terminplaner4.dfn.de/FOKIDXEtIVBq8sQB

@t-book
Copy link
Contributor

t-book commented Jan 11, 2023

@gannebamm It looks today is winner? Does the meating happen?

@gannebamm
Copy link
Contributor Author

gannebamm commented Jan 11, 2023

@t-book @afabiani @giohappy thanks for the quick replies.

I asked the 52N crew if one person is enough on their side. They should answer soon. I will provide the video conference room info by mail.

@mattiagiupponi
Copy link
Contributor

Hi @gannebamm
I'm sorry that yesterday I was not able to join the meeting but @giohappy give me an update.
I just created this small doc with the default handler structure which the importer expects to be available:
https://github.com/geosolutions-it/geonode-importer/tree/master/importer/handlers
Fell free to open new issues on the importer if something is not clear

@matthesrieke
Copy link
Contributor

Hi @mattiagiupponi thanks for adding the documentation, we will take a closer look. Would you be available for a short meeting to discuss possible technical approaches? Maybe this Thursday between 10-12am? You could also reply to me by mail (m.rieke@52north.org), so we take this offline from this thread.

@mattiagiupponi
Copy link
Contributor

Hi @mattiagiupponi thanks for adding the documentation, we will take a closer look. Would you be available for a short meeting to discuss possible technical approaches? Maybe this Thursday between 10-12am? You could also reply to me by mail (m.rieke@52north.org), so we take this offline from this thread.

Hi @matthesrieke
For me is fine from 10 am for about 1 hour.
I'll keep (for now) this here so we can see if someone else is interested in joining the meeting.
If it for you is fine, I'll send u an invitation for the meeting via email

@matthesrieke
Copy link
Contributor

thanks @mattiagiupponi ! Yes, 10am tomorrow is fine for me. I will be joined by @autermann and @ridoo

@gannebamm
Copy link
Contributor Author

@mattiagiupponi I would like to attend, too.

@giohappy
Copy link
Contributor

@matthesrieke @gannebamm we're planning to complete the transition to the new importer very soon, and make it the default importer in 4.1.x.

As you know the new importer misses the CSV handler. We were waiting to implement a solution that should replace the upload steps that we have now, where the lat/lon column can be selected at upload time.
We cannot afford to implement a new UI for the custom selection of columns, so our proposal would be the following:

  • preconfigure the CSV handler with OGR X_POSSIBLE_NAMES="x,lon*" and Y_POSSIBLE_NAMES="y,lat*" options
  • accept a companion "*.csvt" file, as supported by the OGR CSV driver

This solution would provide an alternative that's not too expensive and complex to implement, and gives the opportunity to remove the current upload system (at the moment it's still required only for CSV files).

I'm not against the solution based on Tabular Data Resource and VSI.
I think all these options could coexist, letting the handler pick up the best depending on the provided files, with X_POSSIBLE_NAMES and Y_POSSIBLE_NAMES preconfigurations as a fallback.

What's your opinion?

@gannebamm
Copy link
Contributor Author

@matthesrieke please take a look at @giohappy comment. I do not see that as an issue. We will have two importer handlers, one for geospatial csv and one dedicated for non-spatial csv with TDR / VSI. What do you think?

@ridoo
Copy link
Contributor

ridoo commented Mar 1, 2023

@gannebamm @giohappy We also see no problem. Both solution can co-exist and serve different use cases (one for simple csv uploads and one for whole datapackages). I like the "csvt solution" as well -- quite pragmatic. How would you communicate to the user about the configured name pattern for columns containing geometry information?

@giohappy
Copy link
Contributor

giohappy commented Mar 1, 2023

@ridoo @gannebamm unfortunately our experiments with the CSV driver options and the csvt file didn't give the expected results. Apparently, the OGR Python API does not take them into account, and this is problematic since we leverage the API to extract schema information and prepare the dynamic models.

For the moment we have implemented the basic solution, where only a fixed set of column names are recognized. There's a PR ready for an internal review, but if you want to take a look and suggest improvements you're welcome! GeoNode/geonode-importer#157

@ridoo
Copy link
Contributor

ridoo commented Mar 2, 2023

@giohappy @mattiagiupponi That is a pity to read. I only played around with it on the CLI, so I cannot tell much more on this.

On our datapackage.json approach, we are now able to import csv data as described in the tabular data descriptor. However, there are still issues to solve:

  • Pass a "fake" SLD file (see first comment below)
  • Add a tabular subtype (see second comment below)
  • Create a simple UI preview (instead of showing an empty map on the detail page)

Comment 1: I see that the style of a layer is mandatory during upload. To my understanding right now, I can decide to get errors either from the rest_framework (requiring SLD) and/or geonode.geoserver.helpers.py#get_sld_for() which tries to get a default style for layer's name. For now, I am trying to pass a minimal style file which becomes available under "Styles" in Geoserver, but does not appear in the GWC. I can see GeoServer throwing an exception:

02 Mar 15:52:59 DEBUG [geoserver.monitor] - Testing /gwc/rest/layers/geonode:laboratory_data.xml for monitor filtering
geoserver4thuenen_atlas  | 02 Mar 15:52:59 DEBUG [geoserver.monitor] - /geoserver/gwc/rest/layers/geonode:laboratory_data.xml was filtered from monitoring
geoserver4thuenen_atlas  | 02 Mar 15:52:59 ERROR [geoserver.rest] - Unknown layer: geonode:laboratory_data
geoserver4thuenen_atlas  | org.geowebcache.rest.exception.RestException 404 NOT_FOUND: Unknown layer: geonode:laboratory_data
geoserver4thuenen_atlas  |      at org.geowebcache.rest.controller.GWCController.findTileLayer(GWCController.java:45)
geoserver4thuenen_atlas  |      at org.geowebcache.rest.controller.TileLayerController.layerGet(TileLayerController.java:70)

For now I can ignore this .. but for the future, it would be nice to have a less hackish way introducing tabular data.


Update: The error happens when GeoNode tries to invalidate the GWC. GeoServer does not know the resources logs a 404 but actually returns a 500 (you can see it in the Browser Logs). This lets GeoNode throw an exception ("too many 500 error responses").


Do you think this is the right way to pass a fake SLD file along the upload of non-spatial/tabular data?

Comment 2: During upload the the nonspatial/tabular data become of type VECTOR .. Calling http://localhost/geoserver/rest/layers/laboratory_data.xml gives me

<layer>
  <name>laboratory_data</name>
  <type>VECTOR</type>
  <resource class="featureType">
    <name>geonode:laboratory_data</name>
    <atom:link xmlns:atom="http://www.w3.org/2005/Atom" rel="alternate" href="http://localhost/geoserver/rest/workspaces/geonode/datastores/geonode_data/featuretypes/laboratory_data.xml" type="application/xml"/>
  </resource>
  <attribution>
    <logoWidth>0</logoWidth>
    <logoHeight>0</logoHeight>
  </attribution>
  <dateCreated>2023-03-02 15:49:59.20 UTC</dateCreated>
</layer>

It seems that geonode.geoserver.helpers.py#sync_instance_with_geoserver() is mapping subtype=dataStore to vector and just overrides the subtype=tabular of my instance. What do you think would be the best location to make adjustments to add a tabular type?

The PR look ok on a first glimpse (could not spent too much time on it, though).

@mattiagiupponi
Copy link
Contributor

mattiagiupponi commented Mar 3, 2023

Hi @ridoo
By default, the SLD style is never mandatory during the import phase for the geonode-importer or the legacy upload system.
Geonode always tries to create a default one because it is nature, so you get that error.

A possible approach is to use the custom_resource_manager provided by the importer.

This manager is meant to override the default one to exclude common communication with GeoServer during create/copy/update phase of the resource, I guess in your case you have also to override the "create" method so GeoNode should not try to create the SLD style by adding something like this:

def create(self, uuid, **kwargs) -> ResourceBase:
    return ResourceBase.objects.get(uuid=uuid)

NOTE: the layer in GeoServer (as always) should be imported and published by the previous step importer.publish_resource. With the other handlers, we let this be done by the default manager, since it will sync the GeoNode resource with the one in GeoServer.

Then override the handler create_geonode_resource function to use the custom resource manager instead of the default one, with something like:

def create_geonode_resource(
    self, layer_name: str, alternate: str, execution_id: str, resource_type: Dataset = Dataset, files=None
):
   .......
    saved_dataset = custom_resource_manager.create(
        None,
        resource_type=resource_type,
        defaults=dict(
            name=alternate,
            workspace=workspace,
            subtype="raster",
            alternate=f"{workspace}:{alternate}",
            dirty_state=True,
            title=layer_name,
            owner=_exec.user,
            files=list(set(list(_exec.input_params.get("files", {}).values()) or list(files))),
        ),
    )

   .......
    return saved_dataset

Related to the second comment, I'm sure we talked about that for now, GeoNode is not ready to handle non-spatial resources and it will require some work to enable it.
@giohappy for sure can gives you more hints on it

@ridoo
Copy link
Contributor

ridoo commented Mar 6, 2023

@mattiagiupponi thanks for the hint, I will by-pass the importer's resource_manager and use my own.

Yes, we have talked about the limitation regarding non-spatial/tabualr data in GeoNode. However, I was unsure if you had further thought about possible pitfalls and/or ideas to overcome those :).

@gannebamm
Copy link
Contributor Author

@giohappy @mattiagiupponi
Since the GeoNode 4.1 release was postponed but is likely imminent, shall we try to get this new feature into the upcoming 4.1 release?

@giohappy
Copy link
Contributor

giohappy commented May 8, 2023

Since the GeoNode 4.1 release was postponed but is likely imminent, shall we try to get this new feature into the upcoming 4.1 release?

@gannebamm I'm a bit lost. I don't see a PR connected to this issue, and I'm not sure if a solution has been implemented for the presentation of non-spatial datasets.

@gannebamm
Copy link
Contributor Author

@ridoo Giovanni is correct. Didn´t we create a PR somewhere for this feature?

@ridoo
Copy link
Contributor

ridoo commented May 9, 2023

@gannebamm we did create PR #10842 which was needed to keep all unpacked files from an uploaded zip-file. However, the actual work to support non-spatial (tabular) data is a bit distributed:

@ridoo
Copy link
Contributor

ridoo commented Jun 16, 2023

edited by gannebamm: translated to English

After integration with 4.1.0, the following Geoserver/GeoWebCache error is thrown when publishing:

celery4thuenen_atlas_41x     | Requesting GeoFence rules on resource "geonode:horizon_data" :: Dataset
geoserver4thuenen_atlas_41x  | 13-Jun-2023 06:22:51.011 SEVERE [http-nio-8080-exec-1] org.geowebcache.rest.controller.RestExceptionHandler.handleRestException Unknown layer: geonode:horizon_data
geoserver4thuenen_atlas_41x  |  org.geowebcache.rest.exception.RestException 404 NOT_FOUND: Unknown layer: geonode:horizon_data
geoserver4thuenen_atlas_41x  |          at org.geowebcache.rest.controller.GWCController.findTileLayer(GWCController.java:45)
geoserver4thuenen_atlas_41x  |          at org.geowebcache.rest.controller.TileLayerController.layerGet(TileLayerController.java:70)
geoserver4thuenen_atlas_41x  |          at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
geoserver4thuenen_atlas_41x  |          at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
geoserver4thuenen_atlas_41x  |          at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
geoserver4thuenen_atlas_41x  |          at java.base/java.lang.reflect.Method.invoke(Method.java:566)

..
[ Cut for brevity]
..

celery4thuenen_atlas_41x     | Pushing 10 changes into GeoFence for resource horizon_data

The records are displayed, but the preview (GetFeatureRequest) no longer works. I will investigate further.

@ridoo
Copy link
Contributor

ridoo commented Jun 19, 2023

Tabular Preview works again after re-adding the tabular_viewer again in localConfig (seemed that is was removed during merging 4.1.0).

It also appeared, that there was some orphaned code handling a faked thumbnail, which caused an "action not implemented, yet" error, as the ROLLBACK action was not listed in the actions list of the datapackage handler, yet.

To resolve the error, I added the ROLLBACK action (which is handled by the vector handler), and removed the orphaned code. Now, everything works as before.

Not sure, what caused the GeoWebCache Exception, actually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API v2 gnip A GeoNodeImprovementProcess Issue master
Projects
None yet
Development

No branches or pull requests

7 participants