Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fit_*_random_forest : allow vector-cube column selection #341

Open
clausmichele opened this issue Mar 15, 2022 · 16 comments
Open

fit_*_random_forest : allow vector-cube column selection #341

clausmichele opened this issue Mar 15, 2022 · 16 comments

Comments

@clausmichele
Copy link
Member

After a discussion with @ValentinaHutter @LukeWeidenwalker and @mattia6690 we concluded that:

  • The fit_*_random_forest processes have vector-cubes as input data types
  • The process should allow the user to select which property (or column, if we refer to a general DB) of the vector-cube to select to get the required data (@jdries mentioned this already in the last dev meeting)
  • This is required if the target vector-cube is not the output of aggregate_spatial but it comes from a file (it is the case of UC8). Sorry if this didn't come up earlier but I wasn't fully aware on how the target vector-cube should look like.

@m-mohr how would you add the column selection to the process? My idea of vector-cube, consists in a table where we must have the geometry column, all the rest is optional. We would like to select a column among those optional ones.

For example, from this vector-cube we could select the class field from the geoJSON properties; https://raw.githubusercontent.com/clausmichele/openeo_aggregate_spatial_vector_cubes/master/urban_forest_points.geojson

@m-mohr
Copy link
Member

m-mohr commented Mar 15, 2022

I'm also not 100% sure yet how we handle properties in such cases. In theory, such a property could be a dimension or it still resides in the features, which would require additional processes.

Regardless of how the vector cube looks like, we likely need to add a new parameter for this use case to the fit_* processes.

@clausmichele
Copy link
Member Author

clausmichele commented Mar 15, 2022

This is for example my version of a vector-cube: a table with the geometry field plus some other columns. This would be the output of aggregate_spatial for instance

idx  class geometry result result_meta
0 0 POINT (11.12157 46.06955) {'B04_10m': 1522.0, 'B08_10m': 2127.5} {'total_count': 2.0, 'valid_count': 2.0}
1 0 POINT (11.12171 46.06909) {'B04_10m': 1121.5, 'B08_10m': 1547.0} {'total_count': 2.0, 'valid_count': 2.0}
2 0 POINT (11.12324 46.06875) {'B04_10m': 1045.5, 'B08_10m': 1406.0} {'total_count': 2.0, 'valid_count': 2.0}
3 0 POINT (11.12326 46.06900) {'B04_10m': 1093.5, 'B08_10m': 1497.5} {'total_count': 2.0, 'valid_count': 2.0}
4 0 POINT (11.12370 46.06757) {'B04_10m': 1455.0, 'B08_10m': 2059.5} {'total_count': 2.0, 'valid_count': 2.0}
... ... ... ... ...
100 1 POINT (11.14474 46.13561) {'B04_10m': 812.5, 'B08_10m': 3255.5} {'total_count': 2.0, 'valid_count': 2.0}
101 1 POINT (11.15225 46.13465) {'B04_10m': 614.0, 'B08_10m': 2137.5} {'total_count': 2.0, 'valid_count': 2.0}
102 1 POINT (11.15215 46.13448) {'B04_10m': 559.5, 'B08_10m': 2053.0} {'total_count': 2.0, 'valid_count': 2.0}
103 1 POINT (11.15340 46.13687) {'B04_10m': 549.5, 'B08_10m': 2001.0} {'total_count': 2.0, 'valid_count': 2.0}
104 1 POINT (11.15475 46.13695) {'B04_10m': 596.0, 'B08_10m': 1806.5} {'total_count': 2.0, 'valid_count': 2.0}

@m-mohr m-mohr added this to the 1.3.0 milestone Mar 15, 2022
@m-mohr m-mohr added the vector label Mar 15, 2022
@edzer
Copy link
Member

edzer commented Mar 15, 2022

The process should allow the user to select which property (or column, if we refer to a general DB) of the vector-cube to select to get the required data (@jdries mentioned this already in the last dev meeting)

In our definition of a data cube, cell values are scalars (see here, sixth paragraph), we do not allow for multiple variables (or attributes) at each combination of dimension values (cell = scalar). For vector data cubes, this should also hold. So by defining the vector data cube from the geojson file above you need to specify first which field holds the data cube values (cell values), which should be class, before you can call it a data cube. Then you end up with a one-dimensional vector data cube with geometries (points) as dimension values, and the class values as cell values. And hence, then you don't need to select an attribute.

If the geojson file above would have 4 attributes, B1, B2, B3, B4, they could contain the values of a second dimension, band, with dimension values B1,...,B4, and imported as a vector data cube with dimension npoints x 4. Each combination (POINT, band) would give a single, scalar value.

In the above geojson file the attributes are of different type, so can never be molded into a (useful) data cube dimension.

@m-mohr
Copy link
Member

m-mohr commented Mar 15, 2022

In the above geojson file the attributes are of different type, so can never be molded into a (useful) data cube dimension.

I assume "geojson file" was meant to refer to the table that Michele posted here? #341 (comment)

@clausmichele
Copy link
Member Author

But then how should the output of aggregate_spatial look like if the input has dimensions x,y,bands or x,y,time ? I couldn't find any other way of satisfying the requirements of the process, since we need to create a new target dimension (not multiple) which can hold values from different bands or timesteps.
https://processes.openeo.org/#aggregate_spatial

@m-mohr
Copy link
Member

m-mohr commented Mar 15, 2022

As far as I understand it, the definition of aggregate_spatial simply doesn't work with the definition that @edzer proposes.

@clausmichele
Copy link
Member Author

Well, we will keep this internal representation of the vector-cube for now since we need it for UC8, when we'll have a clearer definition we could modify it.

@m-mohr
Copy link
Member

m-mohr commented Mar 15, 2022

@clausmichele I guess it would be better to flatten result and result_meta? Does that make sense?

If I understood @edzer correctly, then the data cube may look like this:

Dimensions:

  • geometry with labels: POINT (11.12157 46.06955), POINT (11.12171 46.06909), ...
  • properties (or result to follow the default target dimension in aggregate_spatial) with labels: class, B04_10m, B08_10m, total_count, valid_count
geometry v \ properties >  class B04_10m B08_10m total_count valid_count
POINT (11.12157 46.06955) 0 1522.0 2127.5 2 2
POINT (11.12171 46.06909) 0 1121.5 1547.0 2 2

We'd need to fix the description of the returned vector cube in aggregate_spatial then.

@clausmichele
Copy link
Member Author

Fine for me! I've just discussed with @ValentinaHutter and she was also implementing it like this.

@m-mohr
Copy link
Member

m-mohr commented Mar 15, 2022

Although it actually seems that total_count and valid_count should be per band in this case? aggregate_spatial says per geometry, but this seems pretty useless. Then the class is somewhat getting in the way...

@clausmichele
Copy link
Member Author

I've computed it per geometry -> 1 point selected with two bands -> two valid pixels.

@ValentinaHutter
Copy link

Fine for me! I've just discussed with @ValentinaHutter and she was also implementing it like this.

Yes, to make fit_regr_random_forest work at EODC it made more sense for us to have a separate column for every band we have. For now I use a predictors_vars parameter, which specifies the bands that are used in the predictors and I use a target_var parameter to specify the band that is used there. My predictors_vars would be a list (for example ['B04', 'B08']) and the target_var would be a string (for example 'ndvi', if thats the name of the band). Of course this is just for testing it and I will update the parameters later.

@soxofaan
Copy link
Member

soxofaan commented Mar 23, 2022

If I understood @edzer correctly, then the data cube may look like this:
...

I think there is a problem here as the "class" column is a string in the example that @clausmichele posted (https://raw.githubusercontent.com/clausmichele/openeo_aggregate_spatial_vector_cubes/master/urban_forest_points.geojson), which is probably not unusual in practice. And theoretically there is also a bit of conflict between the float aggregation columns and integer count columns.
That's wat @edzer was noting too:

In the above geojson file the attributes are of different type, so can never be molded into a (useful) data cube dimension.

@m-mohr
Copy link
Member

m-mohr commented Mar 23, 2022

Yes, that's what I'm also struggling with right now in general. We need more discussions on this.

@clausmichele
Copy link
Member Author

Well, if the string it's the issue it can be easily converted into a number.

@soxofaan
Copy link
Member

soxofaan commented Mar 23, 2022

Well, if the string it's the issue it can be easily converted into a number.

In this example it's probably easy. But I don't think you can do that "easily" or automatically in general.
I think it indicates that there is something conceptually wrong in how we define/handle vector cubes.

You could also theoretically argue that mixing the float aggregation columns with integer count columns in the same "cube" is bad practice.

@m-mohr m-mohr modified the milestones: 2.0.0, 2.1.0 Mar 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants