`fit_*_random_forest` : allow vector-cube column selection #341

clausmichele · 2022-03-15T08:00:29Z

After a discussion with @ValentinaHutter @LukeWeidenwalker and @mattia6690 we concluded that:

The fit_*_random_forest processes have vector-cubes as input data types
The process should allow the user to select which property (or column, if we refer to a general DB) of the vector-cube to select to get the required data (@jdries mentioned this already in the last dev meeting)
This is required if the target vector-cube is not the output of aggregate_spatial but it comes from a file (it is the case of UC8). Sorry if this didn't come up earlier but I wasn't fully aware on how the target vector-cube should look like.

@m-mohr how would you add the column selection to the process? My idea of vector-cube, consists in a table where we must have the geometry column, all the rest is optional. We would like to select a column among those optional ones.

For example, from this vector-cube we could select the class field from the geoJSON properties; https://raw.githubusercontent.com/clausmichele/openeo_aggregate_spatial_vector_cubes/master/urban_forest_points.geojson

m-mohr · 2022-03-15T11:14:21Z

I'm also not 100% sure yet how we handle properties in such cases. In theory, such a property could be a dimension or it still resides in the features, which would require additional processes.

Regardless of how the vector cube looks like, we likely need to add a new parameter for this use case to the fit_* processes.

clausmichele · 2022-03-15T11:35:33Z

This is for example my version of a vector-cube: a table with the geometry field plus some other columns. This would be the output of aggregate_spatial for instance

idx	class	geometry	result	result_meta
0	0	POINT (11.12157 46.06955)	{'B04_10m': 1522.0, 'B08_10m': 2127.5}	{'total_count': 2.0, 'valid_count': 2.0}
1	0	POINT (11.12171 46.06909)	{'B04_10m': 1121.5, 'B08_10m': 1547.0}	{'total_count': 2.0, 'valid_count': 2.0}
2	0	POINT (11.12324 46.06875)	{'B04_10m': 1045.5, 'B08_10m': 1406.0}	{'total_count': 2.0, 'valid_count': 2.0}
3	0	POINT (11.12326 46.06900)	{'B04_10m': 1093.5, 'B08_10m': 1497.5}	{'total_count': 2.0, 'valid_count': 2.0}
4	0	POINT (11.12370 46.06757)	{'B04_10m': 1455.0, 'B08_10m': 2059.5}	{'total_count': 2.0, 'valid_count': 2.0}
...	...	...	...	...
100	1	POINT (11.14474 46.13561)	{'B04_10m': 812.5, 'B08_10m': 3255.5}	{'total_count': 2.0, 'valid_count': 2.0}
101	1	POINT (11.15225 46.13465)	{'B04_10m': 614.0, 'B08_10m': 2137.5}	{'total_count': 2.0, 'valid_count': 2.0}
102	1	POINT (11.15215 46.13448)	{'B04_10m': 559.5, 'B08_10m': 2053.0}	{'total_count': 2.0, 'valid_count': 2.0}
103	1	POINT (11.15340 46.13687)	{'B04_10m': 549.5, 'B08_10m': 2001.0}	{'total_count': 2.0, 'valid_count': 2.0}
104	1	POINT (11.15475 46.13695)	{'B04_10m': 596.0, 'B08_10m': 1806.5}	{'total_count': 2.0, 'valid_count': 2.0}

edzer · 2022-03-15T15:13:22Z

The process should allow the user to select which property (or column, if we refer to a general DB) of the vector-cube to select to get the required data (@jdries mentioned this already in the last dev meeting)

In our definition of a data cube, cell values are scalars (see here, sixth paragraph), we do not allow for multiple variables (or attributes) at each combination of dimension values (cell = scalar). For vector data cubes, this should also hold. So by defining the vector data cube from the geojson file above you need to specify first which field holds the data cube values (cell values), which should be class, before you can call it a data cube. Then you end up with a one-dimensional vector data cube with geometries (points) as dimension values, and the class values as cell values. And hence, then you don't need to select an attribute.

If the geojson file above would have 4 attributes, B1, B2, B3, B4, they could contain the values of a second dimension, band, with dimension values B1,...,B4, and imported as a vector data cube with dimension npoints x 4. Each combination (POINT, band) would give a single, scalar value.

In the above geojson file the attributes are of different type, so can never be molded into a (useful) data cube dimension.

m-mohr · 2022-03-15T15:16:50Z

In the above geojson file the attributes are of different type, so can never be molded into a (useful) data cube dimension.

I assume "geojson file" was meant to refer to the table that Michele posted here? #341 (comment)

clausmichele · 2022-03-15T15:23:07Z

But then how should the output of aggregate_spatial look like if the input has dimensions x,y,bands or x,y,time ? I couldn't find any other way of satisfying the requirements of the process, since we need to create a new target dimension (not multiple) which can hold values from different bands or timesteps.
https://processes.openeo.org/#aggregate_spatial

m-mohr · 2022-03-15T15:27:37Z

As far as I understand it, the definition of aggregate_spatial simply doesn't work with the definition that @edzer proposes.

clausmichele · 2022-03-15T15:30:30Z

Well, we will keep this internal representation of the vector-cube for now since we need it for UC8, when we'll have a clearer definition we could modify it.

m-mohr · 2022-03-15T16:16:44Z

@clausmichele I guess it would be better to flatten result and result_meta? Does that make sense?

If I understood @edzer correctly, then the data cube may look like this:

Dimensions:

geometry with labels: POINT (11.12157 46.06955), POINT (11.12171 46.06909), ...
properties (or result to follow the default target dimension in aggregate_spatial) with labels: class, B04_10m, B08_10m, total_count, valid_count

geometry v \ properties >	class	B04_10m	B08_10m	total_count	valid_count
POINT (11.12157 46.06955)	0	1522.0	2127.5	2	2
POINT (11.12171 46.06909)	0	1121.5	1547.0	2	2
…	…	…	…	…	…

We'd need to fix the description of the returned vector cube in aggregate_spatial then.

clausmichele · 2022-03-15T16:24:08Z

Fine for me! I've just discussed with @ValentinaHutter and she was also implementing it like this.

m-mohr · 2022-03-15T16:45:21Z

Although it actually seems that total_count and valid_count should be per band in this case? aggregate_spatial says per geometry, but this seems pretty useless. Then the class is somewhat getting in the way...

clausmichele · 2022-03-16T07:58:32Z

I've computed it per geometry -> 1 point selected with two bands -> two valid pixels.

ValentinaHutter · 2022-03-16T08:20:41Z

Fine for me! I've just discussed with @ValentinaHutter and she was also implementing it like this.

Yes, to make fit_regr_random_forest work at EODC it made more sense for us to have a separate column for every band we have. For now I use a predictors_vars parameter, which specifies the bands that are used in the predictors and I use a target_var parameter to specify the band that is used there. My predictors_vars would be a list (for example ['B04', 'B08']) and the target_var would be a string (for example 'ndvi', if thats the name of the band). Of course this is just for testing it and I will update the parameters later.

soxofaan · 2022-03-23T12:49:33Z

If I understood @edzer correctly, then the data cube may look like this:
...

I think there is a problem here as the "class" column is a string in the example that @clausmichele posted (https://raw.githubusercontent.com/clausmichele/openeo_aggregate_spatial_vector_cubes/master/urban_forest_points.geojson), which is probably not unusual in practice. And theoretically there is also a bit of conflict between the float aggregation columns and integer count columns.
That's wat @edzer was noting too:

In the above geojson file the attributes are of different type, so can never be molded into a (useful) data cube dimension.

m-mohr · 2022-03-23T12:52:33Z

Yes, that's what I'm also struggling with right now in general. We need more discussions on this.

clausmichele · 2022-03-23T13:01:54Z

Well, if the string it's the issue it can be easily converted into a number.

soxofaan · 2022-03-23T13:06:47Z

Well, if the string it's the issue it can be easily converted into a number.

In this example it's probably easy. But I don't think you can do that "easily" or automatically in general.
I think it indicates that there is something conceptually wrong in how we define/handle vector cubes.

You could also theoretically argue that mixing the float aggregation columns with integer count columns in the same "cube" is bad practice.

clausmichele added bug patch labels Mar 15, 2022

m-mohr removed bug patch labels Mar 15, 2022

m-mohr added this to the 1.3.0 milestone Mar 15, 2022

m-mohr added the vector label Mar 15, 2022

m-mohr mentioned this issue Mar 21, 2022

Fixes for the random forest processes #351

Merged

m-mohr linked a pull request Mar 21, 2022 that will close this issue

Fixes for the random forest processes #351

Merged

m-mohr removed a link to a pull request Mar 21, 2022

Fixes for the random forest processes #351

Merged

m-mohr added the ML label Mar 21, 2022

clausmichele mentioned this issue Mar 21, 2022

predictors vector cube in fit_class_random_forest #349

Open

soxofaan mentioned this issue Mar 23, 2022

result_meta dimension in aggregate_spatial #356

Closed

m-mohr modified the milestones: 1.3.0, 2.0.0 Feb 1, 2023

m-mohr added the must-have label Feb 1, 2023

m-mohr modified the milestones: 2.0.0, 2.1.0 Mar 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`fit_*_random_forest` : allow vector-cube column selection #341

`fit_*_random_forest` : allow vector-cube column selection #341

clausmichele commented Mar 15, 2022

m-mohr commented Mar 15, 2022

clausmichele commented Mar 15, 2022 •

edited by m-mohr

Loading

edzer commented Mar 15, 2022

m-mohr commented Mar 15, 2022

clausmichele commented Mar 15, 2022

m-mohr commented Mar 15, 2022 •

edited

Loading

clausmichele commented Mar 15, 2022

m-mohr commented Mar 15, 2022 •

edited

Loading

clausmichele commented Mar 15, 2022

m-mohr commented Mar 15, 2022 •

edited

Loading

clausmichele commented Mar 16, 2022

ValentinaHutter commented Mar 16, 2022

soxofaan commented Mar 23, 2022 •

edited

Loading

m-mohr commented Mar 23, 2022

clausmichele commented Mar 23, 2022

soxofaan commented Mar 23, 2022 •

edited

Loading

fit_*_random_forest : allow vector-cube column selection #341

fit_*_random_forest : allow vector-cube column selection #341

Comments

clausmichele commented Mar 15, 2022

m-mohr commented Mar 15, 2022

clausmichele commented Mar 15, 2022 • edited by m-mohr Loading

edzer commented Mar 15, 2022

m-mohr commented Mar 15, 2022

clausmichele commented Mar 15, 2022

m-mohr commented Mar 15, 2022 • edited Loading

clausmichele commented Mar 15, 2022

m-mohr commented Mar 15, 2022 • edited Loading

clausmichele commented Mar 15, 2022

m-mohr commented Mar 15, 2022 • edited Loading

clausmichele commented Mar 16, 2022

ValentinaHutter commented Mar 16, 2022

soxofaan commented Mar 23, 2022 • edited Loading

m-mohr commented Mar 23, 2022

clausmichele commented Mar 23, 2022

soxofaan commented Mar 23, 2022 • edited Loading

`fit_*_random_forest` : allow vector-cube column selection #341

`fit_*_random_forest` : allow vector-cube column selection #341

clausmichele commented Mar 15, 2022 •

edited by m-mohr

Loading

m-mohr commented Mar 15, 2022 •

edited

Loading

m-mohr commented Mar 15, 2022 •

edited

Loading

m-mohr commented Mar 15, 2022 •

edited

Loading

soxofaan commented Mar 23, 2022 •

edited

Loading

soxofaan commented Mar 23, 2022 •

edited

Loading