Spread array of values generated by an expression to form a record #4823

wetneb · 2022-04-30T17:04:32Z

When we use an expression which returns an array in the Transform or Add column based on this column dialogs, the resulting cells are currently left blank because we cannot store an array in a single cell.
This is of course not ideal and quite confusing because there is no explanation or warning about this (#1475, #1088).

Proposed solution

This should behave like the Add column from reconciled values operation: spread multiple values in consecutive cells to form a record.
For instance, if I start with a table such as:

ID	JSON
1	{"letters":["a","b"]}
2	{"letters":["c","d","e"]}

If I add a column letters generated by the expression value.parseJson().letters, I want to get the following result:

ID	JSON	letters
1	{"letters":["a","b"]}	a
		b
2	{"letters":["c","d","e"]}	c
		d
		e

This solution could be implemented in 3.x, without being an important breaking change (since expressions which return arrays are currently useless, we can assume that it is unlikely anyone relies on the current behaviour in a workflow).

Alternatives considered

Just keep the logic as it is but improve error reporting

We could simply make it clearer at the preview stage that those values will not be stored. @thadguidry proposed the following UI:

@ostephens has noted that it would be better not to render those as errors (because the expression does not return an error itself) but rather to signal this in a different way (for instance by improving the way we distinguish between data types of returned values in this preview area).

Spread the values as columns

In issue #36, it is proposed that values are spread on columns instead. With the example above, you would get:

ID	JSON	letters 1	letters 2	letters 3
1	{"values":["a","b"]}	a	b
2	{"values":["c","d","e"]}	c	d	e

I find that less helpful because storing an array like this makes it hard to reuse the results afterwards.
The lists can have variable lengths, so the columns created by the operation depend on the data.
And this behaviour can already be obtained by successively adding columns using the expression value.parseJson().letters[0], value.parseJson().letters[1] and so on (or rather in the reverse order to get the exact result above).

Store arrays as strings in cells

@thadguidry also proposed to convert these values to strings before storing them, for instance by serializing them to JSON with jsonize().
That is indeed a sensible workaround that users can resort to at the moment, but it does not feel so convenient to reuse afterwards (although it is not as bad as when values are spread as columns in my opinion).

Make it possible for cells to store arrays natively

In PR #1586 @ostephens proposed to make it possible to store arrays in cells natively.
That is definitely an interesting approach, which could potentially even remove the need for the records mode entirely: just store lists of values in the cells instead of spreading them on consecutive rows. Hence it seems to be a much bigger project to me. What I find daunting about it is that:

if we start allowing arrays in cells, we basically need to rethink most operations and decide how they should handle arrays. What happens when you reconcile a cell which contains an array? What happens when one of the transpose operations encounters that? And so on.
unfortunately, as explained in a comment on #1586, we currently allow any Java Object to be stored as cell value, which means that we cannot rely on the compiler to make sure we are handling this new possibility in all places of the code where we should add this handling. This is also a problem for extension developers, who will not become aware of this unless they read our release notes.

That being said I would be happy to consider this change in 4.0, where we already change to a very different data model - such a change could be introduced there. That does not make the design effort less daunting of course (it is still basically the same problem as redesigning the records mode).

Give the user a choice between various of those options

The dialogs where those expressions are previewed could be adapted to give the user a choice in how they want their arrays to be stored. Those expression preview dialogs are already quite full, but if we think there can be multiple sensible things to do there, it is natural to give the user a choice. But it is not really clear to me that it is worth it, because:

spreading the values in columns could be done using the Split columns operation instead (by adapting it to accept an expression returning an array), as suggested by @dfhuynh and others
storing the arrays as strings can already be done by adding .jsonize() at the end of the expression. Arguably this is not as discoverable as an explicit option in the UI to do it for you, but I would expect people who want this to be aware of parseJson() at least (otherwise how do they intend to reuse the results?) so it is also quite likely that they know of jsonize()
storing arrays in cells is not something we currently support, so we cannot give the option to do that yet

Additional context

Related issues I am aware of: #36, #1475, #1088.

This is important in the context of the Structured Data on Commons integration project because the CommonsExtension offers GREL functions which return arrays.

The text was updated successfully, but these errors were encountered:

thadguidry · 2022-05-03T13:13:32Z

@wetneb I am in agreement with the proposal. Could you update the beginning of the proposal (or milestone) to signify that this would be a pre-4.0 feature enhancement? I understand from your notes it will not be targeting 4.0, where in 4.0 we might do further data model changes and affected code to support storing arrays directly in cells.

Note that we do also have Transpose which could be utilized after the parseJson() operation by users for spreading columnar values across (but Transpose is a bit clunky still for records in current state and hopefully we address that in 4.0).

ostephens · 2022-05-04T08:51:17Z

I can see all three options being useful here:

Spread results across rows
Spread results across columns
jsonize array to a string

I wonder if a control like we have for errors would be useful here - in the transformation dialogue the user is offered options as to what to do with errors (keep original, store error, store blank) - could we offer a set of options for arrays in a similar way so that the user chooses how to deal with them?

My biggest frustration with arrays (and the reason I originally looked at storing them in cells) is that currently it is too easy (IMO) for the user to overlook the need to convert the array to a string (or other storable representation) and end up storing nothing in the cell instead - any solution that avoids this has my support!

One argument I can see in favour of preferring the jsonize approach is that an array can contain arrays (which can contain arrays, etc.) which means that splitting across rows or columns would have to be recursive and could result in a complex set of rows/columns being added.

wetneb · 2022-05-04T09:33:37Z

Indeed, I did not think about the recursive case, that is going to be pretty fiddly to implement.
Adding an option to choose between various behaviours seems doable to me too.

ostephens · 2022-05-05T09:01:32Z

For me getting an option for jsonize option into the process would be a good and relatively straightforward start while avoiding the complications of splitting complex structures into rows/columns. So doing something simple that keeps the users options open gets my vote as a first step.

I agree with your comment above @wetneb

Those expression preview dialogs are already quite full

I have to admit that from a user perspective:

I never use 'Retransform X times'
I pretty much only change the 'on error' setting when I'm troubleshooting HTTP fetches

If we take the approach of adding user options here I'm wondering if it would make sense to move the settings that control storage options to a separate tab in the transformation dialogue (so move the 'on error' options and add proposed new 'on array' options in this new tab). Potentially also the "Retransform X times" option could live here as well?

If we are happy for scope to creep we could also consider:

Having an application level setting for default behaviour for transformations (for both 'on error' and 'on array')
Adding similar settings for what to do in the case transformations result in a Row, a Cell tuple or a Cell

trnstlntk mentioned this issue May 12, 2022

Decide upon, design and develop parsing options that appear after user has entered Commons categories OpenRefine/CommonsExtension#15

Closed

wetneb mentioned this issue Oct 1, 2022

Bug: extractFromTemplate and value.extractCategories GREL functions produce empty columns OpenRefine/CommonsExtension#61

Open

wetneb removed the Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators label Oct 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spread array of values generated by an expression to form a record #4823

Spread array of values generated by an expression to form a record #4823

wetneb commented Apr 30, 2022 •

edited

thadguidry commented May 3, 2022 •

edited

ostephens commented May 4, 2022

wetneb commented May 4, 2022

ostephens commented May 5, 2022 •

edited

Spread array of values generated by an expression to form a record #4823

Spread array of values generated by an expression to form a record #4823

Comments

wetneb commented Apr 30, 2022 • edited

Proposed solution

Alternatives considered

Just keep the logic as it is but improve error reporting

Spread the values as columns

Store arrays as strings in cells

Make it possible for cells to store arrays natively

Give the user a choice between various of those options

Additional context

thadguidry commented May 3, 2022 • edited

ostephens commented May 4, 2022

wetneb commented May 4, 2022

ostephens commented May 5, 2022 • edited

wetneb commented Apr 30, 2022 •

edited

thadguidry commented May 3, 2022 •

edited

ostephens commented May 5, 2022 •

edited