New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend cross() function to take either a cell or a value #1204

Closed
felixlohmeier opened this Issue Jun 23, 2017 · 23 comments

Comments

Projects
None yet
9 participants
@felixlohmeier
Contributor

felixlohmeier commented Jun 23, 2017

On Saturday, 13 October 2012 01:12:23 UTC+2, Tom Morris wrote:

Currently the cross() function is one of the few (only) that takes a cell as its first argument instead of value.

Can anyone think of a good reason that we shouldn't extend it to take either a cell or a value?

Tom

On Sunday, 14 October 2012 23:13:57 UTC+2, David Huynh wrote:

If I remember correctly--and I should have documented this, the reasons for taking a cell rather than an arbitrary expression were to make the caching logic easy, and also to avoid any expression that yields something other than a single literal value (e.g., an array of value, a "cell" or "row" or "record" object).

We should definitely make it consistent with the other functions if we can work out the caching logic (which has been buggy).

David

Current state:

jackyq2015 added a commit that referenced this issue Jun 26, 2017

@jackyq2015

This comment has been minimized.

Show comment
Hide comment
@jackyq2015

jackyq2015 Jun 26, 2017

Contributor

Pushed the change f03be76, it covers:

  • Take the cell value instead of the cell as the first argument
  • Added the necessary action to flush the stale cache. They were missed before.
Contributor

jackyq2015 commented Jun 26, 2017

Pushed the change f03be76, it covers:

  • Take the cell value instead of the cell as the first argument
  • Added the necessary action to flush the stale cache. They were missed before.
@tfmorris

This comment has been minimized.

Show comment
Hide comment
@tfmorris

tfmorris Jun 26, 2017

Member

@felixlohmeier Thanks for the summary. Do you have links the the original sources of the quotes? Was it one of the mailing lists or another issue?

@jackyq2015 The referenced commit appears to be to master, not a branch. Does it resolve this issue? Who reviewed it? (Apparently Thad has reviewed it post-facto based on his typo comment)

Member

tfmorris commented Jun 26, 2017

@felixlohmeier Thanks for the summary. Do you have links the the original sources of the quotes? Was it one of the mailing lists or another issue?

@jackyq2015 The referenced commit appears to be to master, not a branch. Does it resolve this issue? Who reviewed it? (Apparently Thad has reviewed it post-facto based on his typo comment)

@tfmorris

This comment has been minimized.

Show comment
Hide comment
@tfmorris

tfmorris Jun 26, 2017

Member

"Thad" being @thadguidry, of course (so he gets notified)

Member

tfmorris commented Jun 26, 2017

"Thad" being @thadguidry, of course (so he gets notified)

@felixlohmeier

This comment has been minimized.

Show comment
Hide comment
@wetneb

This comment has been minimized.

Show comment
Hide comment
@wetneb

wetneb Jun 26, 2017

Member

Nice, that was quick! If you have time to write one or two unit tests that would be fantastic too!
Also, what happens if cross gets something else than just a single literal value?

Member

wetneb commented Jun 26, 2017

Nice, that was quick! If you have time to write one or two unit tests that would be fantastic too!
Also, what happens if cross gets something else than just a single literal value?

@felixlohmeier

This comment has been minimized.

Show comment
Hide comment
@felixlohmeier

felixlohmeier Jun 27, 2017

Contributor

@jackyq2015 Yeah, thanks for the quick solution! I am afk until July, 16. Shall I close the issue now, so you can get your reward at bountysource?

Contributor

felixlohmeier commented Jun 27, 2017

@jackyq2015 Yeah, thanks for the quick solution! I am afk until July, 16. Shall I close the issue now, so you can get your reward at bountysource?

@thadguidry

This comment has been minimized.

Show comment
Hide comment
@thadguidry

thadguidry Jun 27, 2017

Member

@felixlohmeier That would be nice Felix. Jacky and I chatted about a few other things on Sunday, like perhaps adding 1 or 2 more unit tests for the cross() work...other than that, its currently shippable and I'm using the trunk build now. Any bugs or issues we can probably resolve down the line, so I'd say go ahead and close the issue so that Jacky can withdraw his funds :) He did a great job on it, and now as a bonus, because of this work, we now have a way to easily generate summed up column values now as part of a cross() function, a much requested feature that Excel does in Pivot Tables but we never did https://groups.google.com/forum/#!topic/openrefine/CfNurLR5CII :) I'll post about how to do that later on the wiki... we could probably even add that function as a column operation later on perhaps if folks want it easier.

Member

thadguidry commented Jun 27, 2017

@felixlohmeier That would be nice Felix. Jacky and I chatted about a few other things on Sunday, like perhaps adding 1 or 2 more unit tests for the cross() work...other than that, its currently shippable and I'm using the trunk build now. Any bugs or issues we can probably resolve down the line, so I'd say go ahead and close the issue so that Jacky can withdraw his funds :) He did a great job on it, and now as a bonus, because of this work, we now have a way to easily generate summed up column values now as part of a cross() function, a much requested feature that Excel does in Pivot Tables but we never did https://groups.google.com/forum/#!topic/openrefine/CfNurLR5CII :) I'll post about how to do that later on the wiki... we could probably even add that function as a column operation later on perhaps if folks want it easier.

@jackyq2015

This comment has been minimized.

Show comment
Hide comment
@jackyq2015

jackyq2015 Jun 28, 2017

Contributor

@wetneb Will add few test case for non-literal value.

Contributor

jackyq2015 commented Jun 28, 2017

@wetneb Will add few test case for non-literal value.

jackyq2015 added a commit that referenced this issue Jun 28, 2017

@felixlohmeier

This comment has been minimized.

Show comment
Hide comment
@felixlohmeier

felixlohmeier Jul 17, 2017

Contributor

Again, thank you very much @jackyq2015 for the great job!

I wonder if this is a breaking change that needs to be addressed in upcoming release notes. The original proposal by @tfmorris in 2012 was to...

extend it to take either a cell or a value

The new cross() function takes a string only, which is fine by me but breaks existing user transformation history and the VIB-BITS OpenRefine extension. Is there a chance to make the new cross() function backward compatible (to take either a cell or a value)?

I added a note in the wiki to reflect this different behavior between the stable release (2.7) and the development version.

Contributor

felixlohmeier commented Jul 17, 2017

Again, thank you very much @jackyq2015 for the great job!

I wonder if this is a breaking change that needs to be addressed in upcoming release notes. The original proposal by @tfmorris in 2012 was to...

extend it to take either a cell or a value

The new cross() function takes a string only, which is fine by me but breaks existing user transformation history and the VIB-BITS OpenRefine extension. Is there a chance to make the new cross() function backward compatible (to take either a cell or a value)?

I added a note in the wiki to reflect this different behavior between the stable release (2.7) and the development version.

@jackyq2015

This comment has been minimized.

Show comment
Hide comment
@jackyq2015

jackyq2015 Jul 18, 2017

Contributor

@felixlohmeier can you please elaborate the symptom of " breaks existing user transformation history"? Definitely we can address it. Either make it back compatible or fix the history issue you refer to.

Contributor

jackyq2015 commented Jul 18, 2017

@felixlohmeier can you please elaborate the symptom of " breaks existing user transformation history"? Definitely we can address it. Either make it back compatible or fix the history issue you refer to.

@wetneb

This comment has been minimized.

Show comment
Hide comment
@wetneb

wetneb Jul 18, 2017

Member

@jackyq2015 So, if an existing transformation script assumes that the cross function expects a cell, and the new version of that function does not accept cells anymore, this transformation script that worked for OpenRefine 2.7 will not work for a later version. (That is how I understand the problem)

Member

wetneb commented Jul 18, 2017

@jackyq2015 So, if an existing transformation script assumes that the cross function expects a cell, and the new version of that function does not accept cells anymore, this transformation script that worked for OpenRefine 2.7 will not work for a later version. (That is how I understand the problem)

@felixlohmeier

This comment has been minimized.

Show comment
Hide comment
@felixlohmeier

felixlohmeier Jul 18, 2017

Contributor

I think of OpenRefine projects that use the old cross function that expects cell.cross() and will be opened someday in a future release of OpenRefine that includes the new cross function that expects value.cross(). In this case the undo/redo function would not work properly.

Additionally, there are many tutorials out there (stackoverflow, google forum, ...) that suggest to use the cross function with cell.cross(). If they try them with the new version they will get an error.

screenshot-cross

Is it possible to extend the cross function one step further to allow both forms of user input (cell and string)?

cell.cross("My Address Book", "friend")[0].cells["address"].value
value.cross("My Address Book", "friend")[0].cells["address"].value
Contributor

felixlohmeier commented Jul 18, 2017

I think of OpenRefine projects that use the old cross function that expects cell.cross() and will be opened someday in a future release of OpenRefine that includes the new cross function that expects value.cross(). In this case the undo/redo function would not work properly.

Additionally, there are many tutorials out there (stackoverflow, google forum, ...) that suggest to use the cross function with cell.cross(). If they try them with the new version they will get an error.

screenshot-cross

Is it possible to extend the cross function one step further to allow both forms of user input (cell and string)?

cell.cross("My Address Book", "friend")[0].cells["address"].value
value.cross("My Address Book", "friend")[0].cells["address"].value
@felixlohmeier

This comment has been minimized.

Show comment
Hide comment
@felixlohmeier

felixlohmeier Jul 18, 2017

Contributor

I have another issue... maybe there is still a problem in the caching logic. I would like to use the new cross() function together with split and join:

forEach(value.split(","),v,v.cross("My Address Book","friend")[0].cells["address"].value).join(",")

This works for "mary,john" but not for "anne,mary"

screenshot-cross-without-anne

In this example, the value "anne" is not part of the recipient column. If I add a row for "anne" then "anne,mary" works too.

screenshot-cross-with-anne

I think this behavior is unexpected. It seems that the new cross() function still expects that the string input value is present as a single cell.

Contributor

felixlohmeier commented Jul 18, 2017

I have another issue... maybe there is still a problem in the caching logic. I would like to use the new cross() function together with split and join:

forEach(value.split(","),v,v.cross("My Address Book","friend")[0].cells["address"].value).join(",")

This works for "mary,john" but not for "anne,mary"

screenshot-cross-without-anne

In this example, the value "anne" is not part of the recipient column. If I add a row for "anne" then "anne,mary" works too.

screenshot-cross-with-anne

I think this behavior is unexpected. It seems that the new cross() function still expects that the string input value is present as a single cell.

@felixlohmeier felixlohmeier reopened this Jul 18, 2017

@jackyq2015

This comment has been minimized.

Show comment
Hide comment
@jackyq2015

jackyq2015 Jul 18, 2017

Contributor

I will review the document related to the cross function to make it consistent and can have the code also support cell if necessary.

For the transformation, I can still redo and undo between cross function. could you please give an example?

For the logic of cache, the key of the cache is based on:
fromProject + ";" + fromColumn + ";" + toProject + ";" + toColumn;
which means:

  • The cache will only do once if there is no change to columns
  • The cache will be done when first cross get called(in your case, on the mary cell)
  • The cross function does not and should not aware of the custom operation(in your case, split). It will treat "mary, john" as one single value. The reason it can be parsed properly is that mary AND john already cached properly. so if you put a separate row which contains anne, the "anne,mary" can be parsed properly. But the "mary,john" will be also get cached since cache is based on the column.
Contributor

jackyq2015 commented Jul 18, 2017

I will review the document related to the cross function to make it consistent and can have the code also support cell if necessary.

For the transformation, I can still redo and undo between cross function. could you please give an example?

For the logic of cache, the key of the cache is based on:
fromProject + ";" + fromColumn + ";" + toProject + ";" + toColumn;
which means:

  • The cache will only do once if there is no change to columns
  • The cache will be done when first cross get called(in your case, on the mary cell)
  • The cross function does not and should not aware of the custom operation(in your case, split). It will treat "mary, john" as one single value. The reason it can be parsed properly is that mary AND john already cached properly. so if you put a separate row which contains anne, the "anne,mary" can be parsed properly. But the "mary,john" will be also get cached since cache is based on the column.
@felixlohmeier

This comment has been minimized.

Show comment
Hide comment
@felixlohmeier

felixlohmeier Jul 18, 2017

Contributor

... can have the code also support cell if necessary

I think that would be great to make it backwards compatible.

The cross function does not and should not aware of the custom operation(in your case, split).

Thank you @jackyq2015, now I understand the caching logic. The reason why I raised this issue was the idea that combining the split, cross and join in one expression could reduce total run time. Row operations (split/join multi-valued-cells, delete rows) are very expensive and often amount to more than 75% of total run time in my projects. Do you have an idea how to support custom operations (e.g. split) together with the cross function? I am willing to give another bounty if this helps.

Contributor

felixlohmeier commented Jul 18, 2017

... can have the code also support cell if necessary

I think that would be great to make it backwards compatible.

The cross function does not and should not aware of the custom operation(in your case, split).

Thank you @jackyq2015, now I understand the caching logic. The reason why I raised this issue was the idea that combining the split, cross and join in one expression could reduce total run time. Row operations (split/join multi-valued-cells, delete rows) are very expensive and often amount to more than 75% of total run time in my projects. Do you have an idea how to support custom operations (e.g. split) together with the cross function? I am willing to give another bounty if this helps.

@jackyq2015

This comment has been minimized.

Show comment
Hide comment
@jackyq2015

jackyq2015 Jul 18, 2017

Contributor

Will find some time this weekend for the compatibility issue.

For the cache logic I think the good practice is that always do the cross on normalized data(source and destination) rather than row data to prevent the unpredictable result. Also in this way you can leverage the cache efficiently.

Contributor

jackyq2015 commented Jul 18, 2017

Will find some time this weekend for the compatibility issue.

For the cache logic I think the good practice is that always do the cross on normalized data(source and destination) rather than row data to prevent the unpredictable result. Also in this way you can leverage the cache efficiently.

@ettorerizza

This comment has been minimized.

Show comment
Hide comment
@ettorerizza

ettorerizza Jul 20, 2017

Member

As I feared, this change "breaks" the VibBits extension, which allows to make a cell.cross() visually. Too bad, it is very convenient.

screencast

Member

ettorerizza commented Jul 20, 2017

As I feared, this change "breaks" the VibBits extension, which allows to make a cell.cross() visually. Too bad, it is very convenient.

screencast

@jackyq2015

This comment has been minimized.

Show comment
Hide comment
@jackyq2015

jackyq2015 Jul 23, 2017

Contributor

@ettorerizza
Added the backward compatibility for cross function as discussed. 4950d29

Could you please test from the head?

Contributor

jackyq2015 commented Jul 23, 2017

@ettorerizza
Added the backward compatibility for cross function as discussed. 4950d29

Could you please test from the head?

@jackyq2015 jackyq2015 reopened this Jul 23, 2017

@felixlohmeier

This comment has been minimized.

Show comment
Hide comment
@felixlohmeier

felixlohmeier Jul 24, 2017

Contributor

Thank you @jackyq2015, it works.

We should update the integrated help too: line 85 in Cross.java

writer.key("params"); writer.value("string s or cell c, string projectName, string columnName");

Contributor

felixlohmeier commented Jul 24, 2017

Thank you @jackyq2015, it works.

We should update the integrated help too: line 85 in Cross.java

writer.key("params"); writer.value("string s or cell c, string projectName, string columnName");

@jackyq2015

This comment has been minimized.

Show comment
Hide comment
@jackyq2015

jackyq2015 Jul 29, 2017

Contributor

@felixlohmeier corrected the description

Contributor

jackyq2015 commented Jul 29, 2017

@felixlohmeier corrected the description

@jackyq2015 jackyq2015 closed this Jul 29, 2017

@felixlohmeier

This comment has been minimized.

Show comment
Hide comment
@felixlohmeier

felixlohmeier Aug 31, 2017

Contributor

Hi @jackyq2015 and ALL, may I ask again if there is a way to support multiple-value-cell-input in the cross function? I am willing to give another bounty. I had a look at the InterProjectModel again and I wonder if it could be possible to add an optional split function before computing the cache? From a user perspective, I would like to call the cross function with a separator as 4th argument, so that multiple values (separated by a custom separator, e.g. ␟) in an input cell get "crossed". Multiple output values should be joined with the same separator.

Example (see screenshots above):

  • Input: anne␟mary
  • Expression: cell.cross("My Address Book", "friend","␟")[0].cells["address"].value
  • Output: 17 Morning Crescent␟50 Broadway Ave

The reason why I raised this issue was the idea that combining split, cross and join in one (cached) function could reduce total run time. Row operations (split/join multi-valued-cells, delete rows) are very expensive and often amount to more than 75% of total run time in my projects.

Contributor

felixlohmeier commented Aug 31, 2017

Hi @jackyq2015 and ALL, may I ask again if there is a way to support multiple-value-cell-input in the cross function? I am willing to give another bounty. I had a look at the InterProjectModel again and I wonder if it could be possible to add an optional split function before computing the cache? From a user perspective, I would like to call the cross function with a separator as 4th argument, so that multiple values (separated by a custom separator, e.g. ␟) in an input cell get "crossed". Multiple output values should be joined with the same separator.

Example (see screenshots above):

  • Input: anne␟mary
  • Expression: cell.cross("My Address Book", "friend","␟")[0].cells["address"].value
  • Output: 17 Morning Crescent␟50 Broadway Ave

The reason why I raised this issue was the idea that combining split, cross and join in one (cached) function could reduce total run time. Row operations (split/join multi-valued-cells, delete rows) are very expensive and often amount to more than 75% of total run time in my projects.

@claussni

This comment has been minimized.

Show comment
Hide comment
@claussni

claussni Oct 26, 2017

Contributor

Implemented as optional 4th parameter to cross() function: https://github.com/claussni/OpenRefine/tree/cross-func-split

@felixlohmeier can you test this, please? Here is the expression I used:
forEach(value.cross("My Address Book", "friend", ","),r,r.cells["address"].value).join(";")

The result should be:
screenshot-2017-10-26 christmas gifts csv - openrefine

Contributor

claussni commented Oct 26, 2017

Implemented as optional 4th parameter to cross() function: https://github.com/claussni/OpenRefine/tree/cross-func-split

@felixlohmeier can you test this, please? Here is the expression I used:
forEach(value.cross("My Address Book", "friend", ","),r,r.cells["address"].value).join(";")

The result should be:
screenshot-2017-10-26 christmas gifts csv - openrefine

@felixlohmeier

This comment has been minimized.

Show comment
Hide comment
@felixlohmeier

felixlohmeier Oct 26, 2017

Contributor

result for anne should be "17 Morning Crescent"; i am going to open a new issue with example data

Contributor

felixlohmeier commented Oct 26, 2017

result for anne should be "17 Morning Crescent"; i am going to open a new issue with example data

wetneb added a commit that referenced this issue Oct 27, 2017

Allow comma separated multi-value source in cross() function
Implements support for comma separated multiple-value keys for joining
another project using the cross() function.

See #1204 (comment)

wetneb added a commit that referenced this issue Oct 27, 2017

Add optional 4th parameter to cross() function
The cross function now accepts a 4th parameter defining a regular
expression separator for splitting multi-value field values when joining
projects.

See #1204 (comment)

@ostephens ostephens added this to the 2.7.2 milestone Oct 27, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment