Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend cross() function to take either a cell or a value #1204

Closed
felixlohmeier opened this issue Jun 23, 2017 · 23 comments
Closed

Extend cross() function to take either a cell or a value #1204

felixlohmeier opened this issue Jun 23, 2017 · 23 comments
Milestone

Comments

@felixlohmeier
Copy link
Contributor

On Saturday, 13 October 2012 01:12:23 UTC+2, Tom Morris wrote:

Currently the cross() function is one of the few (only) that takes a cell as its first argument instead of value.

Can anyone think of a good reason that we shouldn't extend it to take either a cell or a value?

Tom

On Sunday, 14 October 2012 23:13:57 UTC+2, David Huynh wrote:

If I remember correctly--and I should have documented this, the reasons for taking a cell rather than an arbitrary expression were to make the caching logic easy, and also to avoid any expression that yields something other than a single literal value (e.g., an array of value, a "cell" or "row" or "record" object).

We should definitely make it consistent with the other functions if we can work out the caching logic (which has been buggy).

David

Current state:

@jackyq2015
Copy link
Contributor

Pushed the change f03be76, it covers:

  • Take the cell value instead of the cell as the first argument
  • Added the necessary action to flush the stale cache. They were missed before.

@tfmorris
Copy link
Member

@felixlohmeier Thanks for the summary. Do you have links the the original sources of the quotes? Was it one of the mailing lists or another issue?

@jackyq2015 The referenced commit appears to be to master, not a branch. Does it resolve this issue? Who reviewed it? (Apparently Thad has reviewed it post-facto based on his typo comment)

@tfmorris
Copy link
Member

"Thad" being @thadguidry, of course (so he gets notified)

@felixlohmeier
Copy link
Contributor Author

@wetneb
Copy link
Sponsor Member

wetneb commented Jun 26, 2017

Nice, that was quick! If you have time to write one or two unit tests that would be fantastic too!
Also, what happens if cross gets something else than just a single literal value?

@felixlohmeier
Copy link
Contributor Author

@jackyq2015 Yeah, thanks for the quick solution! I am afk until July, 16. Shall I close the issue now, so you can get your reward at bountysource?

@thadguidry
Copy link
Member

thadguidry commented Jun 27, 2017

@felixlohmeier That would be nice Felix. Jacky and I chatted about a few other things on Sunday, like perhaps adding 1 or 2 more unit tests for the cross() work...other than that, its currently shippable and I'm using the trunk build now. Any bugs or issues we can probably resolve down the line, so I'd say go ahead and close the issue so that Jacky can withdraw his funds :) He did a great job on it, and now as a bonus, because of this work, we now have a way to easily generate summed up column values now as part of a cross() function, a much requested feature that Excel does in Pivot Tables but we never did https://groups.google.com/forum/#!topic/openrefine/CfNurLR5CII :) I'll post about how to do that later on the wiki... we could probably even add that function as a column operation later on perhaps if folks want it easier.

@jackyq2015
Copy link
Contributor

@wetneb Will add few test case for non-literal value.

jackyq2015 added a commit that referenced this issue Jun 28, 2017
@felixlohmeier
Copy link
Contributor Author

Again, thank you very much @jackyq2015 for the great job!

I wonder if this is a breaking change that needs to be addressed in upcoming release notes. The original proposal by @tfmorris in 2012 was to...

extend it to take either a cell or a value

The new cross() function takes a string only, which is fine by me but breaks existing user transformation history and the VIB-BITS OpenRefine extension. Is there a chance to make the new cross() function backward compatible (to take either a cell or a value)?

I added a note in the wiki to reflect this different behavior between the stable release (2.7) and the development version.

@jackyq2015
Copy link
Contributor

@felixlohmeier can you please elaborate the symptom of " breaks existing user transformation history"? Definitely we can address it. Either make it back compatible or fix the history issue you refer to.

@wetneb
Copy link
Sponsor Member

wetneb commented Jul 18, 2017

@jackyq2015 So, if an existing transformation script assumes that the cross function expects a cell, and the new version of that function does not accept cells anymore, this transformation script that worked for OpenRefine 2.7 will not work for a later version. (That is how I understand the problem)

@felixlohmeier
Copy link
Contributor Author

I think of OpenRefine projects that use the old cross function that expects cell.cross() and will be opened someday in a future release of OpenRefine that includes the new cross function that expects value.cross(). In this case the undo/redo function would not work properly.

Additionally, there are many tutorials out there (stackoverflow, google forum, ...) that suggest to use the cross function with cell.cross(). If they try them with the new version they will get an error.

screenshot-cross

Is it possible to extend the cross function one step further to allow both forms of user input (cell and string)?

cell.cross("My Address Book", "friend")[0].cells["address"].value
value.cross("My Address Book", "friend")[0].cells["address"].value

@felixlohmeier
Copy link
Contributor Author

I have another issue... maybe there is still a problem in the caching logic. I would like to use the new cross() function together with split and join:

forEach(value.split(","),v,v.cross("My Address Book","friend")[0].cells["address"].value).join(",")

This works for "mary,john" but not for "anne,mary"

screenshot-cross-without-anne

In this example, the value "anne" is not part of the recipient column. If I add a row for "anne" then "anne,mary" works too.

screenshot-cross-with-anne

I think this behavior is unexpected. It seems that the new cross() function still expects that the string input value is present as a single cell.

@felixlohmeier felixlohmeier reopened this Jul 18, 2017
@jackyq2015
Copy link
Contributor

I will review the document related to the cross function to make it consistent and can have the code also support cell if necessary.

For the transformation, I can still redo and undo between cross function. could you please give an example?

For the logic of cache, the key of the cache is based on:
fromProject + ";" + fromColumn + ";" + toProject + ";" + toColumn;
which means:

  • The cache will only do once if there is no change to columns
  • The cache will be done when first cross get called(in your case, on the mary cell)
  • The cross function does not and should not aware of the custom operation(in your case, split). It will treat "mary, john" as one single value. The reason it can be parsed properly is that mary AND john already cached properly. so if you put a separate row which contains anne, the "anne,mary" can be parsed properly. But the "mary,john" will be also get cached since cache is based on the column.

@felixlohmeier
Copy link
Contributor Author

... can have the code also support cell if necessary

I think that would be great to make it backwards compatible.

The cross function does not and should not aware of the custom operation(in your case, split).

Thank you @jackyq2015, now I understand the caching logic. The reason why I raised this issue was the idea that combining the split, cross and join in one expression could reduce total run time. Row operations (split/join multi-valued-cells, delete rows) are very expensive and often amount to more than 75% of total run time in my projects. Do you have an idea how to support custom operations (e.g. split) together with the cross function? I am willing to give another bounty if this helps.

@jackyq2015
Copy link
Contributor

Will find some time this weekend for the compatibility issue.

For the cache logic I think the good practice is that always do the cross on normalized data(source and destination) rather than row data to prevent the unpredictable result. Also in this way you can leverage the cache efficiently.

@ettorerizza
Copy link
Member

As I feared, this change "breaks" the VibBits extension, which allows to make a cell.cross() visually. Too bad, it is very convenient.

screencast

@jackyq2015
Copy link
Contributor

@ettorerizza
Added the backward compatibility for cross function as discussed. 4950d29

Could you please test from the head?

@jackyq2015 jackyq2015 reopened this Jul 23, 2017
@felixlohmeier
Copy link
Contributor Author

Thank you @jackyq2015, it works.

We should update the integrated help too: line 85 in Cross.java

writer.key("params"); writer.value("string s or cell c, string projectName, string columnName");

@jackyq2015
Copy link
Contributor

@felixlohmeier corrected the description

@felixlohmeier
Copy link
Contributor Author

Hi @jackyq2015 and ALL, may I ask again if there is a way to support multiple-value-cell-input in the cross function? I am willing to give another bounty. I had a look at the InterProjectModel again and I wonder if it could be possible to add an optional split function before computing the cache? From a user perspective, I would like to call the cross function with a separator as 4th argument, so that multiple values (separated by a custom separator, e.g. ␟) in an input cell get "crossed". Multiple output values should be joined with the same separator.

Example (see screenshots above):

  • Input: anne␟mary
  • Expression: cell.cross("My Address Book", "friend","␟")[0].cells["address"].value
  • Output: 17 Morning Crescent␟50 Broadway Ave

The reason why I raised this issue was the idea that combining split, cross and join in one (cached) function could reduce total run time. Row operations (split/join multi-valued-cells, delete rows) are very expensive and often amount to more than 75% of total run time in my projects.

@claussni
Copy link
Contributor

claussni commented Oct 26, 2017

Implemented as optional 4th parameter to cross() function: https://github.com/claussni/OpenRefine/tree/cross-func-split

@felixlohmeier can you test this, please? Here is the expression I used:
forEach(value.cross("My Address Book", "friend", ","),r,r.cells["address"].value).join(";")

The result should be:
screenshot-2017-10-26 christmas gifts csv - openrefine

@felixlohmeier
Copy link
Contributor Author

result for anne should be "17 Morning Crescent"; i am going to open a new issue with example data

wetneb pushed a commit that referenced this issue Oct 27, 2017
Implements support for comma separated multiple-value keys for joining
another project using the cross() function.

See #1204 (comment)
wetneb pushed a commit that referenced this issue Oct 27, 2017
The cross function now accepts a 4th parameter defining a regular
expression separator for splitting multi-value field values when joining
projects.

See #1204 (comment)
@ostephens ostephens added this to the 2.7.2 milestone Oct 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants