Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store partial reconciliation results #1231

Closed
ettorerizza opened this issue Aug 11, 2017 · 8 comments
Closed

Store partial reconciliation results #1231

ettorerizza opened this issue Aug 11, 2017 · 8 comments
Labels
large project support Improving support for large projects (for instance, millions of rows) Priority: Low Indicates less critical issues that can be dealt with at a later stage reconciliation Related to the reconciliation operations and other features Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements.

Comments

@ettorerizza
Copy link
Member

ettorerizza commented Aug 11, 2017

Reconciliation (eg with Wikidata) is sometimes a long process. At the slightest connection problem, or if Open Refine crashes, you have to start all over again. Would it not be possible to preserve the reconciliations already made, or to store them in a kind of cache? Unless it is already the case and I ignore it ? (I've never noticed that a failed reconciliation was faster the second time).

screenshot-127 0 0 1-3333-2017-08-11-15-25-13

@wetneb
Copy link
Sponsor Member

wetneb commented Aug 11, 2017

I agree this is an important problem. But it's not entirely clear to me what the solution should look like.

  • I can put a server-side cache on top of the reconciliation queries (but that would only solve the problem for Wikidata)
  • We can put a client-side cache in OpenRefine but it would probably not survive crashes. It would also need to be persistent across operations (which is not the case, for instance, for the cache I have added to the URL fetching operation).

As always with caches, we have to be careful about the size, invalidation strategy, time to live, and so on…

@wetneb wetneb added Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements. large project support Improving support for large projects (for instance, millions of rows) reconciliation Related to the reconciliation operations and other features labels Aug 11, 2017
@wetneb
Copy link
Sponsor Member

wetneb commented Aug 11, 2017

Related issue: #580

@thadguidry
Copy link
Member

thadguidry commented Aug 11, 2017

@wetneb If I recall, I think I asked David H. to consider Fetch URLs and Reconcile as long running operations and hence no automatic project save occurs. Honestly, I cannot recall...so...We need to

and could change the logic so that project saves are configurable and could occur more rapidly than default (ex. every 4 mins).

@wetneb
Copy link
Sponsor Member

wetneb commented Aug 11, 2017

I don't think that running an autosave while the operation is running would do anything - the reconciliation results are stored in memory and they are only added to the project at the very end of the operation. Even if the column was populated gradually, you would still need to write the change to the history, and recover from an interrupted run (that could be done by faceting, but it would be quite manual).

@thadguidry
Copy link
Member

thadguidry commented Aug 11, 2017

@wetneb your probably right, like I said...I cannot recall, and I am not very intimate anymore with our code in a lot of areas. But I can give lots of pointers to folks that have more time so they can get intimate with our code :)

@jackyq2015
Copy link
Contributor

Maybe some tweak around line: https://github.com/OpenRefine/OpenRefine/blob/master/main/src/com/google/refine/process/ProcessManager.java#L152
could save partial reconciliation results. Or even better to use Interceptor design pattern make it more flexible and clean. Just a wild idea. :)

@magdmartin
Copy link
Member

this will help to address #1235

@wetneb
Copy link
Sponsor Member

wetneb commented Apr 18, 2023

This is implemented in the 4.0 branch. See the videos and discussion there: https://forum.openrefine.org/t/partial-results-of-long-running-operations/482

@wetneb wetneb closed this as completed Apr 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
large project support Improving support for large projects (for instance, millions of rows) Priority: Low Indicates less critical issues that can be dealt with at a later stage reconciliation Related to the reconciliation operations and other features Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements.
Projects
None yet
Development

No branches or pull requests

5 participants