Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: remold the storage layer to facilitate large bulk operations #8731

Open
mwoodiupui opened this issue Mar 22, 2023 · 3 comments
Open
Assignees
Labels
backend: Hibernate related to Hibernate or HQL needs discussion Ticket or PR needs discussion before it can be moved forward. new feature
Milestone

Comments

@mwoodiupui
Copy link
Member

mwoodiupui commented Mar 22, 2023

Neither a bug nor a new feature IMO, so I didn't know which to pick.

Is your feature request related to a problem? Please describe.
I have noticed some recurring patterns associated with bulk operations.

DAO methods which can return multiple results tend to do so as List. Very large lists may consume significant amounts of memory, and collecting them may defeat the DBMS' efforts to overlap storage, network, and compute operations.

Because Context encapsulates a single Session, large bulk operations have a serious issue with transaction lifetimes. Such an operation typically uses a DBMS query to produce a sequence of DSOs which are (possibly) modified one by one. Operating on a large number of DSOs can take minutes or hours, and the single transaction must be held open for the duration so as not to detach entities in the list which have not yet been considered. Yet the natural lifetime of a transaction for the actual modifications should be a single DSO (and perhaps its dependents) and is measured in milliseconds. The memory pressure from large, long-lived transactions has led to work-arounds like DBConnection#uncacheEntity(), but this only shifts the burden from DSpace to the DBMS and does not address the possibility of long-held locks.

For the purpose of discussion I would like to point out that one may divide bulk operations into two categories. Operations which update the database are subject to both issues. Operations which merely produce reports, without altering the database, don't have the transaction lifetime conflict.

Describe the solution you'd like
Since one almost always uses a List as a generator rather than fishing around in it at random, DAOs should return Iterable or Stream when multiple results are possible. The implementation can rely on the underlying DBMS driver to produce results on demand, freeing the DBMS and driver to optimize result delivery and minimizing memory pressure.

Using a read-only Session to produce the sequence of DSOs for consideration, and a separate read-write Session to modify individual DSOs, addresses the tension over transaction lifetimes, since each has its own transaction. A read-only Session should require only read locks at most, and may be implemented by the DBMS entirely without long-term locks, or with advisory locks. All of that locking should be internal to the DBMS and of no concern to DSpace. Context could expose Sessions rather than keeping them hidden inside DBConnection, or Sessions could be separated entirely from Context. I have been unable to think of a use for more than two Sessions, so a Context method to return (and, if necessary, create) a second should be sufficient.

I have implemented the two-Session approach as a quick hack for a specific problem, and it worked well.

Describe alternatives or workarounds you've considered
A bulk operation could hand over each DSO to a new Thread which supplies its own Context. A minimal implementation would simply wait() on the single worker thread. Pooled implementations are conceivable.

A rough-and-ready alternative would be to immediately drain the list of results into a list of detached entities, which are then reloaded as required. This allows the closing of the transaction that wraps the query which produced the list. It seems rather inefficient to do this, since a large result set would tend to start with entities which will have been pushed out of the session cache and the 2nd-level cache while accumulating the list. This addresses the transaction lifetime issue, but only incompletely addresses the memory pressure issue.

@mwoodiupui mwoodiupui added new feature needs discussion Ticket or PR needs discussion before it can be moved forward. backend: Hibernate related to Hibernate or HQL needs triage New issue needs triage and/or scheduling labels Mar 22, 2023
@mwoodiupui mwoodiupui added this to the 8.0 milestone Mar 22, 2023
@tdonohue
Copy link
Member

Thanks @mwoodiupui . Good brainstorms.

I wonder if you could link to (if it's in GitHub somewhere) the example you've mentioned above:

I have implemented the two-Session approach as a quick hack for a specific problem, and it worked well.

That might provide more "clues" for future discussion. For lack of a better place to put this right now, I'm moving this to "Needs Discussion / Analysis" column. I agree this sort of thing would be useful to analyze further, just uncertain of priority right now.. (but a volunteer or steward could help push this forward in 8.x or similar, if anyone gets time/interest)

@tdonohue tdonohue removed the needs triage New issue needs triage and/or scheduling label Mar 22, 2023
@amgciadev
Copy link
Contributor

@mwoodiupui thanks a lot for adding this one!

@mwoodiupui
Copy link
Member Author

The above-mentioned hack, as a patch against dspace-api 7.4, is at https://gist.github.com/mwoodiupui/3a07b72f40b702c444335679ed0269fd

This illustrates both the two-session approach and use of a Stream for the result set. But I had to ignore the DAO layer to do either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend: Hibernate related to Hibernate or HQL needs discussion Ticket or PR needs discussion before it can be moved forward. new feature
Projects
Status: 👀 Needs Discussion / Analysis
Development

No branches or pull requests

3 participants