-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drive reindex using client side checkpointing for improved throughput #2195
Comments
Sub-task 1: New Operation: $list-index, or $index, or $reindex-list, etc. Input parameters: Output: Note: Sub-task 2: --Add a new "resourceLogicalIds" parameter that takes resource types+ids as a comma-delimited string. If both "resourceLogicalIds" and "resourceLogicalId" parameters are specified, then either error. Sub-task 3: |
Any thought of adding a |
A similar thought I had on sub-task 1 is that the list of resources you want is basically equivalent to a whole-system search with _total=none, but with no resource contents in the output. I opened a related feature request at #2027 and so I'd vote to get that implemented and make our standard search with no parameters (and system-defined sort) blazing fast.
seems like we just shouldn't list deleted resources in the output of this operation |
I don't think the operation should return deleted resources - they never need to be reindexed.
This mode of reindex won't be updating the reindex_tstamp, so this field isn't that useful. Note that the call to actually perform the reindex should support a list of resources, not just a single resource. This will make the calls a little more efficient (perhaps 50 at a time, which can be processed inside a single transaction). |
Right, since $reindex would now accept a list of resources, I was thinking if the client knew the reindex tstamp of each resource, it could skip over resources that it knows have been reindexed after |
Ah, ok. I now understand the distinction between logical_resource_ids and logical_id. I see a couple options:
I think option 2 makes the most sense by having a pair of endpoints ($reindex and $list-index) that are used together, especially in case there is special metadata, such as the logicial_resource_ids, that really don't fit well in the output Bundle from a normal search. Option 2 feels cleaner to me for this purpose. |
Signed-off-by: Troy Biesterfeld <tbieste@us.ibm.com>
Signed-off-by: Troy Biesterfeld <tbieste@us.ibm.com>
Signed-off-by: Troy Biesterfeld <tbieste@us.ibm.com>
Signed-off-by: Troy Biesterfeld <tbieste@us.ibm.com>
Signed-off-by: Troy Biesterfeld <tbieste@us.ibm.com>
Signed-off-by: Troy Biesterfeld <tbieste@us.ibm.com>
Signed-off-by: Troy Biesterfeld <tbieste@us.ibm.com>
Signed-off-by: Troy Biesterfeld <tbieste@us.ibm.com>
Signed-off-by: Troy Biesterfeld <tbieste@us.ibm.com>
Signed-off-by: Troy Biesterfeld <tbieste@us.ibm.com>
Signed-off-by: Troy Biesterfeld <tbieste@us.ibm.com>
Signed-off-by: Troy Biesterfeld <tbieste@us.ibm.com>
Signed-off-by: Troy Biesterfeld <tbieste@us.ibm.com>
Signed-off-by: Troy Biesterfeld <tbieste@us.ibm.com>
Signed-off-by: Troy Biesterfeld <tbieste@us.ibm.com>
Signed-off-by: Troy Biesterfeld <tbieste@us.ibm.com>
Signed-off-by: Troy Biesterfeld <tbieste@us.ibm.com>
Signed-off-by: Troy Biesterfeld <tbieste@us.ibm.com>
Signed-off-by: Troy Biesterfeld <tbieste@us.ibm.com>
Signed-off-by: Troy Biesterfeld <tbieste@us.ibm.com>
Signed-off-by: Troy Biesterfeld <tbieste@us.ibm.com>
Signed-off-by: Troy Biesterfeld <tbieste@us.ibm.com>
Signed-off-by: Troy Biesterfeld <tbieste@us.ibm.com>
Signed-off-by: Troy Biesterfeld <tbieste@us.ibm.com>
Signed-off-by: Troy Biesterfeld <tbieste@us.ibm.com>
Signed-off-by: Troy Biesterfeld <tbieste@us.ibm.com>
Signed-off-by: Troy Biesterfeld <tbieste@us.ibm.com>
Signed-off-by: Troy Biesterfeld <tbieste@us.ibm.com>
Signed-off-by: Troy Biesterfeld <tbieste@us.ibm.com>
Issue #2195 - Enable client checkpoint driven reindex
Signed-off-by: Troy Biesterfeld <tbieste@us.ibm.com>
Signed-off-by: Troy Biesterfeld <tbieste@us.ibm.com>
Signed-off-by: Troy Biesterfeld <tbieste@us.ibm.com>
Issue #2195 - Have client-driven reindex exit if no work to do
I re-opened #1822 for some ongoing pain with reindexing large databases, but otherwise this seems to be working. Currently, the
For large reindex jobs, this get very verbose. For example, the fhir-bucket client only logs one message per request (and the default is like 50 resources in a single request). What we'd like is:
|
Signed-off-by: Troy Biesterfeld <tbieste@us.ibm.com>
I think we should support Even if we chose not to support it (which I think would be wrong), today it comes back with a 500 internal server error whereas it should be a 400 or 405.
|
Signed-off-by: Troy Biesterfeld <tbieste@us.ibm.com>
Added support for GET. |
Signed-off-by: Troy Biesterfeld <tbieste@us.ibm.com>
the updated "Reindexing was completed" logic looks good:
|
I confirmed that |
I alos verified that, when the reindex operation fails, we now get a log message that more clearly indicates on which resource we have failed. For example:
|
Is your feature request related to a problem? Please describe.
The current
$reindex
operation uses thereindex_tstamp
column in LOGICAL_RESOURCES for selecting which resources to process for a particular thread. This selection process involves updating the column and using database-specific techniques to avoid concurrency issues from the resulting row-locks.In PostgreSQL, this update leaves "tombstone" markers in the blocks which only get cleaned when the table is next vacuumed. If vacuuming is not aggressive enough,
$reindex
slows significantly due to the extra index blocks being scanned every time the request processor attempts to acquire a new resource to process.Describe the solution you'd like
Although using the
reindex_tstamp
simplifies the client needed to drive the reindex operation (it can be as simple as a shell script running curl in a loop), better throughput could be achieved by avoid the update statement and instead tracking progress (checkpointing) with a more sophisticated client implementation._count
can be used to limit the number of resources selected each time. Only the logical_resource_id values would need to be returned, not the resources.Check that the resulting throughput is greater than the current reindex operation (which can be kept), and that its throughput doesn't slow over time due to delayed vacuum of the logical_resources table
The text was updated successfully, but these errors were encountered: