Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reindex method to MNAdmin API #1716

Closed
taojing2002 opened this issue Oct 26, 2023 · 16 comments
Closed

Add reindex method to MNAdmin API #1716

taojing2002 opened this issue Oct 26, 2023 · 16 comments
Assignees
Milestone

Comments

@taojing2002
Copy link
Contributor

Now, we disabled the reindex and reindexall methods on the olde Metacat API. They are very helpful. I am planning to add them into the Metacat Admin API. They are still only called by the Metacat administrators.

@taojing2002 taojing2002 added this to the 3.0.0 milestone Oct 26, 2023
@taojing2002 taojing2002 self-assigned this Oct 26, 2023
@mbjones
Copy link
Member

mbjones commented Nov 3, 2023

This is good, Jing. I suggest it should be a single reindex method that is configurable to index a list of PIDs or all PIDs, and that we should put this in a new MNAdmin API service that we implement just for Metacat for now. Here's a proposed set of docs for a MNAdmin.reindex() method. Feedback and changes welcomed.

Admin API

Administer repository deployments.

The admin API methods are used to administer a repository deployment, and the API MUST be restricted to authenticated subjects with administrative permissions. While the API overall may have utility across the DataONE network, the implementation of these methods will vary across different software implementations of the DataONE API, and some implementations may not need all (or any) of these facilities. The Admin service and all of its API methods are therefore optional and can be omitted from implementation, or may return NotImplemented exceptions as appropriate.

reindex(session, pid, all=FALSE)

GET /reindex[?pid={pid}, all=FALSE]]

For repositories that uses a separate index for metadata search and discovery, this method triggers a reindexing task to ensure the index is up-to-date, either for one or more pid values passed in as parameters, or for all PIDs on the node if all=TRUE. If the caller does not provide a list of PIDs to be processed and does not set all=TRUE, then the method should do nothing, to prevent accidental requests to reindex the entire corpus. For many implementations, this method would only queue an asynchronous indexing task, and not wait for the task to complete.

Parameters

  • pid a repeatable list of identifiers of objects to be reindexed
  • all a boolean value, which if set to TRUE will cause all objects in the system to be reindexed ( a potentially very expensive operation)
  • Returns Boolean TRUE if the reindex request is scheduled
  • Return type: Types.Boolean
  • Raises:
    - Exceptions.NotImplemented – The service is not implemented. (errorCode=501, detailCode=????)
    - Exceptions.ServiceFailure – (errorCode=500, detailCode=????)
    - Exceptions.InvalidRequest – The request was malformed. (errorCode=400, detailCode=????)

@taojing2002 taojing2002 changed the title Add reindex and reindexall methods to Metacat Admin API Add reindex method to MNAdmin API Nov 3, 2023
@taojing2002
Copy link
Contributor Author

Looks good! I like the idea to use an explicit parameter to indicate if Metacat indexes all pids. This can avoid accidental time-consuming processes.

I have two things needed to be clarified:

  1. The command for multiple pids. Does it looks like:
GET /reindex[?pid={pid}, pid={pid1}, pid={pid2}]
  1. What should we consider this command?
GET /reindex[?pid={pid}, all=TRUE]

@mbjones
Copy link
Member

mbjones commented Nov 3, 2023

Yeah, I thought about your point 1 a bit. Seems like there are two options:
- GET /reindex?pid=pid&pid=pid1&pid=pid2
- GET /reindex?pid=pid,pid1,pid2
The first form is, I think, what we use now, and URL-encoding libraries generally support multi-valued params and convert them to an array for us automatically. The second form may be more compact, but suffers from haveing to figure out how to escape commas in PIDs, which are legal URI characters, but would need escaping somehow to work as a delimiter. The first form uses standard form-url-encoded semantics which makes the demarcation and escaping clear. SO I would lean to the first option, but can see benefits of other forms too.

For the other question, I think all=TRUE should always indicate that you reindex all pids, and the pid parameter is irrelvant as a subset of the whole.

@taojing2002
Copy link
Contributor Author

I think the first form is better since some clients, e.g. ess-dive folks, have scripts based on this form. So it can minimize their changes if we adopt the first form.

@taojing2002
Copy link
Contributor Author

I added a NotAuthorized exception for this mn method.

@artntek
Copy link
Contributor

artntek commented Nov 30, 2023

see PR #1738

@taojing2002
Copy link
Contributor Author

taojing2002 commented Jan 23, 2024

Based on our discussion about this ticket, we revised the rest api call:

PUT /query/solr/index?[all=true] | [&pid={pid}]

PUT /query/solr/index?all=true //Reindex everything
PUT /query/solr/index?all=true&pid={pid1}  //pid1 will be ignored. Reindex everything
PUT /query/solr/index?pid={pid1}&pid={pid2} //Reindex pid1 and pid2
PUT /query/solr/index?all=false&pid={pid1}&pid={pid2} //Reindex pid1 and pid2

Methods:
reindex(session, pids[])
Parameters
pids a repeatable list of identifiers of objects to be reindexed
Returns Boolean TRUE if the reindex request is scheduled
Return type: Types.Boolean
Raises:

reindexAll(session)
Returns Boolean TRUE if the reindexall request is scheduled
Return type: Types.Boolean
Raises:

@taojing2002 taojing2002 reopened this Jan 23, 2024
@artntek
Copy link
Contributor

artntek commented Jan 24, 2024

Important to note we are treating index as a noun/resource here - it is not a verb

@mbjones
Copy link
Member

mbjones commented Jan 24, 2024

Even as a noun, it then treats the index as a subcollection of the SOLR query engine, which already tries to pass its contents off as a SOLR query string. Is there a potential conflict there?

Also, do we want reindex to be scoped only to SOLR? What if we switch from SOLR to elasticsearch in the future? Reinfecting would still be needed. Can we separate the API from the implementation? I'd prefer that.

@artntek
Copy link
Contributor

artntek commented Jan 24, 2024

Good points. Would this simpler form work, in that case?

PUT /index?[all=true] | [&pid={pid}]

@mbjones
Copy link
Member

mbjones commented Jan 30, 2024

I think I like that better. It treats the index as a resource (which it is), and makes the service implementation independent. Let's discuss during tomorrow's backend call to finalize.

@taojing2002
Copy link
Contributor Author

taojing2002 commented Jan 30, 2024

Final version:

Rest Calls:

PUT /index/{pid1}
PUT /index[/]?[all=true] | [&pid={pid}]

PUT /index?all=true //Reindex everything
PUT /index?all=true&pid={pid1}  //pid1 will be ignored. Reindex everything
PUT /index/?pid={pid1}&pid={pid2} //Reindex pid1 and pid2
PUT /index/?all=false&pid={pid1}&pid={pid2} //Reindex pid1 and pid2

Two Methods:
reindex(session, pids[])
Parameters:
pids a repeatable list of identifiers of objects to be reindexed
Returns Boolean TRUE if the reindex request is scheduled
Return type: Types.Boolean
Raises:

reindexAll(session)
Returns Boolean TRUE if the reindexall request is scheduled
Return type: Types.Boolean
Raises:

@artntek
Copy link
Contributor

artntek commented Jan 30, 2024

Did we decide not to support the following example? Or should we allow it as valid?

PUT /index/{pid1}

@taojing2002
Copy link
Contributor Author

We didn't mention to support this example.

Did we decide not to support the following example? Or should we allow it as valid?

PUT /index/{pid1}

@mbjones
Copy link
Member

mbjones commented Jan 30, 2024

I requested that we support the standard REST syntax. We discussed that on the call today @taojing2002

PUT /index/{pid1}

@taojing2002
Copy link
Contributor Author

I just remembered we discussed the issue on /identifier/{pid}. Okay, I just added it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants