Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DS-4166 Index workspace, workflow and tasks in SOLR #65

Closed
wants to merge 103 commits into from

Conversation

abollini
Copy link
Member

@abollini abollini commented Feb 14, 2019

This PR is to share the implementation of the indexing features required for the implementation of the MyDSpace.

Integration Tests have been added to the DiscoverControllerIT to demonstrate:

  1. inprogress items and tasks not popup elsewhere
  2. the special workspace configuration works as expected (the user only see her own submissions)
  3. the special workflow configuration works as expected (reviewers see only tasks that they have claimed or can claim)

PR to highlight the small changes required to the Discover (search) endpoint DSpace/RestContract#55
Namely the change of the embedded dspaceObject to rObject (result object) as now also workspace, workflow and tasks are returned

TO DO:

  • move this PR to the DSpace project after than the DS-3851 will be merged

mwoodiupui and others added 30 commits February 4, 2019 14:58
This should work without altering Solr, across Solr releases, as long
as Solr ships the necessary additional analyzers in /contrib.
Solr not DSpace.  Break loooong tags into attribute-per-line format.
sample schema.  Tidy indentation, break very long elements into
multiple lines.
…ting.

Some of these may be unnecessary, but I don't know on which we facet.
@@ -111,6 +127,8 @@ public static void main(String[] args) throws SQLException, IOException, SearchS
options.addOption(OptionBuilder.isRequired(false).withDescription(
"optimize search core").create("o"));

options.addOption("e", "readfile", true, "Read the identifier from a file");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These identifiers need to be items, best to alter the documentation here to make this clear.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

option removed as it is out of scope of this PR

}
indexer.updateIndex(context, ids, line.hasOption("f"), type);
} catch (Exception e) {
log.error("Error: " + e.getMessage());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error log here should also log the stacktrace.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

option removed, the code doesn't exist anymore

indexContent(context, community, force);
public void updateIndex(Context context, List<UUID> ids, boolean force, int type) {
if (type != Constants.ITEM) {
throw new RuntimeException("Only ITEM is supported in this mode - type founded: " + type);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would an IllegalArgumentException not make more sense here instead of a generic "RuntimeException" as this is what it really is. An illegal Argument was passed along.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code doesn't exist anymore

}
break;
default:
throw new RuntimeException("No type known: " + type);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would an IllegalArgumentException not make more sense here instead of a generic "RuntimeException" as this is what it really is. An illegal Argument was passed along.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, 5e3164b

ids.add(wfi.getItem().getID());
}
}
List<UUID>[] arrayIDList = Util.splitList(ids, numThreads);
Copy link

@KevinVdV KevinVdV Mar 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this method is called upon multiple times from a single process the number of threads that can be active at the same time are not limited to the "numThreads" variable. This could really lead to memory leaks where we have way too many threads running at the same time. It would really be best to use a spring TaskExecutor: https://docs.spring.io/spring/docs/4.3.x/spring-framework-reference/html/scheduling.html. This will ensure that the number of threads started by the discovery process is limited to the "numThreads" variable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

support for multi-threads has been withdrawn from this PR as suggested by @cwilper #65 (comment) and will be eventually introduced in a separate PR/ticket

threads.add(thread);
}
boolean finished = false;
while (!finished) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

USe the spring TaskExecutors to find a better way to handle this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

support for multi-threads has been withdrawn from this PR as suggested by @cwilper #65 (comment) and will be eventually introduced in a separate PR/ticket

document.addField("namedresourcetype_keyword", fvalue);
}

class IndexerThread extends Thread {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should really implement the Runnable class instead of extending thread as this is the preferred way to work with threads in java.

https://stackoverflow.com/questions/541487/implements-runnable-vs-extends-thread-in-java

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the suggestion but support for multi-threads has been withdrawn from this PR as suggested by @cwilper #65 (comment) and will be eventually introduced in a separate PR/ticket

public void cleanIndex(boolean force) throws IOException, SQLException, SearchServiceException {
if (force) {
try {
getSolr().deleteByQuery(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use ":" as the query ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is safe in case the local installation has customized the search index in a way that we don't know so that it includes now things maybe managed and indexed by other classes/services.

@@ -1931,6 +2182,20 @@ protected DiscoverResult retrieveResult(Context context, DiscoverQuery query, Qu
return result;
}

public DiscoverResult.FacetResult getDiscoveryFacet(Context context, FacetField facetField,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method isn't used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed cc11f49

System.out.println(head + ":" + (idx++) + " / " + size);
}
} catch (Exception e) {
e.printStackTrace();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be removed, as it is already logged below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code doesn't exist anymore

indexContent(context, item, force);
context.uncacheEntity(item);
} catch (Exception ex) {
log.error("ERROR: identifier item:" + id + " identifier thread:" + head);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also log the stacktrace here ?

} catch (Exception ex) {
log.error("ERROR: identifier item:" + id + " identifier thread:" + head);
}
System.out.println(head + ":" + (idx++) + " / " + size);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this end up in the logs instead of in the system out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

he code doesn't exist anymore

@cwilper
Copy link

cwilper commented Mar 27, 2019

Regarding the multi-threaded indexing, does it makes sense to do that as a distinct JIRA/PR pair? It doesn't seem essential to the functionality being introduced here, and would seem worth dedicated consideration.

I agree that moving toward something like spring executors would probably result in more robust thread management. I also think some dedicated testing would be good...at least to show/report that there's a throughput improvement for a typical site for the chosen default number of threads.

@abollini
Copy link
Member Author

Hi all, thanks to all of you @benbosman @KevinVdV @cwilper @tdonohue for your review. I tried to put response with links in all the comment to speedup the validation process.
Please note that the PR is now open on the official DSpace repository so we should continue here our discussion DSpace#2391

@benbosman about your latest general comment #65 (comment)

We've agreed to rename BrowsableObject to IndexableObject in Java

done, see 4b85bf4 and 30899b0

We've agreed to rename "resultObject" to "indexableObject" in the discover REST result

done, see b98d8f4

We've agreed to make sure IndexableObject is implemented in Item, Collection, Community; and not in DSpaceObject

done, see 689ac4e

the changes to the Constants class are currently under discussion

I have preferred to avoid any changes on that. If we will end in replace the search.resoucetype field with a search.resourcename field we can drop the new constants at this time instead than introduce a double way to deal with this things now.
I don't see any issue in the Constants class to have ID for not DSpaceObject, it is not stated in any place that the scope of the Constants class was limited to DSpaceObject, moreover also in the authorization infrastructure of the REST API we need constants to identify the different kind of objects regardless to be them dspaceobject or not, see
https://github.com/DSpace/DSpace/blob/master/dspace-spring-rest/src/main/java/org/dspace/app/rest/repository/PoolTaskRestRepository.java#L81

If a change to the current implementation of the Constants class is needed it can be done in my opinion after the preview release or later.

@abollini abollini closed this Mar 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants