New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DS-2952 SOLR full text indexing multiple bitstreams #1595
DS-2952 SOLR full text indexing multiple bitstreams #1595
Conversation
…l text bitstreams to SOLR
+1, tested locally and this works as described. I will add review comments. @tomdesair , sorry for the delay in getting back to you on this. While testing this process, I wondered how DSpace should handle full text extraction for bitstreams with different access policies within an item. Should the full text index process require anonymous read access in order to be added to the index? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomdesair : The code looks reasonable for what it is meant to achieve. However, I'm worried about what happens when this encounters the following scenarios:
- An item with one restricted bitstream
- An item with one restricted bitstream and one public bitstream
- An item with an embargoed bitstream
- An item with one embargoed bitstream and one public bitstream
I think we need unit tests that prove out expectations, as I don't see anything in this code that deals with access restrictions on files. Private bitstreams should not be publicly searchable (obviously).
I concur with Tim's recommendations. I will be glad to re-review with those changes.
Sorry for the late reply, but I'm still on paternity leave. This PR kept the original behaviour: Index all bitstreams in the TEXT bundle. I understand the need to take into account any resource policies on the ORIGINAL bitstream, but I disagree that this should happen at the SOLR index step. If we need to support policy start and end dates in the SOLR index step, you would need to reindex all items each night. An alternative solution would be to only extract text of publicly available bitstreams in the filter-media command. This command already runs at least once a day so it will automatically support embargoed bitstreams. But I think this is a separate (but related) issue that should be tracked in a different Jira ticket and pull-request. |
I've rethought my previous suggestions based on the discussion of this ticket/PR in our DevMtg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After further thought and discussion in Today's DevMtg (http://irclogs.duraspace.org/index.php?date=2017-02-22), I've decided to withdraw my request for updates. Instead, I feel this PR is the best fix for 6.1, and am approving it.
- This fixes the main bug (only one bitstream is currently indexed)...now all bitstreams are indexed
- It doesn't make things "worse"...currently it is possible to search within restricted/embargoed bitstreams (though you won't be able to SEE the bitstreams, though might see a snippet of where you search matched). This separate issue is now tracked over in https://jira.duraspace.org/browse/DS-3498
- In reality, if the Item itself is Restricted or Embargoed (at the Item Level), we already cover those scenarios...so you won't see those in your results.
I've cherry-picked these commits to |
[DSC-1497] Add Authority SOURCE ref on the lookup results Approved-by: Stefano Maffei
This is a fix for https://jira.duraspace.org/browse/DS-2952
This fix is different from #1237 because now the code doesn't copy or create any extra files. The approach I took here is to use a SequenceInputStream that will concatenate all input streams of the various bitstreams of the TEXT bundle (containing the extracted full text): http://docs.oracle.com/javase/7/docs/api/java/io/SequenceInputStream.html
I also grouped all logic concerning the retrieval and data extraction of the relevant bitstreams so that this is no longer present in the SolrServiceImpl class (making it a bit smaller). This also made it easier to unit test my changes.
I ran all unit tests, integration tests and license checks.