DS-2952 SOLR full text indexing multiple bitstreams #1595

tomdesair · 2016-12-30T12:35:42Z

This is a fix for https://jira.duraspace.org/browse/DS-2952

This fix is different from #1237 because now the code doesn't copy or create any extra files. The approach I took here is to use a SequenceInputStream that will concatenate all input streams of the various bitstreams of the TEXT bundle (containing the extracted full text): http://docs.oracle.com/javase/7/docs/api/java/io/SequenceInputStream.html

I also grouped all logic concerning the retrieval and data extraction of the relevant bitstreams so that this is no longer present in the SolrServiceImpl class (making it a bit smaller). This also made it easier to unit test my changes.

I ran all unit tests, integration tests and license checks.

…l text bitstreams to SOLR

…t test for it

terrywbrady · 2017-02-01T22:12:49Z

+1, tested locally and this works as described. I will add review comments.

@tomdesair , sorry for the delay in getting back to you on this.

While testing this process, I wondered how DSpace should handle full text extraction for bitstreams with different access policies within an item. Should the full text index process require anonymous read access in order to be added to the index?

tdonohue

@tomdesair : The code looks reasonable for what it is meant to achieve. However, I'm worried about what happens when this encounters the following scenarios:

An item with one restricted bitstream
An item with one restricted bitstream and one public bitstream
An item with an embargoed bitstream
An item with one embargoed bitstream and one public bitstream

I think we need unit tests that prove out expectations, as I don't see anything in this code that deals with access restrictions on files. Private bitstreams should not be publicly searchable (obviously).

I concur with Tim's recommendations. I will be glad to re-review with those changes.

tomdesair · 2017-02-22T12:48:33Z

Sorry for the late reply, but I'm still on paternity leave.

This PR kept the original behaviour: Index all bitstreams in the TEXT bundle.

I understand the need to take into account any resource policies on the ORIGINAL bitstream, but I disagree that this should happen at the SOLR index step. If we need to support policy start and end dates in the SOLR index step, you would need to reindex all items each night.

An alternative solution would be to only extract text of publicly available bitstreams in the filter-media command. This command already runs at least once a day so it will automatically support embargoed bitstreams.

But I think this is a separate (but related) issue that should be tracked in a different Jira ticket and pull-request.

I've rethought my previous suggestions based on the discussion of this ticket/PR in our DevMtg

tdonohue

After further thought and discussion in Today's DevMtg (http://irclogs.duraspace.org/index.php?date=2017-02-22), I've decided to withdraw my request for updates. Instead, I feel this PR is the best fix for 6.1, and am approving it.

This fixes the main bug (only one bitstream is currently indexed)...now all bitstreams are indexed
It doesn't make things "worse"...currently it is possible to search within restricted/embargoed bitstreams (though you won't be able to SEE the bitstreams, though might see a snippet of where you search matched). This separate issue is now tracked over in https://jira.duraspace.org/browse/DS-3498
In reality, if the Item itself is Restricted or Embargoed (at the Item Level), we already cover those scenarios...so you won't see those in your results.

tdonohue · 2017-02-22T20:35:30Z

I've cherry-picked these commits to dspace-6_x.

[DSC-1497] Add Authority SOURCE ref on the lookup results Approved-by: Stefano Maffei

tomdesair added 4 commits December 28, 2016 23:47

DS-2952: Use a SequenceInputStream to add the content of multiple ful…

3b2d8f3

…l text bitstreams to SOLR

DS-2952: Small improvements to FullTextContentStreams and added a uni…

df1f81b

…t test for it

DS-2952: Only prepend new line if we have an actual input stream

c50c300

DS-2952: Added missing license

f5e07ba

terrywbrady self-assigned this Jan 11, 2017

terrywbrady self-requested a review February 1, 2017 21:26

terrywbrady removed their assignment Feb 1, 2017

terrywbrady previously approved these changes Feb 1, 2017

View reviewed changes

tdonohue added the bug label Feb 7, 2017

tdonohue added this to the 6.1 milestone Feb 7, 2017

tdonohue previously requested changes Feb 15, 2017

View reviewed changes

terrywbrady mentioned this pull request Feb 18, 2017

DS-2952: Alternative approach for PR 1595 #1647

Closed

tdonohue approved these changes Feb 22, 2017

View reviewed changes

terrywbrady merged commit 1a1b765 into DSpace:master Feb 22, 2017

tdonohue mentioned this pull request Feb 22, 2017

DS-2952: Enable ALL text-related bitstreams to be included in the full-text index. #1237

Closed

dspace-bot mentioned this pull request Aug 31, 2021

[DS-3495] Full text content in SOLR should carry the permissions of the source bitstream #6847

Open

4science-it pushed a commit to 4Science/DSpace that referenced this pull request Feb 12, 2024

Merged in DSC-1497 (pull request DSpace#1595)

94f0433

[DSC-1497] Add Authority SOURCE ref on the lookup results Approved-by: Stefano Maffei

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DS-2952 SOLR full text indexing multiple bitstreams #1595

DS-2952 SOLR full text indexing multiple bitstreams #1595

tomdesair commented Dec 30, 2016

terrywbrady commented Feb 1, 2017

tdonohue left a comment •

edited

tomdesair commented Feb 22, 2017 •

edited

tdonohue left a comment •

edited

tdonohue commented Feb 22, 2017

DS-2952 SOLR full text indexing multiple bitstreams #1595

DS-2952 SOLR full text indexing multiple bitstreams #1595

Conversation

tomdesair commented Dec 30, 2016

terrywbrady commented Feb 1, 2017

tdonohue left a comment • edited

Choose a reason for hiding this comment

tomdesair commented Feb 22, 2017 • edited

tdonohue left a comment • edited

Choose a reason for hiding this comment

tdonohue commented Feb 22, 2017

tdonohue left a comment •

edited

tomdesair commented Feb 22, 2017 •

edited

tdonohue left a comment •

edited