Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DS-2952 SOLR full text indexing multiple bitstreams #1595

Conversation

tomdesair
Copy link
Contributor

This is a fix for https://jira.duraspace.org/browse/DS-2952

This fix is different from #1237 because now the code doesn't copy or create any extra files. The approach I took here is to use a SequenceInputStream that will concatenate all input streams of the various bitstreams of the TEXT bundle (containing the extracted full text): http://docs.oracle.com/javase/7/docs/api/java/io/SequenceInputStream.html

I also grouped all logic concerning the retrieval and data extraction of the relevant bitstreams so that this is no longer present in the SolrServiceImpl class (making it a bit smaller). This also made it easier to unit test my changes.

I ran all unit tests, integration tests and license checks.

@terrywbrady terrywbrady self-assigned this Jan 11, 2017
@terrywbrady terrywbrady self-requested a review February 1, 2017 21:26
@terrywbrady terrywbrady removed their assignment Feb 1, 2017
@terrywbrady
Copy link
Contributor

+1, tested locally and this works as described. I will add review comments.

@tomdesair , sorry for the delay in getting back to you on this.

While testing this process, I wondered how DSpace should handle full text extraction for bitstreams with different access policies within an item. Should the full text index process require anonymous read access in order to be added to the index?

terrywbrady
terrywbrady previously approved these changes Feb 1, 2017
@tdonohue tdonohue added the bug label Feb 7, 2017
@tdonohue tdonohue added this to the 6.1 milestone Feb 7, 2017
Copy link
Member

@tdonohue tdonohue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomdesair : The code looks reasonable for what it is meant to achieve. However, I'm worried about what happens when this encounters the following scenarios:

  • An item with one restricted bitstream
  • An item with one restricted bitstream and one public bitstream
  • An item with an embargoed bitstream
  • An item with one embargoed bitstream and one public bitstream

I think we need unit tests that prove out expectations, as I don't see anything in this code that deals with access restrictions on files. Private bitstreams should not be publicly searchable (obviously).

@terrywbrady terrywbrady dismissed their stale review February 16, 2017 20:26

I concur with Tim's recommendations. I will be glad to re-review with those changes.

@tomdesair
Copy link
Contributor Author

tomdesair commented Feb 22, 2017

Sorry for the late reply, but I'm still on paternity leave.

This PR kept the original behaviour: Index all bitstreams in the TEXT bundle.

I understand the need to take into account any resource policies on the ORIGINAL bitstream, but I disagree that this should happen at the SOLR index step. If we need to support policy start and end dates in the SOLR index step, you would need to reindex all items each night.

An alternative solution would be to only extract text of publicly available bitstreams in the filter-media command. This command already runs at least once a day so it will automatically support embargoed bitstreams.

But I think this is a separate (but related) issue that should be tracked in a different Jira ticket and pull-request.

@tdonohue tdonohue dismissed their stale review February 22, 2017 16:05

I've rethought my previous suggestions based on the discussion of this ticket/PR in our DevMtg

Copy link
Member

@tdonohue tdonohue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After further thought and discussion in Today's DevMtg (http://irclogs.duraspace.org/index.php?date=2017-02-22), I've decided to withdraw my request for updates. Instead, I feel this PR is the best fix for 6.1, and am approving it.

  1. This fixes the main bug (only one bitstream is currently indexed)...now all bitstreams are indexed
  2. It doesn't make things "worse"...currently it is possible to search within restricted/embargoed bitstreams (though you won't be able to SEE the bitstreams, though might see a snippet of where you search matched). This separate issue is now tracked over in https://jira.duraspace.org/browse/DS-3498
  3. In reality, if the Item itself is Restricted or Embargoed (at the Item Level), we already cover those scenarios...so you won't see those in your results.

@terrywbrady terrywbrady merged commit 1a1b765 into DSpace:master Feb 22, 2017
@tdonohue
Copy link
Member

I've cherry-picked these commits to dspace-6_x.

4science-it pushed a commit to 4Science/DSpace that referenced this pull request Feb 12, 2024
[DSC-1497] Add Authority SOURCE ref on the lookup results

Approved-by: Stefano Maffei
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants