Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve S3 Bitstream Storage to Lazy download object from S3 #9477

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

abollini
Copy link
Member

@abollini abollini commented Apr 15, 2024

Description

The S3BitStoreService#get has been updated to return a custom InputStream able to lazy download the underlying S3 object when used. In this way we reduce the amount of space required on the local storage during download to a configurable value (default 5Mb).

Please note that this PR contains also an unrelated commit
2fd8680
that is needed to allow to use a recent version of Eclipse to build DSpace. Without this commit eclipse complains about The package org.w3c.dom is accessible from more than one module: , java.xml

Instructions for Reviewers

The testBitstreamPutAndGetWithAlreadyPresentBucket in dspace-api/src/test/java/org/dspace/storage/bitstore/S3BitStoreServiceIT.java has been updated to verify that get and put still work properly and are consistent for different content size.

You should checkout the PR and try to download one or more files of different size, monitor the temp folder to verify that no files named s3-copy-... are left behind and that also downloading a large file each temp file is not larger than 5Mb.

You can eventually customize the size of the chunk in the bitstore.xml spring file setting the property bufferSize in the s3Store bean definition.

Checklist

  • My PR is small in size (e.g. less than 1,000 lines of code, not including comments & integration tests). Exceptions may be made if previously agreed upon.
  • My PR passes Checkstyle validation based on the Code Style Guide.
  • My PR includes Javadoc for all new (or modified) public methods and classes. It also includes Javadoc for large or complex private methods.
  • My PR passes all tests and includes new/updated Unit or Integration Tests based on the Code Testing Guide.
  • If my PR includes new libraries/dependencies (in any pom.xml), I've made sure their licenses align with the DSpace BSD License based on the Licensing of Contributions documentation.
  • [n/a] If my PR modifies REST API endpoints, I've opened a separate REST Contract PR related to this change.
  • If my PR includes new configurations, I've provided basic technical documentation in the PR itself.
  • [n/a] If my PR fixes an issue ticket, I've linked them together.

… and cause - The package org.w3c.dom is accessible from more than one module: <unnamed>, java.xml
@abollini abollini changed the title Lazy s3input Improve S3 Bitstream Storage to Lazy download object from S3 Apr 15, 2024
@abollini abollini added the port to dspace-7_x This PR needs to be ported to `dspace-7_x` branch for next bug-fix release label Apr 15, 2024
@tdonohue tdonohue added this to the 8.0 milestone Apr 15, 2024
Copy link
Member

@tdonohue tdonohue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abollini : I haven't tested this yet, but the code looks good overall. A few minor code suggestions inline below.

this.bufferSize = bufferSize;
}

public class S3LazyInputStream extends InputStream {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add JavaDocs to describe this imbedded class?

return byteRead;
}

private void downloadChunk() throws IOException, FileNotFoundException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alignment of this method is off. Can we also add a basic description to this class to describe what it's doing? This seems to be the main method.

downloadChunk();
}

int byteRead = currPos < endOfChunk ? currentChunkStream.read() : -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code alignment it odd here. It's indented more than normal

<groupId>xml-apis</groupId>
<artifactId>xml-apis</artifactId>
</exclusion>
</exclusions>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to exclude this? How is this related to the issue this PR is solving?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pnbecker : This change is unrelated to the bug, but it is described above in the description of the PR:

Please note that this PR contains also an unrelated commit
2fd8680
that is needed to allow to use a recent version of Eclipse to build DSpace. Without this commit eclipse complains about The package org.w3c.dom is accessible from more than one module: , java.xml

<groupId>xml-apis</groupId>
<artifactId>xml-apis</artifactId>
</exclusion>
</exclusions>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment at the other pom.xml

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment...this change seems to be related to an issue with using Eclipse + DSpace. It is unrelated to the S3 changes but I expect @abollini made the change to allow his Eclipse IDE to work better in developing this fix.

@pnbecker
Copy link
Member

I will have to look deeper into this. In a first quick test I was not able to download files anymore. I got a White label error page instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend: bitstore Related to file/bitstream storage bug high priority port to dspace-7_x This PR needs to be ported to `dspace-7_x` branch for next bug-fix release
Projects
Status: 👀 Under Review
Development

Successfully merging this pull request may close these issues.

None yet

3 participants