Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

9361 storage quotas #9409

Merged
merged 21 commits into from Sep 29, 2023
Merged

9361 storage quotas #9409

merged 21 commits into from Sep 29, 2023

Conversation

landreev
Copy link
Contributor

@landreev landreev commented Feb 24, 2023

What this PR does / why we need it:
(please note the draft nature of the pr; work in progress, the model/approach need to be reviewed before I can produce a finalized version)

As explained in the issue, this PR will not result in a feature that's ready to be used in real life for enforcing meaningful storage quotas. What it does is add the quota check to the upload workflow; for testing of the concept it provides a mechanism where the same storage quota limit can be defined for all the users of the installation (and only the files directly created by the user are counted towards the quota). Again, this is not going to be useful as a practical solution, but will allow us to further expand the quota definition mechanism once the checks are there. This was meant to be a sensible sprint-or-so-sized chunk of work. The PR also moves some file unpacking and processing code that turns uploads into Datatverse Datafile objects from a static Utility method and into its own dedicated Command (so that files can eventually become subject to command engine-side metering too).

Which issue(s) this PR closes:

Closes #9361

Special notes for your reviewer:

Suggestions on how to test this:

As stated elsewhere in the PR (see the long-ish comment from Mar 17, in addition to the description above), all that's implemented here is the added mechanisms in the upload pipeline that prevent a user from uploading more content once they exceed the quota. The mechanism for defining the actual quota is essentially a placeholder, just enough to test these new hooks, but not yet usable pas a real life practical solution.
So a test scenario would be to set an arbitrary limit via the :StorageQuotaSizeInBytes setting, and make sure you can't upload more than that once the limit is reached.
[Additionally:] The second setting, :UseStorageQuotas activates the enforcement of the quota (defaults to false)
I can answer any additional questions etc.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

@@ -1285,6 +861,11 @@ public static File saveInputStreamInTempFile(InputStream inputStream, Long fileS
throw new FileExceedsMaxSizeException (MessageFormat.format(BundleUtil.getStringFromBundle("file.addreplace.error.file_exceeds_limit"), bytesToHumanReadable(fileSize), bytesToHumanReadable(fileSizeLimit)));
}

if (storageQuotaLimit != null && fileSize > storageQuotaLimit) {
try {tempFile.toFile().delete();} catch (Exception ex) {}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [reviewdog] <com.puppycrawl.tools.checkstyle.checks.blocks.LeftCurlyCheck> reported by reviewdog 🐶
'{' at column 21 should have line break after.

try {
Charset charset = null;
/*
TODO: (?)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [reviewdog] <com.puppycrawl.tools.checkstyle.checks.whitespace.FileTabCharacterCheck> reported by reviewdog 🐶
File contains tab characters (this is the first instance).

@coveralls
Copy link

coveralls commented Feb 27, 2023

Coverage Status

coverage: 20.01% (-0.04%) from 20.045% when pulling 7ab736c on 9361-storage-quotas into 4e1bc5b on develop.

@scolapasta scolapasta moved this from Ready for Review ⏩ to Review 🔎 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Mar 1, 2023
@scolapasta scolapasta added the Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) label Mar 1, 2023
Copy link
Contributor

@scolapasta scolapasta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks great - I did leave a few questions as comments, most not too important. The biggest is how we actually want to define quotas which leads to what the next steps are? We should discuss that and then add the summary here. (e.g. one example is to go and test this as is, and then add user specific (if that's the direction we go) as a new issue / PR; if so we should also add some integration tests here).

@@ -140,6 +142,36 @@ public class DataFileServiceBean implements java.io.Serializable {
*/
public static final String MIME_TYPE_PACKAGE_FILE = "application/vnd.dataverse.file-package";

public class UserStorageQuota {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we move this out to be its own class? (i.e. any reason for it to be an internal class here?)

try {
Long totalSize = (Long)query.setParameter("creatorId", user.getId()).getSingleResult();
logger.info("total size for user: "+totalSize);
return totalSize == null ? 0L : totalSize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when would this be null (as opposed to the exception thrown)? is is where we for some reason don't have sizes stored?


public boolean isStorageQuotaEnforced() {
return maxTotalUploadSizeInBytes != null;
}

public Long getMaxIngestSizeInBytes() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we reconsider naming this? (since it does more now?)

@@ -1657,4 +1689,29 @@ public Embargo findEmbargo(Long id) {
DataFile d = find(id);
return d.getEmbargo();
}

public Long getStorageUsageByCreator(AuthenticatedUser user) {
Query query = em.createQuery("SELECT SUM(o.filesize) FROM DataFile o WHERE o.creator.id=:creatorId");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes me think (as you point out) how we decide to define the limits even? by user, by dataset - for example a workaround around this limit would be to create a new account and give it permission?)

@@ -155,6 +160,7 @@
fileLimit="#{EditDatafilesPage.getMaxNumberOfFiles()}"
invalidSizeMessage="#{bundle['file.edit.error.file_exceeds_limit']}"
sequential="true"
previewWidth="-1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR it disables the browser-side thumbnail previews.

I was delighted to discover that, as of a couple of recent versions of PrimeFaces, it is now possible to disable their own image thumbnail preview in the upload component. It used to drive me nuts - we don't really need it (because we generate our own thumbnails server-side); but, unlike with our thumbnails, there's no way to define a size limit, so attempting to upload a bunch of large enough images paralyzes the user's browser (since it's all javascript). And, adding insult to injury, the user assumes that it's our site that's crapping the bed... I am embarrassed to acknowledge that, despite all that righteous indignation, I somehow missed the point at which they had added this ability to disable the whole thing, by setting the previewWidth to -1.

@landreev landreev moved this from Review 🔎 to IQSS Team - In Progress 💻 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Mar 15, 2023
@landreev landreev moved this from IQSS Team - In Progress 💻 to Review 🔎 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Mar 15, 2023
@landreev landreev moved this from Review 🔎 to IQSS Team - In Progress 💻 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Mar 15, 2023
@mreekie
Copy link

mreekie commented Mar 15, 2023

grooming

  • It appears this will bleed over from March 1 sprint to March 15th

@mreekie mreekie added the D: FixRateLimitingBehaviors Address this collection of issues that impact rate limiting label Mar 15, 2023
@mreekie
Copy link

mreekie commented Mar 15, 2023

grooming

  • It appears this will bleed over from March 1 sprint to March 15th

Leonid noted no change in size for the march 15th sprint

@landreev
Copy link
Contributor Author

Overall, this looks great - I did leave a few questions as comments, most not too important. The biggest is how we actually want to define quotas which leads to what the next steps are? We should discuss that and then add the summary here. (e.g. one example is to go and test this as is, and then add user specific (if that's the direction we go) as a new issue / PR; if so we should also add some integration tests here).

combining this with another question:

... how we decide to define the limits even?

The linked issue #9361 lists 4 other issues in which specific types of quota definitions were requested. Specifically, per collection, per dataset, per user, and even per-user-per-day. Per group has never been requested, I believe - not sure if it's needed (per dataset/collection may be enough to cover a practical scenario of a group of users, a research project/department etc., working together depositing data (?).

I'm assuming the answer is that we want all of the above. And in each case the answer - how much this user can add to the storage during this specific upload session - will be based on checking all possible quotas that may apply. Somewhat similar to how our permissions work - we check for permissions via direct role assignments, combine with group perms ... etc. Although I very much hope this does not need to be as complex as our permission structure, with all the complicated recursive db queries there. It may be possible to keep it relatively simple because, unlike with permissions, we probably don't need/want to apply the most generous of the possibilities. I mean, you can access a restricted file with either a group or a user permission, whichever you may have. With the quotas, however, I'm assuming we can get away with having a clear precedent structure. For example, the quota on the dvobject (dataset, collection), if defined, beats everything else (if not defined, check if the parent has the quota, repeat). If none are defined on the dataset/collection, then the user's own quota applies (both total and per day, possibly), if that's not defined either, then check if the installation has the default user quota and/or default dataset/collection size. (Something like that, just an example of a possible scheme).

This pr was an attempt to create a sprint-sized amount of work that could be wrapped up and merged independently. We don't need to know how the quotas are going to be defined. We just assume that we can count on a service that will give us these numbers - how many bytes this user can add during this upload session, to this dataset (plus the total applicable quota, for display and/or error messages). All it tests is the behavior of the upload page and the api, with only enough service-side code provided as to be able to test against some defined limit.

As for the next step, I don't think the 4 linked issues can be worked on separately. I.e., it doesn't seem practical to try to add a per-user quota separately from per-collection (not 100% sure though). But this is a natural next step, to actually work on the real, functional quota service.

Though there is another task that may be standalone enough - we need to work out some mechanism for counting used storage space. I don't think we can practically obtain this number in real time every time. Going through the physical storage every time is obviously impractical. Calculating it from the database (by adding the filesizes + the saved originals) is not truly accurate (skips many other aux. files), although it may be a reasonable approximation... but still inefficient, to calculate it in real time every time. So I'm assuming we'll need some system of keeping the sizes of datasets and collections cached... Figuring this out and implementing it is definitely a project of its own.

@mreekie
Copy link

mreekie commented Mar 29, 2023

new sprint:

  • no change to size.
  • was waiting. needed additional input from folks who were out.

@landreev
Copy link
Contributor Author

landreev commented Apr 4, 2023

Per conversation with @scolapasta last week, we decided it made sense to send the PR through the normal QA pipeline and merge it in its current form, before proceeding with the next steps of the overall effort. I will remove the "draft" designation.

@landreev landreev marked this pull request as ready for review April 4, 2023 15:04
@scolapasta scolapasta removed their assignment Apr 4, 2023
@landreev landreev moved this from IQSS Team - In Progress 💻 to Ready for Review ⏩ in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Apr 11, 2023
@landreev landreev moved this from Ready for Review ⏩ to Review 🔎 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Apr 11, 2023
@landreev landreev moved this from Review 🔎 to Ready for Review ⏩ in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Apr 11, 2023
@scolapasta scolapasta removed the Status: Waiting issues in need of input from someone currently unavailable label Sep 11, 2023
@scolapasta scolapasta moved this from ▶ SPRINT READY to Ready for QA ⏩ in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Sep 11, 2023
@scolapasta scolapasta assigned landreev and unassigned landreev and kcondon Sep 11, 2023
@scolapasta
Copy link
Contributor

@landreev I'll move this milestone 6.1 issue to "ready for Review", but I'm assigning you first so that you can handle the 6.0 merge and address any EE10 issues. When done, this one can actually go back to ready for QA (barring no major edits).

@scolapasta scolapasta moved this from Ready for QA ⏩ to Ready for Review ⏩ in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Sep 11, 2023
@landreev
Copy link
Contributor Author

I don't think this PR is actually "ready for review" at this point. Just syncing it with develop appears to be a non-trivial task now that it has sat for so long. So I'll move it to "in progress" to work on that.
But even after the branch is synced and builds under 6.0, I no longer feel like it was a good idea to try to merge the PR as it was in the first place. The PR was created as an experiment in merging incomplete/partially implemented pieces. But the whole idea was to merge such incremental parts quickly and keep working on the functionality; so it seems safe to conclude that the experiment failed. I am wondering if we should just change the goal of the issue to delivering working/practically useful quota functionality and put the PR back in dev.? - idk, let's talk about this part.

@landreev landreev moved this from Ready for Review ⏩ to IQSS Team - In Progress 💻 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Sep 11, 2023
@github-actions

This comment has been minimized.

@landreev landreev moved this from IQSS Team - In Progress 💻 to Ready for QA ⏩ in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Sep 20, 2023
@kcondon kcondon assigned kcondon and unassigned landreev Sep 26, 2023
@kcondon
Copy link
Contributor

kcondon commented Sep 26, 2023

@landreev I ended up doing a quick smoke test and found file upload to direct s3 fails:
file_upload_fail_direct_s3_quota_not_config.txt

I did end up also checking storage quota and it seemed to work on local file and indirect s3 upload.

@landreev
Copy link
Contributor Author

Good catch! I will look into it.

@landreev I ended up doing a quick smoke test and found file upload to direct s3 fails: file_upload_fail_direct_s3_quota_not_config.txt

I did end up also checking storage quota and it seemed to work on local file and indirect s3 upload.

@landreev landreev self-assigned this Sep 27, 2023
@landreev
Copy link
Contributor Author

Fixing the problem with direct upload now (once again, good catch!).

@landreev
Copy link
Contributor Author

Direct uploads should be working via the UI now.

@github-actions
Copy link

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:9361-storage-quotas
ghcr.io/gdcc/configbaker:9361-storage-quotas

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

@kcondon kcondon merged commit c26e1e7 into develop Sep 29, 2023
14 of 15 checks passed
IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from Ready for QA ⏩ to Done 🚀 Sep 29, 2023
@kcondon kcondon deleted the 9361-storage-quotas branch September 29, 2023 20:19
@pdurbin pdurbin added this to the 6.1 milestone Sep 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
D: FixRateLimitingBehaviors Address this collection of issues that impact rate limiting Size: 10 A percentage of a sprint. 7 hours.
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

Add GENERIC storage quota check to the file upload framework.
7 participants