Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

9725 files lookup performance #9736

Merged
merged 10 commits into from
Jul 31, 2023
Merged

9725 files lookup performance #9736

merged 10 commits into from
Jul 31, 2023

Conversation

landreev
Copy link
Contributor

What this PR does / why we need it:

A few quick refinements to the optimizations added in #9558 to address some performance issues found during the testing of the 5.14 release candidate.

Which issue(s) this PR closes:

Closes #9725

Special notes for your reviewer:

Suggestions on how to test this:

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

@landreev landreev added this to Ready for Review ⏩ in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) via automation Jul 26, 2023
@scolapasta scolapasta moved this from Ready for Review ⏩ to In Review 🔎 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Jul 26, 2023
@scolapasta scolapasta self-assigned this Jul 26, 2023
@github-actions

This comment has been minimized.

// Optimization hints: retrieve all data in one query; this prevents point queries when iterating over the files
.setHint("eclipselink.left-join-fetch", "o.fileMetadatas.dataFile.ingestRequest")
.setHint("eclipselink.left-join-fetch", "o.fileMetadatas.dataFile.thumbnailForDataset")
.setHint("eclipselink.left-join-fetch", "o.fileMetadatas.dataFile.dataTables")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea whether things like this, that I think only get used when a file is displayed, would be better w/o the hint? I.e. would that allow getting datatables for the ten displayed files (as separate less efficient queries) rather than getting them for 10K files. (Bad example in that this is only for tabular files, but the same question for anything related to creating the view for a file versus deciding which files should be displayed.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very good point. I just haven't gotten around to finding a dataset with a ton of tabular files to confirm that this is indeed going to result in observable savings... But I can't think of a good reason not to drop this hint, no.
(will find a test dataset now)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yes, I get the part that it's not just the DataTables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, does this really mean that we don't need the hint for roleassignments either? Or rather, that the only hints we need are for the 1:1 relations on DataFile? (that would otherwise result in additional individual queries?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to report that dropping the DataTables hint does not appear to make things noticeably better for large datasets with many tab. files; but also makes things worse for those without any tab. files - individual lookup queries are then issued on every file , on account of this in the page, maybe:

 for(DataFile f : dataset.getFiles()) {
      if(f.isTabularData()) {
          hasTabular = true;
          break;
      }
}

Running out of steam with this thing a little bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The awful loop above can be fixed of course; it needs to be fixed. I was just demoralized a little by seeing things like that. The above happens 40 lines into the init method. There's almost certainly more of the same throughout the rest of the page. Things that were already in 5.13, that had gradually crept into the code, that were making things slow on some large datasets in 5.13 and are offsetting the gains from the 5.14 optimizations now. I wasn't mentally prepared for going this deep. But will keep looking into it.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah fun - needed for the download all button I guess. This type of thing definitely looks like a great little post 5.14 spin-out issue (that the SPA will also have).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I became convinced last night that this really needs to be addressed as part of this PR. At least some of it, the more glaring stuff. The "awful loop" above - it is awful because it goes through every file in the entire dataset; negating the benefits of trying to only initialize one version-worth of files. AND it doesn't appear to be necessary - because for all the useful purposes the page and its fragment only need to know if tab files are present in the current version. (There is a separate loop for the tab files in the version there).
Dropping the hint for DataTable for the version doesn't appear to be as straightforward through - the page in its current form needs all the DataTables in the version, both for checking if there are tab files, and for calculating the "original" sizes, if tab files are present. Can (and probably should) be replaced with custom queries, but this may be an effort that can be delayed - will see how it goes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... the full sentence should have been "negating the benefits of trying to only initialize one version-worth of files in some large datasets with sparseley-populated versions" of course.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should mention that dropping most of the other hints on OneToMany objects in that method appears to be a less problematic affair. Thank you for the tip. Now seems like such an obvious thing in retrospect, of course.

Copy link
Contributor

@scolapasta scolapasta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some small comments / questions - nothing critical. Can you also ee the question @qqmeyers asked? I'm ready to approve, but want to let @landreev have a chance to review what we wrote before that. Let me know when it's set.


// We are only performing these lookups to obtain the database id
// of the version that we are displaying, and then we will use it
// to perform a .findFind(versionId); see below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small typo in the comment

if (dataset == null) {
logger.warning("No such dataset: "+dataset);
return permissionsWrapper.notFound();
}
//retrieveDatasetVersionResponse = datasetVersionService.retrieveDatasetVersionById(dataset.getId(), version);
retrieveDatasetVersionResponse = datasetVersionService.selectRequestedVersion(dataset.getVersions(), version);
if (retrieveDatasetVersionResponse == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary? i.e. we already found the dataset, can there be one without a version here? (is this some edge case I'm not immediately seeing?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I see now that it was moved up from lower down. That said see my comment lower.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may have had a real data corruption case where a dataset failed to get deleted, and was left in the database without versions. I wasn't thinking about it honestly, just feel like it's a good idea to check for nulls. (but yes, it was a copy-and-paste from further below)

@@ -1958,7 +1980,7 @@ private String init(boolean initFull) {
return permissionsWrapper.notAuthorized();
}

if (!retrieveDatasetVersionResponse.wasRequestedVersionRetrieved()) {
if (retrieveDatasetVersionResponse != null && !retrieveDatasetVersionResponse.wasRequestedVersionRetrieved()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this null check necessary? I guess its being extra safe, but you do check for mull further up, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one IS actually necessary. Because in this PR I finally fixed the bug where the page was not working when called by versionId=. So yes, there can be a legit case where retrieveDatasetVersionResponse is going to be null here.
(and I figured when the page gets called by the numeric id, if it doesn't exist, it should always be 404/not found, "no substitutes").

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@scolapasta scolapasta assigned landreev and unassigned scolapasta Jul 27, 2023
@scolapasta scolapasta moved this from In Review 🔎 to IQSS Team - In Progress 💻 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Jul 27, 2023
@scolapasta scolapasta added this to the 5.14 milestone Jul 27, 2023
@landreev
Copy link
Contributor Author

landreev commented Jul 27, 2023

I may have made enough progress addressing the performance on the remaining "outlier" - very high file count - prod. datasets, making it significantly better than under 5.13, as to finally stop (will need to confirm in QA that the numbers I'm seeing are legit, of course).
Thanks again to @qqmyers for the tip on dropping the unnecessary relations from the optimization query.

@landreev landreev moved this from IQSS Team - In Progress 💻 to In Review 🔎 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Jul 27, 2023
@github-actions

This comment has been minimized.

@landreev
Copy link
Contributor Author

We should probably move this into QA now, so that the RC can be properly re-tested. Testing it thoroughly, especially on large datasets takes time.
When I said earlier that the performance on

"outlier" - very high file count - prod. datasets [is now] significantly better than under 5.13,

just as a reference, for that infamous 42K files dataset the page takes almost a minute to load in prod. On the perf. test instance (which is less beefy than prod.), it takes closer to 2 min. under 5.13, and it was even worse in the previous iteration of this branch. With the current code, it loads in 9 sec. on the perf. instance and I expect it to be below 5 sec. in prod. The real meaning of this is less about the optimization in the current branch being particularly good and more about how awful it was in 5.13. We apparently missed the point at which something had been added to the page that compromised the effectiveness of the earlier optimizations measures. I was wrong to even consider accepting a further drop in performance, if only on a few large datasets. We should be grateful to @ErykKul for bringing the performance of the page to our attention.

@github-actions
Copy link

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:9725-files-performance
ghcr.io/gdcc/configbaker:9725-files-performance

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

@landreev landreev removed their assignment Jul 31, 2023
IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from In Review 🔎 to Ready for QA ⏩ Jul 31, 2023
@scolapasta scolapasta removed their assignment Jul 31, 2023
@kcondon kcondon self-assigned this Jul 31, 2023
@kcondon kcondon merged commit d9f0952 into develop Jul 31, 2023
15 checks passed
IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from QA ✅ to Done 🚀 Jul 31, 2023
@kcondon kcondon deleted the 9725-files-performance branch July 31, 2023 20:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Performance issues as a side effect of the file lookup optimizations
5 participants