Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate performance of dataset page, on datasets with hundreds of files. #4173

Closed
landreev opened this issue Oct 3, 2017 · 4 comments
Closed
Assignees

Comments

@landreev
Copy link
Contributor

landreev commented Oct 3, 2017

Noticed this while working on #4091, tried a few datasets with large numbers of files in production - all took a very long time to load; encountered a couple of datasets that would not load at all, resulting in 500 errors. This may be somewhat urgent. The performance appeared to be worse on draft versions (the "read-only mode" vs. full database retrieval?). So, in practical terms, it may be becoming impossible for some authors to manage their datasets.

Received an independent report from Sonia last night, about a 500 on a specific dataset (https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/00097-8).

I haven't found anything specific yet. My best guess is that we are doing some inefficient/unnecessary database look ups; possibly on something growing (for example, the already existing guestbook responses?) - This would explain why the performance is getting worse; and why we haven't been observing it on the dev. systems.

@kcondon
Copy link
Contributor

kcondon commented Oct 4, 2017

Another symptom: on a dataset with many files that has slow performance, clicking between tabs, files, metadata, terms, versions, is really slow, with a spinner appearing when each is clicked, almost as if it was reloading the file list.

landreev added a commit that referenced this issue Oct 5, 2017
…- the "performance fixes mini release".

Combines the Dataset page queries fix (#4173) and the S3/thumbnails improvements made as part of #4091.
@landreev
Copy link
Contributor Author

landreev commented Oct 5, 2017

@kcondon
Made a pull request: #4177
It combines the query fixes for the abysmal performance of the dataset page; plus the S3/thumbnail-related improvements made as part of #4091.
Please test.
If it tests out ok, we should consider this PR as the candidate for the 4.8.1 "performance improvements" release that Danny authorized earlier.

@kcondon
Copy link
Contributor

kcondon commented Oct 6, 2017

-OK s3 image test with 91 images loads in 5 seconds, much better than previous 69s.
-Regression tested local, s3, swift storage. Aside from a couple preexisting issues, works fine.
-Tested Sonia's reported dataset and it now loads in 3s versus 57 seconds.

@dlmurphy
Copy link
Contributor

Per @djbrooke's suggestion, here are links to the top 5 datasets with the most files in Harvard Dataverse. Could be helpful for testing these performance improvements. Note that all but #5 are only visible to a superuser account.

  1. 12372 files (deaccessioned): https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/URJRVY
  2. 6861 files (draft, gives me an internal server error... related?): https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/EN5BKT
  3. 3750 files (draft): https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/XMHYBN
  4. 3646 files (draft): https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/K8JPPL
  5. 2682 files (published): https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/KRUPXZ

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants