Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Large Guestbooks #3609

Closed
kcondon opened this issue Jan 31, 2017 · 16 comments
Closed

Support Large Guestbooks #3609

kcondon opened this issue Jan 31, 2017 · 16 comments

Comments

@kcondon
Copy link
Contributor

kcondon commented Jan 31, 2017

This comes up from time to time when a user with a very large guestbook wants to download it but can't. See RT 246137 for an example of a repeating request. So for now this requires us to run a database query and send the results to the user whenever they need updated info.

Please note this is separate from the bug where some guestbooks can be downloaded in FF and Safari but not Chrome ( #3581 ) . At first glance they seem the same but they are not.

@djbrooke djbrooke changed the title Guestbook: Cannot download a very large guestbook due to slow performance and front end timeout Support Large Guestbooks Feb 15, 2017
@djbrooke djbrooke added ready and removed Backlog labels Jun 19, 2017
@pdurbin pdurbin added the User Role: Depositor Creates datasets, uploads data, etc. label Jul 12, 2017
@landreev landreev self-assigned this Aug 3, 2017
landreev added a commit that referenced this issue Aug 8, 2017
May require/benefit from further optimizations though. [#3609]
@landreev
Copy link
Contributor

landreev commented Aug 8, 2017

I'd like to quickly review (with @scolapasta?) the plan for the guestbook-responses page.

landreev added a commit that referenced this issue Aug 9, 2017
…liminating the extra

lookup for custom questions and answers, that was adding one extra query
per every guestbookresponse in the search results. (#3609)
landreev added a commit that referenced this issue Aug 10, 2017
… guestbook data.

Lots of optimizations and fixes. Will add more info in #3609 explaining what's been
done and how things are supposed to work now.
@landreev
Copy link
Contributor

Going to make a PR and move this into code review. Will add a few more lines here explaining what's been done, how it's supposed to be working now and how to test stuff.
Will need to review with some specific members of the team whose areas of expertise are affected.

@landreev landreev mentioned this issue Aug 10, 2017
5 tasks
@landreev landreev removed their assignment Aug 10, 2017
@djbrooke djbrooke added this to the 4.8 - Large Data Upload Integration milestone Aug 10, 2017
@pdurbin pdurbin removed the User Role: Depositor Creates datasets, uploads data, etc. label Aug 15, 2017
@pdurbin pdurbin self-assigned this Aug 15, 2017
@pdurbin
Copy link
Member

pdurbin commented Aug 15, 2017

I assigned myself to do some code review of pull request #4057 and my first questions are:

  • How large is a large guestbook?
  • How can one programmatically (or otherwise) create a large guestbook for testing?

Judging from https://help.hmdc.harvard.edu/Ticket/Display.html?id=246137 a large guestbook is 1288 rows (select * from guestbookresponse where dataset_id=REDACTED;).

For testing, it'll be easiest to just use a production database.

@pdurbin
Copy link
Member

pdurbin commented Aug 15, 2017

As of b05a026 I noticed some "???" next to "Collected Data" when you preview a guestbook:

screen shot 2017-08-15 at 9 56 52 am

@mheppler
Copy link
Contributor

@pdurbin -- good catch, I missed that popup on the dataset pg. I will get a fix in ASAP.

@pdurbin
Copy link
Member

pdurbin commented Aug 15, 2017

I wrote a little script in e869900 to help me create two thousand guestbook entries, which I was able to download just fine as of b05a026 from the pull request. Then I went back and tested dd55c08 on the develop branch and I was able to download them there too. I haven't noticed anything objectionable in the pull request apart from a where logging could be reduced. It sounds like @mheppler is going to fix the "???" I found above. @landreev if there's anything specific you want a code reviewer to look for, please let us know. Thanks.

@pdurbin pdurbin removed their assignment Aug 15, 2017
@pdurbin
Copy link
Member

pdurbin commented Aug 15, 2017

To do list:

  • fix "???" missing bundle text in Preview Guestbook popup on dataset pg
  • document new setting: :GuestbookResponsesPageDisplayLimit

@landreev
Copy link
Contributor

@pdurbin
Answering the question "how large is a large guestbook". These ones are large:
https://dataverse.harvard.edu/manage-guestbooks.xhtml?dataverseId=99

the one with 150K responses may be the largest we have. "Download All" button on that page will try to download the guestbook entries for the entire dataverse, meaning about 180K responses. You cannot do that in production currently (will get a 500 error), and you cannot download the results for either of these guestbooks.

I tested my patch with the prod. database on vm5; you can now download the results for these largest guestbooks in fairly reasonable time.

As for the 1288 rows - note that that was the result of the query that Kevin ran for them on one specific dataset. The manage-guestbooks and guestbook-results pages operate on entire dataverses.

landreev added a commit that referenced this issue Aug 15, 2017
…to change the display limit on the number of guestbook entries.

(#3609)
@landreev
Copy link
Contributor

Added a documentation section on the display limit.

@landreev
Copy link
Contributor

Summary of the changes, for QA:

The download-as-CSV functionality has been optimized, for both the "download all (responses for the dataverse)" and "download the results for the given dataverse and guestbook"; on the manage-guestbooks and guestbook-results pages, respectively;

Added help tip text to both pages that explains that the downloaded results are going to be in CSV; that they are importable into Excel/Google Sheets; and encouraging them to use this method if they need to further reorganize the results and/or select the results for specific datasets, files, etc.

Fixed the filename for the download function (it was getting chopped on the first space in the name of the dataverse, losing the ".csv" extension in the process).

Also, for very large guestbooks (for example: https://dataverse.harvard.edu/guestbook-responses.xhtml?dataverseId=99&guestbookId=9), with the current implementation you cannot even get to the "download button". The page will take a long time to load, then finally fail with a 500. That is because the current implementation tries to display all the guestbook entries on the page as well. Part of the failure is because the retrieval was not very efficient. But even with that optimized, loading 150K entries on the page is still not a good idea: It will take a long time for the browser to render, no matter what you do; and it's probably not very useful to a user either.
So the agreed upon solution was to add a configurable limit on how many entries to show. By default, only the most recent 5000 entries are shown (in reverse chronological order); with a message explaining what's going on. The 5000 is a somewhat arbitrary number, but the page is usable with that many entries (takes some seconds to load). The limit is configurable as a standard Dataverse setting (documented in the Configuration->Database Settings section of the guide).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants