Solr: Index all performance is too slow with full production data. #50

eaquigley · 2014-07-09T15:34:50Z

Author Name: Kevin Condon (@kcondon)
Original Redmine Issue: 3457, https://redmine.hmdc.harvard.edu/issues/3457
Original Date: 2014-01-29
Original Assignee: Philip Durbin

Preliminary testing shows index all is taking too long with full production data.

Indexing 1861 dataverses: 41 minutes

Indexing 1900 datasets: 2 hours, 15 minutes. There are 52,000+ datasets.

The above numbers were achieved on dvn-3 with full production data of public dv's and studies. Various glassfish heaps of 512MB and 10GB showed the same performance.

We see"java -server -jar start.jar" at https://cwiki.apache.org/confluence/display/solr/Distributed+Search+with+Index+Sharding

-server? What does that mean?

man java says this...

   -server             Selects the Java  HotSpot  Server  VM.   For  more  information  see
                       Server-Class             Machine             Detection            at
                       http://java.sun.com/j2se/1.5.0/docs/guide/vm/server-class.html

... and if you follow that link you see this:

"Starting with J2SE 5.0, when an application starts up, the launcher can attempt to detect whether the application is running on a "server-class" machine and, if so, use the Java HotSpot Server Virtual Machine (server VM) instead of the Java HotSpot Client Virtual Machine (client VM). The aim is to improve performance even if no one configures the VM to reflect the application it's running. In general, the server VM starts up more slowly than the client VM, but over time runs more quickly."

Maybe this can help performance?

Related issue(s): #623
Redmine related issue(s): 3430, 4062

The text was updated successfully, but these errors were encountered:

eaquigley · 2014-07-09T15:34:50Z

Original Redmine Comment
Author Name: Philip Durbin (@pdurbin)
Original Date: 2014-01-31T02:49:58Z

I was convinced that my recursive findPathSegments() method in IndexServiceBean (used to index the "subtree" facet) was the problem but I just commented it out and indexing of 34 dataverses and 554 datasets was not dramatically faster:

7m58.435s with subtree
6m24.946s without subtree

I'll have to dig into this some more...

eaquigley · 2014-07-09T15:34:50Z

Original Redmine Comment
Author Name: Philip Durbin (@pdurbin)
Original Date: 2014-06-23T19:05:39Z

This commit might help a bit:

cache calls to dataverseService.findRootDataverse() #4062 · c9151d4 · IQSS/dataverse - c9151d4

eaquigley · 2014-07-14T23:43:13Z

@pdurbin, is this still an issue? or did those commits fix it? if it isn't an issue, feel free to close this ticket.

pdurbin · 2014-07-15T12:26:00Z

@eaquigley this is still an issue. Or at least it hasn't been confirmed to be fixed. Now that we have API calls to create both dataverses and datasets (and upload files), someone should try to put lots of data into the system and see how long an "index all" takes. It's really a matter of prioritizing this ticket. Anyone who is comfortable with APIs could write a script to load up lots of data.

pdurbin · 2014-07-15T15:44:49Z

On dvn-build it just took 58.6 seconds to index 42 dataverses and 107 datasets.

pdurbin · 2014-08-14T14:56:18Z

I just pushed this to the "final" milestone, since as I mentioned to @eaquigley indexing will get even slower as we start indexing base on permissions in #734. After that, we should look at indexing performance.

eaquigley · 2016-03-30T16:32:54Z

Index performance improvements have been made and this is currently reasonable. Closing this ticket per @kcondon

pdurbin · 2016-03-30T16:40:54Z

@eaquigley @kcondon if you say so. I think it's something like 11 hours to do a full re-index of https://dataverse.harvard.edu as of Dataverse 4.3. We can always open a new issue if we'd like to attempt to make improvements in this area. Also, please note that I reference this issue at http://guides.dataverse.org/en/4.3/installation/administration.html#full-reindex ("Please note that this operation may take hours depending on the amount of data in your system") so we might want to remove that reference from the guides.

pdurbin · 2018-02-14T17:25:39Z

Related: Investigate and fix a memory leak in IndexAll #4463

Update from IQSS develop

DD-375 Disable editing of the cvoc URL fields

pdurbin · 2023-08-24T14:19:35Z

async indexing after update command #9558

eaquigley added this to the Dataverse 4.0: In Review milestone Jul 9, 2014

eaquigley assigned pdurbin Jul 9, 2014

eaquigley mentioned this issue Jul 9, 2014

4.0 performance #181

Closed

eaquigley modified the milestones: Dataverse 4.0: Beta 3, Dataverse 4.0: In Review Jul 14, 2014

eaquigley added Status: Dev and removed Status: Design labels Jul 14, 2014

pdurbin modified the milestones: Beta 3 - Dataverse 4.0, Dataverse 4.0: Final Aug 14, 2014

pdurbin modified the milestones: Beta 9 - Dataverse 4.0, Dataverse 4.0: Final Nov 12, 2014

eaquigley modified the milestones: Beta 9 - Dataverse 4.0, Beta 10 - Dataverse 4.0 Dec 4, 2014

scolapasta modified the milestones: Beta 11 - Dataverse 4.0, Beta 10 - Dataverse 4.0 Dec 8, 2014

pdurbin mentioned this issue Dec 16, 2014

Permissions: Issue saving a dataset due to indexing slowness when many files (20,000) have been uploaded. #1174

Closed

scolapasta modified the milestones: Beta 11 - Dataverse 4.0, Dataverse 4.0: Final, TEMP Jan 23, 2015

scolapasta modified the milestones: 4.0.1, Beta 15 - Dataverse 4.0 Apr 3, 2015

scolapasta unassigned ekraffmiller Apr 4, 2015

scolapasta modified the milestones: 4.0.1, In Review - Short Term Apr 18, 2015

This was referenced Jul 2, 2015

Indexing: ability to load balance indexing tasks across multiple app (Glassfish) servers #1757

Closed

Solr: load balancing, fault tolerance, and high availability #2322

Closed

pdurbin added a commit that referenced this issue Jul 14, 2015

new script to add then revoke role #50

ab0b268

pdurbin mentioned this issue Jul 14, 2015

Performance problems with "Assign Role" or "Remove Assigned Role" #2036

Closed

pdurbin added a commit that referenced this issue Jul 16, 2015

rename script from #50 to more specific issue #2036

7e3d012

pdurbin mentioned this issue Aug 13, 2015

Use Solr for file listing on dataset page #2455

Closed

pdurbin mentioned this issue Sep 16, 2015

Indexing: "index all" should not clear Solr index or timestamps, add separate "clear" API endpoint for this #2529

Closed

pdurbin mentioned this issue Dec 16, 2015

Indexing is not forgiving enough about bad data in database, should report what's wrong #2815

Closed

scolapasta added Status: Triaged and removed Status: Dev labels Jan 28, 2016

scolapasta removed this from the Not Assigned to a Release milestone Jan 28, 2016

eaquigley closed this as completed Mar 30, 2016

pdurbin mentioned this issue Mar 30, 2016

Installation Guide improvements following rewrite for 4.2.4 #2944

Closed

28 tasks

pdurbin mentioned this issue Apr 14, 2016

Permissions: Removing a role assigned at Root, such as curator, is not immediately reflected in browse results due to slow indexing. #2697

Closed

mheppler removed the Status: Triaged label Oct 11, 2016

piyapongch mentioned this issue Mar 29, 2017

Fixed #45 - Map Funding Agency to Grant Agency #3735

Closed

5 tasks

kcondon pushed a commit that referenced this issue Apr 6, 2020

Merge pull request #50 from IQSS/develop

ad9e2f9

Update from IQSS develop

janvanmansum added a commit to janvanmansum/dataverse that referenced this issue Apr 13, 2021

Merge pull request IQSS#50 from PaulBoon/DD-375

847dc65

DD-375 Disable editing of the cvoc URL fields

pdurbin mentioned this issue Aug 24, 2023

Analyze the use of Solr for file searches in the context of the Files API extension and define an action plan for its use #9813

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solr: Index all performance is too slow with full production data. #50

Solr: Index all performance is too slow with full production data. #50

eaquigley commented Jul 9, 2014 •

edited by djbrooke

Loading

eaquigley commented Jul 9, 2014

eaquigley commented Jul 9, 2014

eaquigley commented Jul 14, 2014

pdurbin commented Jul 15, 2014

pdurbin commented Jul 15, 2014

pdurbin commented Aug 14, 2014

eaquigley commented Mar 30, 2016

pdurbin commented Mar 30, 2016

pdurbin commented Feb 14, 2018

pdurbin commented Aug 24, 2023

Solr: Index all performance is too slow with full production data. #50

Solr: Index all performance is too slow with full production data. #50

Comments

eaquigley commented Jul 9, 2014 • edited by djbrooke Loading

eaquigley commented Jul 9, 2014

eaquigley commented Jul 9, 2014

eaquigley commented Jul 14, 2014

pdurbin commented Jul 15, 2014

pdurbin commented Jul 15, 2014

pdurbin commented Aug 14, 2014

eaquigley commented Mar 30, 2016

pdurbin commented Mar 30, 2016

pdurbin commented Feb 14, 2018

pdurbin commented Aug 24, 2023

eaquigley commented Jul 9, 2014 •

edited by djbrooke

Loading