Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solr: Index all performance is too slow with full production data. #50

Closed
eaquigley opened this issue Jul 9, 2014 · 14 comments
Closed

Comments

@eaquigley
Copy link
Contributor

eaquigley commented Jul 9, 2014


Author Name: Kevin Condon (@kcondon)
Original Redmine Issue: 3457, https://redmine.hmdc.harvard.edu/issues/3457
Original Date: 2014-01-29
Original Assignee: Philip Durbin


Preliminary testing shows index all is taking too long with full production data.

Indexing 1861 dataverses: 41 minutes

Indexing 1900 datasets: 2 hours, 15 minutes. There are 52,000+ datasets.

The above numbers were achieved on dvn-3 with full production data of public dv's and studies. Various glassfish heaps of 512MB and 10GB showed the same performance.


We see"java -server -jar start.jar" at https://cwiki.apache.org/confluence/display/solr/Distributed+Search+with+Index+Sharding

-server? What does that mean?

man java says this...

   -server             Selects the Java  HotSpot  Server  VM.   For  more  information  see
                       Server-Class             Machine             Detection            at
                       http://java.sun.com/j2se/1.5.0/docs/guide/vm/server-class.html

... and if you follow that link you see this:

"Starting with J2SE 5.0, when an application starts up, the launcher can attempt to detect whether the application is running on a "server-class" machine and, if so, use the Java HotSpot Server Virtual Machine (server VM) instead of the Java HotSpot Client Virtual Machine (client VM). The aim is to improve performance even if no one configures the VM to reflect the application it's running. In general, the server VM starts up more slowly than the client VM, but over time runs more quickly."

Maybe this can help performance?


Related issue(s): #623
Redmine related issue(s): 3430, 4062


@eaquigley
Copy link
Contributor Author


Original Redmine Comment
Author Name: Philip Durbin (@pdurbin)
Original Date: 2014-01-31T02:49:58Z


I was convinced that my recursive findPathSegments() method in IndexServiceBean (used to index the "subtree" facet) was the problem but I just commented it out and indexing of 34 dataverses and 554 datasets was not dramatically faster:

  • 7m58.435s with subtree
  • 6m24.946s without subtree

I'll have to dig into this some more...

@eaquigley
Copy link
Contributor Author


Original Redmine Comment
Author Name: Philip Durbin (@pdurbin)
Original Date: 2014-06-23T19:05:39Z


This commit might help a bit:

cache calls to dataverseService.findRootDataverse() #4062 · c9151d4 · IQSS/dataverse - c9151d4

@eaquigley eaquigley added this to the Dataverse 4.0: In Review milestone Jul 9, 2014
@eaquigley eaquigley modified the milestones: Dataverse 4.0: Beta 3, Dataverse 4.0: In Review Jul 14, 2014
@eaquigley
Copy link
Contributor Author

@pdurbin, is this still an issue? or did those commits fix it? if it isn't an issue, feel free to close this ticket.

@pdurbin
Copy link
Member

pdurbin commented Jul 15, 2014

@eaquigley this is still an issue. Or at least it hasn't been confirmed to be fixed. Now that we have API calls to create both dataverses and datasets (and upload files), someone should try to put lots of data into the system and see how long an "index all" takes. It's really a matter of prioritizing this ticket. Anyone who is comfortable with APIs could write a script to load up lots of data.

@pdurbin
Copy link
Member

pdurbin commented Jul 15, 2014

On dvn-build it just took 58.6 seconds to index 42 dataverses and 107 datasets.

@pdurbin pdurbin modified the milestones: Beta 3 - Dataverse 4.0, Dataverse 4.0: Final Aug 14, 2014
@pdurbin
Copy link
Member

pdurbin commented Aug 14, 2014

I just pushed this to the "final" milestone, since as I mentioned to @eaquigley indexing will get even slower as we start indexing base on permissions in #734. After that, we should look at indexing performance.

@eaquigley
Copy link
Contributor Author

Index performance improvements have been made and this is currently reasonable. Closing this ticket per @kcondon

@pdurbin
Copy link
Member

pdurbin commented Mar 30, 2016

@eaquigley @kcondon if you say so. I think it's something like 11 hours to do a full re-index of https://dataverse.harvard.edu as of Dataverse 4.3. We can always open a new issue if we'd like to attempt to make improvements in this area. Also, please note that I reference this issue at http://guides.dataverse.org/en/4.3/installation/administration.html#full-reindex ("Please note that this operation may take hours depending on the amount of data in your system") so we might want to remove that reference from the guides.

@pdurbin
Copy link
Member

pdurbin commented Feb 14, 2018

Related: Investigate and fix a memory leak in IndexAll #4463

kcondon pushed a commit that referenced this issue Apr 6, 2020
Update from IQSS develop
janvanmansum added a commit to janvanmansum/dataverse that referenced this issue Apr 13, 2021
DD-375 Disable editing of the cvoc URL fields
@pdurbin
Copy link
Member

pdurbin commented Aug 24, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants