-
Notifications
You must be signed in to change notification settings - Fork 476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Exporter Leaks #4958
Comments
There are several file descriptor leaks in 4.8.6, two of which were patched in the next release. If you'd like, we at Odum have a patched 4.8.6 warfile that we're happy to send (or you may build your own by referring to https://github.com/IQSS/dataverse/pull/4673/files and https://github.com/IQSS/dataverse/pull/4654/files We still hit our limit after several weeks, but RHEL kernel patches help us out there =) |
I was typing a response that included links to @donsizemore's patches, but I see he's beat me to it! |
Thank you @donsizemore & @djbrooke for such a lightning response! I am going to merge code in our release branch and try it. Will let you know how it goes. |
I have applied the patch and the uploads look better, but I am still wondering how come just opening up a Dataset in web opens 314 file handles to "export_schema.org.cached" file!! Is that a bug or its expected?
|
Huh. Sounds like a bug to me. |
@bikramj our resident Dataverse expert identified two of the prolific culprits, but we knew there were more. We found these by running 4.8.6 through Sonarqube, though at this point it's probably more expeditious to upgrade to 4.9.2 first. |
@pdurbin @donsizemore, we can't upgrade to 4.9.2 at the moment because we don't have a dataverse developer to merge our custom code to main v4.9.2 code since Kevin Worthington left Scholars Portal. |
@bikramj do you know if there is a GitHub issue for each of the features you added in your custom code? It would be nice if you didn't have to run a fork in the future. |
@pdurbin I don't have a detailed issue with our custom changes though, but the biggest chunk was the Internationalization code which is being merged in Main code slowly. Another feature was the data explorer which is already implemented in the main code. Plus we have 2-3 custom features like user affiliation and redirection to Institution according to affiliation and a custom splash page which we are thinking to replace with a separate static webpage available on root / of our main DV URL |
Did a little more investigation on this; behavior is still present on bec8015 (aka - post 4.9.2). Open questions:
|
At DANS we also suffer from the 'too many open files' resource leak. This is shown in the attached graph with the number of open file descriptor and it's is about a month; from juli 17th up to august 21th. I can't pinpoint what action is causing the leakage, maybe if there was a stress test (JMeter) we could try to find it and also incorporate this testing in the release procedure? |
@PaulBoon I haven't checked JMeter, but both |
@pameyer I get overwhelmed by the lsof output, but I do indeed see lots of export_schema.org.cached. |
@PaulBoon Thank you for the information. I'm not currently seeing |
Got a chance to do a little more investigation, and have a possible solution for the open file descriptors related to various export formats - although it may need some clean-up (and checking to see that it doesn't revert other intended behavior). |
#4991 addresses file descriptors by |
@pameyer as I mentioned in IRC I made a couple commits to your pull request. 6b4abd6 is simply formatting changes (mostly removing tabs and inconsistent brace placement #4992) and in eed28a9 I switch a change to made to the IOUtils.closeQuietly method that you and @qqmyers seem to like. I didn't run any code. I hope this helps! Please do feel free to document the tools you're using to detect memory leaks over at |
Changes mentioned in http://irclog.iq.harvard.edu/dataverse/2018-08-23#i_71871 backed out; re-check |
@pameyer I approved #4991 after making a couple tweaks (no-op reformat in fb2603a and tweaks to the tools writeup in 5266bf3). I'm passing this to QA but please consider writing up or otherwise explaining the best way to test. From the comment at #4991 (review) it looks like @qqmyers might be patching TDL with some of these fixes, which is great. I tried SonarQube on my laptop but didn't try Infer yet. |
@pdurbin Thanks for reviewing and helping out with the cleanup. I'd thought |
@pameyer ah, you're probably right. You're basically saying to hit the dataset page and check lsof output for open files. Makes sense. Thanks for working on this! |
For future reference, @kcondon and I discussed this during QA and there may be additional factors involved. I was testing and developing on CentOS7 (in docker-aio), and also saw this behavior on a different CentOS 7 system running an earlier branch; both of these systems showed a file descriptor leak and amplification effect (1 request leading to > 150 open descriptors). Both of these systems were also using the local filesystem storage driver (as opposed to S3 or swift). On the CentOS 6 system used for QA, the descriptor leak was observed without amplification (1 request leading to 1 open descriptor). With this branch (now merged), the export descriptor leak wasn't seen anymore. So the cause of the amplification is still unknown; potentially related to differences in JVM, OS or system configuration. I'm not planning to troubleshoot more at the moment, but there's a chance this information will be useful in the future. |
Is Dataverse glassfish process supposed to have lot of open files in the OS?
Following is the output of lsof on one of our Production server running glassfish with pid 1765
Our installation is Dataverse v4.8.6 on a Centos 7.5 32 core VM with datafiles on a NFS storage server. We are running in an issue when some users trying to upload large files ~2GB and glassfish getting stuck randomly and never coming back without a hard reset of VM!
I see ifollowing in kernel logs.
I am suspecting the issue to be Glassfish opening lot of file handles but not closing them properly and eventually staling NFS!
Following is output of lsof for a recently created dataset with 4 small files in it.
The text was updated successfully, but these errors were encountered: