Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Exporter Leaks #4958

Closed
bikramj opened this issue Aug 14, 2018 · 23 comments
Closed

Fix Exporter Leaks #4958

bikramj opened this issue Aug 14, 2018 · 23 comments
Assignees

Comments

@bikramj
Copy link
Contributor

bikramj commented Aug 14, 2018

Is Dataverse glassfish process supposed to have lot of open files in the OS?
Following is the output of lsof on one of our Production server running glassfish with pid 1765

[root@dv1 ~]# lsof | grep 1765 | wc -l
579999

Our installation is Dataverse v4.8.6 on a Centos 7.5 32 core VM with datafiles on a NFS storage server. We are running in an issue when some users trying to upload large files ~2GB and glassfish getting stuck randomly and never coming back without a hard reset of VM!
I see ifollowing in kernel logs.

Aug 14 08:37:35 dv1 kernel: INFO: task java:2129 blocked for more than 
120 seconds. 
Aug 14 08:37:35 dv1 kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message. 
Aug 14 08:37:35 dv1 kernel: java            D ffff9c74de7a5ee0 0  
2129      1 0x00000080 
Aug 14 08:37:35 dv1 kernel: Call Trace: 
Aug 14 08:37:35 dv1 kernel: [<ffffffffb1711fb0>] ? bit_wait+0x50/0x50 
Aug 14 08:37:35 dv1 kernel: [<ffffffffb1714029>] schedule+0x29/0x70 
Aug 14 08:37:35 dv1 kernel: [<ffffffffb1711999>] 
schedule_timeout+0x239/0x2c0 
Aug 14 08:37:35 dv1 kernel: [<ffffffffb106814e>] ? 
kvm_clock_get_cycles+0x1e/0x20 
Aug 14 08:37:35 dv1 kernel: [<ffffffffb10f7ed2>] ? ktime_get_ts64+0x52/0xf0 
Aug 14 08:37:35 dv1 kernel: [<ffffffffb106814e>] ? 
kvm_clock_get_cycles+0x1e/0x20 
Aug 14 08:37:35 dv1 kernel: [<ffffffffb10f7ed2>] ? ktime_get_ts64+0x52/0xf0 
Aug 14 08:37:35 dv1 kernel: [<ffffffffb1711fb0>] ? bit_wait+0x50/0x50 
Aug 14 08:37:35 dv1 kernel: [<ffffffffb171353d>] 
io_schedule_timeout+0xad/0x130 
Aug 14 08:37:35 dv1 kernel: [<ffffffffb17135d8>] io_schedule+0x18/0x20 
Aug 14 08:37:35 dv1 kernel: [<ffffffffb1711fc1>] bit_wait_io+0x11/0x50 
Aug 14 08:37:35 dv1 kernel: [<ffffffffb1711ae7>] __wait_on_bit+0x67/0x90 
Aug 14 08:37:35 dv1 kernel: [<ffffffffb11936d1>] wait_on_page_bit+0x81/0xa0 
Aug 14 08:37:35 dv1 kernel: [<ffffffffb10bc6d0>] ? 
wake_bit_function+0x40/0x40 
Aug 14 08:37:35 dv1 kernel: [<ffffffffb1193801>] 
__filemap_fdatawait_range+0x111/0x190 
Aug 14 08:37:35 dv1 kernel: [<ffffffffb1193894>] 
filemap_fdatawait_range+0x14/0x30 
Aug 14 08:37:35 dv1 kernel: [<ffffffffb1195996>] 
filemap_write_and_wait_range+0x56/0x90 
Aug 14 08:37:35 dv1 kernel: [<ffffffffc07b63b7>] 
nfs4_file_fsync+0x87/0x170 [nfsv4] 
Aug 14 08:37:35 dv1 kernel: [<ffffffffb124f30e>] vfs_fsync+0x2e/0x40 
Aug 14 08:37:35 dv1 kernel: [<ffffffffc07b65de>] 
nfs4_file_flush+0x5e/0x90 [nfsv4] 
Aug 14 08:37:35 dv1 kernel: [<ffffffffb12188f7>] filp_close+0x37/0x90 
Aug 14 08:37:35 dv1 kernel: [<ffffffffb123b6ec>] __close_fd+0x8c/0xb0 
Aug 14 08:37:35 dv1 kernel: [<ffffffffb121a4b3>] SyS_close+0x23/0x50 
Aug 14 08:37:35 dv1 kernel: [<ffffffffb1720795>] 
system_call_fastpath+0x1c/0x21 
Aug 14 08:39:20 dv1 kernel: nfs: server  not responding, 
still trying 
Aug 14 08:39:20 dv1 kernel: nfs: server  not responding, 
still trying 

I am suspecting the issue to be Glassfish opening lot of file handles but not closing them properly and eventually staling NFS!

Following is output of lsof for a recently created dataset with 4 small files in it.

[root@dv1 ~]# lsof | grep SP2/IC7729 | wc -l
5184
@donsizemore
Copy link
Contributor

There are several file descriptor leaks in 4.8.6, two of which were patched in the next release.

If you'd like, we at Odum have a patched 4.8.6 warfile that we're happy to send (or you may build your own by referring to https://github.com/IQSS/dataverse/pull/4673/files and https://github.com/IQSS/dataverse/pull/4654/files

We still hit our limit after several weeks, but RHEL kernel patches help us out there =)

@djbrooke
Copy link
Contributor

I was typing a response that included links to @donsizemore's patches, but I see he's beat me to it!

@bikramj
Copy link
Contributor Author

bikramj commented Aug 15, 2018

Thank you @donsizemore & @djbrooke for such a lightning response! I am going to merge code in our release branch and try it. Will let you know how it goes.

bikramj pushed a commit to scholarsportal/SP-dataverse that referenced this issue Aug 15, 2018
@bikramj
Copy link
Contributor Author

bikramj commented Aug 15, 2018

I have applied the patch and the uploads look better, but I am still wondering how come just opening up a Dataset in web opens 314 file handles to "export_schema.org.cached" file!! Is that a bug or its expected?
eg.

[root@dv1]# lsof | grep QBNPEH
java      22349            glassfish 1187r      REG               0,43       874 8321499284 /mnt/dataverse-files2/SP2/QBNPEH/export_schema.org.cached (172.23:/media/dataverse-files2)
java      22349 22353      glassfish 1187r      REG               0,43       874 8321499284 /mnt/dataverse-files2/SP2/QBNPEH/export_schema.org.cached (172.23:/media/dataverse-files2)
java      22349 22354      glassfish 1187r      REG               0,43       874 8321499284 /mnt/dataverse-files2/SP2/QBNPEH/export_schema.org.cached (172.23:/media/dataverse-files2)
.
.
.
[root@dv1]# lsof | grep QBNPEH | wc -l
314

@pdurbin
Copy link
Member

pdurbin commented Aug 15, 2018

Huh. Sounds like a bug to me.

@donsizemore
Copy link
Contributor

@bikramj our resident Dataverse expert identified two of the prolific culprits, but we knew there were more. We found these by running 4.8.6 through Sonarqube, though at this point it's probably more expeditious to upgrade to 4.9.2 first.

@bikramj
Copy link
Contributor Author

bikramj commented Aug 15, 2018

@pdurbin @donsizemore, we can't upgrade to 4.9.2 at the moment because we don't have a dataverse developer to merge our custom code to main v4.9.2 code since Kevin Worthington left Scholars Portal.
For now, I will keep a watch on uploads.

@pdurbin
Copy link
Member

pdurbin commented Aug 15, 2018

@bikramj do you know if there is a GitHub issue for each of the features you added in your custom code? It would be nice if you didn't have to run a fork in the future.

@bikramj
Copy link
Contributor Author

bikramj commented Aug 15, 2018

@pdurbin I don't have a detailed issue with our custom changes though, but the biggest chunk was the Internationalization code which is being merged in Main code slowly. Another feature was the data explorer which is already implemented in the main code. Plus we have 2-3 custom features like user affiliation and redirection to Institution according to affiliation and a custom splash page which we are thinking to replace with a separate static webpage available on root / of our main DV URL
I am wondering if there is a way to manage our custom code rather than managing a separate code?

@pameyer
Copy link
Contributor

pameyer commented Aug 21, 2018

Did a little more investigation on this; behavior is still present on bec8015 (aka - post 4.9.2). curl (or ab) unauthenticated calls to the dataset page result in ~150 additional "open for read" file descriptors for export_schema.org.cached; no additional cached exports appears to be causing problems. In a development/test environment, it's possible to crash glassfish with enough reloading the dataset page (heap dumps turned out to be dominated by unexpired sessions and unhelpful for diagnosing).

Open questions:

  • location of the problematic open statement(s)
  • why the same cached export file is being opened ~150 times by a single page load
  • if / how many other areas there are that could be generating similar problems.

@PaulBoon
Copy link
Contributor

At DANS we also suffer from the 'too many open files' resource leak.
We also run a fork of 4.8.6 and I backported the two known file descriptor leaks in Dataverse 4.8.6 Export code.
#4654 and #4673, but leaking continues.

chart

This is shown in the attached graph with the number of open file descriptor and it's is about a month; from juli 17th up to august 21th.
We monitor this and restart glassfish if it gets to high, at 4096 it would become problematic.

I can't pinpoint what action is causing the leakage, maybe if there was a stress test (JMeter) we could try to find it and also incorporate this testing in the release procedure?

@pameyer
Copy link
Contributor

pameyer commented Aug 21, 2018

@PaulBoon GET requests to the dataset page (dataset.xhtml) are one cause I've been able to identify; I haven't yet identified the code path responsible. Have you identified any open descriptors in addition to the open-read descriptors for export_schema.org.cached (which would suggest additional causes)?

I haven't checked JMeter, but both ab (stress testing) and curl (small scale testing) seem to work - so I suspect that JMeter would produce similar results.

@PaulBoon
Copy link
Contributor

@pameyer I get overwhelmed by the lsof output, but I do indeed see lots of export_schema.org.cached.
I selected one dataset and found that it had 645 lines for the export_schema.org.cached and 129 for the export_dataverse_json.cached (1:5 ratio).

@pameyer
Copy link
Contributor

pameyer commented Aug 21, 2018

@PaulBoon Thank you for the information. I'm not currently seeing export_dataverse_json.cached open, so there's a possibility that this has a fix on the branch I'm working on - I'll keep an eye out for it to make sure though.

@pameyer
Copy link
Contributor

pameyer commented Aug 21, 2018

Got a chance to do a little more investigation, and have a possible solution for the open file descriptors related to various export formats - although it may need some clean-up (and checking to see that it doesn't revert other intended behavior).

@pameyer pameyer mentioned this issue Aug 22, 2018
5 tasks
@pameyer
Copy link
Contributor

pameyer commented Aug 22, 2018

#4991 addresses file descriptors by GET calls to the dataset page (dataset.xhtml) and the native export dataset API (api/datasets/export). Prior to this PR, these GET calls (curl or ab) resulted in increased numbers of open-read file descriptors for the cached export files, evaluated by lsof | grep $thing_we_used_to_call_dataset_identifier | wc -l. #4991 may not address all resource leaks.

@bikramj
Copy link
Contributor Author

bikramj commented Aug 22, 2018

@PaulBoon @pameyer @pdurbin
Thank you, Gentlemen, for looking into it and putting up a fix.

@pdurbin
Copy link
Member

pdurbin commented Aug 23, 2018

@pameyer as I mentioned in IRC I made a couple commits to your pull request. 6b4abd6 is simply formatting changes (mostly removing tabs and inconsistent brace placement #4992) and in eed28a9 I switch a change to made to the IOUtils.closeQuietly method that you and @qqmyers seem to like. I didn't run any code. I hope this helps! Please do feel free to document the tools you're using to detect memory leaks over at doc/sphinx-guides/source/developers/tools.rst in this pull request! I've heard of Sonar but Infer ( https://github.com/facebook/infer ) is new to me.

@pameyer
Copy link
Contributor

pameyer commented Aug 24, 2018

Changes mentioned in http://irclog.iq.harvard.edu/dataverse/2018-08-23#i_71871 backed out; re-check dataset.xhtml for file descriptor leaks; added some docs.

@pdurbin
Copy link
Member

pdurbin commented Aug 24, 2018

@pameyer I approved #4991 after making a couple tweaks (no-op reformat in fb2603a and tweaks to the tools writeup in 5266bf3). I'm passing this to QA but please consider writing up or otherwise explaining the best way to test. From the comment at #4991 (review) it looks like @qqmyers might be patching TDL with some of these fixes, which is great. I tried SonarQube on my laptop but didn't try Infer yet.

@pameyer
Copy link
Contributor

pameyer commented Aug 24, 2018

@pdurbin Thanks for reviewing and helping out with the cleanup.

I'd thought
#4958 (comment) had enough info to test - could you let me know what I'm missing there?

@pdurbin
Copy link
Member

pdurbin commented Aug 24, 2018

@pameyer ah, you're probably right. You're basically saying to hit the dataset page and check lsof output for open files. Makes sense. Thanks for working on this!

@kcondon kcondon self-assigned this Aug 24, 2018
@kcondon kcondon closed this as completed Aug 28, 2018
@djbrooke djbrooke changed the title Lot of open files by dataverse glassfish process! Fix Exporter Leaks Aug 28, 2018
@pameyer
Copy link
Contributor

pameyer commented Aug 29, 2018

For future reference, @kcondon and I discussed this during QA and there may be additional factors involved. I was testing and developing on CentOS7 (in docker-aio), and also saw this behavior on a different CentOS 7 system running an earlier branch; both of these systems showed a file descriptor leak and amplification effect (1 request leading to > 150 open descriptors). Both of these systems were also using the local filesystem storage driver (as opposed to S3 or swift). On the CentOS 6 system used for QA, the descriptor leak was observed without amplification (1 request leading to 1 open descriptor). With this branch (now merged), the export descriptor leak wasn't seen anymore.

So the cause of the amplification is still unknown; potentially related to differences in JVM, OS or system configuration. I'm not planning to troubleshoot more at the moment, but there's a chance this information will be useful in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants