Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lidarr.exe process hung on FreeBSD 11.2 (FreeNAS 11.2) #602

Closed
rmccu opened this issue Jan 25, 2019 · 23 comments
Closed

Lidarr.exe process hung on FreeBSD 11.2 (FreeNAS 11.2) #602

rmccu opened this issue Jan 25, 2019 · 23 comments

Comments

@rmccu
Copy link

rmccu commented Jan 25, 2019

Describe the bug
The Lidarr.exe process becomes hung on FreeNAS 11.2 (FreeBSD 11.2) periodically.
The process looks like this (from ps auxwwww): /usr/local/bin/mono /usr/local/share/Lidarr/Lidarr.exe --nobrowser --data=/usr/local/lidarr (mono-sgen)

It is running in a FreeBSD IO Cage Jail.

I have tried to kill -9 the process and it does not end. I tell the host system/OS to reboot and it is not able to because it cannot end this process.

I will provide a URL for the log, but I think it froze at 19-1-24 05:04:06.2 in the log.

To Reproduce
Steps to reproduce the behavior: Unknown

Expected behavior
The process should not hang, I should be able to restart the service, I should be able to restart the system.

Screenshots
N/A

Logs
https://pastebin.com/t7j5qb0K

System info (please complete the following information):

  • Lidarr Version: 0.5.0.583
  • Operating System: FreeNAS 11.2 (FreeBSD 11.2)
  • .net Framework (Windows) or mono (macOS/Linux) Version: 5.10.1.57

Additional context
N/A

AB#321

@rootwyrm
Copy link

Note: this is FreeNAS-specific, and is not FreeBSD. I do active testing with 11.2 and 12.0 currently and have not observed nor am I able to reproduce this behavior. FreeNAS has it's own separate base system which has a number of patches not present in FreeBSD.

I would need the output of truss -f -p ${mono_pid} all the way up to the failure point to be able to identify where exactly the failure is occurring.

@rmccu
Copy link
Author

rmccu commented Jan 26, 2019

I don't know how to reproduce this behavior. It has happened 2 or 3 times in the last ~ 2 months. Also does not happen when I am actively doing stuff via the Web UI. It appears to hang during background processing.

FWIW, I ran that command against it currently, and it looks like it is producing a bunch of errors: https://pastebin.com/xnnbrt4S

Is this a red herring? Or relevant?

@rootwyrm
Copy link

Yes, that is extremely helpful. The pointer here is a flood of UMTX_OP_WAIT_UINT_PRIVATE going on, indicating the process is spinning uselessly trying to get a lock in kernel space. Most likely related to a file operation or high rate of file operations. This does NOT reproduce on FreeBSD 11.2 or 12.0 with ZFS.

There are complaints about numerous other applications that access files frequently running into the same problems, making it clearly Yet Another FreeNAS Specific Issue.

@rmccu
Copy link
Author

rmccu commented Jan 26, 2019

I am not even sure what to put in a ticket to FreeNAS about this.

@rootwyrm
Copy link

Honestly, I wouldn't even bother. There's a pile of bugs for this exact behavior on various applications, and they've just been closing them without actually fixing them. Just a lot of falsely pointing the finger at the applications. You could attempt to tune vfs and IO Cage settings to see if that alleviates the problem, but I don't think it would help here. This is ultimately a kernel-level issue.

@rmccu
Copy link
Author

rmccu commented Jan 27, 2019

I went ahead and posted on the FreeNAS forums: https://forums.freenas.org/index.php?threads/lidarr-exe-process-hangs-occassionally-process-trying-to-get-a-lock-in-kernel-space.73264/

I will likely open a ticket in their Redmine system for this issue, I just don't know enough technically to describe the real underlying problem. Is this something you would be willing to assist with?

@rmccu
Copy link
Author

rmccu commented Jan 28, 2019

FWIW, this happened again. Here is ps & truss:

[root@lidarr /]# ps auxwwww
USER     PID  %CPU %MEM    VSZ    RSS TT  STAT STARTED       TIME COMMAND
lidarr 51252 100.0  1.3 496644 448348  -  IsJ  Sat14   1196:11.17 /usr/local/bin/mono /usr/local/share/Lidarr/Lidarr.exe --nobrowser --data=/usr/local/lidarr (mono-sgen)
root    5990   0.0  0.0   6544   2356  -  IsJ  Thu12      0:00.06 dhclient: epair0b [priv] (dhclient)
_dhcp   6009   0.0  0.0   6544   2476  -  DCsJ Thu12    117:50.15 dhclient: epair0b (dhclient)
root    6216   0.0  0.0   6412   2456  -  SsJ  Thu12      0:00.69 /usr/sbin/syslogd -c -ss
root    6272   0.0  0.0   6464   2432  -  IsJ  Thu12      0:01.52 /usr/sbin/cron -J 15 -s
root   82953   0.0  0.0   6464   2436  -  IJ   03:01      0:00.00 cron: running job (cron)
root   82957   0.0  0.0   7064   2796  -  IsJ  03:01      0:00.00 /bin/sh - /usr/sbin/periodic daily
root   83823   0.0  0.0   4236   1964  -  IJ   03:01      0:00.00 lockf -t 0 /var/run/periodic.daily.lock /bin/sh /usr/sbin/periodic LOCKED daily
root   83824   0.0  0.0   7064   2796  -  IJ   03:01      0:00.01 /bin/sh /usr/sbin/periodic LOCKED daily
root   83833   0.0  0.0   7064   2804  -  IJ   03:01      0:00.01 /bin/sh /usr/sbin/periodic LOCKED daily
root   83834   0.0  0.0   6388   2184  -  IJ   03:01      0:00.00 mail -E -s lidarr daily run output root
root   83920   0.0  0.0   7064   2800  -  IJ   03:01      0:00.00 /bin/sh /etc/periodic/daily/450.status-security
root   83921   0.0  0.0   7064   2796  -  IJ   03:01      0:00.00 /bin/sh - /usr/sbin/periodic security
root   83923   0.0  0.0   4236   1964  -  IJ   03:01      0:00.00 lockf -t 0 /var/run/periodic.security.lock /bin/sh /usr/sbin/periodic LOCKED security
root   83924   0.0  0.0   7064   2796  -  IJ   03:01      0:00.01 /bin/sh /usr/sbin/periodic LOCKED security
root   83932   0.0  0.0   7064   2804  -  IJ   03:01      0:00.01 /bin/sh /usr/sbin/periodic LOCKED security
root   83933   0.0  0.0   6388   2152  -  IJ   03:01      0:00.00 mail -E -s lidarr daily security run output root
root   88428   0.0  0.0   7064   2820  -  IJ   03:13      0:00.00 /bin/sh - /usr/local/etc/periodic/security/460.pkg-checksum
root   88431   0.0  0.0   7064   2820  -  IJ   03:13      0:00.00 /bin/sh - /usr/local/etc/periodic/security/460.pkg-checksum
root   88432   0.0  0.0  13572   8312  -  IJ   03:13      0:00.00 /usr/local/sbin/pkg check -qsa
root   88433   0.0  0.0   6324   2096  -  IJ   03:13      0:00.00 sed -e s/ checksum mismatch for//
root   88434   0.0  0.0   6288   2000  -  ICJ  03:13      0:00.00 tee /dev/stderr
root   88435   0.0  0.0   6508   2244  -  IJ   03:13      0:00.00 wc -l
root   88436   0.0  0.0  13572   9152  -  DJ   03:13     98:42.15 /usr/local/sbin/pkg check -qsa
root   80857   0.0  0.0   7872   4024  0  SJ   09:05      0:00.01 bash
root   80885   0.0  0.0   6992   2908  0  R+J  09:05      0:00.00 ps auxwwww
[root@lidarr /]# truss -f -p 51252
^Ctruss: Unexpect stop in waitpid: Interrupted system call
[root@lidarr /]#

Notice that the truss command I had to kill after waiting for about 30 seconds for output.

Here are the last 25 lines of /usr/local/lidarr/logs/Lidarr.txt:

19-1-27 13:29:31.2|Info|ExistingOtherExtraImporter|Found 0 existing other extra files
19-1-27 13:29:31.2|Info|ExistingExtraFileService|Found 0 extra files
19-1-27 13:29:31.2|Info|RefreshArtistService|Updating Info for Bleachers
19-1-27 13:29:31.8|Info|DiskScanService|Scanning Bleachers
19-1-27 13:29:32.1|Info|DiskScanService|Completed scanning disk for Bleachers
19-1-27 13:29:32.1|Info|ExistingMetadataImporter|Found 0 existing metadata files
19-1-27 13:29:32.1|Info|ExistingLyricImporter|Found 0 existing lyric files
19-1-27 13:29:32.1|Info|ExistingOtherExtraImporter|Found 0 existing other extra files
19-1-27 13:29:32.1|Info|ExistingExtraFileService|Found 0 extra files
19-1-27 13:29:32.1|Info|RefreshArtistService|Updating Info for Blues Traveler
19-1-27 13:29:32.3|Info|DiskScanService|Scanning Blues Traveler
19-1-27 13:29:32.8|Info|DiskScanService|Completed scanning disk for Blues Traveler
19-1-27 13:29:32.8|Info|ExistingMetadataImporter|Found 0 existing metadata files
19-1-27 13:29:32.8|Info|ExistingLyricImporter|Found 0 existing lyric files
19-1-27 13:29:32.8|Info|ExistingOtherExtraImporter|Found 0 existing other extra files
19-1-27 13:29:32.8|Info|ExistingExtraFileService|Found 0 extra files
19-1-27 13:29:32.8|Info|RefreshArtistService|Updating Info for Bob Dylan
19-1-27 13:29:33.0|Info|DiskScanService|Scanning Bob Dylan
19-1-27 13:29:35.9|Info|DiskScanService|Completed scanning disk for Bob Dylan
19-1-27 13:29:38.5|Info|ExistingMetadataImporter|Found 0 existing metadata files
19-1-27 13:29:38.5|Info|ExistingLyricImporter|Found 0 existing lyric files
19-1-27 13:29:38.6|Info|ExistingOtherExtraImporter|Found 0 existing other extra files
19-1-27 13:29:38.6|Info|ExistingExtraFileService|Found 0 extra files
19-1-27 13:29:38.6|Info|RefreshArtistService|Updating Info for Bob Seger
19-1-27 13:29:39.3|Info|DiskScanService|Scanning Bob Seger

@rootwyrm
Copy link

This is definitely pointing to a kernel level failure under high file loads, a common failure on FreeNAS. About how many files do you have for artists starting with a number (e.g. 0-9), A, and B? Also, what other applications are you running? This will help inform whether it may be able to be mitigated with tuning.

@rmccu
Copy link
Author

rmccu commented Jan 28, 2019

Here is a count of FLAC files for artists whose name starts with a number, A, or B:

root@freenas[/mnt/Tank/Media/Music]# find . -type f -name "*.flac" | grep "^\./[0-9]\|^\./A\|^\./B" | wc -l
     870

I don't think I have files other than FLAC in there.

In this IOCage jail I am not running anything else. This is what ps displays after a reboot/reset:

[root@lidarr /]# ps auxwww
USER    PID %CPU %MEM    VSZ    RSS TT  STAT STARTED    TIME COMMAND
root   5893  0.0  0.0   6544   2356  -  IsJ  12:49   0:00.00 dhclient: epair0b [priv] (dhclient)
_dhcp  6009  0.0  0.0   6544   2476  -  ICsJ 12:49   0:00.00 dhclient: epair0b (dhclient)
root   6229  0.0  0.0   6412   2412  -  IsJ  12:49   0:00.00 /usr/sbin/syslogd -c -ss
lidarr 6278  0.0  0.9 360532 305120  -  SsJ  12:49   5:45.62 /usr/local/bin/mono /usr/local/share/Lidarr/Lidarr.exe --nobrowser --data=/usr/local/lidarr (mono-sgen)
root   6285  0.0  0.0   6464   2432  -  IsJ  12:49   0:00.00 /usr/sbin/cron -J 15 -s
root   9571  0.0  0.0   7872   4024  0  SJ   13:01   0:00.01 bash
root   9573  0.0  0.0   6992   2884  0  R+J  13:01   0:00.00 ps auxwww

On my FreeNAS server, I have separate IOCage jails setup for the following:

  • Lidarr
  • OpenVPN
  • Plex
  • Radarr
  • SABnzbd
  • Sonarr
  • Transmission

I have a few other jails, but they are not running.

@rootwyrm
Copy link

Okay, so making some fairly conservative assumptions, Lidarr should probably only be holding somewhere on the order of ~225 files open at any one point worst case. If it was the number of files, I'd have no problem reproducing with a much larger file set. So something on FreeNAS is causing the locks to spin too slow or get hung up in sleep. The extremely high CPU utilization is actually pretty normal, because that's just how mono rolls in general.

The fix here is probably going to require the MTX_POOL_SLEEP_SIZE changes from r343329 because those slow locks are probably exhausting the pool catastrophically. (Increases the sleep pool from 128->1024.) That was only MFC'd to 11-STABLE 5 days ago though. There's also a GEOM fix (r343334) which changes some mutexes to atomics that could heavily influence this as well, at least with regards to spinning locks. Unfortunately to test that does require a complete release build from 11-STABLE.

@rmccu
Copy link
Author

rmccu commented Jan 29, 2019

This sounds like FreeNAS would need to add this to the next incremental release for 11.2. I would like to open a ticket on their ticketing site, but am not entirely sure how to word the request. Let me take a stab at it though and will post here when I am done.

@rootwyrm
Copy link

Should just need to reference those two commits as r343329 and r343334 - iX sponsored the r343334 commit anyways. (Well, the original in -CURRENT, but same difference.) If it is pool exhaustion as I suspect, r343329 should change the behavior enough to actually debug what's going on here. Once the system's dead hung, it's just not possible to get usable data.

Looking at a truss on my working system, I do see a very high number of _umtx_op(*addr,UMTX_OP_WAIT_UINT_PRIVATE,int,int,*addr) calls going on, but they are waking up properly. They aren't exclusive to Lidarr either - other mono applications that aren't NzbDrone based show similar behavior. If I get some free time later this week I'll see if I can find a heavy stressor to test mono with (though I'm honestly having a hard time thinking of anything worse than my Lidarr use case.)

@rmccu
Copy link
Author

rmccu commented Jan 29, 2019

Thank you @rootwyrm, you have gone above and beyond here helping me with a problem that is not your/Lidarr's problem.

@rmccu
Copy link
Author

rmccu commented Jan 29, 2019

Opened Issue #72545.

@rmccu
Copy link
Author

rmccu commented Jan 29, 2019

Received a response on #72545:

Ryan, I don't believe those patches are related to your problem, whatever it is. Second patch is completely unrelated to ZFS, improving only raw SSD benchmarks. First may slightly improve performance if you get to hundreds of thousands of IOPS, but it should not affect any ZFS hangs or other fatal issues. Those patches will be present in FreeNAS 11.3, but I see no particular reason to hurry with their backport to 11.2, especially this late in its U2 cycle.

@rootwyrm
Copy link

Commented on the FreeNAS issue; I don't agree with the assessment that this can wait for 11.3, because 11.3 doesn't have a date yet. However, we're at their mercy there, so all we can do is hope they listen.

As discussed with @ta264 on Discord, Lidarr also appears to hold files open excessively. This definitely would exacerbate the issue, especially on systems that have slow spinners like we have here. So there are two root causes.

Issue 1: On FreeNAS, the activity overall is leading to catastrophic pool depletion. Once pool depletion occurs, the system is basically unrecoverable unless using a bhyve instance with fully isolated kernel. My hypothesis is that commits r343329 and r343334 should alleviate the lock problem enough to hopefully tune around it. To be clear, this is due to the lazy design of NzbDrone - Sonarr does the same thing - so multiple NzbDrones which regularly scan will make it worse.
Theoretically a bhyve VM for Lidarr might alleviate this enough to keep things running, but, there's other issues associated with going to bhyve so it's a question of trade-offs.

Issue 2: Lidarr holds files open much longer than necessary - but doesn't leak FDs. There's no evidence to suggest Lidarr is actually exhausting FDs, but the lazy freeing and reliance on the Mono GC is causing it to excessively lean on the kernel like we're seeing here. Because Lidarr will tend to work on much larger file sets as compared to Sonarr/Radarr, not holding files open is much more critical. This behavior dates back to NzbDrone as I mentioned above. @ta264 's rewrite should address that behavior but we'll need to verify. Unfortunately I'm fighting through a lang/mono problem on the 12.0 I have allocated for nightly so I won't be able to test as quickly as I'd like.

@ta264
Copy link
Contributor

ta264 commented Jan 29, 2019

Most of this is over my head but just FYI keeping the file handle open too long is a Lidarr specific issue (reading MP3/flac tags) so if you have the same problems in Sonarr or Radarr my change won't fix it.

@rootwyrm
Copy link

I may have misunderstood what you said you had changed then. The biggest concern is fopen() without fclose() and just relying on mono's GC to clean it up afterwards. Mono's GC can only do so much, and putting more load on it that way is going to cause serious problems down the road for larger libraries.

Unfortunately, I don't know anything at all about .NET so I don't even know where to look for the fopen() operations in it. If it's actively closing file access after scan/update with your change, that will indeed fix it though.

@rmccu
Copy link
Author

rmccu commented Jan 29, 2019

@ta264 I do not see this issue on Sonarr or Radarr. It seems to only affect Lidarr.

@rootwyrm I used to write a ton of C# code, but it has been about 6 years. Should I see if I can find where files are being opened but not closed and create a pull request to address this?

@Qstick
Copy link
Contributor

Qstick commented Jan 30, 2019

ta264s file disposal fix is here,
81895bd and getting closer to being merged into nightly.

There are also three memory fixes recently merged into nightly related incorrect sqlite db cache, datamapper lazyloaded keeping parent mapper alive, and workaround for a mono gzip issue in http connection code that caused a leak.

Will be interesting to see how this looks after all of these are integrated.

@rmccu
Copy link
Author

rmccu commented Jan 30, 2019

Thanks all, I look forward to trying these fixes.

@rmccu
Copy link
Author

rmccu commented Feb 1, 2019

@rootwyrm I posted a couple of responses in the ixSystems/FreeNAS issue. Not sure if you are following it.

@ta264
Copy link
Contributor

ta264 commented May 20, 2022

Mono is dead, long live .NET

@ta264 ta264 closed this as completed May 20, 2022
@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants