Skip to content
This repository has been archived by the owner on Dec 17, 2021. It is now read-only.

sslyze calls deadlocking due to combination of threads/processes/logging #157

Closed
konklone opened this issue Nov 5, 2017 · 4 comments · Fixed by #158
Closed

sslyze calls deadlocking due to combination of threads/processes/logging #157

konklone opened this issue Nov 5, 2017 · 4 comments · Fixed by #158

Comments

@konklone
Copy link
Contributor

konklone commented Nov 5, 2017

There are no more defunct processes after #151, and our bulk scans go for much longer before they become an issue, but eventually they do just get stuck. Or as this says:

If a process is forked while an io lock is being held, the child process will deadlock on the next call to flush.

After looking at the stuck processes' trace with gdb, I'm convinced I'm facing the same issue described here:
https://stackoverflow.com/questions/39884898/large-amount-of-multiprocessing-process-causing-deadlock

And that this bug, opened in 2009 and still quite actively discussed in October 2017, is the cause:
https://bugs.python.org/issue6721

The folks on that bug thread seem to be reaching process at a fix that is specific to logging calls and buffered IO, which I suspect would be enough to fix our case. There's also some related discussion on this bug, with Guido indicating he believes something should be done for the GC interrupt case.

There are a few ways to work around this I can think of:

  • Use sslyze's SynchronousScanner instead of the ConcurrentScanner. However, this is both much slower and results in a distinct memory leak, as noted in Using SSLyze as a Python library instead of shelling out #151.
  • Don't ever write to stdout from the child processes. However, this exposes some of my not-totally-complete understanding here -- I am not sure whether it's my own logging calls or something inside sslyze that is the problem. Workers at the domain-scan level are done as threads via a ThreadPoolExecutor, whereas I believe it's SSLyze's ConcurrentScanner that forks off processes. So I may have limited control here. The only reference to using stdout in SSLyze's core is this emergency shutdown message, so I am not sure where in SSLyze this might be happening.
  • This Python module that lets you register an after-fork hook that clears up any held locks the child copied over. It's not on PyPi and would need to be installed from the repo (pip supports github repo syntax).

Right now, I'm leaning toward the 3rd option, the Python module. I'll try it out and see how it goes.

@konklone
Copy link
Contributor Author

konklone commented Nov 5, 2017

Ah, here's a possible candidate - SSLyze reads trust stores from disk at some point (which is IO that I believe involves a lock), to verify certificate chains against them:

https://github.com/nabla-c0d3/sslyze/blob/f0610b73da965ab4872375b4eebc316a0a2a71eb/sslyze/plugins/utils/trust_store/trust_store.py#L84-L85

It's possible that the certificate info plugin is the culprit here.

@konklone
Copy link
Contributor Author

konklone commented Nov 5, 2017

Hmm, also noting that the atfork-lock-release module only supports Python 3 (right now domain-scan and sslyze both maintain Python 2.7 support as well), and is also a C module that, given its one-off non-PyPi nature, could add complications to Lambda packaging.

I guess I'll give a shot at cutting out cert info scanning for now, but that doesn't seem sustainable.

@konklone
Copy link
Contributor Author

konklone commented Nov 5, 2017

I finally got a working py-bt trace in gdb, thanks to @jsf9k's work at getting this set via Docker. Here's a hung process:

<built-in method flush of _io.BufferedWriter object at remote 0x7f8c7a209bb8>
  <built-in method flush of _io.TextIOWrapper object at remote 0x7f8c7a1c8688>
  File "/opt/pyenv/versions/3.6.1-debug/lib/python3.6/multiprocessing/process.py", line 269, in _bootstrap
    sys.stderr.flush()
  File "/opt/pyenv/versions/3.6.1-debug/lib/python3.6/multiprocessing/popen_fork.py", line 74, in _launch
    code = process_obj._bootstrap()
  File "/opt/pyenv/versions/3.6.1-debug/lib/python3.6/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/opt/pyenv/versions/3.6.1-debug/lib/python3.6/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/opt/pyenv/versions/3.6.1-debug/lib/python3.6/multiprocessing/context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/opt/pyenv/versions/3.6.1-debug/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/opt/pyenv/versions/3.6.1-debug/lib/python3.6/site-packages/sslyze/concurrent_scanner.py", line 118, in _check_and_create_process
    process.start()
  File "/opt/pyenv/versions/3.6.1-debug/lib/python3.6/site-packages/sslyze/concurrent_scanner.py", line 94, in queue_scan_command
    self._check_and_create_process(server_info.hostname)
  File "/home/scanner/scanners/sslyze.py", line 389, in queue
    return scanner.queue_scan_command(server_info, command)
  File "/home/scanner/scanners/sslyze.py", line 399, in scan_parallel
    queue(Sslv20ScanCommand())
  File "/home/scanner/scanners/sslyze.py", line 162, in run_sslyze
    sslv2, sslv3, tlsv1, tlsv1_1, tlsv1_2, certs = scan_parallel(scanner, server_info, options)
  File "/home/scanner/scanners/sslyze.py", line 74, in scan
    data = run_sslyze(scan_domain, options)
  File "./scan", line 120, in process_scan
    rows = list(scanner.scan(domain, options))
  File "/opt/pyenv/versions/3.6.1-debug/lib/python3.6/concurrent/futures/thread.py", line 55, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/opt/pyenv/versions/3.6.1-debug/lib/python3.6/concurrent/futures/thread.py", line 66, in _worker
    work_item.run()
  File "/opt/pyenv/versions/3.6.1-debug/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/pyenv/versions/3.6.1-debug/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/opt/pyenv/versions/3.6.1-debug/lib/python3.6/threading.py", line 884, in _bootstrap
    self._bootstrap_inner()

And it's specifically hung during multiprocess' cleanup, where it flushes stderr:

(gdb) py-list
 264                sys.stderr.write('Process %s:\n' % self.name)
 265                traceback.print_exc()
 266            finally:
 267                util.info('process exiting with exitcode %d' % exitcode)
 268                sys.stdout.flush()
>269                sys.stderr.flush()
 270    
 271            return exitcode
 272    
 273    #
 274    # We subclass bytes to avoid accidental transmission of auth keys over network

When I hook into the parent process, its backtrace suggests it's deadlocked trying to shut down a thread inside the ThreadPoolExecutor:

(gdb) py-bt
Traceback (most recent call first):
  <built-in method acquire of _thread.lock object at remote 0x7f8c73a11fb8>
  File "/opt/pyenv/versions/3.6.1-debug/lib/python3.6/threading.py", line 1072, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
  File "/opt/pyenv/versions/3.6.1-debug/lib/python3.6/threading.py", line 1056, in join
    self._wait_for_tstate_lock()
  File "/opt/pyenv/versions/3.6.1-debug/lib/python3.6/concurrent/futures/thread.py", line 144, in shutdown
    t.join()
  File "/opt/pyenv/versions/3.6.1-debug/lib/python3.6/concurrent/futures/_base.py", line 581, in __exit__
    self.shutdown(wait=True)
  File "./scan", line 142, in scan_domains
    executor.map(process_scan, tasks)
  File "./scan", line 84, in run
    scan_domains(scans, domains)
  File "./scan", line 178, in <module>
    run(options)

And the specific lines:

(gdb) py-list
1067            # If the lock is acquired, the C code is done, and self._stop() is
1068            # called.  That sets ._is_stopped to True, and ._tstate_lock to None.
1069            lock = self._tstate_lock
1070            if lock is None:  # already determined that the C code is done
1071                assert self._is_stopped
>1072            elif lock.acquire(block, timeout):
1073                lock.release()
1074                self._stop()
1075    
1076        @property
1077        def name(self):

I am not sure how to resolve this. It's possible that I could limit the domain-scan worker threads to 1, and that would at least make it so the fork was always happening in a predictable state, so there was no chance of a thread forking off a child process which had a lock held by some other thread. That would slow things down, but would at least make it predictable.

@konklone
Copy link
Contributor Author

konklone commented Nov 5, 2017

Checking out this very helpful suggestion:
https://bugs.python.org/issue28382#msg298900

Also, a reliable fix is to use the "forkserver" (or "spawn", but it is much slower) method:
https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods

Adding multiprocessing.set_start_method('forkserver') to the main scan runner's __main__ execution block, as suggested by the docs, would seem to be an elegant solution.

This approach creates a "fork server", sort of a vanilla master process, from which children can be asked for at any time. This makes child processes clean slates, and eliminates the chances of them inheriting a held lock from the parent (which could have many threads all happening at once).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant