Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random process crashes on the (almost) lastest portable #929

Closed
jirutka opened this issue Aug 21, 2019 · 61 comments
Closed

Random process crashes on the (almost) lastest portable #929

jirutka opened this issue Aug 21, 2019 · 61 comments
Assignees

Comments

@jirutka
Copy link
Contributor

jirutka commented Aug 21, 2019

I’m running OpenSMTPD built from 772da22 (+ this patch for musl compatibility) with OpenSSL 1.1.1c* on Alpine Linux 3.10 (musl libc) with filter opensmtpd-filter-rewrite-from. It’s randomly crashes during email relaying.

This is all it logs when running with -v:

Aug 21 14:23:36 no-reply mail.info smtpd[3456]: 8bf482cee62b7ea0 mta connecting address=smtp://10.38.1.1:25 host=relay.in.*****.cz
Aug 21 14:23:36 no-reply mail.info smtpd[3456]: 8bf482cee62b7ea0 mta connected
Aug 21 14:23:36 no-reply mail.debug smtpd[3456]: debug: mta-routing: route [] <-> 10.38.1.1 (relay.in.*****.cz) is now valid.
Aug 21 14:23:36 no-reply mail.debug smtpd[3456]: debug: mta: connecting with [connector:[]->[relay:relay.in.*****.cz,port=25,smtp,mx,heloname=no-reply.in.*****.cz],0x20000]
Aug 21 14:23:36 no-reply mail.debug smtpd[3456]: debug: mta: cancelling connector timeout
Aug 21 14:23:36 no-reply mail.debug smtpd[3456]: debug: mta: enough connections already
Aug 21 14:23:36 no-reply mail.debug smtpd[3456]: debug: mta: 0x556c2071d0e0: handling next task for relay [relay:relay.in.*****.cz,port=25,smtp,mx,heloname=no-reply.in.*****.cz]
Aug 21 14:23:36 no-reply mail.debug smtpd[3454]: debug: control -> pony express: pipe closed
Aug 21 14:23:36 no-reply mail.debug smtpd[3454]: debug: control agent exiting
Aug 21 14:23:36 no-reply mail.debug smtpd[3453]: debug: ca -> pony express: pipe closed
Aug 21 14:23:36 no-reply mail.debug smtpd[3453]: debug: ca agent exiting
Aug 21 14:23:36 no-reply mail.debug smtpd[3446]: debug: parent -> control: pipe closed
Aug 21 14:23:36 no-reply mail.crit smtpd[3446]: smtpd: process control socket closed
Aug 21 14:23:36 no-reply mail.debug smtpd[3458]: debug: scheduler -> control: pipe closed
Aug 21 14:23:36 no-reply mail.debug smtpd[3458]: debug: scheduler agent exiting
Aug 21 14:23:36 no-reply mail.debug smtpd[3457]: debug: queue -> pony express: pipe closed
Aug 21 14:23:36 no-reply mail.debug smtpd[3457]: debug: queue agent exiting
Aug 21 14:23:36 no-reply mail.debug smtpd[3455]: debug: lka -> pony express: pipe closed
Aug 21 14:23:36 no-reply mail.debug smtpd[3455]: debug: lookup agent exiting

smtpd.conf:

# This is the smtpd server system-wide configuration file.
# See smtpd.conf(5) for more information.

# Full domain name of this mail server.
server_name = "no-reply.in.****.cz"

# Address of the relay server.
relay_host = "relay.in.****.cz"

# Table of hosts/networks that are allowed to relay mails.
table internal { "10.38.1.0/24", "127.0.0.1", "::1", "local" }

# Table of user credentials allowed to relay mails for any recipients.
table passwd "/etc/smtpd/passwd.table"

pki $server_name cert "/etc/ssl/smtpd/server.crt"
pki $server_name key "/etc/ssl/smtpd/server.key"

# Specify a maximum message size of n bytes.
# The argument may contain a multiplier, as documented in scan_scaled(3).
# The default maximum message size is 35MB if none is specified.
smtp max-message-size 1M

filter "rewrite-from" proc-exec "/etc/smtpd/filter-rewrite-from no-reply@****.cz"

listen on socket filter "rewrite-from"
listen on lo hostname $server_name filter "rewrite-from"

# Submission port for relaying mails from the internal network to the outside
# world. This port is accessible only from the internal network.
listen on eth0 port 587 tls \
	hostname $server_name \
	auth-optional <passwd> \
	received-auth \
	mask-src \
	filter "rewrite-from" \
	tag "internal"

action "discard" mda "/bin/true" virtual { "@" = "nobody" }
action "relay" relay host $relay_host helo $server_name

match from src <internal> \
	for domain { "drop.me", "example.org", "local.host" } \
	action "discard"

match from src <internal> \
	for domain "****.cz" \
	action "relay"

match auth action "relay"

* I wanted to build it against the last stable LibreSSL, but it doesn’t build due to some missing symbols…

/cc @poolpOrg

@poolpOrg
Copy link
Member

Can you reproduce without your filter ?

@poolpOrg poolpOrg self-assigned this Aug 21, 2019
@jirutka
Copy link
Contributor Author

jirutka commented Aug 21, 2019

I don’t know how to reproduce it, it didn’t crash when I was trying to reproduce it. And I cannot let it run in production without that filter.

@jirutka
Copy link
Contributor Author

jirutka commented Aug 21, 2019

At morning it crashed with a different error message:

Aug 21 08:59:32 no-reply mail.crit smtpd[2536]: smtpd: process ca socket closed

@opensmtpd-github
Copy link
Member

opensmtpd-github commented Aug 21, 2019 via email

@jirutka
Copy link
Contributor Author

jirutka commented Aug 21, 2019

I can provide you SSH access to some VM with Alpine installed if it would help you.

@poolpOrg
Copy link
Member

That would be helpful yes if it can send mail to the outside, i'll setup something between that vm and my machines to try reproducing.

@jirutka
Copy link
Contributor Author

jirutka commented Aug 24, 2019

gilles@185.8.164.31

@jirutka
Copy link
Contributor Author

jirutka commented Aug 29, 2019

Just a note, I’ve created the VM in the playground environment, so it will be available only for 1 month.

@jirutka
Copy link
Contributor Author

jirutka commented Aug 29, 2019

I’ve encountered a different crash, now from SSL library:

Aug 28 21:29:50 no-reply mail.info smtpd[4129]: 3d81d7318b47cbc7 smtp connected address=10.***** host=*****
Aug 28 21:29:50 no-reply mail.debug smtpd[4128]: debug: looking up pki "no-reply.in.*****.cz"
Aug 28 21:29:50 no-reply mail.debug smtpd[4129]: debug: session_start_ssl: switching to SSL
Aug 28 21:29:50 no-reply mail.debug smtpd[4129]: debug: pony: rsae_priv_enc
Aug 28 21:29:50 no-reply mail.debug smtpd[4129]: debug: SSL library error: io_dispatch_accept_tls:SSL_accept: error:140943F2:SSL routines:ssl3_read_bytes:sslv3 alert unexpected message
Aug 28 21:29:50 no-reply mail.info smtpd[4129]: 3d81d7318b47cbc7 smtp disconnected reason="io-error: error:140943F2:SSL routines:ssl3_read_bytes:sslv3 alert unexpected message"

This version of OpenSMTPD is very unreliable for me. :(

@jirutka
Copy link
Contributor Author

jirutka commented Aug 31, 2019

@poolpOrg, I’ve updated to 7275812 (the current latest portable), but no change at all, it’s still randomly crashing.

@poolpOrg
Copy link
Member

poolpOrg commented Sep 1, 2019

Yes, I haven't had time to check yet but this is high on my list

@jirutka
Copy link
Contributor Author

jirutka commented Sep 16, 2019

Any change…?

@poolpOrg
Copy link
Member

not yet but i've just entered my "opensource" week so I'm on it

@poolpOrg
Copy link
Member

I'll address the two issues separately:

Regarding the initial crash, I've tried but I'm unable to reproduce on an alpine setup similar to yours both in terms of using OpenSSL and in terms of setup using a relay host. I'll keep it running a while and taking real trafic but not knowing if your filter caused it is an issue for troubleshooting as we've committed four days ago a diff to fix a race condition that could crash in filter layer. Do you still experience the crash ?

Regarding the SSL library stuff, you mentioned a different crash but your sample logs doesn't show a crash, what it shows is a TLS negotiation error. Did OpenSMTPD crash after that negotiation ?

@jirutka
Copy link
Contributor Author

jirutka commented Sep 16, 2019

…not knowing if your filter caused it is an issue for troubleshooting…

If the filter caused a crash, I would expect OpenSMTPD to log it (“lost processor: dynproc:00000001 exited abnormally” or “misbehaving filter”), but I don’t see any message regarding the filter.

we've committed four days ago a diff to fix a race condition that could crash in filter layer.

I’ll update my instance to the latest portable and let you know if something changed.

what it shows is a TLS negotiation error. Did OpenSMTPD crash after that negotiation?

IIRC it crashed right after this message, but I’m not 100% sure now. I’ve already fixed the problem on the client side trying to use SSLv3.

@poolpOrg
Copy link
Member

If the filter caused a crash, I would expect OpenSMTPD to log it (“lost processor: dynproc:00000001 exited abnormally” or “misbehaving filter”), but I don’t see any message regarding the filter.

Not necessarily, when I said "if your filter caused a crash" I didn't mean it necessarily as your filter crashing, it could also be that your filter does something which causes OpenSMTPD itself to crash.

This is what happened in the race condition we fixed this week, where if a session was disconnected for some reason while a filter was working, upon response of the filter, the session would no longer exist and OpenSMTPD would fatal() assuming a corruption.

This was a crash in the daemon, so not logged as a lost processor, but which would only ever happen with a filter processing a specific phase.

I’ll update my instance to the latest portable and let you know if something changed.

good, I have committed your diff to includes.h so you don't need it anymore

what it shows is a TLS negotiation error. Did OpenSMTPD crash after that negotiation?

IIRC it crashed right after this message, but I’m not 100% sure now. I’ve already fixed the problem on the client side trying to use SSLv3.

OK, let me know how it goes, as far as I'm concerned, I have produced multiple TLS errors ranging from no TLS on the TLS listener, random data on the TLS listener, wrong ciphers on the TLS listener, and all I get is a disconnect with no crash.

@jirutka
Copy link
Contributor Author

jirutka commented Sep 17, 2019

Today it crashed four times and every time with the following message (with different token and sid ofc):

lookup: Invalid filter message type: filter-dataline|b6d5dbf02113cec4|a8cfd22834b47a7f|Received:

I didn’t see this message before, so it’s new after the update.

I don’t understand what’s wrong with this particular message. I’m not modifying it in any way, just printing what I read on stdin.

Relevant lines from the filter:

"filter" == $1 {
        if (NF < 7) {                                               
                die("invalid filter command: expected >6 fields!")
        }
        sid = $6
        token = $7
        line = substr($0, length($1$2$3$4$5$6$7) + 8)

        # continue with next rule...
} 
"filter|smtp-in|data-line" == $1_$4_$5 {
        ...                       
        print("filter-dataline", token, sid, line)
}

@poolpOrg
Copy link
Member

I think you're still running the code from 7275812 and not the latest code from the portable branch.

The error message you've shown can only trigger for that line if you're running the code from 7275812 as it contains a typo which I fixed in the very next commit:

https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/usr.sbin/smtpd/lka_proc.c.diff?r1=1.9&r2=1.10&f=h

Can you update and confirm ?

@jirutka
Copy link
Contributor Author

jirutka commented Sep 17, 2019

Ah, sorry, you’re right! I really did update OpenSMTPD, but to the package built from 7275812, not the latest one – I forgot to install the update after updating the package some time ago and didn’t notice the version number now.

Anyway, now I’m really running on the latest portable. I will let you know tomorrow if it crashed.

@poolpOrg
Copy link
Member

cool

@poolpOrg
Copy link
Member

Hi,

Any news ?

@jirutka
Copy link
Contributor Author

jirutka commented Sep 19, 2019

Sep 18 11:35:08 no-reply mail.info smtpd[5449]: 9aeb49dd6a265771 smtp connected address=10.xxxx host=xxxxx
Sep 18 11:35:08 no-reply mail.debug smtpd[5448]: debug: looking up pki "no-reply.in.xxxx.cz"
Sep 18 11:35:08 no-reply mail.debug smtpd[5449]: debug: session_start_ssl: switching to SSL
Sep 18 11:35:08 no-reply mail.debug smtpd[5449]: debug: pony: rsae_priv_enc
Sep 18 11:35:08 no-reply mail.debug smtpd[5449]: debug: SSL library error: io_dispatch_accept_tls:SSL_accept: error:14094416:SSL routines:ssl3_read_bytes:sslv3 alert certificate unknown
Sep 18 11:35:08 no-reply mail.info smtpd[5449]: 9aeb49dd6a265771 smtp disconnected reason="io-error: error:14094416:SSL routines:ssl3_read_bytes:sslv3 alert certificate unknown"
Sep 18 11:35:09 no-reply mail.debug smtpd[5450]: debug: queue -> pony express: pipe closed
Sep 18 11:35:09 no-reply mail.debug smtpd[5450]: debug: queue agent exiting
Sep 18 11:35:09 no-reply mail.debug smtpd[5451]: debug: scheduler -> queue: pipe closed
Sep 18 11:35:09 no-reply mail.debug smtpd[5447]: debug: control -> queue: pipe closed
Sep 18 11:35:09 no-reply mail.debug smtpd[5447]: debug: control agent exiting
Sep 18 11:35:09 no-reply mail.debug smtpd[5451]: debug: scheduler agent exiting
Sep 18 11:35:09 no-reply mail.debug smtpd[5446]: debug: ca -> control: pipe closed
Sep 18 11:35:09 no-reply mail.debug smtpd[5446]: debug: ca agent exiting
Sep 18 11:35:09 no-reply mail.debug smtpd[5441]: debug: parent -> queue: pipe closed
Sep 18 11:35:09 no-reply mail.crit smtpd[5441]: smtpd: process queue socket closed
Sep 18 11:35:09 no-reply mail.debug smtpd[5448]: debug: lka -> queue: pipe closed
Sep 18 11:35:09 no-reply mail.debug smtpd[5448]: debug: lookup agent exiting
Sep 18 11:35:09 no-reply daemon.warn supervise-daemon[5439]: /usr/sbin/smtpd, pid 5441, exited with return code 1

This one came out of nowhere:

Sep 18 15:37:04 no-reply mail.debug smtpd[5837]: debug: queue -> pony express: pipe closed
Sep 18 15:37:04 no-reply mail.debug smtpd[5837]: debug: queue agent exiting
Sep 18 15:37:04 no-reply mail.debug smtpd[5834]: debug: control -> pony express: pipe closed
Sep 18 15:37:04 no-reply mail.debug smtpd[5834]: debug: control agent exiting
Sep 18 15:37:04 no-reply mail.debug smtpd[5832]: debug: parent -> queue: pipe closed
Sep 18 15:37:04 no-reply mail.crit smtpd[5832]: smtpd: process queue socket closed
Sep 18 15:37:04 no-reply mail.debug smtpd[5838]: debug: scheduler -> queue: pipe closed
Sep 18 15:37:04 no-reply mail.debug smtpd[5838]: debug: scheduler agent exiting
Sep 18 15:37:04 no-reply mail.debug smtpd[5835]: debug: lka -> pony express: pipe closed
Sep 18 15:37:04 no-reply mail.debug smtpd[5835]: debug: lookup agent exiting
Sep 18 15:37:04 no-reply mail.debug smtpd[5833]: debug: ca -> pony express: pipe closed
Sep 18 15:37:04 no-reply mail.debug smtpd[5833]: debug: ca agent exiting
Sep 18 15:37:04 no-reply daemon.warn supervise-daemon[5439]: /usr/sbin/smtpd, pid 5832, exited with return code 1
Sep 19 14:39:23 no-reply mail.info smtpd[5913]: 17be4dd4d0fd84bb smtp connected address=10.38.3.182 host=xxxx.cz
Sep 19 14:39:23 no-reply mail.debug smtpd[5912]: debug: looking up pki "no-reply.in.fit.xxxxx.cz"
Sep 19 14:39:23 no-reply mail.debug smtpd[5913]: debug: session_start_ssl: switching to SSL
Sep 19 14:39:23 no-reply mail.debug smtpd[5914]: debug: queue -> pony express: pipe closed
Sep 19 14:39:23 no-reply mail.debug smtpd[5914]: debug: queue agent exiting
Sep 19 14:39:23 no-reply mail.debug smtpd[5912]: debug: lka -> queue: pipe closed
Sep 19 14:39:23 no-reply mail.debug smtpd[5910]: debug: ca -> pony express: pipe closed
Sep 19 14:39:23 no-reply mail.debug smtpd[5910]: debug: ca agent exiting
Sep 19 14:39:23 no-reply mail.debug smtpd[5911]: debug: control -> queue: pipe closed
Sep 19 14:39:23 no-reply mail.debug smtpd[5915]: debug: scheduler -> queue: pipe closed
Sep 19 14:39:23 no-reply mail.debug smtpd[5911]: debug: control agent exiting
Sep 19 14:39:23 no-reply mail.debug smtpd[5912]: debug: lookup agent exiting
Sep 19 14:39:23 no-reply mail.debug smtpd[5915]: debug: scheduler agent exiting
Sep 19 14:39:23 no-reply mail.debug smtpd[5909]: debug: parent -> queue: pipe closed
Sep 19 14:39:23 no-reply mail.crit smtpd[5909]: smtpd: process queue socket closed
Sep 19 14:39:23 no-reply daemon.warn supervise-daemon[5439]: /usr/sbin/smtpd, pid 5909, exited with return code 1

@jirutka
Copy link
Contributor Author

jirutka commented Sep 30, 2019

@poolpOrg, any news?

@poolpOrg
Copy link
Member

nope, got context switched but this is still being investigated, whole october is on stabilization

@poolpOrg
Copy link
Member

poolpOrg commented Oct 3, 2019

I have two possible candidates for this bug which would explain why I don't observe them.

Investigation still in progress.

@poolpOrg
Copy link
Member

poolpOrg commented Oct 3, 2019

OK, I'm out of ideas.

I've setup a brand new alpine machine, built OpenSMTPD with OpenSSL 1.1.1c, I tested incoming and outgoing TLS, I tested DNS-resolved MX and relay host MX over TLS, received and sent thousands of mails and did not observe a single crash.

I need to be able to replicate your setup otherwise we're out of luck fixing this, I've seen your config but this is not enough for me to reproduce:

  • what version of Alpine are you using ?
  • does the crash happen when contacting specific hosts ?

@jdelic
Copy link

jdelic commented Oct 3, 2019

@jirutka

any chance of running this with debug symbols compiled in and core dumps enabled?

@jirutka
Copy link
Contributor Author

jirutka commented Oct 3, 2019

I’m using Alpine v3.10, there’s currently OpenSSL 1.1.1d. You can install the exact OpenSMTPD binary that I’m running from my packages:

cd /etc/apk/keys
wget https://raw.githubusercontent.com/jirutka/user-aports/v3.10/.keys/jakub@jirutka.cz-56d0d9fd.rsa.pub

echo "@jirutka https://repo.jirutka.cz/alpine/v3.10/user" > /etc/apk/repositories

apk add opensmtpd@jirutka opensmtpd-dbg@jirutka

It’s built from this APKBUILD.

does the crash happen when contacting specific hosts ?

It works only as a relay for clients on the local network and relays all the mails via single relay server. I don’t know if it happens only with some clients, I may check it, but there are only about 2-3 active clients anyway…

any chance of running this with debug symbols compiled in and core dumps enabled?

I thought that core dumps are generated only when the process crash, e.g. with segfault, isn’t it? This doesn’t seem to be the case, according to the logs OpenSMTPD just decides to quit…

(I’ve installed debug symbols and enabled core dumps now.)

@jdelic
Copy link

jdelic commented Oct 8, 2019

@jirutka

I thought that core dumps are generated only when the process crash, e.g. with segfault, isn’t it? This doesn’t seem to be the case, according to the logs OpenSMTPD just decides to quit…

Sorry, I was under the impression that the filter crashes and that takes OpenSMTPD with it. You're right of course.

@poolpOrg
Copy link
Member

poolpOrg commented Nov 4, 2019

There's a possibility the crash was related to the fixes I applied, I'll wait for @jirutka to let us know if he still experiences a crash before rolling the new release.

@jirutka
Copy link
Contributor Author

jirutka commented Nov 4, 2019

There is a good chance that the problem you were experiencing is fixed.

No, I’ve just upgraded and it’s still killing itself. >_<

Btw. are you maintaining Alpine Linux opensmtpd package?

I’m not its official maintainer, but I’m Alpine dev and I’m contributing to the opensmtpd package.

@poolpOrg
Copy link
Member

poolpOrg commented Nov 4, 2019

Are you around and can you join IRC ?

I have an hour to spare, if you are able to reproduce the crash and generate a core dump, then I can work from that to debug and produce a fix asap.

@jirutka
Copy link
Contributor Author

jirutka commented Nov 4, 2019

Are you around and can you join IRC ?

Yes.

if you are able to reproduce the crash and generate a core dump

There’s no crash! That’s the whole point all the time, OpenSMTPD is not crashing, it’s quitting itself without logging the reason.

@ngortheone
Copy link
Contributor

ngortheone commented Nov 4, 2019 via email

@poolpOrg
Copy link
Member

poolpOrg commented Nov 4, 2019

nah this is more subtle, there is a child process crashing, the exit 01 message is because the parent process exits with status 1 if it detects a child has disappeared.

I'm on #OpenSMTPD @ freenode, hit me up and we'll figure this out fast.

@jirutka
Copy link
Contributor Author

jirutka commented Nov 4, 2019

Check dmesg. You might see segfault there

Uh, you’re right! There really are segfaults. So I have somehow broken logging on this system and haven’t noticed it until now (always looked only into /var/log/dmesg). But then I don’t understand why core dumps are not generated; I’ve already verified that core dumps work on this system. o.O

[796819.036664] smtpd[16067]: segfault at 55d26ea3a0ba ip 00007f41c4120527 sp 00007ffe09466600 error 6 in ld-musl-x86_64.so.1[7f41c40ff000+46000]
[796819.036675] Code: 47 28 48 85 c0 74 08 48 ff c5 48 89 c7 eb ef 48 0f bf 47 50 48 89 fb 48 6b c0 58 48 29 c3 4c 8d 63 4c 4c 89 e7 e8 15 94 01 00 <66> 29 6b 52 75 0c 48 89 df 5b 5d 41 5c e9 17 eb fd ff 5b 4c 89 e7
[796819.895016] smtpd[16077]: segfault at 55667af30dba ip 00007f401f774527 sp 00007fffb0579e20 error 6 in ld-musl-x86_64.so.1[7f401f753000+46000]
[796819.895030] Code: 47 28 48 85 c0 74 08 48 ff c5 48 89 c7 eb ef 48 0f bf 47 50 48 89 fb 48 6b c0 58 48 29 c3 4c 8d 63 4c 4c 89 e7 e8 15 94 01 00 <66> 29 6b 52 75 0c 48 89 df 5b 5d 41 5c e9 17 eb fd ff 5b 4c 89 e7
[797052.980022] smtpd[16087]: segfault at 555c3f46231a ip 00007f6688ccf527 sp 00007ffd60eaa1b0 error 6 in ld-musl-x86_64.so.1[7f6688cae000+46000]
[797052.980036] Code: 47 28 48 85 c0 74 08 48 ff c5 48 89 c7 eb ef 48 0f bf 47 50 48 89 fb 48 6b c0 58 48 29 c3 4c 8d 63 4c 4c 89 e7 e8 15 94 01 00 <66> 29 6b 52 75 0c 48 89 df 5b 5d 41 5c e9 17 eb fd ff 5b 4c 89 e7
[798451.917013] smtpd[16097]: segfault at ffff0000 ip 00000000ffff0000 sp 00007fffe0b02008 error 14 in smtpd[5628e277e000+9000]
[798451.917061] Code: Bad RIP value.

@poolpOrg
Copy link
Member

poolpOrg commented Nov 4, 2019

you may have them disabled by default, can you try to ulimit -c unlimited to increase the core size ?

@jirutka
Copy link
Contributor Author

jirutka commented Nov 4, 2019

you may have them disabled by default, can you try to ulimit -c unlimited to increase the core size ?

I’ve already set this for opensmtpd. I suspect OpenRC’s supervise-daemon that it maybe doesn’t propagate it correctly. :/

@poolpOrg
Copy link
Member

poolpOrg commented Nov 4, 2019

With help from @jirutka the issue seems to be properly understood:

A recent change in musl's freeaddrinfo() causes it to no longer be NULL-safe and segv when passed a NULL pointer. This explains why the crashes happen at various places, it also explains why I couldn't reproduce on an alpine myself (with a musl prior to the change) and on other systems (NULL-safe in glibc, BSD, ...).

Right now, @jirutka is running with a diff to confirm the issue but given that I read the implementations for freeaddrinfo() on different libc and that we have been able to reproduce the crash with surrounding logs, I'm 100% confident this is the issue.

For the next release, I think the only solution for Alpine is to ship a patch in their package to NULL-protect all freeaddrinfo() calls until we sort out if this is going to stay or be reverted, or at the very least until the next major release if we decide to NULL-protect upstream.

@ngortheone
Copy link
Contributor

Awesome diagnosis @poolpOrg

Currently OpenSMTPD package in alpine is unmaintained (Jonathan Curran told me that he is no longer maintaining the package)

It was on my to-do list to update Alpine's package, so I can add the patch too.

@jirutka
Copy link
Contributor Author

jirutka commented Nov 4, 2019

Thanks a lot to @poolpOrg for diagnosis!

Currently OpenSMTPD package in alpine is unmaintained (Jonathan Curran told me that he is no longer maintaining the package)

Okay, I’m gonna take over maintainership.

@ngortheone
Copy link
Contributor

Awesome @jirutka, thanks a lot

algitbot pushed a commit to alpinelinux/aports that referenced this issue Nov 4, 2019
Ihor Antonov in OpenSMTPD/OpenSMTPD#929 (comment):
> Currently OpenSMTPD package in alpine is unmaintained (Jonathan Curran
> told me that he is no longer maintaining the package)
@jirutka
Copy link
Contributor Author

jirutka commented Nov 4, 2019

@poolpOrg, I have bad news – it crashed even with non-NULL argument to freeaddrinfo:

Nov  4 19:53:21 no-reply mail.debug smtpd[17411]: debug: mta: 0x559f70b0f2a0: handling next task for relay [relay:relay.in.*****.cz,port=25,smtp,mx,heloname=no-reply.in.*****.cz]
Nov  4 19:53:21 no-reply mail.info smtpd[17411]: 48bd7c7361438deb mta delivery evpid=aab8c324325d8504 from=<no-reply@*****.cz> to=<*****@*****.cz> rcpt=<-> source="10.38.3.167" relay="10.38.3.14 (relay.in.*****.cz)" delay=0s result="Ok" stat="250 2.0.0 xA4Ir
Nov  4 19:53:21 no-reply mail.debug smtpd[17411]: debug: mta: 0x559f70b0f2a0: no task for relay [relay:relay.in.*****.cz,port=25,smtp,mx,heloname=no-reply.in.*****.cz]
Nov  4 19:53:21 no-reply mail.debug smtpd[17411]: mta: debug: last connection: hanging on for 10s
Nov  4 19:53:21 no-reply mail.debug smtpd[17411]: debug: mta: flush for aab8c324325d8504 (-> *****@*****.cz)
Nov  4 19:53:22 no-reply mail.debug smtpd[17411]: debug: mta: ... timeout for [relay:relay.in.*****.cz,port=25,smtp,mx,heloname=no-reply.in.*****.cz]
Nov  4 19:53:22 no-reply mail.debug smtpd[17411]: debug: mta: draining [relay:relay.in.*****.cz,port=25,smtp,mx,heloname=no-reply.in.*****.cz] refcount=2, ntask=0, nconnector=1, nconn=1
Nov  4 19:53:22 no-reply mail.debug smtpd[17411]: debug: mta: all done for [relay:relay.in.*****.cz,port=25,smtp,mx,heloname=no-reply.in.*****.cz]
Nov  4 19:53:22 no-reply mail.debug smtpd[17411]: mta: timeout for session hangon
Nov  4 19:53:22 no-reply mail.debug smtpd[17411]: debug: mta: 0x559f70b0f2a0: no task for relay [relay:relay.in.*****.cz,port=25,smtp,mx,heloname=no-reply.in.*****.cz]
Nov  4 19:53:22 no-reply mail.debug smtpd[17411]: mta: debug: last connection: hanging on for 9s
Nov  4 19:53:22 no-reply mail.debug smtpd[17411]: #### before: 0x559f70b0dc80
Nov  4 19:53:22 no-reply mail.debug smtpd[17412]: debug: queue -> pony express: pipe closed
Nov  4 19:53:22 no-reply mail.debug smtpd[17412]: debug: queue agent exiting
Nov  4 19:53:22 no-reply mail.debug smtpd[17413]: debug: scheduler -> queue: pipe closed
Nov  4 19:53:22 no-reply mail.debug smtpd[17413]: debug: scheduler agent exiting
Nov  4 19:53:22 no-reply mail.debug smtpd[17409]: debug: control -> queue: pipe closed
Nov  4 19:53:22 no-reply mail.debug smtpd[17409]: debug: control agent exiting
Nov  4 19:53:22 no-reply mail.debug smtpd[17408]: debug: ca -> control: pipe closed
Nov  4 19:53:22 no-reply mail.debug smtpd[17407]: debug: parent -> queue: pipe closed
Nov  4 19:53:22 no-reply mail.debug smtpd[17408]: debug: ca agent exiting
Nov  4 19:53:22 no-reply mail.crit smtpd[17407]: smtpd: process queue socket closed
Nov  4 19:53:22 no-reply mail.debug smtpd[17410]: debug: lka -> queue: pipe closed
Nov  4 19:53:22 no-reply mail.debug smtpd[17410]: debug: lookup agent exiting
Core was generated by `smtpd: pony express          '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f3624097527 in ?? ()
(gdb) where
#0  0x00007f3624097527 in ?? ()
#1  0x0000559f70b05260 in ?? ()
#2  0x0000559f70b0dc80 in ?? ()
#3  0x0000559f70b0dc80 in ?? ()
#4  0x0000559f6f8f6c09 in smtp_getaddrinfo_cb (arg=0x559f70cdc728, gaierrno=<optimized out>, ai0=0x1)
    at ../../smtpd/smtp_session.c:689
#5  0x0000000000000000 in ?? ()

(I still cannot get rid of ??; I’ve already installed -dbg for musl, openssl, libasr and tried to manually load them to gdb, but it’s still the same.)

OpenSMTPD built from 1c1bdb6 with this patch and libasr 1.0.3 with this patch.

@jirutka
Copy link
Contributor Author

jirutka commented Nov 4, 2019

I still cannot get rid of ??; I’ve already installed -dbg for musl, openssl, libasr and tried to manually load them to gdb, but it’s still the same.

Aha! Passing /usr/lib/debug//usr/sbin/smtpd.debug instead of /usr/sbin/smtpd to gdb was the problem.

Core was generated by `smtpd: pony express          '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  freeaddrinfo (p=<optimized out>, p@entry=0x559f70b0dc80) at src/network/freeaddrinfo.c:14
14	src/network/freeaddrinfo.c: No such file or directory.
(gdb) where
#0  freeaddrinfo (p=<optimized out>, p@entry=0x559f70b0dc80) at src/network/freeaddrinfo.c:14
#1  0x0000559f6f8f6c09 in smtp_getaddrinfo_cb (arg=0x559f70b05260, gaierrno=<optimized out>, 
    ai0=0x559f70b0dc80) at ../../smtpd/smtp_session.c:689
#2  0x0000559f6f8f1cc6 in resolver_dispatch_result (proc=<optimized out>, imsg=0x7ffd728b6710)
    at ../../smtpd/resolver.c:346
#3  0x0000559f6f8ed5e0 in pony_imsg (p=<optimized out>, imsg=<optimized out>) at ../../smtpd/pony.c:66
#4  0x0000559f6f8fb9a0 in imsg_dispatch (p=0x559f70b40e80, imsg=0x7ffd728b6710)
    at ../../smtpd/smtpd.c:1851
#5  0x0000559f6f8dfa84 in mproc_dispatch (fd=<optimized out>, event=<optimized out>, arg=0x559f70b40e80)
    at ../../smtpd/mproc.c:208
#6  0x00007f3623ba606c in ?? () from /usr/lib/libevent-2.1.so.6
#7  0x00007f3623ba6767 in event_base_loop () from /usr/lib/libevent-2.1.so.6
#8  0x0000559f6f8ed7eb in pony () at ../../smtpd/pony.c:208
#9  0x0000559f6f8cd16f in main (argc=<optimized out>, argv=<optimized out>) at ../../smtpd/smtpd.c:828

src/network/freeaddrinfo.c:

...
void freeaddrinfo(struct addrinfo *p)
{
        size_t cnt;
        for (cnt=1; p->ai_next; cnt++, p=p->ai_next);
        struct aibuf *b = (void *)((char *)p - offsetof(struct aibuf, ai));
        b -= b->slot;
        LOCK(b->lock);
        if (!(b->ref -= cnt)) free(b);  // <--- HERE
        else UNLOCK(b->lock);
}

@poolpOrg
Copy link
Member

poolpOrg commented Nov 4, 2019

Not such a bad news, we have found the culprit function for sure, what have not found yet are all the possible ways for it to crash :-)

Your stack shows that smtp_getaddrinfo_cb receives a struct addrinfo *ai0 at address 0x559f70b0dc80 and freeaddrinfo() gets called with the same address so we're good on that front, this is not a pointer corruption issue.

What I think is happening is that the struct addrinfo * we allocate and craft is different from what musl expects and so when it does:

struct aibuf *b = [...];
b -= b->slot;
[...]
free(b);

it probably ends up calling free() on an invalid pointer.

I'll check the difference in struct addrinfo *and struct aibuf *, but worst case scenario will be that we provide a custom freeaddrinfo() function if we detect that we're using musl's

@poolpOrg
Copy link
Member

poolpOrg commented Nov 4, 2019

One possible way to test would be to make sure smtp_session.c has the two following includes:

#include <stdlib.h>
#include <netdb.h>

Then throw the following at the end of the file:

void
openbsd_freeaddrinfo(struct addrinfo *ai)
{
	struct addrinfo *p;

	do {
		p = ai;
		ai = ai->ai_next;
		free(p->ai_canonname);
		free(p);
	} while (ai);
}

And replacing the crashing freeaddrinfo() call with openbsd_freeaddrinfo() and check if it works better, but given that I read the code it's quite obvious it will ;-)

@poolpOrg
Copy link
Member

poolpOrg commented Nov 4, 2019

@jirutka

I have investigated the issue and have a full understanding of the problem.

The bottom line is that musl's getaddrinfo() and freeaddrinfo() do not behave at all like other libc. It is not possible to craft a struct addrinfo * that's portable between other libc and musl because the latter doesn't really use it as a linked list but relies on an underlying array and pointer arithmetics.

This means that you may ONLY call freeaddrinfo() on a struct addrinfo * you obtained from a previous getaddrinfo() call and nothing else. The opposite is also true, we can't free the struct addrinfo * by iterating over the linked list because then we'll be missing the underlying array used by musl and leak memory.

So the only viable fix is that we consider struct addrinfo uses as either coming from getaddrinfo() and requiring a call to freeaddrinfo(), or coming from self-crafting and requiring self-releasing.

My assessment is that there are only two places where we need self-releasing, one is resolver.c and the other the smtp_session.c bit you crashed on.

The portable branch has a potential fix, can you update and report how it goes ?

@jirutka
Copy link
Contributor Author

jirutka commented Nov 5, 2019

The portable branch has a potential fix, can you update and report how it goes ?

It looks very promising, ~40 relayed mails and no crash so far!

@poolpOrg
Copy link
Member

poolpOrg commented Nov 5, 2019

Still stable after all these hours ?

If so, I'll tag the release tonight :-)

@jirutka
Copy link
Contributor Author

jirutka commented Nov 5, 2019

144 mails relayed and still stable! 👍

@poolpOrg
Copy link
Member

poolpOrg commented Nov 5, 2019

I'll let you close the issue, will tag 6.6.1p1 shortly, thanks for helping

@jirutka
Copy link
Contributor Author

jirutka commented Nov 5, 2019

Thanks!

@jirutka jirutka closed this as completed Nov 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants