Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dnstap logging significantly affects unbound performance (regression in 1.11) #305

Closed
abulimov opened this issue Sep 9, 2020 · 7 comments

Comments

@abulimov
Copy link

abulimov commented Sep 9, 2020

With unbound 1.11 being first release that switched away from libfstrm (#164, #264), we observe number of regressions with dnstap logging over unix socket (first one reported as #304)

This is quite a big one.

In our production environment we've noticed a direct correlation between dnstap logs being consumed and unbound performance with 1.11.

Strangely enough, the quicker the dnstap stream was consumed by dnstap service, the more CPU was unbound using.

We've noticed it when on few machines with especially powerful hardware it went absolutely crazy, when we sped up dnstap service ~1.5 times by simply skipping half of samples, the CPU used by unbound went up 10x, from 200% to 2000%.

While investigating the issue with synthetic load on different hardware, I can observe that simply having dnstap socket consumed by dnstap process significantly increases the CPU used by unbound.

For example:

unbound 1.11, dnstap logging over unix socket, dnstap service running:
unbound uses ~120% of CPU under synthetic load

when we stop dnstap service, unbound immediately uses less CPU (~80%).

Same server, same load, with unbound 1.9.6, we have unbound CPU usage stable at around ~80%, with or without dnstap service running and consuming logs (as one would expect).

I understand that dnstap uses bidirectional protocol, and when there is no consumer running the unbound doesn't send any dnstap samples.
But before 1.11 sending samples had no significant impact on Unbound itself, and we've been using dnstap logging with unbound for a few years now.

@edmonds
Copy link
Contributor

edmonds commented Sep 9, 2020

If you're on Linux, it would be interesting if you could use the perf tools (perf record, perf report, etc.) to measure the performance of the system as a whole when performing your synthetic tests. That might be helpful in identifying hotspots.

@abulimov
Copy link
Author

abulimov commented Sep 10, 2020

I cannot provide system-wide perf records, but here are perf CPU flame graphs for unbound process for all combinations (1.11.0, 1.9.6, with and without dnstap service consuming the logs). It's pretty obvious that 1.11 spends much more CPU sending dnstap logs.

(had to put SVGs in ZIP, as Github doesn't allow uploading SVGs for some reason)
unbound_perf_flamegraphs.zip

@edmonds
Copy link
Contributor

edmonds commented Sep 10, 2020

Hi, Alexander:

I'm the original author of fstrm (which was used in Unbound for dnstap support until 1.11.0). I wasn't involved in the Frame Streams re-implementation work in Unbound, but looking at your perf-1.11-tap flame graph and zooming in on dt_msg_queue_submit():

dt_msg_queue_submit

It looks like it's doing a lot of write()'s. Looking at the dt_msg_queue_submit() function:

unbound/dnstap/dtstream.c

Lines 167 to 226 in 753487f

void
dt_msg_queue_submit(struct dt_msg_queue* mq, void* buf, size_t len)
{
int wakeup = 0;
struct dt_msg_entry* entry;
/* check conditions */
if(!buf) return;
if(len == 0) {
/* it is not possible to log entries with zero length,
* because the framestream protocol does not carry it.
* However the protobuf serialization does not create zero
* length datagrams for dnstap, so this should not happen. */
free(buf);
return;
}
if(!mq) {
free(buf);
return;
}
/* allocate memory for queue entry */
entry = malloc(sizeof(*entry));
if(!entry) {
log_err("out of memory logging dnstap");
free(buf);
return;
}
entry->next = NULL;
entry->buf = buf;
entry->len = len;
/* aqcuire lock */
lock_basic_lock(&mq->lock);
/* list was empty, wakeup dtio */
if(mq->first == NULL)
wakeup = 1;
/* see if it is going to fit */
if(mq->cursize + len > mq->maxsize) {
/* buffer full, or congested. */
/* drop */
lock_basic_unlock(&mq->lock);
free(buf);
free(entry);
return;
}
mq->cursize += len;
/* append to list */
if(mq->last) {
mq->last->next = entry;
} else {
mq->first = entry;
}
mq->last = entry;
/* release lock */
lock_basic_unlock(&mq->lock);
if(wakeup)
dtio_wakeup(mq->dtio);
}

My guess is the problem is here:

unbound/dnstap/dtstream.c

Lines 201 to 203 in 753487f

/* list was empty, wakeup dtio */
if(mq->first == NULL)
wakeup = 1;

If I'm understanding the code correctly, it looks like individual dnstap log payloads are being queued, and when the queue size changes from 0 → 1, a wakeup message is sent, which causes a write(). This is problematic, because if the log payloads are being drained faster than they're produced (which is likely the case), this will cause a lot of wakeups.

I ran into this problem when developing fstrm, which is why that library allows a number of messages to be queued (default 32, the queue_notify_threshold parameter) before sending a wakeup signal to the I/O thread. Because a >1 low watermark value can cause queued log payloads to be queued indefinitely, this also has to be paired with a periodic time-based unconditional flush (default 1 second, the flush_timeout parameter).

If my understanding is correct, my suggestion to the Unbound developers would be to implement a similar "low watermark" threshold for sending a wakeup to drain the queue, combined with a timeout flush. That should significantly reduce the number of syscalls caused by dnstap logging when the server is under load.

@wcawijngaards
Copy link
Member

wcawijngaards commented Sep 11, 2020

Your analysis is excellent, thank you and for the cpu flame graphs (that wake up routine sure looks expensive). I want to wait for the event handling fix from the previous issue (#304), before tackling this.

@abulimov
Copy link
Author

abulimov commented Sep 11, 2020

Thank you Robert, excellent analysis!

@wcawijngaards
Copy link
Member

wcawijngaards commented Sep 23, 2020

The commit should fix the issue. It is hard to reproduce the performance numbers, but it implements the threshold wakeup solution. That means the threads wake up when there are 32 messages, or it is almost full (90% of 1Mb), or after 1 second has passed.

@wcawijngaards
Copy link
Member

wcawijngaards commented Sep 23, 2020

And a fix for that commit so that it only wakes up once, when it reaches the threshold, it does not keep doing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants