Rec: Purge map of failed auths periodically by keeping a last changed timestamp. #8525

omoerbeek · 2019-11-12T12:40:05Z

Short description

SyncRes thread local storage includes a map of failed auths which was
only cleaned if a specific IP was contacted again and that contact
succeeded. Persistent failing auths or auths that are never tried
again remained in the map.

While here add code to dump the failed servers map. Might (partially?)
solve #7771.

Checklist

I have:

read the CONTRIBUTING.md document
compiled this code
tested this code
included documentation (including possible behaviour changes)
documented the code
added or modified regression test(s)
added or modified unit test(s)

…stamp. SyncRes thread local storage includes a map of failed auths which was only cleaned if a specific IP was contacted again and that contact succeeded. Persistent failing auths or auths that are never tried again remained in the map. While here add code to dump the failed servers map. Might (partially?) solve PowerDNS#7771.

rgacogne

Looks good! Don't we have the same situation with the ednsstatus map, by the way? We do need to keep EDNS status entries for quite longer, though, I would say several hours, perhaps one day, since the last time we used it.

omoerbeek · 2019-11-12T14:26:19Z

I'll take a look at the EDNS status map as well.

omoerbeek · 2019-11-12T15:04:51Z

The mode is set to UNKNOWN if the entry is older than an hour in asyncResolveWrapper(), so the prune age can be an hour if I am not mistaken.

rgacogne · 2019-11-12T15:08:24Z

The mode is set to UNKNOWN if the entry is older than an hour in asyncResolveWrapper(), so the prune age can be an hour if I am not mistaken.

I expected a longer validity but OK.

…c report.

pdns/pdns_recursor.cc

Switch away from a ref to a pointer because of above and use modern init for EDNSStatus.

pdns/syncres.hh

rgacogne · 2019-11-13T13:19:07Z

pdns/syncres.hh

  size_t size() const
  {
    return d_cont.size();
  }
+
+  void prune(time_t cutoff) {
+    for (auto it = d_cont.begin(); it != d_cont.end(); ) {


Any idea of how many entries we usually have in there? If it might be a lot perhaps we need to use a multi_index keyed on the last field to avoid having to scan the whole map.

rgacogne · 2019-11-13T13:19:22Z

pdns/syncres.hh

+  {
+    for (auto it = t_sstorage.ednsstatus.begin(); it != t_sstorage.ednsstatus.end(); ) {
+      if (it->second.modeSetAt <= cutoff) {
+        it = t_sstorage.ednsstatus.erase(it);


Same question about the number of entries?

On one of our small recursors (sub 1kqps) that has only been up for a week we have 200,000 edns status entries - if that data point helps.

With cleanup, I suppose it will follow the same trend as the nsspeed size, which in that instance is an order of magnitude smaller.

omoerbeek · 2019-11-13T15:20:21Z

In my test runs (with the purging) continously resolving alexa-top-1m in a loop with calidns, I see the ednsstatus count grow to about 600k. Failed servers hovers around 30k and nsspeeds is the same order (bit a bit bigger) than ednsstatus.

rgacogne · 2019-11-13T15:24:52Z

In my test runs (with the purging) continously resolving alexa-top-1m in a loop with calidns, I see the ednsstatus count grow to about 600k. Failed servers hovers around 30k and nsspeeds is the same order (bit a bit bigger) than ednstable.

Are these numbers before without this PR? Or does it stay that high even when regularly purged?
Although, we won't start purging the EDNS statuses until 2 hours after startup, so I'm guessing it will grow almost up to 600k quickly. So perhaps it would make sense to use a multi_index, with the existing index plus one on the timestamp, so that we only scan expired entries?

omoerbeek · 2019-11-13T15:28:39Z

The numbers I reported are with this PR included, so with purging. I'll look into the multi-index thing soon. I currently do not want to stop this test run so see how it is doing in the long run (multi-days). After than I can add some code to see how fast the purging is and compare it with a multi-index solution.

rgacogne · 2019-11-13T15:30:58Z

The numbers I reported are with this PR included, so with purging. I'll look into the multi-index thing soon.

OK! Let me know if you are not familiar with how multi-index works and want some pointers (no pun intended)!

omoerbeek · 2019-11-13T19:19:38Z

I'm not familiar with multi-index, but I'd like to give it a shot myself as a start.

BTW, the numbers above are the accumulated ones for a rec with 16 threads. So to get per-thread estimates you need to divide by 16.

omoerbeek · 2019-11-15T11:30:37Z

@rgacogne Can you take a look to see if I'm doing this right? If so we might need to generalize this, since the other tables are purged in a similar manner.

pdns/syncres.hh

pdns/syncres.cc

omoerbeek · 2019-11-18T07:10:05Z

Here's the graph of resident memory usage after more than 5 days running 4.1.x wih a rough backport of this PR.

omoerbeek · 2019-11-18T08:20:28Z

Friday I became smart enough to have a periodic ps running each hour.
Two lines from it:

2019-11-15T14:00:01+00:00       nobody    1158  286 15.2 5415688 2507572 pts/3 Sl+  Nov12 12048:21 ./pdns_recursor --config-dir=.
...
2019-11-18T08:00:01+00:00       nobody    1158  286 16.6 5415688 2740668 pts/3 Sl+  Nov12 23414:51 ./pdns_recursor --config-dir=.

While the RSS creep is visible here (2507572 -> 2740668) it must be noted that the VSZ (5415688) does not grow.
Possible explanation: since there is no pressure on the allocated physical pages on this machine:

MiB Mem :  16044.8 total,   5760.7 free,   3137.4 used,   7146.7 buff/cache
MiB Swap:  16384.0 total,  16384.0 free,      0.0 used.  12431.9 avail Mem

there is no incentive for the kernel to evict physical pages, even if they are unused. Add to that the glibc malloc hangs on to pages, it does no unmap them afaik.

To summarize:

This PR fixes two real leaks
With it, I do not see growth in VSZ

Next steps:

Run some tests to see if using multi-index is worth the trouble by comparing a full scan against a cleanup using a time based index.
Based on 1 merge (a modified version of) this PR
Prepare backports to 4.1 and 4.2

On the longer term:

Also consider using multi-index for the other time-based data structures in Syncres (nsspeeds and throttle)
Try to find a memory usage statistic that does not put us and others on the wrong foot. Is hard to do that in general.
It might be worth to revisit the mthread stack allocation again: some form of caching (and thus better re-use) could reduce heap fragmentation.
Other ideas that could produce less memory use/fragmentation: shared cache, custom allocators for specific data types.

omoerbeek · 2019-11-18T10:00:23Z

Another observation: I applied memory pressure to the point where the machine started needing swap by running a program allocating and touching lots of memory. I saw almost the whole rec process being paged out to swap. After I stopped the memory eating process, about 1.1G of rec was paged in again after 10min or so. The whole time VSZ remained constant. Now the process is at:

nobody    1158  286  7.0 5415688 1151420 pts/3 Sl+  Nov12 23729:33 ./pdns_recursor --config-dir=.

So the short-term working set of this process is about 1.1G.

omoerbeek · 2019-11-18T10:19:24Z

Measurement shows that a linear scan of the threads ednsstatus map takes about 0.024s for 123515 entries. This was on a 2 thread test run, where (with the given test setup) it is expected to grow to about 300k entries per thread. For the 16 threads case the table size would be smaller. Testing that setup I'm seeing cleanup times around 4-23ms for 24k entries. Note that the variance is higher here since my test machine is a 4 core one, so running 16threads on it will introduce (much) more scheduling noise.

So I think that a multi_index solution is not really needed at the moment. And I wonder if it's worth the added complexity. I'll do some test runs with the multi_index one next.
...
For the multi_index based ones we're looking at 0.3ms for a cleanup of a table with 20k entries. So it is much faster.

I think a good plan is:

merge the linear versions of the cleanup for failed servers and ednsstatus into master
backport it
later rework all tables that should be purged on a time based decision to use a multi_index map. A unification to avoid re-implementing the same logic in multiple places is needed. It might be even possible to use fewer tables that combine the data that is now stored in several separate tables.

omoerbeek · 2019-11-19T08:09:55Z

I'll be squashing the multi_index code out, save a copy for future reference.

class fails_t : public boost::noncopyable
{
public:
  typedef unsigned long counter_t;
  struct value_t {
    value_t(const ComboAddress &a) : address(a) {}
    ComboAddress address;
    mutable counter_t value{0};
    time_t last{0};
  };

  typedef multi_index_container<value_t,
                                indexed_by<
                                  ordered_unique<tag<ComboAddress>, member<value_t, ComboAddress, &value_t::address>>,
                                  ordered_non_unique<tag<time_t>, member<value_t, time_t, &value_t::last>>
                                  >> cont_t;

  cont_t getMap() const {
    return d_cont;
  }
  counter_t value(const ComboAddress& t) const
  {
    auto i = d_cont.find(t);

    if (i == d_cont.end()) {
      return 0;
    }
    return i->value;
  }

  counter_t incr(const ComboAddress& t, const struct timeval & now)
  {
    auto i = d_cont.insert(t).first;

    if (i->value < std::numeric_limits<counter_t>::max()) {
      i->value++;
    }
    auto &ind = d_cont.get<ComboAddress>();
    ind.modify(i, [t = now.tv_sec](value_t &val) { val.last = t;});
    return i->value;
  }

  void clear(const ComboAddress& a)
  {
    d_cont.erase(a);
  }

  void clear()
  {
    d_cont.clear();
  }

  size_t size() const
  {
    return d_cont.size();
  }

  void prune(time_t cutoff) {
    auto &ind = d_cont.get<time_t>();
    ind.erase(ind.begin(), ind.upper_bound(cutoff));
  }

private:
  cont_t d_cont;
};

struct EDNSStatus {
    EDNSStatus(const ComboAddress &arg) : address(arg) {}
    ComboAddress address;
    time_t modeSetAt{0};
    mutable enum EDNSMode { UNKNOWN=0, EDNSOK=1, EDNSIGNORANT=2, NOEDNS=3 } mode{UNKNOWN};
  };
  
  struct ednsstatus_t : public multi_index_container<EDNSStatus,
                                indexed_by<
                                  ordered_unique<tag<ComboAddress>, member<EDNSStatus, ComboAddress, &EDNSStatus::address>>,
                                  ordered_non_unique<tag<time_t>, member<EDNSStatus, time_t, &EDNSStatus::modeSetAt>>
                                  >> {
    void reset(index<ComboAddress>::type &ind, iterator it) {
      ind.modify(it, [](EDNSStatus &s) { s.mode = EDNSStatus::EDNSMode::UNKNOWN; s.modeSetAt = 0; }); 
    }
    void setMode(index<ComboAddress>::type &ind, iterator it, EDNSStatus::EDNSMode mode) {
      it->mode = mode;
    }
    void setTS(index<ComboAddress>::type &ind, iterator it, time_t ts) {
      ind.modify(it, [ts](EDNSStatus &s) { s.modeSetAt = ts; });
    }

  };

rgacogne · 2019-11-19T08:52:10Z

For the 16 threads case the table size would be smaller.

This assumes that the number of authoritative servers contacted by a thread decreases with the number of threads, which might be true in the long-term. It's clearly not true when pdns-distributes-queries is off.

I think a good plan is:

merge the linear versions of the cleanup for failed servers and ednsstatus into master

backport it

later rework all tables that should be purged on a time based decision to use a multi_index map. A unification to avoid re-implementing the same logic in multiple places is needed. It might be even possible to use fewer tables that combine the data that is now stored in several separate tables.

OK, I can see how that makes sense. I'm still a bit worried about the impact of linear scanning in real-world setups, though. We have learned in the past that things that were negligible in small setups could have dramatic consequences in large scale ones, like metrics reporting, pipes being blocking and so on. So I think we really should get the multi_index version in master quickly, so that we have a solution ready if the linear scan proves to be a problem.

omoerbeek · 2019-11-19T08:57:18Z

Agreed. But note that the 600k edns size was an observed aggregate number of a 16 threads test run.

omoerbeek · 2019-11-19T09:10:05Z

@rgacogne can you take a last look? esp. the SyncRes::asyncresolveWrapper() part. tx.

rgacogne · 2019-11-19T09:19:32Z

pdns/syncres.cc

@@ -556,32 +576,35 @@ int SyncRes::asyncresolveWrapper(const ComboAddress& ip, bool ednsMANDATORY, con
    else {
      ret=asyncresolve(ip, sendQname, type, doTCP, sendRDQuery, EDNSLevel, now, srcmask, ctx, d_outgoingProtobufServers, d_frameStreamServers, luaconfsLocal->outgoingProtobufExportConfig.exportTypes, res, chained);
    }
+    // ednsstatus might be cleared, so do a new lookup
+    ednsstatus = &t_sstorage.ednsstatus[ip]; // does this include port? YES
+    mode = &ednsstatus->mode;


I wonder if we should update oldmode as well here? The only difference it would make is that if the status has already been updated in the meantime, we will set modeSetAt again. It's a matter of seconds, though, so it probably doesn't matter.

omoerbeek added rec defect backport to rec-4.1.x? labels Nov 12, 2019

omoerbeek changed the title ~~Purge map of failed auths periodically by keeping a last changed timestamp.~~ Rec: Purge map of failed auths periodically by keeping a last changed timestamp. Nov 12, 2019

man page bits

c71d643

rgacogne approved these changes Nov 12, 2019

View reviewed changes

omoerbeek added 2 commits November 12, 2019 16:42

Also purge t_sstorage.ednsstatus and include edns size in the periodi…

bbc7101

…c report.

If modeSetAt is zero, we never updated the entry and it can go.

b5788c3

rgacogne requested changes Nov 13, 2019

View reviewed changes

pdns/pdns_recursor.cc Show resolved Hide resolved

ednsmap might be cleared while yielding; so reassign pointer.

90e8ea7

Switch away from a ref to a pointer because of above and use modern init for EDNSStatus.

omoerbeek requested a review from rgacogne November 13, 2019 10:24

rgacogne approved these changes Nov 13, 2019

View reviewed changes

rgacogne reviewed Nov 15, 2019

View reviewed changes

pdns/syncres.hh Outdated Show resolved Hide resolved

pdns/syncres.hh Outdated Show resolved Hide resolved

pdns/syncres.cc Outdated Show resolved Hide resolved

omoerbeek force-pushed the rec-prune-failed-servers branch from e1f41c4 to 98b946e Compare November 15, 2019 14:48

omoerbeek force-pushed the rec-prune-failed-servers branch from 98b946e to 90e8ea7 Compare November 19, 2019 08:11

Avoid looking up an entry twice by using a ref.

3eba274

omoerbeek requested a review from rgacogne November 19, 2019 09:09

rgacogne approved these changes Nov 19, 2019

View reviewed changes

omoerbeek merged commit 548cf07 into PowerDNS:master Nov 19, 2019

omoerbeek deleted the rec-prune-failed-servers branch November 19, 2019 10:54

omoerbeek removed backport to rec-4.1.x? labels Feb 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rec: Purge map of failed auths periodically by keeping a last changed timestamp. #8525

Rec: Purge map of failed auths periodically by keeping a last changed timestamp. #8525

omoerbeek commented Nov 12, 2019 •

edited

rgacogne left a comment

omoerbeek commented Nov 12, 2019

omoerbeek commented Nov 12, 2019

rgacogne commented Nov 12, 2019

rgacogne Nov 13, 2019

rgacogne Nov 13, 2019

phonedph1 Nov 13, 2019

phonedph1 Nov 13, 2019

omoerbeek commented Nov 13, 2019 •

edited

rgacogne commented Nov 13, 2019 •

edited

omoerbeek commented Nov 13, 2019

rgacogne commented Nov 13, 2019

omoerbeek commented Nov 13, 2019

omoerbeek commented Nov 15, 2019

omoerbeek commented Nov 18, 2019

omoerbeek commented Nov 18, 2019 •

edited

omoerbeek commented Nov 18, 2019 •

edited

omoerbeek commented Nov 18, 2019 •

edited

omoerbeek commented Nov 19, 2019

rgacogne commented Nov 19, 2019

omoerbeek commented Nov 19, 2019

omoerbeek commented Nov 19, 2019

rgacogne Nov 19, 2019

Rec: Purge map of failed auths periodically by keeping a last changed timestamp. #8525

Rec: Purge map of failed auths periodically by keeping a last changed timestamp. #8525

Conversation

omoerbeek commented Nov 12, 2019 • edited

Short description

Checklist

rgacogne left a comment

Choose a reason for hiding this comment

omoerbeek commented Nov 12, 2019

omoerbeek commented Nov 12, 2019

rgacogne commented Nov 12, 2019

rgacogne Nov 13, 2019

Choose a reason for hiding this comment

rgacogne Nov 13, 2019

Choose a reason for hiding this comment

phonedph1 Nov 13, 2019

Choose a reason for hiding this comment

phonedph1 Nov 13, 2019

Choose a reason for hiding this comment

omoerbeek commented Nov 13, 2019 • edited

rgacogne commented Nov 13, 2019 • edited

omoerbeek commented Nov 13, 2019

rgacogne commented Nov 13, 2019

omoerbeek commented Nov 13, 2019

omoerbeek commented Nov 15, 2019

omoerbeek commented Nov 18, 2019

omoerbeek commented Nov 18, 2019 • edited

omoerbeek commented Nov 18, 2019 • edited

omoerbeek commented Nov 18, 2019 • edited

omoerbeek commented Nov 19, 2019

rgacogne commented Nov 19, 2019

omoerbeek commented Nov 19, 2019

omoerbeek commented Nov 19, 2019

rgacogne Nov 19, 2019

Choose a reason for hiding this comment

omoerbeek commented Nov 12, 2019 •

edited

omoerbeek commented Nov 13, 2019 •

edited

rgacogne commented Nov 13, 2019 •

edited

omoerbeek commented Nov 18, 2019 •

edited

omoerbeek commented Nov 18, 2019 •

edited

omoerbeek commented Nov 18, 2019 •

edited