Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rec: Purge map of failed auths periodically by keeping a last changed timestamp. #8525

Merged
merged 6 commits into from Nov 19, 2019

Conversation

omoerbeek
Copy link
Member

@omoerbeek omoerbeek commented Nov 12, 2019

Short description

SyncRes thread local storage includes a map of failed auths which was
only cleaned if a specific IP was contacted again and that contact
succeeded. Persistent failing auths or auths that are never tried
again remained in the map.

While here add code to dump the failed servers map. Might (partially?)
solve #7771.

Checklist

I have:

  • read the CONTRIBUTING.md document
  • compiled this code
  • tested this code
  • included documentation (including possible behaviour changes)
  • documented the code
  • added or modified regression test(s)
  • added or modified unit test(s)

…stamp.

SyncRes thread local storage includes a map of failed auths which was
only cleaned if a specific IP was contacted again and that contact
succeeded. Persistent failing auths or auths that are never tried
again remained in the map.

While here add code to dump the failed servers map. Might (partially?)
solve PowerDNS#7771.
@omoerbeek omoerbeek changed the title Purge map of failed auths periodically by keeping a last changed timestamp. Rec: Purge map of failed auths periodically by keeping a last changed timestamp. Nov 12, 2019
Copy link
Member

@rgacogne rgacogne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Don't we have the same situation with the ednsstatus map, by the way? We do need to keep EDNS status entries for quite longer, though, I would say several hours, perhaps one day, since the last time we used it.

@omoerbeek
Copy link
Member Author

I'll take a look at the EDNS status map as well.

@omoerbeek
Copy link
Member Author

The mode is set to UNKNOWN if the entry is older than an hour in asyncResolveWrapper(), so the prune age can be an hour if I am not mistaken.

@rgacogne
Copy link
Member

The mode is set to UNKNOWN if the entry is older than an hour in asyncResolveWrapper(), so the prune age can be an hour if I am not mistaken.

I expected a longer validity but OK.

pdns/pdns_recursor.cc Show resolved Hide resolved
Switch away from a ref to a pointer because of above and use modern
init for EDNSStatus.
pdns/syncres.hh Outdated Show resolved Hide resolved
size_t size() const
{
return d_cont.size();
}

void prune(time_t cutoff) {
for (auto it = d_cont.begin(); it != d_cont.end(); ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea of how many entries we usually have in there? If it might be a lot perhaps we need to use a multi_index keyed on the last field to avoid having to scan the whole map.

{
for (auto it = t_sstorage.ednsstatus.begin(); it != t_sstorage.ednsstatus.end(); ) {
if (it->second.modeSetAt <= cutoff) {
it = t_sstorage.ednsstatus.erase(it);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question about the number of entries?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On one of our small recursors (sub 1kqps) that has only been up for a week we have 200,000 edns status entries - if that data point helps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With cleanup, I suppose it will follow the same trend as the nsspeed size, which in that instance is an order of magnitude smaller.

@omoerbeek
Copy link
Member Author

omoerbeek commented Nov 13, 2019

In my test runs (with the purging) continously resolving alexa-top-1m in a loop with calidns, I see the ednsstatus count grow to about 600k. Failed servers hovers around 30k and nsspeeds is the same order (bit a bit bigger) than ednsstatus.

@rgacogne
Copy link
Member

rgacogne commented Nov 13, 2019

In my test runs (with the purging) continously resolving alexa-top-1m in a loop with calidns, I see the ednsstatus count grow to about 600k. Failed servers hovers around 30k and nsspeeds is the same order (bit a bit bigger) than ednstable.

Are these numbers before without this PR? Or does it stay that high even when regularly purged?
Although, we won't start purging the EDNS statuses until 2 hours after startup, so I'm guessing it will grow almost up to 600k quickly. So perhaps it would make sense to use a multi_index, with the existing index plus one on the timestamp, so that we only scan expired entries?

@omoerbeek
Copy link
Member Author

The numbers I reported are with this PR included, so with purging. I'll look into the multi-index thing soon. I currently do not want to stop this test run so see how it is doing in the long run (multi-days). After than I can add some code to see how fast the purging is and compare it with a multi-index solution.

@rgacogne
Copy link
Member

The numbers I reported are with this PR included, so with purging. I'll look into the multi-index thing soon.

OK! Let me know if you are not familiar with how multi-index works and want some pointers (no pun intended)!

@omoerbeek
Copy link
Member Author

I'm not familiar with multi-index, but I'd like to give it a shot myself as a start.

BTW, the numbers above are the accumulated ones for a rec with 16 threads. So to get per-thread estimates you need to divide by 16.

@omoerbeek
Copy link
Member Author

@rgacogne Can you take a look to see if I'm doing this right? If so we might need to generalize this, since the other tables are purged in a similar manner.

pdns/syncres.hh Outdated Show resolved Hide resolved
pdns/syncres.hh Outdated Show resolved Hide resolved
pdns/syncres.cc Outdated Show resolved Hide resolved
@omoerbeek
Copy link
Member Author

Here's the graph of resident memory usage after more than 5 days running 4.1.x wih a rough backport of this PR.

Schermafbeelding 2019-11-18 om 08 05 44

@omoerbeek
Copy link
Member Author

omoerbeek commented Nov 18, 2019

Friday I became smart enough to have a periodic ps running each hour.
Two lines from it:

2019-11-15T14:00:01+00:00       nobody    1158  286 15.2 5415688 2507572 pts/3 Sl+  Nov12 12048:21 ./pdns_recursor --config-dir=.
...
2019-11-18T08:00:01+00:00       nobody    1158  286 16.6 5415688 2740668 pts/3 Sl+  Nov12 23414:51 ./pdns_recursor --config-dir=.

While the RSS creep is visible here (2507572 -> 2740668) it must be noted that the VSZ (5415688) does not grow.
Possible explanation: since there is no pressure on the allocated physical pages on this machine:

MiB Mem :  16044.8 total,   5760.7 free,   3137.4 used,   7146.7 buff/cache
MiB Swap:  16384.0 total,  16384.0 free,      0.0 used.  12431.9 avail Mem 

there is no incentive for the kernel to evict physical pages, even if they are unused. Add to that the glibc malloc hangs on to pages, it does no unmap them afaik.

To summarize:

  • This PR fixes two real leaks
  • With it, I do not see growth in VSZ

Next steps:

  1. Run some tests to see if using multi-index is worth the trouble by comparing a full scan against a cleanup using a time based index.
  2. Based on 1 merge (a modified version of) this PR
  3. Prepare backports to 4.1 and 4.2

On the longer term:

  • Also consider using multi-index for the other time-based data structures in Syncres (nsspeeds and throttle)
  • Try to find a memory usage statistic that does not put us and others on the wrong foot. Is hard to do that in general.
  • It might be worth to revisit the mthread stack allocation again: some form of caching (and thus better re-use) could reduce heap fragmentation.
  • Other ideas that could produce less memory use/fragmentation: shared cache, custom allocators for specific data types.

@omoerbeek
Copy link
Member Author

omoerbeek commented Nov 18, 2019

Another observation: I applied memory pressure to the point where the machine started needing swap by running a program allocating and touching lots of memory. I saw almost the whole rec process being paged out to swap. After I stopped the memory eating process, about 1.1G of rec was paged in again after 10min or so. The whole time VSZ remained constant. Now the process is at:

nobody    1158  286  7.0 5415688 1151420 pts/3 Sl+  Nov12 23729:33 ./pdns_recursor --config-dir=.

So the short-term working set of this process is about 1.1G.

@omoerbeek
Copy link
Member Author

omoerbeek commented Nov 18, 2019

Measurement shows that a linear scan of the threads ednsstatus map takes about 0.024s for 123515 entries. This was on a 2 thread test run, where (with the given test setup) it is expected to grow to about 300k entries per thread. For the 16 threads case the table size would be smaller. Testing that setup I'm seeing cleanup times around 4-23ms for 24k entries. Note that the variance is higher here since my test machine is a 4 core one, so running 16threads on it will introduce (much) more scheduling noise.

So I think that a multi_index solution is not really needed at the moment. And I wonder if it's worth the added complexity. I'll do some test runs with the multi_index one next.
...
For the multi_index based ones we're looking at 0.3ms for a cleanup of a table with 20k entries. So it is much faster.

I think a good plan is:

  1. merge the linear versions of the cleanup for failed servers and ednsstatus into master
  2. backport it
  3. later rework all tables that should be purged on a time based decision to use a multi_index map. A unification to avoid re-implementing the same logic in multiple places is needed. It might be even possible to use fewer tables that combine the data that is now stored in several separate tables.

@omoerbeek
Copy link
Member Author

I'll be squashing the multi_index code out, save a copy for future reference.

class fails_t : public boost::noncopyable
{
public:
  typedef unsigned long counter_t;
  struct value_t {
    value_t(const ComboAddress &a) : address(a) {}
    ComboAddress address;
    mutable counter_t value{0};
    time_t last{0};
  };

  typedef multi_index_container<value_t,
                                indexed_by<
                                  ordered_unique<tag<ComboAddress>, member<value_t, ComboAddress, &value_t::address>>,
                                  ordered_non_unique<tag<time_t>, member<value_t, time_t, &value_t::last>>
                                  >> cont_t;

  cont_t getMap() const {
    return d_cont;
  }
  counter_t value(const ComboAddress& t) const
  {
    auto i = d_cont.find(t);

    if (i == d_cont.end()) {
      return 0;
    }
    return i->value;
  }

  counter_t incr(const ComboAddress& t, const struct timeval & now)
  {
    auto i = d_cont.insert(t).first;

    if (i->value < std::numeric_limits<counter_t>::max()) {
      i->value++;
    }
    auto &ind = d_cont.get<ComboAddress>();
    ind.modify(i, [t = now.tv_sec](value_t &val) { val.last = t;});
    return i->value;
  }

  void clear(const ComboAddress& a)
  {
    d_cont.erase(a);
  }

  void clear()
  {
    d_cont.clear();
  }

  size_t size() const
  {
    return d_cont.size();
  }

  void prune(time_t cutoff) {
    auto &ind = d_cont.get<time_t>();
    ind.erase(ind.begin(), ind.upper_bound(cutoff));
  }

private:
  cont_t d_cont;
};

struct EDNSStatus {
    EDNSStatus(const ComboAddress &arg) : address(arg) {}
    ComboAddress address;
    time_t modeSetAt{0};
    mutable enum EDNSMode { UNKNOWN=0, EDNSOK=1, EDNSIGNORANT=2, NOEDNS=3 } mode{UNKNOWN};
  };
  
  struct ednsstatus_t : public multi_index_container<EDNSStatus,
                                indexed_by<
                                  ordered_unique<tag<ComboAddress>, member<EDNSStatus, ComboAddress, &EDNSStatus::address>>,
                                  ordered_non_unique<tag<time_t>, member<EDNSStatus, time_t, &EDNSStatus::modeSetAt>>
                                  >> {
    void reset(index<ComboAddress>::type &ind, iterator it) {
      ind.modify(it, [](EDNSStatus &s) { s.mode = EDNSStatus::EDNSMode::UNKNOWN; s.modeSetAt = 0; }); 
    }
    void setMode(index<ComboAddress>::type &ind, iterator it, EDNSStatus::EDNSMode mode) {
      it->mode = mode;
    }
    void setTS(index<ComboAddress>::type &ind, iterator it, time_t ts) {
      ind.modify(it, [ts](EDNSStatus &s) { s.modeSetAt = ts; });
    }

  };

@rgacogne
Copy link
Member

For the 16 threads case the table size would be smaller.

This assumes that the number of authoritative servers contacted by a thread decreases with the number of threads, which might be true in the long-term. It's clearly not true when pdns-distributes-queries is off.

I think a good plan is:

  1. merge the linear versions of the cleanup for failed servers and ednsstatus into master
  2. backport it
  3. later rework all tables that should be purged on a time based decision to use a multi_index map. A unification to avoid re-implementing the same logic in multiple places is needed. It might be even possible to use fewer tables that combine the data that is now stored in several separate tables.

OK, I can see how that makes sense. I'm still a bit worried about the impact of linear scanning in real-world setups, though. We have learned in the past that things that were negligible in small setups could have dramatic consequences in large scale ones, like metrics reporting, pipes being blocking and so on. So I think we really should get the multi_index version in master quickly, so that we have a solution ready if the linear scan proves to be a problem.

@omoerbeek
Copy link
Member Author

Agreed. But note that the 600k edns size was an observed aggregate number of a 16 threads test run.

@omoerbeek
Copy link
Member Author

@rgacogne can you take a last look? esp. the SyncRes::asyncresolveWrapper() part. tx.

@@ -556,32 +576,35 @@ int SyncRes::asyncresolveWrapper(const ComboAddress& ip, bool ednsMANDATORY, con
else {
ret=asyncresolve(ip, sendQname, type, doTCP, sendRDQuery, EDNSLevel, now, srcmask, ctx, d_outgoingProtobufServers, d_frameStreamServers, luaconfsLocal->outgoingProtobufExportConfig.exportTypes, res, chained);
}
// ednsstatus might be cleared, so do a new lookup
ednsstatus = &t_sstorage.ednsstatus[ip]; // does this include port? YES
mode = &ednsstatus->mode;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should update oldmode as well here? The only difference it would make is that if the status has already been updated in the meantime, we will set modeSetAt again. It's a matter of seconds, though, so it probably doesn't matter.

@omoerbeek omoerbeek merged commit 548cf07 into PowerDNS:master Nov 19, 2019
@omoerbeek omoerbeek deleted the rec-prune-failed-servers branch November 19, 2019 10:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants