-
-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unbound 1.20 Cachedb broken? #1064
Comments
What could be happening is that the setup includes options that now act differently for cachedb, since the release. The release fixes problems that cachedb has with them, and it works, and this may be the cause of the change. Or perhaps it is going wrong because of the change. There was a new option introduced, In particular, if cachedb is configured but also serve-expired with the serve-expired-client-timeout, then the serve-expired-client-timeout is now respected. And that means that an expired response is not sent immediately from the cachedb, but instead it attempts to fully resolve the query within the time interval. And if this works, in the case queries take 17 - 25 ms, then that answer is used. For the case the time increase with a couple msec makes it look like that could be what is going on. If it was to fail and take a longer time the cachedb expired response is then used. So it could be that for the server the expired options have started working because of the new expired and cachedb fixes in 1.20.0, and for this reason it spends time looking up stuff, instead of immediately responding with expired answers from cachedb. If also edns client subnet is involved, there are other fixes with cachedb and serve-expired and that as well. |
So here is my Unbound Configuration -- maybe i miss something that should be changed with the new build:
If i understand that right, the settings "cachedb-check-when-serve-expired: yes" changed to NO could be not the solution - cause the explanaition is: So - this is the behavior I want... or have I misunderstood something? I think the following fixes do something that must be explained a little bit more from the Devs, i dont know why the entire cachedb function is completly different with this build... because - what is fixed if the previous build worked better ?!?
|
Yes those are the changes that make things different, and the new config option can control them; if that option is turned to a different value it should behave in a different manner. And perhaps in the old manner, but I can also not be sure because I do not know what it did before and now. Perhaps it is possible to find out what it did before and now, for one query, with verbosity: 5 and then look at the output to see what it did before and what it does now. The cachedb function changed to fix it to make this possible, previously it would not actually pick up expired data from cachedb correctly, if serve expired is enabled, but instead pick it up as briefly valid data. The change makes it pick up the data and set it expired. Another change is perhaps that it can ignore the expired contents from the cache, in order to check the cachedb, perhaps redis is the slow component, and now that it incorporates the expired data as expired, it ignores it the next lookup, because it is expired and checks cachedb again for possibly fresh data. |
Mhm.. maybe I've wait for other build and watch where it's going. Otherwise, Unbound is left on version 1.19.3. The build at least runs smoothly - at least I mean that😅 |
I'm a bit confused about the |
I’m encountering the same behavior with my docker setup for unbound in combination with redis. When cachedb-check-when-serve-expired is set to yes (default), queries getting a resolve time of like 20ms after some time has passed since last query (I have serve expired enabled with 24h as max TTL). When cachedb-check-when-serve-expired is set to no the expected behavior is happening, queries are instantly served from cache. Would this indicate that searching the cachedb/redis takes up to 20ms? |
I don't know, but this will not make sense. The cachedb via socket is really fast and in the build 1.19.3 you can see via logs that the queries are answered from redis. And that with 3-5ms. The cache from unbound is no longer active when you work with Redis. Or did I overlook something? |
@whyisthisbroken i always thought redis was only used as second level cache. So unbound still uses the in memory cache, with a fallback to redis if the no record is found in the in memory cache. But I indeed see lookups to redis even when the in memory cache should have the record. I’m unsure what the new option really does… |
I hope that some developer gives some clarification and can explain in an understandable way what has now been changed and why the behavior is completely different... |
Experimented a little today. When I set cachedb-check-when-serve-expired to no, queries are served immediately, even when expired. This is the expected behavior. However, when I restart unbound the same query will not immediately be returned from redis, from the logs it looks like a normal resolve takes place, which is strange in my opinion, from my understanding it should just serve the query from redis as stale record. This way using redis is only useful for non expired queries. |
Yeah, the entire cachedb is with build 1.20 completely useless. This is the result with 1.20 I've run now for 12h... I'll go back to 1.19.3 - I don't know what's wrong, but as long no dev reports or a new version comes up i will not test it any further. I'll post a screenshot tomorrow with build 1.19.3 with the same settings - of course without the new ones added for 1.20. |
So, I asked for detailed logs and you do not give them, so it is not sure what is actually happening apart from speculation? Just looking at the timing does not actually answer the questions? So, running with a verbosity level of 4 and 5, you can test what actually happens that makes queries take longer. Since the queries are actually resolved, I do not see this as a problem. Because the resolution works. The option can change between the old behaviour and the new behaviour. The correct setting is the new one, because expired information needs to be categorized as expired. Despite what is commented earlier, the data shown seems to not show actual bugs apart from curiosity at the timeframe. The setting to change Of course, there could be bugs, and that would be nice to fix them. I actually do not see any bugs in the 1.20.0 implementation here. Of course, if you get fresh lookups with a serve-expired-client-timeout, then the responses are classified as cache misses. With the new option enabled, it can in fact pick up expired responses from the cachedb, and use them. Previously it would briefly mark them as present, giving trouble, and making expired messages really expired makes sure that domains do not swap between older and newer versions of themselves, so that is not really something that is wrong here. If there is a reload, of course the cache is lost, and also no expired messages are present, in memory. The cachedb cache can then give an expired response. With the new option difference perhaps this is counted as a cache miss now, but it should have been but may have been counted as a cache hit before. Regardless the user gets the expired answer, if so, and the resolution continues. With the new option, the user may then, after the brief lookup, receive the new lookup answer if that happens to be able to be fetched within the serve-expired-client-timeout timeframe. |
Hi, thanks for the response. The last part of your explanation is where I'm having trouble. When you set serve-expired-client-timeout to 0 (disabling it), serve-expired to true and cachedb-check-when-serve-expired to no, I would expect the existing expired record van de cachedb to be returned immediatelly after a reload (when the memory cache is gone). It should not wait on a new resolve, because serve-expired-client-timeout is set to 0. But instead this response takes a couple of tens of milliseconds, indicating it waiting for a resolve. I'm not sure if this is a bug or I'm misunderstanding something. |
Well with the new option to no, it does the previous behaviour. So that would be the same as 1.19 does. Perhaps redis is that slow, that could be visible from the logs to see what part is slow. It would be useful to have an option that logs milliseconds for that, but log-time-ascii does not do it. Or it is resolving before the expired response but I also do not know why it would, but see below how I do, if that is broken, intend to then remove the option altogether and not fix it. Regardless of these issues, the fix with the new option yes it really a mandatory fix. Without it the serve-expired-client-timeout does not work. Also without it the information about a domain can fluctuate between different versions of the zone, once it is updated and also expired. The fix improves consistency and makes the serve expired options actually work, so the option turned on is a nicety. Realistically I think it should be removed and the current default the only choice, also the 1.19 behaviour, regardless of whether it was nice for you, has to go, because the bug has to be fixed for it. That said, it should also work, in this case. Once we figure out what that is, or what the problem, if any, is in this case. |
I was not specifically referring to 1.19. I myself do not have the option to test that version at the moment. I was just wondering how the current setting works. I will try to look in the logging if I can find something about slowness. The problem I had with setting cachedb-check-when-serve-expired to yes, is that now and then some expired entries did take additional time to resolve which could indeed indicate a problem with redis. I would have to investigate this further. |
Well if the cache has an expired message, then with the new option enabled, it will actually check cachedb for a fresh new message. This is perhaps the change in behaviour that is the trouble. |
So if I understood that correctly and I set "Check cachedb" to "No", even after restart Unbound and the cache is gone, does Unbound no longer look in the Redis database? So the whole Redis story would be unnecessary since it is no longer checked whether data is in the DB. This results, Redis is completely unnecessary, because the time for Redis lookups is sometimes slower than if I let it resolve again from Unbound... |
I’ve checked the logs again, from what I can see the time is coming from cached entries which needs validation before being served: [1715628891] unbound[1:1] debug: Serve expired: Trying to reply with expired data This causes a resolve to happening sometimes, which explains the time elapsed. But I wonder what the purpose is of having a cachedb if eventually a resolve is needed anyway. |
Based on the conversation, I wonder if the
Based on this analysis, it seems to me that this option would only add unnecessary Redis checks for a single-instance of Unbound. I'm not sure if my understanding is correct. It's just a summary based on the conversation and developers' explanations. Please feel free to correct any errors. |
@kkkgo You are mostly right indeed. That’s why the option cachedb-check-when-serve-expired is mandatory in this case and in your situation it would probably be better to set it to no. In step 3 it is theoretically possible to have fresh records if your restart unbound which will delete the in memory cache. But today’s TTL’s are so low that this chance is also low. So basically the option cachedb-check-when-serve-expired is useful in the end and I don’t think it should be left out. The only thing I wonder is why there happens to be a resolve after getting the item out of the cachedb, in my opinion it should be served immediately if serve-expired is enabled and the time is within serve-expired-ttl. |
@jcbvm When you restart Unbound, it's essentially equivalent to scenario 1, as described above. In this case, I believe it is unrelated to the |
So, for me the issue is closed. But I left it open for others. For me, Redis is no longer an alternative than a secondary cache with the new options. It's not worth looking for an entry via Redis if it's faster to resolve again. |
Yes, these changes appear to completely break the use of cachedb. Prior to 1.20.0, I used cachedb + redis to provide a cache following a restart of the Unbound service or reboot of the DNS server, but now Unbound essentially doesn't use redis at all. Instead of returning an entry from the redis cache following a reboot, Unbound does a lookup instead (?), even if the lookup takes a long time (e.g., 200ms on some test domains). This all happens even if "serve-expired-client-timeout" is set to zero (and I have "service-expired-ttl" set to "259200" as well). Maybe there is some millisecond comparison going on in Unbound, but I have to believe that a 200ms DNS lookup is a couple of orders of magnitude slower than looking up an entry from redis... |
There seems to be separate problem, perhaps, where it resolves instead of looks up in cachedb, as different from other cases, where it looks at cachedb before returning a reply. With the client timeout disabled, so it should either respond immediately with a cache response, or expired response, possibly after a cachedb lookup of course, but not then wait for recursion. Are there logs, say verbosity 4 or 5, of this going wrong? It should check the cachedb if that is configured. |
In have a log file here with verbosity 4. It looks at the cachedb and finds an item. But instead of serving it right away it starts resolving. I have serve-expired-client-timeout on 0 and serve-expired-ttl on 1 day. |
What I am seeing is that if the cachedb entry hasn't expired, it returns the cachedb entry and doesn't resolve and is done - so this situation works correctly. However, if the cachedb entry has expired, now in 1.20.0 it ignores the cachedb entry and resolves instead, blocking the return of the record until it is resolved, and then it returns the new resolved records. It shouldn't do this if "serve-expired-client-timeout" is zero and "serve-expired-ttl" is set to 1-3 days, right? Why doesn't it return the expired record with TTL 0 and then resolve in background, like it did previously to 1.20.0? This new behavior makes cachedb useless except for cases where the Unbound cache happens to be cleared within the record's TTL (?), which seems to be very unlikely to normally occur...?
|
For what it's worth (which is almost certainly not very much), undoing one of the changes made in 7c5e765 appears to work around these new issues with cachedb being broken for the "serve-expired-client-timeout: 0" case:
With this change everything works again, just like it used to (i.e., prior to 1.20.0). |
insecure, and thus can be served to clients when dnssec is enabled.
The commit fbdc06e contains a fix for the example log and for the excerpt posted. It sets the dnssec validation status of the expired cachedb contents to insecure. This allows it to be returned to the client when DNSSEC is enabled. Otherwise it would need dnssec validation, but the message is expired and the validator does not work on that; it works on currently looked up data. With the fix, I think, it is going to make the two examples return an expired reply straight away, and then also continue to resolve. The fixup suggested points in the right direction, but is not the fix to use; it would return a full reply but not resolve a fresh entry. |
Hi @wcawijngaards, thanks for the change! I have a couple of questions: Will this change work with DNSSEC validation disabled (as I am not using DNSSEC)? It does seem to be okay for this case. Also, this change requires "cachedb-check-when-serve-expired" to be enabled, correct? I did some quick tests and this new change appears to address the issue, albeit behavior is now a bit different than in 1.19.x (?). I just want to confirm: Instead of returning a zero (0) TTL from the cachedb result, it now returns the value for "serve-expired-reply-ttl" (which appears to make more sense than zero anyway). Thank you! |
I have the same question. Does this fix address the case when cachedb-check-when-serve-expired is set to yes? If I set cachedb-check-when-serve-expired to no, will it have the same effect? In a scenario where the in-memory cache always has fresher DNS records than Redis, wouldn't it be more appropriate to set it to no? |
Yes it would work with validation disabled. In fact, things should already work for that. It requires the cachedb-check-when-serve-expired. It could apply when that is disabled too, but I wanted to preserve the old behaviour that you are asking for. That the serve-expired-reply-ttl is honored, is a new feature for the cachedb-check-when-serve-expired option, also the serve-expired-client-timeout works. It is nice that it works. Supposedly when cachedb-check-when-serve-expired is disabled, then it should already be working like in previous versions and no change needed. Hence the fix does not change that. |
I have closed the issue with the commit, this adds a unit test to the previous commit that fixes the validation status. This makes the new 1.20.0 cachedb changes work with validation. The unit test fails with similar debug output as the shown logs without the validation status fixup, and with the validation status fixup the unit test works and produces neat serve expired responses from cachedb. It shows that the fix up works and makes this kind of situation work with the new cachedb code, and this is nice because that code supports options like serve-expired-client-timeout and serve-expired-reply-ttl. |
Thank you for taking the time for this issue and for the provided fix. |
+1 @wcawijngaards - the new behavior does work nicely, thank you very much for looking into this and addressing it! |
Anyone else runs the actual unbound 1.20.0 where the cachedb don't work properly anymore?
My avg ttl goes up to 50-60ms from 4-5ms.
My setup is on a Rasperry Pi 4:
AGH (cache disabled)- Unbound (with redis cachedb) - DNScrypt
AdGuardHome resolve with 127.0.0.1:53 / [::1]:53 to Unbound ->
Unbound resolve from CacheDB or DNSCrypt with forward-addr: 127.0.0.1@6053 / forward-addr: ::1@6053
Since the update from 1.19.3 to 1.20.0 there is something broken. Compiled exactly the same.
All queries are 25-30ms, although obviously the cache responds with "get".
DBSize also growning and the log from unbound is without errors.
On verbostiy lvl 4 there are entries with
info: redis ;; ->>HEADER<<-
so i expect that unbound resolve from cachedb, but AdGuardHome show me over 30-50ms for example resolves.The setup is the same but something doesn't work anymore correctly... It seems that unbound doesn't respect the cache entries and resolve again...
I'm back to 1.19.3 and everything works fine again...see screenshot.
Compiling Code:
./configure --build=aarch64-linux-gnu --prefix=/usr --includedir=\${prefix}/include --infodir=\${prefix}/share/info --libdir=\${prefix}/lib/aarch64-linux-gnu --mandir=\${prefix}/share/man --localstatedir=/var --runstatedir=/run --sysconfdir=/etc --with-chroot-dir= --with-dnstap-socket-path=/run/dnstap.sock --with-libevent --with-libhiredis --with-libnghttp2 --with-pidfile=/run/unbound.pid --with-pythonmodule --with-pyunbound --with-rootkey-file=/var/lib/unbound/root.key --disable-dependency-tracking --disable-flto --disable-maintainer-mode --disable-option-checking --disable-rpath --disable-silent-rules --enable-cachedb --enable-dnstap --enable-subnet --enable-systemd --enable-tfo-client --enable-tfo-server CFLAGS="-O2"
With 1.19.3
With 1.20.0
The text was updated successfully, but these errors were encountered: