Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolving records through more than 8 CNAME fails due to hardcoded MAX_RESTART_COUNT #438

Closed
tylerszabo opened this issue Mar 2, 2021 · 19 comments · Fixed by #461
Closed
Assignees

Comments

@tylerszabo
Copy link

I've run into an edge case where a name passed through 9 CNAMEs (and 10 total resolutions to get the final answer). Because MAX_RESTART_COUNT is hardcoded there's no way to tune the configuration to work around this situation.

The original bug report is in the pfSense bug tracker here where detailed traces from drill are available.

@cgallred
Copy link
Contributor

cgallred commented Apr 3, 2021

I've hit this same issue. It prevents me (and anybody using Unbound) from setting up a Windows device with their Microsoft Account (MSA). logincdn.msauth.net, used during MSA authentication has 9 CNAME records (see dig output below). On my home network behind an OPNsense router I get SERVFAIL when trying to authenticate an MSA.

christian@IG-88:~$ dig logincdn.msauth.net @8.8.8.8

; <<>> DiG 9.16.1-Ubuntu <<>> logincdn.msauth.net @8.8.8.8
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 10402
;; flags: qr rd ra; QUERY: 1, ANSWER: 11, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;logincdn.msauth.net.           IN      A

;; ANSWER SECTION:
logincdn.msauth.net.    10      IN      CNAME   lgincdn.trafficmanager.net.
lgincdn.trafficmanager.net. 29  IN      CNAME   lgincdnmsftuswe2.azureedge.net.
lgincdnmsftuswe2.azureedge.net. 1493 IN CNAME   lgincdnmsftuswe2.afd.azureedge.net.
lgincdnmsftuswe2.afd.azureedge.net. 16 IN CNAME star-azureedge-prod.trafficmanager.net.
star-azureedge-prod.trafficmanager.net. 23 IN CNAME dual.t-0009.t-msedge.net.
dual.t-0009.t-msedge.net. 227   IN      CNAME   t-0009.t-msedge.net.
t-0009.t-msedge.net.    59      IN      CNAME   Edge-Prod-WSTr3.ctrl.t-0009.t-msedge.net.
Edge-Prod-WSTr3.ctrl.t-0009.t-msedge.net. 239 IN CNAME edge-prod-wstr3.ctrl.t-0001.trafficmanager.net.
edge-prod-wstr3.ctrl.t-0001.trafficmanager.net. 0 IN CNAME standard.t-0009.t-msedge.net.
standard.t-0009.t-msedge.net. 21 IN     A       13.107.246.19
standard.t-0009.t-msedge.net. 21 IN     A       13.107.213.19

;; Query time: 70 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Fri Apr 02 17:04:47 PDT 2021
;; MSG SIZE  rcvd: 376

@jkugler
Copy link

jkugler commented Apr 7, 2021

I have hit this too, which makes LinkedIn inaccessible at times. Sometimes their CDN, static-exp1.licdn.com sometimes resolves through this chain (but not always):

static-exp1.licdn.com.  300     IN      CNAME   2-01-2c3e-003d.cdx.cedexis.net.
2-01-2c3e-003d.cdx.cedexis.net. 300 IN  CNAME   li-prod-static.azureedge.net.
li-prod-static.azureedge.net. 1800 IN   CNAME   li-prod-static.afd.azureedge.net.
li-prod-static.afd.azureedge.net. 30 IN CNAME   star-azureedge-prod.trafficmanager.net.
star-azureedge-prod.trafficmanager.net. 30 IN CNAME dual.t-0009.t-msedge.net.
dual.t-0009.t-msedge.net. 240   IN      CNAME   t-0009.t-msedge.net.
t-0009.t-msedge.net.    60      IN      CNAME   Edge-Prod-WSTr3.ctrl.t-0009.t-msedge.net.
Edge-Prod-WSTr3.ctrl.t-0009.t-msedge.net. 240 IN CNAME edge-prod-wstr3.ctrl.t-0001.trafficmanager.net.
edge-prod-wstr3.ctrl.t-0001.trafficmanager.net. 0 IN CNAME standard.t-0009.t-msedge.net.
standard.t-0009.t-msedge.net. 240 IN    A       13.107.246.19
standard.t-0009.t-msedge.net. 240 IN    A       13.107.213.19

@bdrewery
Copy link

bdrewery commented Apr 7, 2021

Failures I ran into with this for the sake of anyone else confused:

  • During initial Windows 10 setup you cannot signin with microsoft account. It just spins and then drops back into the local user setup.
  • Cannot signin with microsoft account once in Windows 10. Window opens and then closes.
  • Cannot add a local user in an existing Windows 10 system. You just get a screen with a broken ms logo image, 2 back buttons, and a cookie and privacy policy link.

Granted those are Microsoft Windows problems but 8 is arbitrarily low for a cheap way to prevent CNAME loops.
PR #461 is nice but perhaps the default needs to be bumped up significantly too. There's a lot of MS Windows systems and CDNs out there.

The only workaround I've found beyond hardcoding microsoft's IPs is to not use unbound. dig @8.8.8.8 logincdn.msauth.net A returns full results so I figured forwarding msauth.net would work but unbound insists on prioritizing and chaining CNAMEs locally and hits the limit. I believe #132 is relevant here. It's surprising to me that "forward" is injecting more behavior than the forwarded server would provide with no way to disable it rather than being a transparent true forwarder.

@Arnavion
Copy link

Arnavion commented Apr 20, 2021

I hit this issue too, with Unbound 1.13.1 on OPNsense. The strange thing is, Unbound does sometimes resolve the domain successfully, but querying immediately afterwards fails again. Based on level 5 logs, I believe the difference is that it succeeds when its cache is empty, and fails when it find the CNAME chain in its cache.


Amusingly, for logincdn.msauth.net this isn't a worldwide problem and might be specific to people around US-West. A person in Italy told me they have no problems resolving the domain, because the CNAME chain they get is much shorter:

;; ANSWER SECTION:
logincdn.msauth.net.    261     IN      CNAME   lgincdn.trafficmanager.net.
lgincdn.trafficmanager.net. 29  IN      CNAME   lgincdnvzeuno.azureedge.net.
lgincdnvzeuno.azureedge.net. 1762 IN    CNAME   lgincdnvzeuno.ec.azureedge.net.
lgincdnvzeuno.ec.azureedge.net. 3562 IN CNAME   cs1227.wpc.alphacdn.net.
cs1227.wpc.alphacdn.net. 3562   IN      A       192.229.221.185

@jkugler
Copy link

jkugler commented Apr 20, 2021

@Arnavion It certainly would change from place to place, as that is what a CDN is designed for. In the case of LinkedIn, it will even change throughout the day for me, from a 3-name chain to a 8+ name chain. So...there are certain times of the day LinkedIn will fail to work for me. :)

@Arnavion
Copy link

Yes, it's just that on a few other forums I saw people wondering why something as fundamental as logincdn.msauth.net could be broken and only they were noticing it. I know I certainly was confused, until I saw the debug logs + got the Italian user's CNAME chain confirming that that was what the issue was.

It doesn't help that there are also a few red herrings on the internet, because some people claim they got it working by disabling qname-minimisation, but at least for me doing that doesn't make any difference.

@cgallred
Copy link
Contributor

Naturally there are other names besides logincdn.msauth.net affected by this issue, and it can cause problems beyond the ken of mortal man. :-) I was trying to do my taxes in TurboTax for Mac and it hit an error trying to connect to my ItsDeductible account to download charitable contribution info. On a hunch I set my MacBook's DNS resolver to 8.8.8.8. Problem went away. I see folks on the Intuit forums complaining about the issue, and of course nobody there has any idea what the cause could be.

@gthess
Copy link
Member

gthess commented Jun 18, 2021

Hi,
Thanks for bringing this up!
Since the problem is that the domain in question has a longer CNAME chain than Unbound is willing to accept, we believe the solution is to raise the MAX_RESTART_COUNT value to something higher like 11 which is still small enough. Small values ensure that Unbound will not waste resources trying to resolve potentially harmful queries.

However we would not like to introduce a configuration option for that since this will only solve the problem for users that mess with the value. The proper solution is for domains to check their CNAME chains because different resolvers have different strategies and limits on how to deal with CNAME chains/query restarts.

@gthess gthess self-assigned this Jun 18, 2021
@jkugler
Copy link

jkugler commented Jun 18, 2021

@gthess Could we raise it to something like 20? I know that sounds high, but it should still resolve pretty fast, most places won't do this, caching should mitigate when it does happen, and it will keep this bug from recurring too soon. :)

@gthess
Copy link
Member

gthess commented Jun 21, 2021

I am reluctant to go that "high". For legitimate purposes that should work fine. But since Unbound does not trust CNAME chains and goes and asks for each record itself, the illegitimate/DoS purposes are what we don't want to enable.

@jkugler
Copy link

jkugler commented Jun 21, 2021

@gthess Yeah, I fully understand that. We are currently dealing with this bug because we never thought someone would make a CNAME chain this long. To prevent this from coming up in the future, I would think we'd either need to make it a high number, or make it configurable so it's an easy work around if something "stupid" happens in the future. Right now, there are a LOT Of downstream users that can't (or don't know how) to upgrade because they use unbound in a project such as PFSense. The bug won't be fixed for them until a fix is released...and then PFSense upgrades the package. A configuration option would let users mitigate this in the future.

@vcunat
Copy link
Contributor

vcunat commented Jun 22, 2021

Can someone provide reasoning why anyone would need such long CNAME chains? To me this sounds like the mistake was on that side. Perhaps it's a bit unfortunate that the protocol does not standardize a particular global limit; I agree you don't want a (web) service telling all their users to reconfigure their DNS.

@tylerszabo
Copy link
Author

With regard to why this change might be helpful - if you're in a situation (like I was) where you're maintaining a system that uses unbound (like PFSense) and swapping out the package for a custom build isn't particularly easy or well supported then you're forced to use another DNS resolver altogether to get unblocked. One of the reasons why I thought this would be valuable to have as a configuration option is to avoid defaulting everyone into a higher count just to cover the few cases where it could help but give people that have identified the problem a way to work around it.

Part of this discussion has become what it should be increased to and while that might help for the next instance I think that would just be kicking the can down the road. There's still a good reason to have a limit; even without cycles in the resolution it would be possible to create a domain to DOS a resolver by cycling through an absurdly long CNAME chain so having a limit makes sense. But making that limit have the ability to override it in config would have allowed me to be unblocked while waiting for the service to be fixed.

I'd very much like #461 to be adopted; all of the defaults could remain in place to not cause new behavior for existing deployments but the next time I get into this state because someone did an unusual DNS configuration (very easy to accidentally do) and only tested it in a few configurations then I can still unblock myself.

@jkugler
Copy link

jkugler commented Jun 22, 2021

@vcunat Some CDN systems will configure a very long CNAME chain as part of their normal operation. See the LinkedIn example above.

@Arnavion
Copy link

(For anyone who was tracking this issue because of logincdn.msauth.net specifically, that one's chain has become shorter enough to work now.)

logincdn.msauth.net.    263     IN      CNAME   lgincdn.trafficmanager.net.
lgincdn.trafficmanager.net. 29  IN      CNAME   lgincdnmsftuswe2.azureedge.net.
lgincdnmsftuswe2.azureedge.net. 1763 IN CNAME   lgincdnmsftuswe2.afd.azureedge.net.
lgincdnmsftuswe2.afd.azureedge.net. 263 IN CNAME firstparty-azurefd-prod.trafficmanager.net.
firstparty-azurefd-prod.trafficmanager.net. 0 IN CNAME dual.part-0042.t-0009.t-msedge.net.
dual.part-0042.t-0009.t-msedge.net. 203 IN CNAME part-0042.t-0009.t-msedge.net.
part-0042.t-0009.t-msedge.net. 203 IN   A       13.107.246.70
part-0042.t-0009.t-msedge.net. 203 IN   A       13.107.213.70

@coddec
Copy link

coddec commented Jul 21, 2021

This issue is very annoying, at least make it possible to be changed in configuration rather than hardcoded.

Yes, long chain of CNAME probably is not a good design, but we can't just simply ask Akamai, Microsoft, Apple, other companies or CDN and Cloud companies say hey my software is not compatible with your design, can you change your design?

Chances are the answer will be: No, why others are fine, only you are complaining, and no, we will not change our design for minor users who is having issue with their own software. Just switch to another software that works.

So, please make this a configuration parameter rather than hardcoded.

gthess added a commit that referenced this issue Aug 4, 2021
@gthess
Copy link
Member

gthess commented Aug 4, 2021

Hi all,

Thanks for your input.
For now we are bumping the value to 11 from the previous 8 for the next version. This will allow both examples given here (one of them is already fixed in DNS) to work.

Since this is the first time we get something similar regarding CNAME chain length, we are reluctant to jump to a configuration option at this moment.

@gthess gthess closed this as completed Aug 4, 2021
@jkugler
Copy link

jkugler commented Aug 5, 2021

I'm not sure which one you are referencing as fixed, but for me, the LinkedIn one (static-exp1.licdn.com) will fluctuate between a 3-long chain and an 8-long chain, meaning sometimes I can access LinkedIn and sometimes i can't. :)

@dkstringer
Copy link

dkstringer commented Nov 4, 2021

I had the same issue today in pfsense 2.5.2 against the microsoft fqdn content.powerapps.com.
My ISP's (Zen Internet) DNS server returned a response with 9 aliases listed (3 of them were internal looking fallback edge server aliases). Google DNS only listed 6 aliases for the same fqdn. So I had to repoint Unbound at google DNS with forwarding enabled to fix the issue.
The more I think about it, why doesn't unbound just give up on recursively following all aliases when large lists are returned. The excess aliases don't really need to be resolved as many of them may well not be that useful for public access.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants