-
Notifications
You must be signed in to change notification settings - Fork 39.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Iptables based Proxy #9210
Conversation
This has a long ways to go before it should be merged, but I wanted to get an issue/PR up tracking the state of it since otherwise I've been pretty quietly working on this. I'd also like to get some eyes on it just to verify that it's headed in the right direction. Vagrant/e2e haven't been playing very nice [bugging out during validation mostly, occasionally the vbox kext has even crashed my laptop(!)] so testing this has been very slow. Edit: It looks like I've also fallen ~300 commits behind master, I'll be rebasing soon to catch up to that. /cc @thockin |
Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist") If this message is too spammy, please complain to ixdy. |
I'm not sure what happened to the commit history here, looks like I messed up a rebase or something. |
Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist") If this message is too spammy, please complain to ixdy. |
I've pushed some smaller PR's working on breaking down some of the things that this will eventually rely on. I'll clean it and the commit history up and try to finish debugging/vetting the iptables rules once #9563 lands. |
@BenTheElder Thanks for your comments on #7245. 2 things worth drawing your attention to: Firstly, in Calico, instead of attaching the container veth to a local bridge, we leave the host end in the root namespace and use Linux routing instead of bridging. We don't assign an IP address to the veth in the root namespace (only the container namespace). In the current userspace proxy, we find that Secondly, it is highly desirable for the traffic that arrives at the service instance pod to have the correct source IP (and source port, although in practice port is less important than IP) from the pod that is connecting to the service. (Rather than, say, an IP associated with the pod host.) This is because Calico uses IP tables rules to enforce network policy. Kubernetes doesn't support defining network policy, but will be doing so very soon with the introduction of namespaces, so it's important we get the proxy working in concert with those efforts. Hopefully that makes sense! I'm happy to discuss further if you want clarification or detail. |
@spikecurtis Thanks. I believe the source IP should be preserved correctly as I'm only using dnat otherwise; no Other than that, I'm currently adding/maintaining the rules for connecting |
How easy will be to automate the removal of the new iptables rules, if kube-proxy is terminated and k8s is uninstalled from a node? |
@jdef Well all of the chains are known or have a known prefix, we have 4 chains that always exist ( |
The history should be cleaned up now and unified with the other PRs. I've just finished running a full e2e before pushing this and i see 13 failures (output: https://gist.github.com/BenTheElder/f8d7f70fa5bd45f9ad06#file-e2e-snippet-L5130) however as noted in #9681 vagrant + e2e has numerous failures on master currently. |
That last commit (54eae42) also fixes an odd bug: while running e2e i saw instances where the port allocator was returning 0 which appears to be default behavior for one without a configured port-range(?). Iptables of course won't let you use 0 for your ports so I have a fallback routine that attempts to get a port assigned randomly by the OS by opening a socket with port 0 and getting the actual port then closing it and returning the port. |
Pretty typical rules on a minion from e2e currently look like:
|
As far as I can tell the service and endpoint monitoring should be working 100%, the chain tracking works, I'll be working on those tests today, but i'd appreciate any help reviewing the rule generation. |
@BenTheElder The person on my team I would most like to look at this is @Symmetric, but he is on vacation. He's back in the office Tuesday next week (23rd). He should be able to look at it then. CC'ing @fasaxc in case he gets a chance to look at this later today too. |
Thanks!
|
I'd be happy to help with testing this with my application if that would be useful. |
@statik that would be. I'm currently testing in a vagrant test cluster, on which a number of tests are currently always failing even on master so it's been difficult to finish validating it. |
Rebased to current master. |
@BenTheElder I've just started giving this another look, I'll apply our patches on top of the latest code here and check that the iptables rules look correct. Should (hopefully) have something tomorrow, I'll let you know how I get on. |
Thanks! On Thu, Jun 25, 2015, 20:51 Paul Tiplady notifications@github.com wrote:
|
Hi @BenTheElder , I gave your PR a spin, and I'm seeing some unexpected rules; it looks like services (e.g. kube-dns) are now getting the old REDIRECT rules, instead of DNAT:
I'm on iptables 1.4.21. The ShouldUseProxierIptables logic looks incorrect:
This will return false for 1.4.21, I think you want >= on the minor version. I checked the previous commit and the DNAT rules are getting created as expected there. I'll test a bit more with my changes merged into your HEAD^ (without the ShouldUseProxierIptables) and let you know how that goes. |
LGTM |
GCE e2e build/test failed for commit aa45ac255d8ea532bf9607684ec73a1b7079b968. |
Reviewing log...
Edit again: older commit anyhow, we should be able to just ignore this. |
I figured out the LB issue, it's a GCE thing, we have a fix in the works. Next: graphs, nodeports, and hairpins. |
@ArtfulCoder you'll have to patch your external IPs work onto this, too |
Also the bridge-nf module sysctl/modprobe right? |
yes - I think of what's left as the :advanced use cases" :) |
GCE e2e build/test passed for commit fc928c5f1d8ffaf4c381a3bf12684acc1d72c50c. |
@thockin, which do you want to tackle first? I'll open a PR in the morning. Though I'm not sure how we want to handle nodePorts yet (obviously with some iptables rules, but...). |
GCE e2e build/test passed for commit ae569e2. |
@thockin if I may, the hairpin requirement is for running an Apache Kafka cluster without having to put the advertised hostname to 127.0.0.1 in /etc/hosts, which doesn't seems to be an advanced use case to me ;) It will probably surprise some users (here, literally generating more than 100MBps of "hard-to-understand-why" logs), and I think it should be fixed before integrating the iptables-proxy implementation in a release. Of course, I can understand this shouldn't, but I really don't know how many other use-cases it may impact. At least, I give this issue a +1 in the priority vote :) In any case, of course, I'll test with the current setup (with the workaround for when hairpin is disabled). |
I think by "advanced use cases" @thockin is mainly referring to things that I don't have anything to do with releases of course, but I don't believe it Thanks for testing :)
|
Better safe than sorry ;-) |
Don't get me wrong, we know how to solve it but we wanted to get the core On Wed, Aug 12, 2015 at 1:38 AM, MikaelCluseau notifications@github.com
|
On 08/13/2015 02:19 AM, Tim Hockin wrote:
Totally fine, I just wanted to express my view about this "secondary |
Implement Iptables based Proxy
🎆 😄 |
W00t!! |
I'm looking at NodePorts. I want you to focus on the test - I want to show a graph of how this is better :) |
even if the test is largely manual, just needs to be reproducible. |
Sounds good. :) On Wed, Aug 12, 2015 at 12:42 PM, Tim Hockin notifications@github.com
|
This is an implementation of #3760.
It implements an
iptables-restore
wrapper, a dummy 'loadbalancer' that serves as aconfig.EndpointsConfigHandler
, and a purely iptables based version ofProxier
, as well as a method for determining whether to use the newProxierIptables
or the oldProxier
based on theiptables
version.kube-proxy
now autoselects between the two and uses the newProxierIptables
+DummyLoadBalancerIptables
pair when possible.TODO:
Proxier
andLoadBalancer
(ProxierIptables
andDummyLoadBalancerIptables
)kube-proxy
to select between oldProxier
andProxierIptables
iptables
rule generationiptables
minimum version for switching over fromProxier
toProxierIptables
(I've selected a version now, but we may want to change it, this probably needs some discussion)iptables-save
output to extract existing chains and rulesiptables-save
output to preserve countersFurther TODO:
iptables-save
output to minimize excess rule rewriting.