-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUGFIX] fix iptables rules to make them idempotent #1214
[BUGFIX] fix iptables rules to make them idempotent #1214
Conversation
This needs a total rework - I would suggest disabling the following (in serviceManager) until a proper fix is applied:
This is only blocking access for containers that run on the default docker network - which Flux containers don't, so all it is doing currently is adding iptables rules every time FluxOS restarts. Ideally, it needs to be done per docker network so each bridge would have it's own filter rules. This can be done, it just means every time a network is craeted / removed, iptables needs to be updated. However, there is actually a much easier way to do this, if you remove the I'll go with the later now |
Okay, I've fixed it. It now works like it should; As long as Flux continues to use 172.23.0.0/16 for app networks. Also if someone deletes the DROP rule - it will get appended, instead of inserted (which would kill DNS, if the host is using local router for DNS) Of note, it looks like FluxOS is starting before docker, as FluxOS inserts the docker chain rules. Flux logs on reboot:
Flux logs on FluxOS restart: 2024-02-10T16:35:30.854Z DOCKER-USER iptables chain already created
2024-02-10T16:35:30.861Z jump to DOCKER-USER chain already enabled in iptables
2024-02-10T16:35:30.872Z Access to host from containers already removed
2024-02-10T16:35:30.883Z Access to containers from host already accepted
2024-02-10T16:35:30.895Z Access to host DNS from containers already accepted iptables: -A INPUT -j ufw-track-input
-A FORWARD -j DOCKER-USER
-A FORWARD -j DOCKER-ISOLATION-STAGE-1
<snip>
-A DOCKER-ISOLATION-STAGE-2 -j RETURN
-A DOCKER-USER -s 172.23.0.0/16 -d 172.16.32.0/24 -p udp -m udp --dport 53 -j ACCEPT
-A DOCKER-USER -s 172.23.0.0/16 -d 172.16.32.0/24 -m state --state RELATED,ESTABLISHED -j ACCEPT
-A DOCKER-USER -s 172.23.0.0/16 -d 172.16.32.0/24 -j DROP
-A DOCKER-USER -j RETURN |
Here is the container being blocked from my router (and home network), and can still ping google. /app # ping -W 1 -c 3 google.com
PING google.com (142.250.200.46): 56 data bytes
64 bytes from 142.250.200.46: seq=0 ttl=114 time=14.068 ms
64 bytes from 142.250.200.46: seq=1 ttl=114 time=8.605 ms
64 bytes from 142.250.200.46: seq=2 ttl=114 time=12.328 ms
--- google.com ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 8.605/11.667/14.068 ms
/app # ping -W 1 -c 3 172.16.32.1
PING 172.16.32.1 (172.16.32.1): 56 data bytes
--- 172.16.32.1 ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss route to router on host: davew@chud:~$ ip route
default via 172.16.32.1 dev ens18 proto static
172.16.32.0/24 dev ens18 proto kernel scope link src 172.16.32.12
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
172.23.191.0/24 dev br-72d1725e481c proto kernel scope link src 172.23.191.1
172.23.210.0/24 dev br-048fde111132 proto kernel scope link src 172.23.210.1 ping from host to container: davew@chud:~$ ping 172.23.191.2
PING 172.23.191.2 (172.23.191.2) 56(84) bytes of data.
64 bytes from 172.23.191.2: icmp_seq=1 ttl=64 time=0.131 ms
64 bytes from 172.23.191.2: icmp_seq=2 ttl=64 time=0.042 ms
^C
--- 172.23.191.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1020ms
rtt min/avg/max/mdev = 0.042/0.086/0.131/0.044 ms
davew@chud:~$ ip address of container: /app # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
8: eth0@if9: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP
link/ether 02:42:ac:17:bf:02 brd ff:ff:ff:ff:ff:ff
inet 172.23.191.2/24 brd 172.23.191.255 scope global eth0
valid_lft forever preferred_lft forever
/app # |
I'll do a full app test tomorrow - unless someone else wants to verify |
I got rid of all the shell wizardry; in favor of using the |
} | ||
|
||
const fluxSrc = '172.23.0.0/16'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
each app create own network using:
const fluxNetworkOptions = {
Name: fluxDockerNetwork_${appname}
,
IPAM: {
Config: [{
Subnet: 172.23.${number}.0/24
,
Gateway: 172.23.${number}.1
,
}],
},
};
that why i dont understood this /16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
each app create own network using: const fluxNetworkOptions = { Name:
fluxDockerNetwork_${appname}
, IPAM: { Config: [{ Subnet:172.23.${number}.0/24
, Gateway:172.23.${number}.1
, }], }, };that why i dont understood this /16
This is network subnetting. Flux docker networks allocate out of the 172.23.0.0/16 range. I.e., like you say, a Flux app docker network will have any /24 that is 172.23.x.0. These are all encapsulated by the /16.
172.23.0.0/16 is a "supernet" of the 172.23.x.0/24 networks.
What this means, is that any Flux network will be included by the 17.23.0.0/16 subnet. Which is why we can use it for the source address in the iptables rules. It would be better to do this on a per docker interface (br interface) but that is considerably more work.
It's just saying, "I want to match any address that falls in this range"
Ps - I'll do up some tests for this after lunch (I'm on GMT) |
Thinking more about this - we can do this a lot easier. It's not really the host network we're trying to block here - it's container access to non routeable address space (except flux container network, dns and established connections) I.e. we should be blocking all RFC1918 private address space. There is no legitimate reason that a container should connect to this address space. RFC1918 address space is as follows:
Blocking the host network doesn't remedy the problem 100% - the node provider may have other private networks that they are routing - which would still be accessible, given, it would be hard for a container to know about them, but a determined operator could find them. Also, there is no guarantee that the host only has one default gateway, which makes things unpredicatable. I propose the following:
Doing this, means we can elimiate the requirement to find the default gateway and the shenanigans involved with determining the host net. |
I've updated it - no longer reliant on the davew@chud:~/zelflux$ node test_newcmd.js
IPTABLES: DOCKER-USER chain created
IPTABLES: New rule in FORWARD inserted to jump to DOCKER-USER chain
IPTABLES: Access to Flux containers from 10.0.0.0/8 accepted
IPTABLES: DNS access to 10.0.0.0/8 from Flux containers accepted
IPTABLES: Access to 10.0.0.0/8 from Flux containers removed
IPTABLES: Access to Flux containers from 172.16.0.0/12 accepted
IPTABLES: DNS access to 172.16.0.0/12 from Flux containers accepted
IPTABLES: Access to 172.16.0.0/12 from Flux containers removed
IPTABLES: Access to Flux containers from 192.168.0.0/16 accepted
IPTABLES: DNS access to 192.168.0.0/16 from Flux containers accepted
IPTABLES: Access to 192.168.0.0/16 from Flux containers removed Second run: davew@chud:~/zelflux$ node test_newcmd.js
IPTABLES: DOCKER-USER chain already created
IPTABLES: Jump to DOCKER-USER chain already enabled
IPTABLES: Access to Flux containers from 10.0.0.0/8 already accepted
IPTABLES: DNS access to 10.0.0.0/8 from Flux containers already accepted
IPTABLES: Access to 10.0.0.0/8 from Flux containers already removed
IPTABLES: Access to Flux containers from 172.16.0.0/12 already accepted
IPTABLES: DNS access to 172.16.0.0/12 from Flux containers already accepted
IPTABLES: Access to 172.16.0.0/12 from Flux containers already removed
IPTABLES: Access to Flux containers from 192.168.0.0/16 already accepted
IPTABLES: DNS access to 192.168.0.0/16 from Flux containers already accepted
IPTABLES: Access to 192.168.0.0/16 from Flux containers already removed I'll do up some tests now, then throw it on a node for e2e. |
I think it also needs to be made clear for node operators in whatever documentation they use, that the 172.23.0.0/16 network is reserved for Flux use only (and has been since before this pull) |
I need to check access to the 172.23.0.0/16 network - in theory I don't think it's needed as you're using layer 2 - you don't need to cross the bridge (and iptables) to route in your own network. Will test a bit later |
I was wrong here. Iptables was dropping this traffic. Have added an allow rule for this. Also, have to remove docker's explicit This solution isn't quite 100% - it still allows traffic between Flux applications (eg. app 1 <-> app2) which should be denied. So if that's the case - would need to rework this. Which I can do... and probably makes sense in the long run to do it properly. However, think this is good enough for the meantime. Due to these changes, 3 of the tests are now failing. I will fix these up tomorrow morning. |
After reflection, I'm not happy with how this works. Going to do it properly today, will have in done this evening. (Before tomorrow relase) |
Total refactor. Works 100% now.
Here is an example of the ruleset that gets created (this has one flux app network bridge on it): -A DOCKER-USER -s 172.23.0.0/16 -d 192.168.0.0/16 -p udp -m udp --dport 53 -j ACCEPT
-A DOCKER-USER -s 172.23.0.0/16 -d 192.168.0.0/16 -m state --state RELATED,ESTABLISHED -j ACCEPT
-A DOCKER-USER -s 172.23.0.0/16 -d 172.16.0.0/12 -p udp -m udp --dport 53 -j ACCEPT
-A DOCKER-USER -s 172.23.0.0/16 -d 172.16.0.0/12 -m state --state RELATED,ESTABLISHED -j ACCEPT
-A DOCKER-USER -s 172.23.0.0/16 -d 10.0.0.0/8 -p udp -m udp --dport 53 -j ACCEPT
-A DOCKER-USER -s 172.23.0.0/16 -d 10.0.0.0/8 -m state --state RELATED,ESTABLISHED -j ACCEPT
-A DOCKER-USER -i docker0 -o docker0 -j ACCEPT
-A DOCKER-USER -i br-a9ebf17b6e8d -o br-a9ebf17b6e8d -j ACCEPT
-A DOCKER-USER -s 172.23.0.0/16 -d 10.0.0.0/8 -j DROP
-A DOCKER-USER -s 172.23.0.0/16 -d 172.16.0.0/12 -j DROP
-A DOCKER-USER -s 172.23.0.0/16 -d 192.168.0.0/16 -j DROP
-A DOCKER-USER -j RETURN I haven't done e2e yet - and all the tests will be broken. Have to head out for a few hours - will fix up after that. |
Live testing on node: Ruleset added: -A DOCKER-USER -s 172.23.0.0/16 -d 192.168.0.0/16 -p udp -m udp --dport 53 -j ACCEPT
-A DOCKER-USER -s 172.23.0.0/16 -d 192.168.0.0/16 -m state --state RELATED,ESTABLISHED -j ACCEPT
-A DOCKER-USER -s 172.23.0.0/16 -d 172.16.0.0/12 -p udp -m udp --dport 53 -j ACCEPT
-A DOCKER-USER -s 172.23.0.0/16 -d 172.16.0.0/12 -m state --state RELATED,ESTABLISHED -j ACCEPT
-A DOCKER-USER -s 172.23.0.0/16 -d 10.0.0.0/8 -p udp -m udp --dport 53 -j ACCEPT
-A DOCKER-USER -s 172.23.0.0/16 -d 10.0.0.0/8 -m state --state RELATED,ESTABLISHED -j ACCEPT
-A DOCKER-USER -i docker0 -o docker0 -j ACCEPT
-A DOCKER-USER -i br-098bac43a7f1 -o br-098bac43a7f1 -j ACCEPT
-A DOCKER-USER -i br-aaf87aa57b20 -o br-aaf87aa57b20 -j ACCEPT
-A DOCKER-USER -s 172.23.0.0/16 -d 10.0.0.0/8 -j DROP
-A DOCKER-USER -s 172.23.0.0/16 -d 172.16.0.0/12 -j DROP
-A DOCKER-USER -s 172.23.0.0/16 -d 192.168.0.0/16 -j DROP
-A DOCKER-USER -j RETURN Active docker networks: (these match the rules above) davew@chud:~$ docker network ls
NETWORK ID NAME DRIVER SCOPE
9d00ff270b68 bridge bridge local
098bac43a7f1 fluxDockerNetwork_beaver bridge local
aaf87aa57b20 fluxDockerNetwork_chud bridge local
9728c8571954 host host local
c50d38a94e0e none null local Here is the access being removed on a new app deploy: I've tested intra-network communications - all work fine. Have tested a live app from the internet back in - works fine. I'm pretty much done here I think. |
Yup - finished here. Happy to run someone through it... but ready for UA. Cheers. |
> NODE_CONFIG_DIR=./ZelBack/config npx mocha -g "remove flux container access" tests/unit/fluxNetworkHelper.test.js
fluxNetworkHelper tests
remove flux container access to private address space tests
✔ should add the DOCKER-USER chain to iptables if it is missing
✔ should skip addding the DOCKER-USER chain to iptables if it already exists
✔ should bail out if there is an error addding the DOCKER-USER chain to iptables
✔ should add the jump to DOCKER-USER chain from FORWARD chain to iptables if it is missing
✔ should skip adding the jump to DOCKER-USER chain from FORWARD chain to iptables if it already exists
✔ should bail out if there is an error addding the DOCKER-USER chain to iptables
✔ should flush the DOCKER-USER chain
✔ should bail if there is an error flushing the DOCKER-USER chain
✔ should add two allow and one drop rule for each private network
✔ should add an allow for intra-network traffic per docker network
✔ should bail out as soon as a rule errors out
11 passing (50ms) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, I will start testing it after tuesday FluxOs release.
if (dockerUserChainExists) log.info('IPTABLES: DOCKER-USER chain already created'); | ||
|
||
const checkJumpToDockerChain = await cmdAsync(checkJumpChain).catch(async (checkErr) => { | ||
if (checkErr.message.includes('Bad rule')) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the error returned always in english? We have many Fluxnodes installed with different OS languages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - I'll check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the source for iptables 1.8.7
(installed on 22.04) the error responses are hardcoded english:
{ nft_chain_user_add, EEXIST, "Chain already exists" },
{ nft_chain_user_rename, EEXIST, "File exists" },
{ nft_rule_insert, E2BIG, "Index of insertion too big" },
{ nft_rule_check, ENOENT, "Bad rule (does a matching rule exist in that chain?)" },
{ nft_rule_replace, E2BIG, "Index of replacement too big" },
{ nft_rule_delete_num, E2BIG, "Index of deletion too big" },
/* { TC_READ_COUNTER, E2BIG, "Index of counter too big" },
{ TC_ZERO_COUNTER, E2BIG, "Index of counter too big" }, */
/* ENOENT for DELETE probably means no matching rule */
{ nft_rule_delete, ENOENT,
"Bad rule (does a matching rule exist in that chain?)" },
{ nft_chain_set, ENOENT, "Bad built-in chain name" },
However, just checking on a 20.04
box, it's using iptables 1.8.4 (legacy)
which has a different response.
I will fix this up so that it just looks for an error and doesn't read the text. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
return false; | ||
} | ||
|
||
const giveContainerAccessToDNS = baseAllowDnsCmd.replace('#DST', network); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't be '/#DST/g'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be, but there is only one #DST
so I just went with the more simple non regex replace. The one ones with the regex, replace twice.
} | ||
|
||
// This always gets appended, so the drop is at the end | ||
const dropAccessToHostNetwork = baseDropCmd.replace('#DST', network); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't be '/#DST/g'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be, but there is only one #DST
so I just went with the more simple non regex replace. The one ones with the regex, replace twice.
This commit now blocks 100% of access to private address space, while maintaining isolation for each Flux docker network. Now apps can be sure no other app is snooping their traffic, and operators can be sure that apps do not have access to ANY private network they are routing. Tests will all be broken - I'll fix up in next commit.
Older iptables (legacy) on ubuntu 20.04 operates slightly differently than the nf_tables version, the check output command doesn't return anything. Some of the output strings are different, so we don't check those anymore. Have also added a check to make sure the iptables binary is in the root users path.
3c2c46e
to
f079ef3
Compare
Not sure what your workflow is, usually I would branch off master for a bugfix and merge back into master (and rebase or merge on development) but have branched off develop here as I believe you will merge this into next release.
Fixes #1213
Regarding the existing extra rules, I think they're okay to leave, they will get removed by attrition over time as nodes reboot etc (iptables rules get flushed on reboot). It's non trival to remove them manually as there are rules inbetween. It only adds a new rule if the FluxOS services is restarted - I only notice as I'm doing testing on one of my nodes for some other work, and have restarted many times.
I've update this function to include a check for each iptables command using the
-C
optionThere was also a problem where if someone deletes one rule - it was impossible to know where to insert new rules, as you can't have the DROP rule first.
Have implemented a FLUX chain and jump to this immediately from the FORWARD chain. This way, can INSERT the ACCEPT rules, and APPEND the DROP rule. Of note, once the end of the FLUX chain is reached, there is an implicit RETURN back to the FORWARD chain, and rule evaulation continues if no match found.
I've only done quick testing on this (I'm pretty sure it's good) by running the function on one of my nodes. I have to pop out for a couple of hours and will deploy it properly on a node when I get back.