Skip to content

Conversation

@huntergregory
Copy link
Contributor

@huntergregory huntergregory commented Nov 4, 2021

Fix #1088

Problem

In the iptables nat table, under certain conditions, KUBE-SERVICES will mark a packet sent to an LB service before DNAT to a pod's IP. Then, in iptables filter table the KUBE-FORWARD table will accept packets with this mark before the packet reaches AZURE-NPM chain.

As a result, we might not get a chance to drop packets that we should.

iptables chains

filter table

FORWARD chain (as it is now)

As it is now, NPM comes after KUBE-FORWARD.

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination         
17386 1819K KUBE-FORWARD  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes forwarding rules */
17129 1806K KUBE-SERVICES  all  --  *      *       0.0.0.0/0            0.0.0.0/0            ctstate NEW /* kubernetes service portals */
  845 43940 AZURE-NPM  all  --  *      *       0.0.0.0/0            0.0.0.0/0
17129 1806K KUBE-EXTERNAL-SERVICES  all  --  *      *       0.0.0.0/0            0.0.0.0/0            ctstate NEW /* kubernetes externally-visible service portals */
    0     0 DROP       tcp  --  *      *       0.0.0.0/0            168.63.129.16        tcp dpt:80

KUBE-FORWARD chain

Notice the accept on masquerade mark (0x4000).

Chain KUBE-FORWARD (1 references)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            ctstate INVALID
    0     0 ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes forwarding rules */ mark match 0x4000/0x4000
    0     0 ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes forwarding conntrack pod source rule */ ctstate RELATED,ESTABLISHED
    0     0 ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes forwarding conntrack pod destination rule */ ctstate RELATED,ESTABLISHED

nat table (example with an ILB service named elf/nginx-svc)

For this example, elf/nginx-svc delegates traffic to two pods for incoming traffic on tcp port 80 or nodeport 30525.

KUBE-SERVICES is Kubernetes' nat chain. It is referenced in the OUTPUT AND PREROUTING chains.

Depending on the src, we may mark for masquerading if port 80 is used. Depending on the IP/port used for the svc, we either jump to the svc chain, fwd chain, or nodeports chain.

Chain KUBE-SERVICES (2 references)
 pkts bytes target     prot opt in     out     source               destination 
      ...
    0     0 KUBE-MARK-MASQ  tcp  --  *      *      !10.240.0.0/12        ILB_CLUSTER_IP          /* elf/nginx-svc cluster IP */ tcp dpt:80
    0     0 KUBE-SVC-TQTTMDXT2ILLOFTL  tcp  --  *      *       0.0.0.0/0            ILB_CLUSTER_IP          /* elf/nginx-svc cluster IP */ tcp dpt:80
    0     0 KUBE-FW-TQTTMDXT2ILLOFTL  tcp  --  *      *       0.0.0.0/0            ILB_EXTERNAL_IP          /* elf/nginx-svc loadbalancer IP */ tcp dpt:80
      ...
 1453 79972 KUBE-NODEPORTS  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL

Here's the mark-for-masquerade chain:

Chain KUBE-MARK-MASQ (91 references)
 pkts bytes target     prot opt in     out     source               destination         
  890 46280 MARK       all  --  *      *       0.0.0.0/0            0.0.0.0/0            MARK or 0x4000

Here's the LB's svc chain, which DNATs to one of the pods randomly. The packet is marked for masquerade (0x4000) if the src IP is the same as the dst pod's IP.

Chain KUBE-SVC-TQTTMDXT2ILLOFTL (3 references)
 pkts bytes target     prot opt in     out     source               destination         
  449 23348 KUBE-SEP-7NIBX3P3TLQVYL3E  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* elf/nginx-svc */ statistic mode random probability 0.50000000000
  453 23556 KUBE-SEP-KJY2B5OMZRWCDXG5  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* elf/nginx-svc */

Chain KUBE-SEP-7NIBX3P3TLQVYL3E (1 references)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 KUBE-MARK-MASQ  all  --  *      *       POD2_IP          0.0.0.0/0            /* elf/nginx-svc */
  449 23348 DNAT       tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            /* elf/nginx-svc */ tcp to:POD2_IP:80

Chain KUBE-SEP-KJY2B5OMZRWCDXG5 (1 references)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 KUBE-MARK-MASQ  all  --  *      *       POD1_IP          0.0.0.0/0            /* elf/nginx-svc */
  453 23556 DNAT       tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            /* elf/nginx-svc */ tcp to:POD1_IP:80

If the traffic was sent to the LB's external IP, it gets forwarded (marked for masquerade before sent to above svc chain for DNAT).

Chain KUBE-FW-TQTTMDXT2ILLOFTL (1 references)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 KUBE-MARK-MASQ  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* elf/nginx-svc loadbalancer IP */
    0     0 KUBE-SVC-TQTTMDXT2ILLOFTL  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* elf/nginx-svc loadbalancer IP */
    0     0 KUBE-MARK-DROP  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* elf/nginx-svc loadbalancer IP */

If the nodeport was used, mark for masquerade and send to the svc chain)

Chain KUBE-NODEPORTS (1 references)
 pkts bytes target     prot opt in     out     source               destination         
  902 46904 KUBE-MARK-MASQ  tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            /* elf/nginx-svc */ tcp dpt:30525
  902 46904 KUBE-SVC-TQTTMDXT2ILLOFTL  tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            /* elf/nginx-svc */ tcp dpt:30525

Solution

When a toggle is turned on (default off), move the jump from FORWARD to AZURE-NPM chain above the jump to KUBE-FORWARD.

As a result, packets DNAT to a pod from the ILB will pass through NPM instead of being accepted beforehand.

new FORWARD chain in filter table

We add a ctstate NEW requirement for the jump to Azure chain, and position the jump depending on the toggle. This follows KUBE-FORWARD's practice and is necessary so for example, we don't deny an HTTP response for an HTTP request that we allow.

When the toggle is set to place the azure chain first:

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination         
  845 43940 AZURE-NPM  all  --  *      *       0.0.0.0/0            0.0.0.0/0            ctstate NEW
17386 1819K KUBE-FORWARD  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes forwarding rules */
17129 1806K KUBE-SERVICES  all  --  *      *       0.0.0.0/0            0.0.0.0/0            ctstate NEW /* kubernetes service portals */
17129 1806K KUBE-EXTERNAL-SERVICES  all  --  *      *       0.0.0.0/0            0.0.0.0/0            ctstate NEW /* kubernetes externally-visible service portals */
    0     0 DROP       tcp  --  *      *       0.0.0.0/0            168.63.129.16        tcp dpt:80

Keep the rest the same, including the redundant check for state RELATED/ESTABLISHED in the final ACCEPT in Azure chain.

Chain AZURE-NPM (1 references)
 pkts bytes target     prot opt in     out     source               destination         
  845 43940 AZURE-NPM-INGRESS  all  --  *      *       0.0.0.0/0            0.0.0.0/0           
    0     0 AZURE-NPM-EGRESS  all  --  *      *       0.0.0.0/0            0.0.0.0/0           
    0     0 AZURE-NPM-ACCEPT  all  --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x3000 /* ACCEPT-on-INGRESS-and-EGRESS-mark-0x3000 */
    0     0 AZURE-NPM-ACCEPT  all  --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x2000 /* ACCEPT-on-INGRESS-mark-0x2000 */
    0     0 AZURE-NPM-ACCEPT  all  --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x1000 /* ACCEPT-on-EGRESS-mark-0x1000 */
    0     0 ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED /* ACCEPT-on-connection-state */

@huntergregory
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@vakalapa
Copy link
Contributor

vakalapa commented Nov 9, 2021

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@huntergregory huntergregory changed the title [NPM] Fix: reposition iptables jump to AZURE-NPM chain fix: [NPM] reposition iptables jump to AZURE-NPM chain Nov 9, 2021
@huntergregory
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@vakalapa
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@vakalapa
Copy link
Contributor

placeAzureChainFirst == true test results:

Cyclonus: https://github.com/Azure/azure-container-networking/runs/4193283318?check_suite_focus=true
Conformance: https://msazure.visualstudio.com/One/_build/results?buildId=48905776&view=results

lets switch the flag and test with false to make sure there is no regressions

Copy link
Contributor

@vakalapa vakalapa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@vakalapa
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@huntergregory huntergregory added the npm Related to NPM. label Nov 16, 2021
@huntergregory huntergregory merged commit db3c706 into master Nov 16, 2021
@rbtr rbtr deleted the iptables-chain-placement branch November 30, 2021 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

npm Related to NPM.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Defalut deny in Azure network policy is not working with Loadbalancer service (internal/external)

4 participants