Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cilium dropping IPIP packets w/ unknown drop reason of 119 #32473

Open
2 of 3 tasks
tehnerd opened this issue May 10, 2024 · 16 comments
Open
2 of 3 tasks

Cilium dropping IPIP packets w/ unknown drop reason of 119 #32473

tehnerd opened this issue May 10, 2024 · 16 comments
Labels
info-completed The GH issue has received a reply from the author kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.

Comments

@tehnerd
Copy link

tehnerd commented May 10, 2024

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

Cilium is dropping packets w/ unknown drop reason.
expected behavior: not having error code 119; but something else (if it is missconfiguration etc).

Cilium Version

Client: 1.15.1 a368c8f 2024-02-14T22:16:57+00:00 go version go1.21.6 linux/amd64
Daemon: 1.15.1 a368c8f 2024-02-14T22:16:57+00:00 go version go1.21.6 linux/amd64

Kernel Version

Linux dfw5a-rg19-9b 5.15.0-73-generic #80-Ubuntu SMP Mon May 15 15:18:26 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Kubernetes Version

Client Version: v1.28.5
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.5

Regression

No response

Sysdump

No response

Relevant log output

xx drop (119, 0) flow 0x94b1cf61 to endpoint 2125, ifindex 34, file bpf_lxc.c:251, , identity world->10294: 10.80.84.41:28757 -> 10.220.23.10:3991 tcp SYN
xx drop (119, 0) flow 0x8a358f62 to endpoint 1349, ifindex 33, file bpf_lxc.c:251, , identity world->29312: 10.80.84.41:26331 -> 10.220.23.10:3991 tcp SYN
xx drop (119, 0) flow 0xdcd19bbf to endpoint 2125, ifindex 34, file bpf_lxc.c:251, , identity world->10294: 10.80.82.54:16255 -> 10.220.23.10:3991 tcp SYN
xx drop (119, 0) flow 0xc255dbbc to endpoint 1349, ifindex 33, file bpf_lxc.c:251, , identity world->29312: 10.80.82.54:16167 -> 10.220.23.10:3991 tcp SYN
xx drop (119, 0) flow 0xff1a3516 to endpoint 3503, ifindex 32, file bpf_lxc.c:251, , identity world->32410: 10.80.107.38:16053 -> 10.220.23.9:3991 tcp SYN

Anything else?

environment where it is happening:

LB (not controlled by cilum) is sending ipip packet to the pod/k8s cluster where we have cilium installed. cilium is w/ default configuration. flow from logs above (e.g. 10.80.107.38:xxx -> 10.220.23.9:3991 is from the payload of ipip (aka inner packets etc))

it feels like drop happens here somewhere: https://github.com/cilium/cilium/blob/v1.15.1/bpf/bpf_lxc.c#L283
https://github.com/cilium/cilium/blob/v1.15.1/bpf/lib/conntrack.h#L884
https://github.com/cilium/cilium/blob/v1.15.1/bpf/lib/conntrack.h#L715

as ct_extract_ports4 does not have a case for ipip
and 119 is a 256-DROP_CT_UNKNOWN_PROTO (137) but i failed so far to find how/where this could be misscalculated.

also in general it is unclear why in logs we have a line for inner flow but ct_lookup is being done (theory; unfrotunately even w/ debug-verbose datapath there are 0 log lines related to this) against ipip packet.

Do cilium even supports of passing IPIP from external load balancer (e.g. ipvs)

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct
@tehnerd tehnerd added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels May 10, 2024
@squeed
Copy link
Contributor

squeed commented May 14, 2024

@tehnerd interesting! Would you be able to capture a pwru trace of an affected packet?

@squeed squeed added the need-more-info More information is required to further debug or fix the issue. label May 14, 2024
@tehnerd
Copy link
Author

tehnerd commented May 14, 2024

Yes. How to do this? I actually have a repro in dev environment so can take any debug info required

@tehnerd
Copy link
Author

tehnerd commented May 14, 2024

Oh nvm. Missed that this is a link (on mobile). Will try to do today

@github-actions github-actions bot added info-completed The GH issue has received a reply from the author and removed need-more-info More information is required to further debug or fix the issue. labels May 14, 2024
@tehnerd
Copy link
Author

tehnerd commented May 14, 2024

haven't run pwru yet; but i've confirmed that the drop is indeed in
https://github.com/cilium/cilium/blob/v1.15.1/bpf/lib/conntrack.h#L761

i've added printk there:

default:
		printk("drop ct unknown proto\n");
		/* Can't handle extension headers yet */
		return DROP_CT_UNKNOWN_PROTO;

and in bpf tracelog

           gping-202700  [001] b.s1.  7402.652876: bpf_trace_printk: in ct extract ports

           gping-202700  [001] b.s1.  7402.652894: bpf_trace_printk: drop ct unknown proto

           gping-202700  [001] b.s1.  7402.652896: bpf_trace_printk: sending drop notification

(tests are running against latest commit in github)

@tehnerd
Copy link
Author

tehnerd commented May 14, 2024

@squeed pwru output:

Ctehnerd:~/gh/cilium$ sudo ../pwru/pwru 'proto 4'
2024/05/14 17:13:29 Attaching kprobes (via kprobe-multi)...
1554 / 1554 [-----------------------------------------------------------------------------------] 100.00% ? p/s
2024/05/14 17:13:29 Attached (ignored 0)
2024/05/14 17:13:29 Listening for events..
               SKB    CPU          PROCESS                     FUNC
0xffff9f7db6178200      4   [gping:223270]     packet_parse_headers
0xffff9f7db6178200      4   [gping:223270]              packet_xmit
0xffff9f7db6178200      4   [gping:223270]         __dev_queue_xmit
0xffff9f7db6178200      4   [gping:223270]       qdisc_pkt_len_init
0xffff9f7db6178200      4   [gping:223270]      netdev_core_pick_tx
0xffff9f7db6178200      4   [gping:223270]        validate_xmit_skb
0xffff9f7db6178200      4   [gping:223270]       netif_skb_features
0xffff9f7db6178200      4   [gping:223270]  passthru_features_check
0xffff9f7db6178200      4   [gping:223270]     skb_network_protocol
0xffff9f7db6178200      4   [gping:223270]       validate_xmit_xfrm
0xffff9f7db6178200      4   [gping:223270]      dev_hard_start_xmit
0xffff9f7db6178200      4   [gping:223270]       dev_queue_xmit_nit
0xffff9f7db6178200      4   [gping:223270]                 skb_pull
0xffff9f7db6178200      4   [gping:223270]             nf_hook_slow
0xffff9f7db6178200      4   [gping:223270]                 skb_push
0xffff9f7db6178200      4   [gping:223270]         __dev_queue_xmit
0xffff9f7db6178200      4   [gping:223270]       qdisc_pkt_len_init
0xffff9f7db6178200      4   [gping:223270]      netdev_core_pick_tx
0xffff9f7db6178200      4   [gping:223270]        validate_xmit_skb
0xffff9f7db6178200      4   [gping:223270]       netif_skb_features
0xffff9f7db6178200      4   [gping:223270]  passthru_features_check
0xffff9f7db6178200      4   [gping:223270]     skb_network_protocol
0xffff9f7db6178200      4   [gping:223270]       validate_xmit_xfrm
0xffff9f7db6178200      4   [gping:223270]      dev_hard_start_xmit
0xffff9f7db6178200      4   [gping:223270]   skb_clone_tx_timestamp
0xffff9f7db6178200      4   [gping:223270]        __dev_forward_skb
0xffff9f7db6178200      4   [gping:223270]       __dev_forward_skb2
0xffff9f7db6178200      4   [gping:223270]         skb_scrub_packet
0xffff9f7db6178200      4   [gping:223270]           eth_type_trans
0xffff9f7db6178200      4   [gping:223270]               __netif_rx
0xffff9f7db6178200      4   [gping:223270]        netif_rx_internal
0xffff9f7db6178200      4   [gping:223270]       enqueue_to_backlog
0xffff9f7db6178200      4   [gping:223270]      __netif_receive_skb
0xffff9f7db6178200      4   [gping:223270] __netif_receive_skb_one_core
0xffff9f7db6178200      4   [gping:223270]             tcf_classify
0xffff9f7db6178200      4   [gping:223270]      skb_ensure_writable
0xffff9f7db6178200      4   [gping:223270]                   ip_rcv
0xffff9f7db6178200      4   [gping:223270]              ip_rcv_core
0xffff9f7db6178200      4   [gping:223270]               sock_wfree
0xffff9f7db6178200      4   [gping:223270]             nf_hook_slow
0xffff9f7db6178200      4   [gping:223270]     ip_route_input_noref
0xffff9f7db6178200      4   [gping:223270]      ip_route_input_slow
0xffff9f7db6178200      4   [gping:223270]          __mkroute_input
0xffff9f7db6178200      4   [gping:223270]      fib_validate_source
0xffff9f7db6178200      4   [gping:223270]    __fib_validate_source
0xffff9f7db6178200      4   [gping:223270]               ip_forward
0xffff9f7db6178200      4   [gping:223270]             nf_hook_slow
0xffff9f7db6178200      4   [gping:223270]        ip_forward_finish
0xffff9f7db6178200      4   [gping:223270]                ip_output
0xffff9f7db6178200      4   [gping:223270]             nf_hook_slow
0xffff9f7db6178200      4   [gping:223270]    apparmor_ip_postroute
0xffff9f7db6178200      4   [gping:223270]         ip_finish_output
0xffff9f7db6178200      4   [gping:223270]       __ip_finish_output
0xffff9f7db6178200      4   [gping:223270]        ip_finish_output2
0xffff9f7db6178200      4   [gping:223270]         __dev_queue_xmit
0xffff9f7db6178200      4   [gping:223270]       qdisc_pkt_len_init
0xffff9f7db6178200      4   [gping:223270]             tcf_classify
0xffff9f7db6178200      4   [gping:223270]      skb_ensure_writable
0xffff9f7db6178200      4   [gping:223270]      skb_ensure_writable
0xffff9f7db6178200      4   [gping:223270]      skb_ensure_writable
0xffff9f7db6178200      4   [gping:223270]      skb_ensure_writable
0xffff9f7db6178200      4   [gping:223270]      skb_ensure_writable
0xffff9f7db6178200      4   [gping:223270] kfree_skb_reason(SKB_DROP_REASON_TC_EGRESS)
0xffff9f7db6178200      4   [gping:223270]   skb_release_head_state
0xffff9f7db6178200      4   [gping:223270]         skb_release_data
0xffff9f7db6178200      4   [gping:223270]            skb_free_head
0xffff9f7db6178200      4   [gping:223270]             kfree_skbmem
^C2024/05/14 17:13:40 Received signal, exiting program..
2024/05/14 17:13:40 Detaching kprobes...
5 / 5 [----------------------------------------------------------------------------------------] 100.00% 20 p/s
~/gh/cilium$ 

@tehnerd
Copy link
Author

tehnerd commented May 14, 2024

and sending ipip4 packet from the dev server to the the k8s pod which is running w/ kind on the same devserver (and cilium is installed on that cluster; with defualt config as generated during make kind from cilium dev docs)

@tehnerd
Copy link
Author

tehnerd commented May 14, 2024

pwru.txt
pwru w/ more flags:

sudo ../pwru/pwru 'proto 4' --output-tuple --output-stack --output-skb --output-meta  --output-file /tmp/pwru.txt

@tehnerd
Copy link
Author

tehnerd commented May 14, 2024

generated packet was:

outer destination of ipip: 10.244.1.205 
inner destination of ipip: 10.244.1.205
inner source of ipip: 192.168.14.14 
outer soruce of ipip: 10.11.12.13 
sport 31337 
dport 80

@tehnerd
Copy link
Author

tehnerd commented May 14, 2024

so w/ patch like

        if (ct_buffer.ret < 0)                                                  \
-               return drop_for_direction(ctx, DIR, ct_buffer.ret, ext_err);    \
+               return drop_for_direction(ctx, DIR, -ct_buffer.ret, ext_err);   \
        if (map_update_elem(&CT_TAIL_CALL_BUFFER4, &zero, &ct_buffer, 0) < 0)   \

and the issue it seems that ct_buffer.ret is int ; but drop_for_direction is expecting unsigned. so we have an issue w/ translating 2 compliment to unsigned

i've got

xx drop (CT: Unknown L4 protocol) flow 0x0 to endpoint 3294, ifindex 3, file bpf_lxc.c:248, , identity world-ipv4->50250: 192.168.14.14:31337 -> 10.244.1.205:80 tcp SYN
xx drop (CT: Unknown L4 protocol) flow 0x0 to endpoint 3294, ifindex 3, file bpf_lxc.c:248, , identity world-ipv4->50250: 192.168.14.14:31337 -> 10.244.1.205:80 tcp SYN
xx drop (CT: Unknown L4 protocol) flow 0x0 to endpoint 3294, ifindex 3, file bpf_lxc.c:248, , identity world-ipv4->50250: 192.168.14.14:31337 -> 10.244.1.205:80 tcp SYN

as expected. but the question is - are there any config option for cilium to pass ipip (e.g. conntrack should check against inner packet; not outer ipip) ? i thought it is supported

@tehnerd
Copy link
Author

tehnerd commented May 15, 2024

So I made this work by calculating offsets so it is looking into inner ipv4 header and transport ports. But i have no idea what could this possible break. So wonder who could give us more info on how ipip supposed to be processed on ingress side etc

@tehnerd
Copy link
Author

tehnerd commented May 15, 2024

changes which made this work (for ipv4; this is just to continue the discussion on what to do w/ ipip. mb there is a config option which allows to do the same? to allow ingress ipip in pod which is running cilium)

        __u32 zero = 0;                                                         \
-       void *map;                                                              \
-                                                                               \
+       void *map;                                                                      \
+       int off;                                                        \
        ct_state = (struct ct_state *)&ct_buffer.ct_state;                      \
        tuple = (struct ipv4_ct_tuple *)&ct_buffer.tuple;                       \
                                                                                \
        if (!revalidate_data(ctx, &data, &data_end, &ip4))                      \
                return drop_for_direction(ctx, DIR, DROP_INVALID, ext_err);     \
                                                                                \
+       off = ETH_HLEN;                                 \
        tuple->nexthdr = ip4->protocol;                                         \
+       if (tuple->nexthdr == IPPROTO_IPIP) { \
+               printk("IPIP\n"); \
+               off  = off + 20; \
+               if (!revalidate_data_l3_off(ctx, &data, &data_end, &ip4, off))  {               \
+                       printk("drop ipip with invalid size\n"); \
+                       return drop_for_direction(ctx, DIR, DROP_INVALID, ext_err);     \
+               } \
+               tuple->nexthdr = ip4->protocol;                                         \
+       } \
        tuple->daddr = ip4->daddr;                                              \
        tuple->saddr = ip4->saddr;                                              \
-       ct_buffer.l4_off = ETH_HLEN + ipv4_hdrlen(ip4);                         \
+       ct_buffer.l4_off = off + ipv4_hdrlen(ip4);                              \
                                                    

@lmb lmb added the sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. label May 21, 2024
@oblazek
Copy link
Contributor

oblazek commented May 21, 2024

I think @borkmann already has an implementation for this, but we have a tunnel iface on each node which we pass as --device along with eth0 😉

@borkmann
Copy link
Member

Hi @tehnerd great to see you here! :) Do you expect the inbound LB traffic to be terminated in hostns of the Cilium nodes? Some time ago I added #31213 which just sets up an ipip device to do the former.

@tehnerd
Copy link
Author

tehnerd commented May 21, 2024

Hey, Daniel! In our setup each container (aka pod) has tunl interface in its namespace. So we terminate ipip there

@borkmann
Copy link
Member

Hey, Daniel! In our setup each container (aka pod) has tunl interface in its namespace. So we terminate ipip there

Ok, so that is currently not supported and needs to be extended for Cilium. I had some old code in https://github.com/cilium/cilium/pull/30547/commits for extracting inner tuple for service lookup, maybe it can be of help, or a diff properly cooked as patch as above.

@tehnerd
Copy link
Author

tehnerd commented May 22, 2024

Yeah. I think recalc offset as I proposed above seems easier. And in our internal setup it is actually works as expected (at least all FW features seems working as expected on inner packets). Ok I think I will make something in a few weeks. Just need to run more internals tests etc to make sure nothing else is required

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
info-completed The GH issue has received a reply from the author kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
None yet
Development

No branches or pull requests

5 participants