Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Falco 0.37.1 modern_ebpf crashes server #3181

Open
apsega opened this issue Apr 25, 2024 · 4 comments
Open

Falco 0.37.1 modern_ebpf crashes server #3181

apsega opened this issue Apr 25, 2024 · 4 comments
Labels

Comments

@apsega
Copy link

apsega commented Apr 25, 2024

Describe the bug

After upgrading Falco from 0.36.2 to 0.37.1 and switching driver from ebpf to modern_ebpf, it causes physical server with higher load to crash.

How to reproduce it

Random behaviour over time on more loaded physical servers.

Environment

  • Falco version: 0.37.1
  • System info:
{
  "machine": "x86_64",
  "nodename": "falcosecurity-falco-<...>",
  "release": "6.1.42-1.el8.x86_64",
  "sysname": "Linux",
  "version": "#1 SMP PREEMPT_DYNAMIC Tue Aug  1 07:24:16 UTC 2023"
}
  • Cloud provider or hardware configuration:
  • OS: Rocky Linux 8.8
  • CPU: AMD EPYC 7742 64-Core Processor 128 cores
  • Kernel: Linux 6.1.42-1.el8.x86_64 SMP PREEMPT_DYNAMIC x86_64 GNU/Linux
  • Installation method: plain OSS Kubernetes

Additional context

Crashdump:

17284898.905756] IPv6: ADDRCONF(NETDEV_CHANGE): cali841dc279d4d: link becomes ready
[17285388.370981] IPv6: ADDRCONF(NETDEV_CHANGE): cali6a7f0dad2a8: link becomes ready
[17285491.259227] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[17285491.259283] IPv6: ADDRCONF(NETDEV_CHANGE): cali5d758ecb513: link becomes ready
[17285552.983963] BUG: unable to handle page fault for address: ffffffffff6000c7
[17285552.987818] #PF: supervisor read access in kernel mode
[17285552.991552] #PF: error_code(0x0000) - not-present page
[17285552.995304] PGD 6a0e067 P4D 6a0e067 PUD 6a10067 PMD 6a12067 PTE 0
[17285552.999051] Oops: 0000 [#1] PREEMPT SMP NOPTI
[17285553.002776] CPU: 31 PID: 95831 Comm: kube-proxy Kdump: loaded Not tainted 6.1.42-1.el8.x86_64 #1
[17285553.006737] Hardware name: Dell Inc. PowerEdge R6515/0R4CNN, BIOS 2.11.4 03/22/2023
[17285553.010774] RIP: 0010:copy_from_kernel_nofault+0x6d/0x120
[17285553.014852] Code: f8 4c 89 e7 4b 8d 14 2c 31 f6 48 c1 e8 03 4d 8d 44 c4 08 eb 13 48 83 c7 08 48 89 d1 48 83 c3 08 48 29 f9 4c 39 c7 74 34 89 f1 <48> 8b 03 48 89 07 85 c9 74 e1 65 48 8b 04 25 c0 bb 01 00 83 a8 18
[17285553.023657] RSP: 0018:ffffc90003be7d80 EFLAGS: 00010256
[17285553.028208] RAX: 0000000000000000 RBX: ffffffffff6000c7 RCX: 0000000000000000
[17285553.033957] RDX: ffffc90003be7e18 RSI: 0000000000000000 RDI: ffffc90003be7e10
[17285553.038745] RBP: ffffc90003be7d98 R08: ffffc90003be7e18 R09: 0000000000000000
[17285553.043381] R10: 0000000000000001 R11: ffff88826a519990 R12: ffffc90003be7e10
[17285553.048067] R13: 0000000000000008 R14: 0000000000000000 R15: ffffc90003be7e98
[17285553.052769] FS:  000000c000d90890(0000) GS:ffff88fe7d9c0000(0000) knlGS:0000000000000000
[17285553.057962] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[17285553.062947] CR2: ffffffffff6000c7 CR3: 000000153ab3c000 CR4: 0000000000350ee0
[17285553.068504] Call Trace:
[17285553.074274]  <TASK>
[17285553.079218]  ? show_regs.cold.14+0x1a/0x1f
[17285553.084320]  ? __die_body+0x1f/0x70
[17285553.089309]  ? __die+0x2a/0x35
[17285553.094284]  ? _end+0x7b5da0c7/0x0
[17285553.099340]  ? page_fault_oops+0xaf/0x270
[17285553.104379]  ? bpf_probe_read_kernel+0x1d/0x50
[17285553.109575]  ? bpf_ringbuf_submit+0x10/0x20
[17285553.115044]  ? bpf_prog_182d4293644cc965_pf_kernel+0x549/0x558
[17285553.121418]  ? _end+0x7b5da0c7/0x0
[17285553.127468]  ? do_user_addr_fault+0x30b/0x590
[17285553.132943]  ? _end+0x7b5da0c7/0x0
[17285553.138381]  ? exc_page_fault+0x6f/0x160
[17285553.143782]  ? asm_exc_page_fault+0x27/0x30
[17285553.149265]  ? _end+0x7b5da0c7/0x0
[17285553.154742]  ? copy_from_kernel_nofault+0x6d/0x120
[17285553.160220]  bpf_probe_read_kernel+0x1d/0x50
[17285553.166254]  bpf_prog_3a9838b3cf5001f5_accept4_x+0x2e6/0x1589
[17285553.172566]  ? bpf_probe_read_kernel+0x1d/0x50
[17285553.178263]  ? bpf_prog_c5b1b737d5cb01c5_sys_exit+0x28f/0x50c
[17285553.184115]  bpf_trace_run2+0x54/0xd0
[17285553.189977]  __bpf_trace_sys_exit+0x9/0x10
[17285553.195917]  syscall_exit_to_user_mode_prepare+0x171/0x1d0
[17285553.202015]  syscall_exit_to_user_mode+0xd/0x40
[17285553.207926]  do_syscall_64+0x46/0x90
[17285553.214281]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[17285553.221453] RIP: 0033:0x42130e
[17285553.228105] Code: 20 4c 89 44 24 38 e8 31 3d ff ff 48 85 f6 0f 84 97 00 00 00 48 8b 54 24 78 49 89 f1 48 8b 74 24 48 4d 89 c8 49 29 d0 4d 8b 09 <4d> 85 c9 74 b1 4d 89 ca 49 29 d1 4c 39 ce 77 a6 4c 89 44 24 70 48
[17285553.240964] RSP: 002b:000000c000e51e88 EFLAGS: 00000206
[17285553.247460] RAX: 000000c003f36f70 RBX: 00000000000000d0 RCX: 000000000002aaa0
[17285553.254109] RDX: 000000c003f36f70 RSI: 00000000000000d0 RDI: 0000000000000012
[17285553.260788] RBP: 000000c000e51f08 R08: 0000000000000018 R09: 0000000000000000
[17285553.267735] R10: 000000000002aaaa R11: 0000000000000002 R12: 000000c000e51f08
[17285553.274514] R13: 000000000000000e R14: 000000c0005c6ea0 R15: 0000000002f14f80
[17285553.280551]  </TASK>
[17285553.286380] Modules linked in: xt_CT xt_multiport ipt_rpfilter ip_set_hash_net veth ip6t_REJECT nf_reject_ipv6 nf_conntrack_netlink ipt_REJECT nf_reject_ipv4 xt_addrtype xt_set ip_set_hash_ipportnet ip_set_hash_ipport ip_set_hash_ipportip ip_set_hash_ip ip_set_bitmap_port dummy ip_set ip_vs_sh ip_vs_wrr ip_vs_rr xt_MASQUERADE xt_mark nft_chain_nat nf_nat xt_conntrack xt_comment nft_compat overlay ip_vs_sed ip_vs_lc ip_vs nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 tcp_diag inet_diag amd64_edac edac_mce_amd kvm_amd kvm irqbypass wmi_bmof pcspkr rapl nf_tables sp5100_tco acpi_ipmi i2c_piix4 k10temp nfnetlink ipmi_si acpi_power_meter vfat fat sch_fq_codel ipmi_devintf ipmi_msghandler xfs libcrc32c dm_crypt sd_mod t10_pi crc64_rocksoft crc64 crct10dif_pclmul crc32_pclmul crc32c_intel sg ghash_clmulni_intel sha512_ssse3 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci i2c_algo_bit aesni_intel drm_shmem_helper crypto_simd libahci cryptd tg3 i40e drm ptp libata ccp pps_core
[17285553.286445]  megaraid_sas wmi dm_mirror dm_region_hash dm_log dm_mod
[17285553.350752] CR2: ffffffffff6000c7

Installation using official helm chart version 0.4.2 with the following values:

services:
  - name: k8saudit-webhook
    type: ClusterIP
    ports:
      - port: 9765
        protocol: TCP

# -- Tolerations to allow Falco to run on Kubernetes masters.
tolerations:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
  - effect: NoSchedule
    key: node-role.kubernetes.io/control-plane

driver:
  kind: modern_ebpf
  modernEbpf:
    bufSizePreset: 8
  loader:
    initContainer:
      resources:
        requests:
          cpu: 10m
          memory: 1Gi
        limits:
          cpu: 1000m
          memory: 1Gi

falcoctl:
  config:
    indexes:
    - name: falcosecurity
      url: https://falcosecurity.github.io/falcoctl/index.yaml
    artifact:
      allowedTypes:
        - rulesfile
        - plugin
      install:
        refs: [k8saudit-rules:0.7]
      follow:
      # -- List of artifacts to be followed by the falcoctl sidecar container.
        refs: [k8saudit-rules:0.7]
        # -- How often the tool checks for new versions of the followed artifacts.
        every: 1h

falco:
  rules_file:
    - /etc/falco/falco_rules.local.yaml
    - /etc/falco/rules.d
  json_output: true
  json_include_output_property: true
  json_include_tags_property: true
  http_output:
    enabled: true
    url: "http://falcosecurity-falcosidekick:80/"
  grpc:
    enabled: true
    bind_address: "unix:///run/falco/falco.sock"
    threadiness: 0 # 0 means "auto"
  grpc_output:
    enabled: true
  plugins:
    - name: k8saudit
      library_path: libk8saudit.so
      init_config:
        maxEventSize: "125829120"
        webhookMaxBatchSize: "125829120"
      open_params: "http://:9765/k8s-audit"
    - name: json
      library_path: libjson.so
      init_config: ""
  buffered_outputs: true
  load_plugins: [k8saudit, json]
  syscall_event_drops:
    actions:
      - ignore
    rate: "0.03333"
    max_burst: 10
  log_level: notice

resources:
  requests:
    cpu: 1
    memory: 12Gi
  limits:
    cpu: 2
    memory: 16Gi

# Collectors for data enrichment (scenario requirement)
collectors:
  docker:
    enabled: false
  crio:
    enabled: false
  kubernetes:
    enabled: false
@Andreagit97
Copy link
Member

ei @apsega thank you for reporting! we will take a look ASAP!

@Andreagit97
Copy link
Member

This falcosecurity/libs#1858 should be the cause of the failure! we will probably release it with Falco 0.38.0 by the end of the month!

Just a question, do you see this page fault sporadically or is this something that always happens?

@apsega
Copy link
Author

apsega commented May 9, 2024

@Andreagit97 occasionally, probably depends on that server load.

@Andreagit97
Copy link
Member

ok got it thank you!
I've seen that you have page faults ebpf programs enabled bpf_prog_182d4293644cc965_pf_kernel+. Do you use page_fault events in your rules? So something like evt.type= page_fault

I asked this because it is unusual to see page faults programs enabled, and probably these programs are generating a lot of events...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants