Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCX use a large amount of SYSV HugePage memory #9830

Open
littleneko opened this issue Apr 18, 2024 · 5 comments
Open

UCX use a large amount of SYSV HugePage memory #9830

littleneko opened this issue Apr 18, 2024 · 5 comments
Labels

Comments

@littleneko
Copy link

littleneko commented Apr 18, 2024

Describe the bug

When we use UCX, we found that the system's Huge Pages are heavily utilized, with a total of over 4000 huge pages(2MB) occupied. In some scenarios, there is a large amount of memory being requested in a short period of time (around 1 minute), totaling around 2GB of huge page memory.

By examining the information in /proc/[pid]/numa_maps, we found that SYSV huge pages are heavily utilized. After analyzing our code, we found that only UCX uses SYSV to allocate large page memory, so we suspect that UCX is abnormally occupying these pages.

What could be occupying this SYSV huge page memory? What scenarios might trigger this problem? Are there any methods to avoid it?

Our program using UCX uses the ucp_tag_send_nb and ucp_tag_recv_nb interfaces. Internally, there is a UCX server that receives external requests, with multiple clients establishing connections with it, which could be either rc_x or tcp. Additionally, there are several UCX clients establishing connections with other services, only using rc_x. A schematic diagram is shown below:

未命名绘图 drawio

The SYSV hugepage in /proc/[pid]/numa_maps:
SYSV

Steps to Reproduce

  • Command line
  • UCX version used: v1.12.0
  • UCX configure flags
# configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --disable-params-check --without-java --enable-cma --without-cuda --without-gdrcopy --with-verbs --with-knem --with-rdmacm --without-rocm --without-xpmem --without-fuse3 --without-ugni
  • Any UCX environment variables used
    setenv("UCX_MAX_EAGER_LANES", "2", 1);
    setenv("UCX_IB_SEG_SIZE", "2k", 1);
    setenv("UCX_RC_RX_QUEUE_LEN", "1024", 1);
    setenv("UCX_RC_MAX_RD_ATOMIC", "16", 1);
    setenv("UCX_RC_ROCE_PATH_FACTOR", "2", 1);
    setenv("UCX_RNDV_THRESH", "32k", 1);
    setenv("UCX_IB_TRAFFIC_CLASS", "166", 1);
    setenv("UCX_SOCKADDR_CM_ENABLE", "y", 1);
    setenv("UCX_RC_MAX_GET_ZCOPY", "32k", 1);
    setenv("UCX_RC_TX_NUM_GET_OPS", "8", 1);
    setenv("UCX_RC_TX_NUM_GET_BYTES", "256k", 1);
    setenv("UCX_RC_TX_CQ_MODERATION", "1", 1);
    setenv("UCX_HANDLE_ERRORS", "", 1);
    setenv("UCX_IB_FORK_INIT", "n", 1);

Setup and versions

  • OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
    • CentOS Linux release 7.2.1511 (Core) 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
  • For RDMA/IB/RoCE related issues:
    • Driver version:
      • rdma-core-52mlnx1-1.52104.x86_64
      • MLNX_OFED_LINUX-5.2-1.0.4.0:
    • HW information from ibstat or ibv_devinfo -vv command
hca_id: mlx5_bond_0
        transport:                      InfiniBand (0)
        fw_ver:                         16.27.1016
        node_guid:                      b8ce:f603:00e9:4d5e
        sys_image_guid:                 b8ce:f603:00e9:4d5e
        vendor_id:                      0x02c9
        vendor_part_id:                 4119
        hw_ver:                         0x0
        board_id:                       MT_0000000080
        phys_port_cnt:                  1
        max_mr_size:                    0xffffffffffffffff
        page_size_cap:                  0xfffffffffffff000
        max_qp:                         262144
        max_qp_wr:                      32768
        device_cap_flags:               0xed721c36
                                        BAD_PKEY_CNTR
                                        BAD_QKEY_CNTR
                                        AUTO_PATH_MIG
                                        CHANGE_PHY_PORT
                                        PORT_ACTIVE_EVENT
                                        SYS_IMAGE_GUID
                                        RC_RNR_NAK_GEN
                                        MEM_WINDOW
                                        XRC
                                        MEM_MGT_EXTENSIONS
                                        MEM_WINDOW_TYPE_2B
                                        RAW_IP_CSUM
                                        MANAGED_FLOW_STEERING
                                        Unknown flags: 0xC8400000
        max_sge:                        30
        max_sge_rd:                     30
        max_cq:                         16777216
        max_cqe:                        4194303
        max_mr:                         16777216
        max_pd:                         16777216
        max_qp_rd_atom:                 16
        max_ee_rd_atom:                 0
        max_res_rd_atom:                4194304
        max_qp_init_rd_atom:            16
        max_ee_init_rd_atom:            0
        atomic_cap:                     ATOMIC_HCA (1)
        max_ee:                         0
        max_rdd:                        0
        max_mw:                         16777216
        max_raw_ipv6_qp:                0
        max_raw_ethy_qp:                0
        max_mcast_grp:                  2097152
        max_mcast_qp_attach:            240
        max_total_mcast_qp_attach:      503316480
        max_ah:                         2147483647
        max_fmr:                        0
        max_srq:                        8388608
        max_srq_wr:                     32767
        max_srq_sge:                    31
        max_pkeys:                      128
        local_ca_ack_delay:             16
        general_odp_caps:
                                        ODP_SUPPORT
                                        ODP_SUPPORT_IMPLICIT
        rc_odp_caps:
                                        SUPPORT_SEND
                                        SUPPORT_RECV
                                        SUPPORT_WRITE
                                        SUPPORT_READ
                                        SUPPORT_SRQ
        uc_odp_caps:
                                        NO SUPPORT
        ud_odp_caps:
                                        SUPPORT_SEND
        xrc_odp_caps:
                                        SUPPORT_SEND
                                        SUPPORT_WRITE
                                        SUPPORT_READ
                                        SUPPORT_SRQ
        completion timestamp_mask:                      0x7fffffffffffffff
        hca_core_clock:                 156250kHZ
        raw packet caps:
                                        C-VLAN stripping offload
                                        Scatter FCS offload
                                        IP csum offload
                                        Delay drop
        device_cap_flags_ex:            0x30000055ED721C36
                                        RAW_SCATTER_FCS
                                        PCI_WRITE_END_PADDING
                                        Unknown flags: 0x3000004100000000
        tso_caps:
                max_tso:                        262144
                supported_qp:
                                        SUPPORT_RAW_PACKET
        rss_caps:
                max_rwq_indirection_tables:                     65536
                max_rwq_indirection_table_size:                 2048
                rx_hash_function:                               0x1
                rx_hash_fields_mask:                            0x800000FF
                supported_qp:
                                        SUPPORT_RAW_PACKET
        max_wq_type_rq:                 8388608
        packet_pacing_caps:
                qp_rate_limit_min:      1kbps
                qp_rate_limit_max:      25000000kbps
                supported_qp:
                                        SUPPORT_RAW_PACKET
        tag matching not supported

        cq moderation caps:
                max_cq_count:   65535
                max_cq_period:  4095 us

        maximum available device memory:        131072Bytes

                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet
                        max_msg_sz:             0x40000000
                        port_cap_flags:         0x04010000
                        port_cap_flags2:        0x0000
                        max_vl_num:             invalid value (0)
                        bad_pkey_cntr:          0x0
                        qkey_viol_cntr:         0x0
                        sm_sl:                  0
                        pkey_tbl_len:           1
                        gid_tbl_len:            256
                        subnet_timeout:         0
                        init_type_reply:        0
                        active_width:           1X (1)
                        active_speed:           25.0 Gbps (32)
                        phys_state:             LINK_UP (5)
                        GID[  0]:               fe80:0000:0000:0000:bace:f6ff:fee9:4d5e, RoCE v1
                        GID[  1]:               fe80::bace:f6ff:fee9:4d5e, RoCE v2
                        GID[  2]:               fe80:0000:0000:0000:bace:f6ff:fee9:4d5e, RoCE v1
                        GID[  3]:               fe80::bace:f6ff:fee9:4d5e, RoCE v2
                        GID[  4]:               0000:0000:0000:0000:0000:ffff:0a4e:0588, RoCE v1
                        GID[  5]:               ::ffff:10.78.5.136, RoCE v2

Additional information (depending on the issue)

  • Output of ucx_info -d to show transports and devices recognized by UCX
 ucx_info -d
#
# Memory domain: posix
#     Component: posix
#             allocate: <= 64967552K
#           remote key: 24 bytes
#           rkey_ptr is supported
#
#      Transport: posix
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 12179.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: sysv
#     Component: sysv
#             allocate: unlimited
#           remote key: 12 bytes
#           rkey_ptr is supported
#
#      Transport: sysv
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 12179.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: self
#     Component: self
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#
#      Transport: self
#         Device: memory0
#           Type: loopback
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 6911.00 MB/sec
#              latency: 0 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 8K
#             am_bcopy: <= 8K
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: tcp
#     Component: tcp
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#
#      Transport: tcp
#         Device: lo
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 11.91/ppn + 0.00 MB/sec
#              latency: 10960 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 1
#     device num paths: 1
#              max eps: 256
#       device address: 18 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#      Transport: tcp
#         Device: bond1
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 5658.18/ppn + 0.00 MB/sec
#              latency: 5212 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 1
#     device num paths: 1
#              max eps: 256
#       device address: 6 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#
# Connection manager: tcp
#      max_conn_priv: 2064 bytes
#
# Memory domain: mlx5_bond_0
#     Component: ib
#             register: unlimited, cost: 180 nsec
#           remote key: 8 bytes
#           local memory handle is required for zcopy
#
#      Transport: rc_verbs
#         Device: mlx5_bond_0:1
#           Type: network
#  System device: mlx5_bond_0 (0)
#
#      capabilities:
#            bandwidth: 2739.46/ppn + 0.00 MB/sec
#              latency: 800 + 1.000 * N nsec
#             overhead: 75 nsec
#            put_short: <= 124
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 5 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 1K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 5 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 1K
#             am_short: <= 123
#             am_bcopy: <= 8255
#             am_zcopy: <= 8255, up to 4 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 1K
#            am header: <= 127
#               domain: device
#           atomic_add: 64 bit
#          atomic_fadd: 64 bit
#         atomic_cswap: 64 bit
#           connection: to ep
#      device priority: 38
#     device num paths: 2
#              max eps: 256
#       device address: 18 bytes
#           ep address: 5 bytes
#       error handling: peer failure, ep_check
#
#
#      Transport: rc_mlx5
#         Device: mlx5_bond_0:1
#           Type: network
#  System device: mlx5_bond_0 (0)
#
#      capabilities:
#            bandwidth: 2739.46/ppn + 0.00 MB/sec
#              latency: 800 + 1.000 * N nsec
#             overhead: 40 nsec
#            put_short: <= 2K
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 14 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 1K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 14 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 1K
#             am_short: <= 2046
#             am_bcopy: <= 8254
#             am_zcopy: <= 8254, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 1K
#            am header: <= 186
#               domain: device
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to ep
#      device priority: 38
#     device num paths: 2
#              max eps: 256
#       device address: 18 bytes
#           ep address: 7 bytes
#       error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
#      Transport: dc_mlx5
#         Device: mlx5_bond_0:1
#           Type: network
#  System device: mlx5_bond_0 (0)
#
#      capabilities:
#            bandwidth: 2739.46/ppn + 0.00 MB/sec
#              latency: 860 nsec
#             overhead: 40 nsec
#            put_short: <= 2K
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 11 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 1K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 11 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 1K
#             am_short: <= 2046
#             am_bcopy: <= 8254
#             am_zcopy: <= 8254, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 1K
#            am header: <= 138
#               domain: device
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 38
#     device num paths: 2
#              max eps: inf
#       device address: 18 bytes
#        iface address: 5 bytes
#       error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
#      Transport: ud_verbs
#         Device: mlx5_bond_0:1
#           Type: network
#  System device: mlx5_bond_0 (0)
#
#      capabilities:
#            bandwidth: 2739.46/ppn + 0.00 MB/sec
#              latency: 830 nsec
#             overhead: 105 nsec
#             am_short: <= 116
#             am_bcopy: <= 1016
#             am_zcopy: <= 1016, up to 5 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 1K
#            am header: <= 880
#           connection: to ep, to iface
#      device priority: 38
#     device num paths: 2
#              max eps: inf
#       device address: 18 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure, ep_check
#
#
#      Transport: ud_mlx5
#         Device: mlx5_bond_0:1
#           Type: network
#  System device: mlx5_bond_0 (0)
#
#      capabilities:
#            bandwidth: 2739.46/ppn + 0.00 MB/sec
#              latency: 830 nsec
#             overhead: 80 nsec
#             am_short: <= 180
#             am_bcopy: <= 1016
#             am_zcopy: <= 1016, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 1K
#            am header: <= 132
#           connection: to ep, to iface
#      device priority: 38
#     device num paths: 2
#              max eps: inf
#       device address: 18 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure, ep_check
#
#
# Connection manager: rdmacm
#      max_conn_priv: 54 bytes
#
# Memory domain: cma
#     Component: cma
#             register: unlimited, cost: 9 nsec
#
#      Transport: cma
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 11145.00 MB/sec
#              latency: 80 nsec
#             overhead: 2000 nsec
#            put_zcopy: unlimited, up to 16 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 16 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 4 bytes
#       error handling: peer failure, ep_check
#
#
# Memory domain: knem
#     Component: knem
#             register: unlimited, cost: 18446744073709551616000000000 nsec
#           remote key: 16 bytes
#
#      Transport: knem
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 13862.00/ppn + 0.00 MB/sec
#              latency: 80 nsec
#             overhead: 2000 nsec
#            put_zcopy: unlimited, up to 16 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 16 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 0 bytes
#       error handling: none
#
@littleneko littleneko added the Bug label Apr 18, 2024
@tvegas1
Copy link
Contributor

tvegas1 commented Apr 22, 2024

assuming UCX allocates huge pages with sysv transport you could try:

  • disable huge pages UCX_SYSV_HUGETLB_MODE=no
  • disable sysv transport like UCX_TLS=rc_x,tcp,self

else if it is related to internal buffers there is also:

  • UCX_SELF_ALLOC=huge,thp,md,mmap,heap
  • UCX_TCP_ALLOC=huge,thp,md,mmap,heap
  • UCX_RC_MLX5_ALLOC=thp,mmap,heap
  • UCX_ALLOC_PRIO=md:sysv,md:posix,thp,md:*,mmap,heap

you could try removing md:sysv/thp/huge. also, ucp_mem_map(UCP_MEM_MAP_ALLOCATE) call from the app could lead to sysv allocation.

@littleneko
Copy link
Author

assuming UCX allocates huge pages with sysv transport you could try:

  • disable huge pages UCX_SYSV_HUGETLB_MODE=no
  • disable sysv transport like UCX_TLS=rc_x,tcp,self

else if it is related to internal buffers there is also:

  • UCX_SELF_ALLOC=huge,thp,md,mmap,heap
  • UCX_TCP_ALLOC=huge,thp,md,mmap,heap
  • UCX_RC_MLX5_ALLOC=thp,mmap,heap
  • UCX_ALLOC_PRIO=md:sysv,md:posix,thp,md:*,mmap,heap

you could try removing md:sysv/thp/huge. also, ucp_mem_map(UCP_MEM_MAP_ALLOCATE) call from the app could lead to sysv allocation.

I tried starting the program with the following configuration and found that SYSV huge pages are still being used. In the UCX log, I discovered that the huge pages are being used by ucp_am_bufs.

UCX_ALLOC_PRIO=md:sysv,md:posix,md:*,mmap,heap
UCX_POSIX_ALLOC=md,mmap,heap
UCX_SYSV_ALLOC=md,mmap,heap
UCX_SELF_ALLOC=md,mmap,heap
UCX_TCP_ALLOC=md,mmap,heap
UCX_RC_VERBS_ALLOC=md,mmap,heap
UCX_RC_MLX5_ALLOC=md,mmap,heap
UCX_DC_MLX5_ALLOC=md,mmap,heap
UCX_UD_VERBS_ALLOC=md,mmap,heap
UCX_UD_MLX5_ALLOC=md,mmap,heap
UCX_CMA_ALLOC=mmap,heap
UCX_KNEM_ALLOC=md,mmap,heap
UCX_POSIX_HUGETLB_MODE=n
UCX_SYSV_HUGETLB_MODE=n

The alloc_chunk function of ucp_am_bufs is ucs_mpool_hugetlb_malloc, which always allocates huge pages for 64KB-sized elements unless the allocation of huge pages fails. There are no parameters to modify this behavior.

I speculate that the TCP server in UCX is using this mpool.

@tvegas1
Copy link
Contributor

tvegas1 commented Apr 26, 2024

@yosefe, shall we allow non-huge pages allocation for ucp_am_bufs?

@littleneko
Copy link
Author

@yosefe, shall we allow non-huge pages allocation for ucp_am_bufs?

I still have a question: why does using TCP transports require so many ucp_am_bufs? In our production environment, there have been instances where 1000 (2GB) huge pages were requested in a short period of time.

@yosefe
Copy link
Contributor

yosefe commented Apr 26, 2024

hi, @littleneko is it an option to upgrade the UCX version, or check if 7ca49c4 is part of the version in use?
In older UCX versions, the receive buffers were allocated according to the max possible frag size, however in newer versions it's allocated according to actual size, in powers of 2.
ucp_am_bufs is allocating memory for unexpected received messages - those that were received before ucp_tag_recv_nbx() with a matching tag/mask was called.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants