Kumar-Kartikey…
Commits on Nov 22, 2021
-
samples/bpf: Add example to checkpoint/restore io_uring
The sample demonstrates how BPF iterators for task and io_uring can be used to checkpoint the state of an io_uring instance and then recreate it using that information, as a working example of how the iterator will be utilized for the same by userspace projects like CRIU. This is very similar to how CRIU actually works in principle, by writing all data on dump to protobuf images, which are then read during restore to reconstruct the task and its resources. Here we use a custom binary format and pipe the io_uring "image(s)" (in case of wq_fd there will be multiple images), to the restorer, which then consumes this information to form a total ordering of restore actions it has to execute to reach the same state. The sample restores all features that currently cannot be restored without bpf iterators, hence is a good demonstration of what we would like to achieve using these new facilities. As is evident, we need a single iteration pass in each iterator to obtain all the information we require. io_uring ring buffer restoration is orthogonal and not specific to iterators, so it has been left out. Our example app also shares the workqueue with parent io_uring, which is detected by our dumper tool and it moves to first dump the parent io_uring. io_uring doesn't allow creating cycles in this case, so the chain ends eventually in practice. For now only single parent is supported, but it easy to extend to arbitrary length chains (by recursing with limit in do_dump_parent after detecting presence of wq_fd > 0). The epoll iterator usecase is similar to what we do in dump_io_uring_file, and would significantly simplify current implementation [0]. [0]: https://github.com/checkpoint-restore/criu/blob/criu-dev/criu/eventpoll.c The dry-run mode of bpf_cr tool prints the dump image: $ ./bpf_cr app & PID: 318, Parent io_uring: 3, Dependent io_uring: 4 $ ./bpf_cr dump 318 4 | ./bpf_cr restore --dry-run DUMP_SETUP: io_uring_fd: 3 end: true flags: 14 sq_entries: 2 cq_entries: 4 sq_thread_cpu: 0 sq_thread_idle: 1500 wq_fd: 0 DUMP_SETUP: io_uring_fd: 4 end: false flags: 46 sq_entries: 2 cq_entries: 4 sq_thread_cpu: 0 sq_thread_idle: 1500 wq_fd: 3 DUMP_EVENTFD: io_uring_fd: 4 end: false eventfd: 5 async: true DUMP_REG_FD: io_uring_fd: 4 end: false reg_fd: 0 index: 0 DUMP_REG_FD: io_uring_fd: 4 end: false reg_fd: 0 index: 2 DUMP_REG_FD: io_uring_fd: 4 end: false reg_fd: 0 index: 4 DUMP_REG_BUF: io_uring_fd: 4 end: false addr: 0 len: 0 index: 0 DUMP_REG_BUF: io_uring_fd: 4 end: true addr: 140721288339216 len: 120 index: 1 Nothing to do, exiting... ====== The trace is as follows: // We can shift fd number around randomly, it doesn't impact C/R $ exec 3<> /dev/urandom $ exec 4<> /dev/random $ exec 5<> /dev/null $ strace ./bpf_cr app & ... io_uring_setup(2, {flags=IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF|IORING_SETUP_CQSIZE, sq_thread_cpu=0, sq_thread_idle=1500, sq_entries=2, cq_entries=4, features=IORING_FEAT_SINGLE_MMAP|IORING_FEAT_NODROP|IORING_FEAT_SUBMIT_STABLE|IORING_FEAT_RW_CUR_POS|IORING_FEAT_CUR_PERSONALITY|IORING_FEAT_FAST_POLL|IORING_FEAT_POLL_32BITS|IORING_FEAT_SQPOLL_NONFIXED|IORING_FEAT_EXT_ARG|IORING_FEAT_NATIVE_WORKERS|IORING_FEAT_RSRC_TAGS, sq_off={head=0, tail=64, ring_mask=256, ring_entries=264, flags=276, dropped=272, array=384}, cq_off={head=128, tail=192, ring_mask=260, ring_entries=268, overflow=284, cqes=320, flags=0x118 /* IORING_CQ_??? */}}) = 6 getpid() = 324 ... io_uring_setup(2, {flags=IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF|IORING_SETUP_CQSIZE|IORING_SETUP_ATTACH_WQ, sq_thread_cpu=0, sq_thread_idle=1500, wq_fd=6, sq_entries=2, cq_entries=4, features=IORING_FEAT_SINGLE_MMAP|IORING_FEAT_NODROP|IORING_FEAT_SUBMIT_STABLE|IORING_FEAT_RW_CUR_POS|IORING_FEAT_CUR_PERSONALITY|IORING_FEAT_FAST_POLL|IORING_FEAT_POLL_32BITS|IORING_FEAT_SQPOLL_NONFIXED|IORING_FEAT_EXT_ARG|IORING_FEAT_NATIVE_WORKERS|IORING_FEAT_RSRC_TAGS, sq_off={head=0, tail=64, ring_mask=256, ring_entries=264, flags=276, dropped=272, array=384}, cq_off={head=128, tail=192, ring_mask=260, ring_entries=268, overflow=284, cqes=320, flags=0x118 /* IORING_CQ_??? */}}) = 7 ... // PID: 324, Parent io_uring: 6, Dependent io_uring: 7 ... eventfd2(42, 0) = 8 io_uring_register(7, IORING_REGISTER_EVENTFD_ASYNC, [8], 1) = 0 io_uring_register(7, IORING_REGISTER_FILES, [0, -1, 1, -1, 2], 5) = 0 io_uring_register(7, IORING_REGISTER_BUFFERS, [{iov_base=NULL, iov_len=0}, {iov_base=0x7ffdf1a27680, iov_len=120}], 2) = 0 The restore's trace is as follows (which detects the wq_fd on its own) and dumps and restores it as well, before restoring fd 7: $ ./bpf_cr dump 326 7 | strace ./bpf_cr restore ... io_uring_setup(2, {flags=IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF|IORING_SETUP_CQSIZE, sq_thread_cpu=0, sq_thread_idle=1500, sq_entries=2, cq_entries=4, features=IORING_FEAT_SINGLE_MMAP|IORING_FEAT_NODROP|IORING_FEAT_SUBMIT_STABLE|IORING_FEAT_RW_CUR_POS|IORING_FEAT_CUR_PERSONALITY|IORING_FEAT_FAST_POLL|IORING_FEAT_POLL_32BITS|IORING_FEAT_SQPOLL_NONFIXED|IORING_FEAT_EXT_ARG|IORING_FEAT_NATIVE_WORKERS|IORING_FEAT_RSRC_TAGS, sq_off={head=0, tail=64, ring_mask=256, ring_entries=264, flags=276, dropped=272, array=384}, cq_off={head=128, tail=192, ring_mask=260, ring_entries=268, overflow=284, cqes=320, flags=0x118 /* IORING_CQ_??? */}}) = 6 dup2(6, 6) = 6 ... io_uring_setup(2, {flags=IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF|IORING_SETUP_CQSIZE|IORING_SETUP_ATTACH_WQ, sq_thread_cpu=0, sq_thread_idle=1500, wq_fd=6, sq_entries=2, cq_entries=4, features=IORING_FEAT_SINGLE_MMAP|IORING_FEAT_NODROP|IORING_FEAT_SUBMIT_STABLE|IORING_FEAT_RW_CUR_POS|IORING_FEAT_CUR_PERSONALITY|IORING_FEAT_FAST_POLL|IORING_FEAT_POLL_32BITS|IORING_FEAT_SQPOLL_NONFIXED|IORING_FEAT_EXT_ARG|IORING_FEAT_NATIVE_WORKERS|IORING_FEAT_RSRC_TAGS, sq_off={head=0, tail=64, ring_mask=256, ring_entries=264, flags=276, dropped=272, array=384}, cq_off={head=128, tail=192, ring_mask=260, ring_entries=268, overflow=284, cqes=320, flags=0x118 /* IORING_CQ_??? */}}) = 7 dup2(7, 7) = 7 ... eventfd2(42, 0) = 8 io_uring_register(7, IORING_REGISTER_EVENTFD_ASYNC, [8], 1) = 0 ... // fd number 0 is same as 1 and 2, hence the lowest one is used during restore, // it doesn't matter as underlying struct file is same... io_uring_register(7, IORING_REGISTER_FILES, [0, -1, 0, -1, 0], 5) = 0 // This step would happen after restoring mm, so it fails for now for second iovec io_uring_register(7, IORING_REGISTER_BUFFERS, [{iov_base=NULL, iov_len=0}, {iov_base=0x7ffdf1a27680, iov_len=120}], 2) = -1 EFAULT (Bad address) ...
-
selftests/bpf: Fix btf_dump test for bpf_iter_link_info
Since we changed the definition while adding io_uring and epoll iterator support, adjust the selftest to check against the updated definition. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
-
selftests/bpf: Test partial reads for io_uring, epoll iterators
Ensure that the output is consistent in face of partial reads that return to userspace and then resume again later. To this end, we do reads in 1-byte chunks, which is a bit stupid in real life, but works well to simulate interrupted iteration. This also tests case where seq_file buffer is consumed (after seq_printf) on interrupted read before iterator invoked BPF prog again. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
-
selftests/bpf: Add test for epoll BPF iterator
This tests the epoll iterator, including peeking into the epitem to inspect the registered file and fd number, and verifying that in userspace. Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
-
selftests/bpf: Add test for io_uring BPF iterators
This exercises the io_uring_buf and io_uring_file iterators, and tests sparse file sets as well. Cc: Jens Axboe <axboe@kernel.dk> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: io-uring@vger.kernel.org Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
-
bpftool: Output io_uring iterator info
Output the sole field related to io_uring iterator (inode of attached io_uring) so that it can be useful in informational and also debugging cases (trying to find actual io_uring fd attached to the iterator). Output: 89: iter prog 262 target_name io_uring_file io_uring_inode 16764 pids test_progs(384) [ { "id": 123, "type": "iter", "prog_id": 463, "target_name": "io_uring_buf", "io_uring_inode": 16871, "pids": [ { "pid": 443, "comm": "test_progs" } ] } ] [ { "id": 126, "type": "iter", "prog_id": 483, "target_name": "io_uring_file", "io_uring_inode": 16887, "pids": [ { "pid": 448, "comm": "test_progs" } ] } ] Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> -
epoll: Implement eBPF iterator for registered items
This patch adds eBPF iterator for epoll items (epitems) registered in an epoll instance. It gives access to the eventpoll ctx, and the registered epoll item (struct epitem). This allows the iterator to inspect the registered file and be able to use others iterators to associate it with a task's fdtable. The primary usecase this is enabling is expediting existing eventpoll checkpoint/restore support in the CRIU project. This iterator allows us to switch from a worst case O(n^2) algorithm to a single O(n) pass over task and epoll registered descriptors. We also make sure we're iterating over a live file, one that is not going away. The case we're concerned about is a file that has its f_count as zero, but is waiting for iterator bpf_seq_read to release ep->mtx, so that it can remove its epitem. Since such a file will disappear once iteration is done, and it is being destructed, we use get_file_rcu to ensure it is alive when invoking the BPF program. Getting access to a file that is going to disappear after iteration is not useful anyway. This does have a performance overhead however (since file reference will be raised and dropped for each file). The rcu_read_lock around get_file_rcu isn't strictly required for lifetime management since fput path is serialized on ep->mtx to call ep_remove, hence the epi->ffd.file pointer remains stable during our seq_start/seq_stop bracketing. To be able to continue back from the position we were iterating, we store the epi->ffi.fd and use ep_find_tfd to find the target file again. It would be more appropriate to use both struct file pointer and fd number to find the last file, but see below for why that cannot be done. Taking reference to struct file and walking RB-Tree to find it again will lead to reference cycle issue if the iterator after partial read takes reference to socket which later is used in creating a descriptor cycle using SCM_RIGHTS. An example that was encountered when working on this is mentioned below. Let there be Unix sockets SK1, SK2, epoll fd EP, and epoll iterator ITER. Let SK1 be registered in EP, then on a partial read it is possible that ITER returns from read and takes reference to SK1 to be able to find it later in RB-Tree and continue the iteration. If SK1 sends ITER over to SK2 using SCM_RIGHTS, and SK2 sends SK2 over to SK1 using SCM_RIGHTS, and both fds are not consumed on the corresponding receive ends, a cycle is created. When all of SK1, SK2, EP, and ITER are closed, SK1's receive queue holds reference to SK2, and SK2's receive queue holds reference to ITER, which holds a reference to SK1. All file descriptors except EP leak. To resolve it, we would need to hook into the Unix Socket GC mechanism, but the alternative of using ep_find_tfd is much more simpler. The finding of the last position in face of concurrent modification of the epoll set is at best an approximation anyway. For the case of CRIU, the epoll set remains stable. Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
-
io_uring: Implement eBPF iterator for registered files
This change adds eBPF iterator for buffers registered in io_uring ctx. It gives access to the ctx, the index of the registered buffer, and a pointer to the struct file itself. This allows the iterator to save info related to the file added to an io_uring instance, that isn't easy to export using the fdinfo interface (like being able to match registered files to a task's file set). Getting access to underlying struct file allows deduplication and efficient pairing with task file set (obtained using task_file iterator). The primary usecase this is enabling is checkpoint/restore support. Note that we need to use mutex_trylock when the file is read from, in seq_start functions, as the order of lock taken is opposite of what it would be when io_uring operation reads the same file. We take seq_file->lock, then ctx->uring_lock, while io_uring would first take ctx->uring_lock and then seq_file->lock for the same ctx. This can lead to a deadlock scenario described below: The sequence on CPU 0 is for normal read(2) on iterator. For CPU 1, it is an io_uring instance trying to do same on iterator attached to itself. So CPU 0 does sys_read vfs_read bpf_seq_read mutex_lock(&seq_file->lock) # A io_uring_buf_seq_start mutex_lock(&ctx->uring_lock) # B and CPU 1 does io_uring_enter mutex_lock(&ctx->uring_lock) # B io_read bpf_seq_read mutex_lock(&seq_file->lock) # A ... Since the order of locks is opposite, it can deadlock. So we switch the mutex_lock in io_uring_buf_seq_start to trylock, so it can return an error for this case, then it will release seq_file->lock and CPU 1 will make progress. The trylock also protects the case where io_uring tries to read from iterator attached to itself (same ctx), where the order of locks would be: io_uring_enter mutex_lock(&ctx->uring_lock) <------------. io_read \ seq_read \ mutex_lock(&seq_file->lock) / mutex_lock(&ctx->uring_lock) # deadlock-` In both these cases (recursive read and contended uring_lock), -EDEADLK is returned to userspace. With the advent of descriptorless files supported by io_uring, this iterator provides the required visibility and introspection of io_uring instance for the purposes of dumping and restoring it. In the future, this iterator will be extended to support direct inspection of a lot of file state (currently descriptorless files are obtained using openat2 and socket) to dump file state for these hidden files. Later, we can explore filling in the gaps for dumping file state for more file types (those not hidden in io_uring ctx). All this is out of scope for the current series however, but builds upon this iterator. Cc: Jens Axboe <axboe@kernel.dk> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: io-uring@vger.kernel.org Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> -
bpf: Add bpf_page_to_pfn helper
In CRIU, we need to be able to determine whether the page pinned by io_uring is still present in the same range in the process VMA. /proc/<pid>/pagemap gives us the PFN, hence using this helper we can establish this mapping easily from the iterator side. It is a simple wrapper over the in-kernel page_to_pfn macro, and ensures the passed in pointer is a struct page PTR_TO_BTF_ID. This is obtained from the bvec of io_uring_ubuf for the CRIU usecase. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
-
io_uring: Implement eBPF iterator for registered buffers
This change adds eBPF iterator for buffers registered in io_uring ctx. It gives access to the ctx, the index of the registered buffer, and a pointer to the io_uring_ubuf itself. This allows the iterator to save info related to buffers added to an io_uring instance, that isn't easy to export using the fdinfo interface (like exact struct page composing the registered buffer). The primary usecase this is enabling is checkpoint/restore support. Note that we need to use mutex_trylock when the file is read from, in seq_start functions, as the order of lock taken is opposite of what it would be when io_uring operation reads the same file. We take seq_file->lock, then ctx->uring_lock, while io_uring would first take ctx->uring_lock and then seq_file->lock for the same ctx. This can lead to a deadlock scenario described below: The sequence on CPU 0 is for normal read(2) on iterator. For CPU 1, it is an io_uring instance trying to do same on iterator attached to itself. So CPU 0 does sys_read vfs_read bpf_seq_read mutex_lock(&seq_file->lock) # A io_uring_buf_seq_start mutex_lock(&ctx->uring_lock) # B and CPU 1 does io_uring_enter mutex_lock(&ctx->uring_lock) # B io_read bpf_seq_read mutex_lock(&seq_file->lock) # A ... Since the order of locks is opposite, it can deadlock. So we switch the mutex_lock in io_uring_buf_seq_start to trylock, so it can return an error for this case, then it will release seq_file->lock and CPU 1 will make progress. The trylock also protects the case where io_uring tries to read from iterator attached to itself (same ctx), where the order of locks would be: io_uring_enter mutex_lock(&ctx->uring_lock) <------------. io_read \ seq_read \ mutex_lock(&seq_file->lock) / mutex_lock(&ctx->uring_lock) # deadlock-` In both these cases (recursive read and contended uring_lock), -EDEADLK is returned to userspace. In the future, this iterator will be extended to directly support iteration of bvec Flexible Array Member, so that when there is no corresponding VMA that maps to the registered buffer (e.g. if VMA is destroyed after pinning pages), we are able to reconstruct the registration on restore by dumping the page contents and then replaying them into a temporary mapping used for registration later. All this is out of scope for the current series however, but builds upon this iterator. Cc: Jens Axboe <axboe@kernel.dk> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: io-uring@vger.kernel.org Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Commits on Nov 19, 2021
-
libbpf: Change bpf_program__set_extra_flags to bpf_program__set_flags
bpf_program__set_extra_flags has just been introduced so we can still change it without breaking users. This new interface is a bit more flexible (for example if someone wants to clear a flag). Signed-off-by: Florent Revest <revest@chromium.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211119180035.1396139-1-revest@chromium.org
-
selftests/bpf: Add btf_dedup case with duplicated structs within CU
Add an artificial minimal example simulating compilers producing two different types within a single CU that correspond to identical struct definitions. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20211117194114.347675-2-andrii@kernel.org
-
libbpf: Accommodate DWARF/compiler bug with duplicated structs
According to [0], compilers sometimes might produce duplicate DWARF definitions for exactly the same struct/union within the same compilation unit (CU). We've had similar issues with identical arrays and handled them with a similar workaround in 6b6e6b1 ("libbpf: Accomodate DWARF/compiler bug with duplicated identical arrays"). Do the same for struct/union by ensuring that two structs/unions are exactly the same, down to the integer values of field referenced type IDs. Solving this more generically (allowing referenced types to be equivalent, but using different type IDs, all within a single CU) requires a huge complexity increase to handle many-to-many mappings between canonidal and candidate type graphs. Before we invest in that, let's see if this approach handles all the instances of this issue in practice. Thankfully it's pretty rare, it seems. [0] https://lore.kernel.org/bpf/YXr2NFlJTAhHdZqq@krava/ Reported-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20211117194114.347675-1-andrii@kernel.org
-
libbpf: Add runtime APIs to query libbpf version
Libbpf provided LIBBPF_MAJOR_VERSION and LIBBPF_MINOR_VERSION macros to check libbpf version at compilation time. This doesn't cover all the needs, though, because version of libbpf that application is compiled against doesn't necessarily match the version of libbpf at runtime, especially if libbpf is used as a shared library. Add libbpf_major_version() and libbpf_minor_version() returning major and minor versions, respectively, as integers. Also add a convenience libbpf_version_string() for various tooling using libbpf to print out libbpf version in a human-readable form. Currently it will return "v0.6", but in the future it can contains some extra information, so the format itself is not part of a stable API and shouldn't be relied upon. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20211118174054.2699477-1-andrii@kernel.org
Commits on Nov 18, 2021
-
selfetests/bpf: Adapt vmtest.sh to s390 libbpf CI changes
[1] added s390 support to libbpf CI and added an ${ARCH} prefix to a number of paths and identifiers in libbpf GitHub repo, which vmtest.sh relies upon. Update these and make use of the new s390 support. [1] libbpf/libbpf#204 Co-developed-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211118115225.1349726-1-iii@linux.ibm.com
Commits on Nov 17, 2021
-
selftests/bpf: Fix xdpxceiver failures for no hugepages
xsk_configure_umem() needs hugepages to work in unaligned mode. So when hugepages are not configured, 'unaligned' tests should be skipped which is determined by the helper function hugepages_present(). This function erroneously returns true with MAP_NORESERVE flag even when no hugepages are configured. The removal of this flag fixes the issue. The test TEST_TYPE_UNALIGNED_INV_DESC also needs to be skipped when there are no hugepages. However, this was not skipped as there was no check for presence of hugepages and hence was failing. The check to skip the test has now been added. Fixes: a4ba98d (selftests: xsk: Add test for unaligned mode) Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20211117123613.22288-1-tirthendu.sarkar@intel.com
-
bpf, docs: Fix ordering of bpf documentation
This commit fixes the display of the BPF documentation in the sidebar when rendered as HTML. Before this patch, the sidebar would render as follows for some sections: | BPF Documentation |- BPF Type Format (BTF) |- BPF Type Format (BTF) This was due to creating a heading in index.rst followed by a sphinx toctree, where the file referenced carries the same title as the section heading. To fix this I applied a pattern that has been established in other subfolders of Documentation: 1. Re-wrote index.rst to have a single toctree 2. Split the sections out in to their own files Additionally maps.rst and programs.rst make use of a glob pattern to include map_* or prog_* rst files in their toctree, meaning future map or program type documentation will be automatically included. Signed-off-by: Dave Tucker <dave@dtucker.co.uk> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/1a1eed800e7b9dc13b458de113a489641519b0cc.1636749493.git.dave@dtucker.co.uk -
bpf, docs: Rename bpf_lsm.rst to prog_lsm.rst
This allows for documentation relating to BPF Program Types to be matched by the glob pattern prog_* for inclusion in a sphinx toctree Signed-off-by: Dave Tucker <dave@dtucker.co.uk> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: KP Singh <kpsingh@kernel.org> Link: https://lore.kernel.org/bpf/49fe0f370a2b28500c1b60f1fdb6fb7ec90de28a.1636749493.git.dave@dtucker.co.uk
-
bpf, docs: Change underline in btf to match style guide
This changes the type of underline used to follow the guidelines in Documentation/doc-guide/sphinx.rst which also ensures that the headings are rendered at the correct level in the HTML sidebar Signed-off-by: Dave Tucker <dave@dtucker.co.uk> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/981b27485cc294206480df36fca46817e2553e39.1636749493.git.dave@dtucker.co.uk
-
selftests/bpf: Mark variable as static
Fix warnings from checkstyle.pl Signed-off-by: Yucong Sun <sunyucong@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211112192535.898352-4-fallentree@fb.com
-
selftests/bpf: Variable naming fix
Change log_fd to log_fp to reflect its type correctly. Signed-off-by: Yucong Sun <sunyucong@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211112192535.898352-3-fallentree@fb.com
-
selftests/bpf: Move summary line after the error logs
Makes it easier to find the summary line when there is a lot of logs to scroll back. Signed-off-by: Yucong Sun <sunyucong@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211112192535.898352-2-fallentree@fb.com
Commits on Nov 16, 2021
-
selftests/bpf: Add uprobe triggering overhead benchmarks
Add benchmark to measure overhead of uprobes and uretprobes. Also have a baseline (no uprobe attached) benchmark. On my dev machine, baseline benchmark can trigger 130M user_target() invocations. When uprobe is attached, this falls to just 700K. With uretprobe, we get down to 520K: $ sudo ./bench trig-uprobe-base -a Summary: hits 131.289 ± 2.872M/s # UPROBE $ sudo ./bench -a trig-uprobe-without-nop Summary: hits 0.729 ± 0.007M/s $ sudo ./bench -a trig-uprobe-with-nop Summary: hits 1.798 ± 0.017M/s # URETPROBE $ sudo ./bench -a trig-uretprobe-without-nop Summary: hits 0.508 ± 0.012M/s $ sudo ./bench -a trig-uretprobe-with-nop Summary: hits 0.883 ± 0.008M/s So there is almost 2.5x performance difference between probing nop vs non-nop instruction for entry uprobe. And 1.7x difference for uretprobe. This means that non-nop uprobe overhead is around 1.4 microseconds for uprobe and 2 microseconds for non-nop uretprobe. For nop variants, uprobe and uretprobe overhead is down to 0.556 and 1.13 microseconds, respectively. For comparison, just doing a very low-overhead syscall (with no BPF programs attached anywhere) gives: $ sudo ./bench trig-base -a Summary: hits 4.830 ± 0.036M/s So uprobes are about 2.67x slower than pure context switch. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20211116013041.4072571-1-andrii@kernel.org
-
bpf: Change value of MAX_TAIL_CALL_CNT from 32 to 33
In the current code, the actual max tail call count is 33 which is greater than MAX_TAIL_CALL_CNT (defined as 32). The actual limit is not consistent with the meaning of MAX_TAIL_CALL_CNT and thus confusing at first glance. We can see the historical evolution from commit 04fd61a ("bpf: allow bpf programs to tail-call other bpf programs") and commit f9dabe0 ("bpf: Undo off-by-one in interpreter tail call count limit"). In order to avoid changing existing behavior, the actual limit is 33 now, this is reasonable. After commit 874be05 ("bpf, tests: Add tail call test suite"), we can see there exists failed testcase. On all archs when CONFIG_BPF_JIT_ALWAYS_ON is not set: # echo 0 > /proc/sys/net/core/bpf_jit_enable # modprobe test_bpf # dmesg | grep -w FAIL Tail call error path, max count reached jited:0 ret 34 != 33 FAIL On some archs: # echo 1 > /proc/sys/net/core/bpf_jit_enable # modprobe test_bpf # dmesg | grep -w FAIL Tail call error path, max count reached jited:1 ret 34 != 33 FAIL Although the above failed testcase has been fixed in commit 18935a7 ("bpf/tests: Fix error in tail call limit tests"), it would still be good to change the value of MAX_TAIL_CALL_CNT from 32 to 33 to make the code more readable. The 32-bit x86 JIT was using a limit of 32, just fix the wrong comments and limit to 33 tail calls as the constant MAX_TAIL_CALL_CNT updated. For the mips64 JIT, use "ori" instead of "addiu" as suggested by Johan Almbladh. For the riscv JIT, use RV_REG_TCC directly to save one register move as suggested by Björn Töpel. For the other implementations, no function changes, it does not change the current limit 33, the new value of MAX_TAIL_CALL_CNT can reflect the actual max tail call count, the related tail call testcases in test_bpf module and selftests can work well for the interpreter and the JIT. Here are the test results on x86_64: # uname -m x86_64 # echo 0 > /proc/sys/net/core/bpf_jit_enable # modprobe test_bpf test_suite=test_tail_calls # dmesg | tail -1 test_bpf: test_tail_calls: Summary: 8 PASSED, 0 FAILED, [0/8 JIT'ed] # rmmod test_bpf # echo 1 > /proc/sys/net/core/bpf_jit_enable # modprobe test_bpf test_suite=test_tail_calls # dmesg | tail -1 test_bpf: test_tail_calls: Summary: 8 PASSED, 0 FAILED, [8/8 JIT'ed] # rmmod test_bpf # ./test_progs -t tailcalls torvalds#142 tailcalls:OK Summary: 1/11 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Johan Almbladh <johan.almbladh@anyfinetworks.com> Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> Acked-by: Björn Töpel <bjorn@kernel.org> Acked-by: Johan Almbladh <johan.almbladh@anyfinetworks.com> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Link: https://lore.kernel.org/bpf/1636075800-3264-1-git-send-email-yangtiezhu@loongson.cn
-
selftests/bpf: Configure dir paths via env in test_bpftool_synctypes.py
Script test_bpftool_synctypes.py parses a number of files in the bpftool directory (or even elsewhere in the repo) to make sure that the list of types or options in those different files are consistent. Instead of having fixed paths, let's make the directories configurable through environment variable. This should make easier in the future to run the script in a different setup, for example on an out-of-tree bpftool mirror with a different layout. Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20211115225844.33943-4-quentin@isovalent.com
-
bpftool: Update doc (use susbtitutions) and test_bpftool_synctypes.py
test_bpftool_synctypes.py helps detecting inconsistencies in bpftool between the different list of types and options scattered in the sources, the documentation, and the bash completion. For options that apply to all bpftool commands, the script had a hardcoded list of values, and would use them to check whether the man pages are up-to-date. When writing the script, it felt acceptable to have this list in order to avoid to open and parse bpftool's main.h every time, and because the list of global options in bpftool doesn't change so often. However, this is prone to omissions, and we recently added a new -l|--legacy option which was described in common_options.rst, but not listed in the options summary of each manual page. The script did not complain, because it keeps comparing the hardcoded list to the (now) outdated list in the header file. To address the issue, this commit brings the following changes: - Options that are common to all bpftool commands (--json, --pretty, and --debug) are moved to a dedicated file, and used in the definition of a RST substitution. This substitution is used in the sources of all the man pages. - This list of common options is updated, with the addition of the new -l|--legacy option. - The script test_bpftool_synctypes.py is updated to compare: - Options specific to a command, found in C files, for the interactive help messages, with the same specific options from the relevant man page for that command. - Common options, checked just once: the list in main.h is compared with the new list in substitutions.rst. Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20211115225844.33943-3-quentin@isovalent.com -
bpftool: Add SPDX tags to RST documentation files
Most files in the kernel repository have a SPDX tags. The files that don't have such a tag (or another license boilerplate) tend to fall under the GPL-2.0 license. In the past, bpftool's Makefile (for example) has been marked as GPL-2.0 for that reason, when in fact all bpftool is dual-licensed. To prevent a similar confusion from happening with the RST documentation files for bpftool, let's explicitly mark all files as dual-licensed. Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20211115225844.33943-2-quentin@isovalent.com
-
selftests/bpf: Add a dedup selftest with equivalent structure types
Without previous libbpf patch, the following error will occur: $ ./test_progs -t btf ... do_test_dedup:FAIL:check btf_dedup failed errno:-22#13/205 btf/dedup: btf_type_tag #5, struct:FAIL And the previous libbpf patch fixed the issue. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20211115163943.3922547-1-yhs@fb.com
-
libbpf: Fix a couple of missed btf_type_tag handling in btf.c
Commit 2dc1e48 ("libbpf: Support BTF_KIND_TYPE_TAG") added the BTF_KIND_TYPE_TAG support. But to test vmlinux build with ... #define __user __attribute__((btf_type_tag("user"))) ... I needed to sync libbpf repo and manually copy libbpf sources to pahole. To simplify process, I used BTF_KIND_RESTRICT to simulate the BTF_KIND_TYPE_TAG with vmlinux build as "restrict" modifier is barely used in kernel. But this approach missed one case in dedup with structures where BTF_KIND_RESTRICT is handled and BTF_KIND_TYPE_TAG is not handled in btf_dedup_is_equiv(), and this will result in a pahole dedup failure. This patch fixed this issue and a selftest is added in the subsequent patch to test this scenario. The other missed handling is in btf__resolve_size(). Currently the compiler always emit like PTR->TYPE_TAG->... so in practice we don't hit the missing BTF_KIND_TYPE_TAG handling issue with compiler generated code. But lets add case BTF_KIND_TYPE_TAG in the switch statement to be future proof. Fixes: 2dc1e48 ("libbpf: Support BTF_KIND_TYPE_TAG") Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20211115163937.3922235-1-yhs@fb.com
-
bpftool: Add current libbpf_strict mode to version output
+ bpftool --legacy --version bpftool v5.15.0 features: libbfd, skeletons + bpftool --version bpftool v5.15.0 features: libbfd, libbpf_strict, skeletons + bpftool --legacy --help Usage: bpftool [OPTIONS] OBJECT { COMMAND | help } bpftool batch file FILE bpftool version OBJECT := { prog | map | link | cgroup | perf | net | feature | btf | gen | struct_ops | iter } OPTIONS := { {-j|--json} [{-p|--pretty}] | {-d|--debug} | {-l|--legacy} | {-V|--version} } + bpftool --help Usage: bpftool [OPTIONS] OBJECT { COMMAND | help } bpftool batch file FILE bpftool version OBJECT := { prog | map | link | cgroup | perf | net | feature | btf | gen | struct_ops | iter } OPTIONS := { {-j|--json} [{-p|--pretty}] | {-d|--debug} | {-l|--legacy} | {-V|--version} } + bpftool --legacy Usage: bpftool [OPTIONS] OBJECT { COMMAND | help } bpftool batch file FILE bpftool version OBJECT := { prog | map | link | cgroup | perf | net | feature | btf | gen | struct_ops | iter } OPTIONS := { {-j|--json} [{-p|--pretty}] | {-d|--debug} | {-l|--legacy} | {-V|--version} } + bpftool Usage: bpftool [OPTIONS] OBJECT { COMMAND | help } bpftool batch file FILE bpftool version OBJECT := { prog | map | link | cgroup | perf | net | feature | btf | gen | struct_ops | iter } OPTIONS := { {-j|--json} [{-p|--pretty}] | {-d|--debug} | {-l|--legacy} | {-V|--version} } + bpftool --legacy version bpftool v5.15.0 features: libbfd, skeletons + bpftool version bpftool v5.15.0 features: libbfd, libbpf_strict, skeletons + bpftool --json --legacy version {"version":"5.15.0","features":{"libbfd":true,"libbpf_strict":false,"skeletons":true}} + bpftool --json version {"version":"5.15.0","features":{"libbfd":true,"libbpf_strict":true,"skeletons":true}} Suggested-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20211116000448.2918854-1-sdf@google.com
Commits on Nov 15, 2021
-
Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says: ==================== pull-request: bpf-next 2021-11-15 We've added 72 non-merge commits during the last 13 day(s) which contain a total of 171 files changed, 2728 insertions(+), 1143 deletions(-). The main changes are: 1) Add btf_type_tag attributes to bring kernel annotations like __user/__rcu to BTF such that BPF verifier will be able to detect misuse, from Yonghong Song. 2) Big batch of libbpf improvements including various fixes, future proofing APIs, and adding a unified, OPTS-based bpf_prog_load() low-level API, from Andrii Nakryiko. 3) Add ingress_ifindex to BPF_SK_LOOKUP program type for selectively applying the programmable socket lookup logic to packets from a given netdev, from Mark Pashmfouroush. 4) Remove the 128M upper JIT limit for BPF programs on arm64 and add selftest to ensure exception handling still works, from Russell King and Alan Maguire. 5) Add a new bpf_find_vma() helper for tracing to map an address to the backing file such as shared library, from Song Liu. 6) Batch of various misc fixes to bpftool, fixing a memory leak in BPF program dump, updating documentation and bash-completion among others, from Quentin Monnet. 7) Deprecate libbpf bpf_program__get_prog_info_linear() API and migrate its users as the API is heavily tailored around perf and is non-generic, from Dave Marchevsky. 8) Enable libbpf's strict mode by default in bpftool and add a --legacy option as an opt-out for more relaxed BPF program requirements, from Stanislav Fomichev. 9) Fix bpftool to use libbpf_get_error() to check for errors, from Hengqi Chen. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (72 commits) bpftool: Use libbpf_get_error() to check error bpftool: Fix mixed indentation in documentation bpftool: Update the lists of names for maps and prog-attach types bpftool: Fix indent in option lists in the documentation bpftool: Remove inclusion of utilities.mak from Makefiles bpftool: Fix memory leak in prog_dump() selftests/bpf: Fix a tautological-constant-out-of-range-compare compiler warning selftests/bpf: Fix an unused-but-set-variable compiler warning bpf: Introduce btf_tracing_ids bpf: Extend BTF_ID_LIST_GLOBAL with parameter for number of IDs bpftool: Enable libbpf's strict mode by default docs/bpf: Update documentation for BTF_KIND_TYPE_TAG support selftests/bpf: Clarify llvm dependency with btf_tag selftest selftests/bpf: Add a C test for btf_type_tag selftests/bpf: Rename progs/tag.c to progs/btf_decl_tag.c selftests/bpf: Test BTF_KIND_DECL_TAG for deduplication selftests/bpf: Add BTF_KIND_TYPE_TAG unit tests selftests/bpf: Test libbpf API function btf__add_type_tag() bpftool: Support BTF_KIND_TYPE_TAG libbpf: Support BTF_KIND_TYPE_TAG ... ==================== Link: https://lore.kernel.org/r/20211115162008.25916-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski committedNov 15, 2021 -
Revert "Merge branch 'mctp-i2c-driver'"
This reverts commit 71812af, reversing changes made to cc0be1a. Wolfram Sang says: Please revert. Besides the driver in net, it modifies the I2C core code. This has not been acked by the I2C maintainer (in this case me). So, please don't pull this in via the net tree. The question raised here (extending SMBus calls to 255 byte) is complicated because we need ABI backwards compatibility. Link: https://lore.kernel.org/all/YZJ9H4eM%2FM7OXVN0@shikoro/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski committedNov 15, 2021 -
Merge branch 'generic-phylink-validation'
Russell King says: ==================== introduce generic phylink validation The various validate method implementations we have in phylink users have been quite repetitive but also prone to bugs. These patches introduce a generic implementation which relies solely on the supported_interfaces bitmap introduced during last cycle, and in the first patch, a bit array of MAC capabilities. MAC drivers are free to continue to do their own thing if they have special requirements - such as mvneta and mvpp2 which do not support 1000base-X without AN enabled. Most implementations currently in the kernel can be converted to call phylink_generic_validate() directly from the phylink MAC operations structure once they fill in the supported_interfaces and mac_capabilities members of phylink_config. This series introduces the generic implementation, and converts mvneta and mvpp2 to use it. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
net: mvpp2: use phylink_generic_validate()
Convert mvpp2 to use phylink_generic_validate() for the bulk of its validate() implementation. This network adapter has a restriction that for 802.3z links, autonegotiation must be enabled. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
-
net: mvneta: use phylink_generic_validate()
Convert mvneta to use phylink_generic_validate() for the bulk of its validate() implementation. This network adapter has a restriction that for 802.3z links, autonegotiation must be enabled. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>