Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

criu/plugin: Add environment variable to cap size of buffers. #10

Open
wants to merge 382 commits into
base: criu-dev
Choose a base branch
from

Conversation

fdavid-amd
Copy link

The amdgpu plugin would create a memory buffer at the size of the largest VRAM bo (buffer object). On some systems, VRAM size exceeds RAM size, so the largest bo might be larger than the available memory.

Add an environment variable KFD_MAX_BUFFER_SIZE, which caps the size of this buffer. By default, it is set to 0, and has no effect. When active, any bo larger than its value will be saved to/restored from file in multiple passes.

mclapinski and others added 30 commits April 15, 2023 21:17
This kernel feature contained some bugs initially. Those logs are useful in identifing what the
underlaying issue is and which kernel patch to backport.

Signed-off-by: Michal Clapinski <mclapinski@google.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
As our pr_* functions are complex and can call different system calls
inside before actual printing (e.g. gettimeofday for timestamps) actual
errno at the time of printing may be changed.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
As our pr_* functions are complex and can call different system calls
inside before actual printing (e.g. gettimeofday for timestamps) actual
errno at the time of printing may be changed.

Let's just use %s + strerror(errno) instead of %m with pr_* functions to
be explicit that errno to string transformation happens before calling
anything else.

Note: tcp_repair_off is called from pie with no pr_perror defined due to
CR_NOGLIBC set and if I use errno variable there I get "Unexpected
undefined symbol: `__errno_location'. External symbol in PIE?", so it
seems there is no way to print errno there, so let's just skip it.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
We see that libbsd redefines __has_include to be always true, which
breaks such checks for rseq. The idea behind this patch is remove the
use of libbsd functions and always export our replacement functions.

Using __strlcat and __strlcpy everywhere in existing code:
git grep --files-with-matches "strlcat" | xargs sed -i 's/strlcat/__strlcat/g'
git grep --files-with-matches "strlcpy" | xargs sed -i 's/strlcpy/__strlcpy/g'

Fixes: checkpoint-restore#2036
Suggested-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
… bsd headers

We see that libbsd redefines __has_include to be always true, which
breaks such checks for rseq. The idea behind this patch is to put all
uses of libbsd functions to separate c files and only export wrapper
functions for them.

Using __setproctitle and __setproctitle_init everywhere in existing
code:

git grep --files-with-matches "setproctitle" | xargs sed -i 's/setproctitle/__setproctitle/g'
git grep --files-with-matches "setproctitle_init" | xargs sed -i 's/setproctitle_init/__setproctitle_init/g'

Fixes: checkpoint-restore#2036
Suggested-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
This patch sets VMA_AREA_REGULAR on hugetlb and anon shmem VMAs since
they can be handled the same way as other kinds of regular memory.

Co-authored-by: Ivanq <imachug@yandex.ru>
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
If trying to open /proc/$pid/map_files/x-x for a given VMA fails with
EPERM (can happen in unprivileged mode when running in a non-init user
ns), fall back to reading the content from /proc/$pid/mem.

Co-authored-by: Ivanq <imachug@yandex.ru>
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
Co-authored-by: Ivanq <imachug@yandex.ru>
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
We see that when lint is called for push action git has only one last
commit which makes make indent with git-clang-format fail to operate.

Fix it by increasing fetch depth to one more commit.

Fixes: checkpoint-restore#2066
Fixes: d6db333 ("clang-format: rework make indent to check specific commits")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Fixes: 237bd26 ("remap: Rename global lock", 2017-05-18)
Signed-off-by: Michał Mirosław <emmir@google.com>
Make the code a bit more readable by uncovering a while loop from
a if() goto sequence.

Signed-off-by: Michał Mirosław <emmir@google.com>
Checking errno in outer function is really strange, also saving errno of
mount syscall after calling pr_perror is completely wrong. So let's try
to simplify things.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Simplify code a bit: make exit codes of those functions more
transparent, rename ret to exit_code.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
This is done to follow 'Linux kernel coding style', same change was
added to .clang-format in linux kernel source recently:
torvalds/linux@d7f6604341c74

We don't change it in current code base but let's follow it in all
future uses.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
This patch applies the '-ffreestanding' flag that was introduced
with checkpoint-restore#1726 to MIPS.

Fixes: checkpoint-restore#1725

Signed-off-by: Radostin Stoyanov <radostin@redhat.com>
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
If we don't have access to map_files and instead have to get the data
from /proc/$pid/mem we can close and reset the fd before passing it to
do_dump_one_shmem() which can then check it before trying to seek past
holes, eliminating the need for a separate seek_data_supported boolean.

Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
CAP_CHECKPOINT_RESTORE does not give access to /proc/$pid/map_files in
user namespaces. In order to test that CRIU in unprivileged mode can
dump and restore anonymous shared memory pages we will run the maps00
tests in a user namespace.

Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
If we can't access a map_files entry directly and instead have to follow
the link and access the file via a filesystem path we need to properly
deal with files on btrfs subvolumes.

Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
The IP_FREEBIND option is supported for RAW sockets, why not save it
while we do this for other ip sockets anyway?

One difference is that for SOCK_RAW there is no fallback between
IP_FREEBIND and IPV6_FREEBIND, see:

https://github.com/torvalds/linux/blob/ef4d3ea40565a781c25847e9cb96c1bd9f462bc6/net/ipv6/ipv6_sockglue.c#L1497

So let's have explicit IPV6_FREEBIND for ipv6.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
We see systemd-resolved relying on these options, and after migration
the options are lost and systemd-resolved stops serving dns requests.

The socket options make kernel add cmsg with destination address to
packets, see more how systemd-resolved uses them:

https://github.com/systemd/systemd/blob/00a60eaf5fcb3a0e415349aa649f2699550d26b0/src/resolve/resolved-manager.c#L826

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Just creates ipv4/ipv6 raw/dgram sockets with IP_PKTINFO and IP_FREEBIND
socket options enabled/disabled and checks that these options persist.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
The --ghost-fiemap option was introduced with checkpoint-restore#1963.

It enables an optimized algorithm based on fiemap ioctl that can reduce
the number of syscalls used to checkpoint highly sparse ghost files. This
option is enabled by default. It can be disabled with --no-ghost-fiemap
when using SEEK_HOLE/SEEK_DATA is preferred. In addition, an automatic
fallback to SEEK_HOLE/SEEK_DATA is used for filesystems that do not
supporting fiemap.

Co-authored-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
If we build tags for our repo:

[criu]$ make tags
  GEN      tags

And then run codespell, we get an error:

[criu]$ codespell
./tags:3755: struc ==> struct

Let's exclude tags file from codespell search, this would add usability
to `make lint`.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
When error happens on file dumping stage the only information about the
task we dumping is its PID. For debug purpose show task's @comm early.

It proves useful when trying to understand which of dumped applications
is "guilty" in brokern dump when pid is not there anymore.

Signed-off-by: Cyrill Gorcunov <gorcunov@virtuozzo.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
In Python 3 b'' == '' is False. This causes the info action to fail with

  File "/usr/lib/python3.11/site-packages/crit-3.17-py3.11.egg/pycriu/images/images.py", line 178, in count
    size, = struct.unpack('i', buf)
            ^^^^^^^^^^^^^^^^^^^^^^^
  struct.error: unpack requires a buffer of 4 bytes

Reported-by: Sankalp Acharya (@sankalp-12)
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
osctobe and others added 28 commits September 26, 2023 16:23
When -- after restore -- sockets can't communicate, the test times out
while waiting on recvfrom(). Since the communication is local, send()
works instantaneously - so mark sockets with SOCK_NONBLOCK and report
failure if the message is not received immediately.

Signed-off-by: Michał Mirosław <emmir@google.com>
All test logs are flooded with the "userns is supported" messages...

Signed-off-by: Andrei Vagin <avagin@gmail.com>
Currently page_size() returns unsigned int value that is after "bitwise
not" is promoted to unsigned long value e.g. in uffd.c
handle_page_fault. Since the value is unsigned promotion is done with 0
MSB that results in lost of MSB pagefault address bits. So make
page_size to return  unsigned long to avoid such situation.

Signed-off-by: Vladislav Khmelevsky <och95@yandex.ru>
Currently most of the times we don't have problems with VVAR segment and
lazy restore because when VDSO is parked there is an munmap call that
calls UFFDIO_UNREGISTER on the destination address.
But we don't want to enable userfaultfd for VDSO and VVAR at the first
place.
Signed-off-by: Michal Clapinski <mclapinski@google.com>
It means CRIU has to close it when it is not needed.

It looks more logically correct and matches the behaviour of
the RESTORE_EXT_FILE callback.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
This patch adds the `libdrm-dev` package to the list of CRIU
dependencies installed in CI to build CRIU with amdgpu plugin.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
amdgpu_plugin.c:930:6: error: variable 'buffer' is used uninitialized whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized]
        if (ret) {
            ^~~
amdgpu_plugin.c:988:8: note: uninitialized use occurs here
        xfree(buffer);

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
One memfd can be shared by a few restored files. Only of these files is
restored with a file created with memfd_open. Others are restored by reopening
memfd files via /proc/self/fd/.

It seems unnecessary for restoring memfd memory mappings. We can always use the
origin file.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
The "ColumnLimit: 120" is not only allowing lines to be longer than 80
characters but it also forces line wrapping at 120 characters. If total
expression length is more than 120 characters, clang-format will try to
wrap it as close to 120 as it can, it would not even allow to wrap at 80
characters if we really want it. But as we all know 80 characters is
Linux kernel coding style default and as far as our coding style is
based on it it is really strange to prohibit wrapping lines at 80
characters...

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
GCC's lto source:
> To avoid this problem the compiler must assume that it sees the
> whole program when doing link-time optimization.  Strictly
> speaking, the whole program is rarely visible even at link-time.
> Standard system libraries are usually linked dynamically or not
> provided with the link-time information.  In GCC, the whole
> program option (@option{-fwhole-program}) asserts that every
> function and variable defined in the current compilation
> unit is static, except for function @code{main} (note: at
> link time, the current unit is the union of all objects compiled
> with LTO).  Since some functions and variables need to
> be referenced externally, for example by another DSO or from an
> assembler file, GCC also provides the function and variable
> attribute @code{externally_visible} which can be used to disable
> the effect of @option{-fwhole-program} on a specific symbol.

As far as I read gcc's source, ipa_comdats() will avoid placing symbols
that are either already in a user-defined section or have
externally_visible attribute into new optimized gcc sections.

Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
fork_and_ptrace_attach has to fork a child with CLONE_UNTRACED,
so that strace doesn't trace it.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
read_ns_sys_file() can return an error, but we are trying to parse a
buffer before checking a return code.

CID 417395 (#3 of 3): String not null terminated (STRING_NULL)
2. string_null: Passing unterminated string buf to strtol, which expects
   a null-terminated string.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
This check is redundant as line 201 checks for this condition.

Signed-off-by: Taemin Ha <taeminha@cs.utexas.edu>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
The is_native field is a boolean. Therefore, else if() should can be
changed to a simple else{}.

Signed-off-by: Taemin Ha <taeminha@cs.utexas.edu>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
The condition meant to check fd2 instead of fd1, which is checked in
line 24.

Signed-off-by: Taemin Ha <taeminha@cs.utexas.edu>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
line 131 checks if (ret >= 0). line 133 could be replaced by a simple else statement

Signed-off-by: Taemin Ha <taeminha@cs.utexas.edu>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Eventpollentry's fields are set only when ret == 3 or ret == 6. The
remaining cases can be grouped together to an error

Signed-off-by: Taemin Ha <taemin.ha@utexas.edu>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
At this point the correct position is already restored, so reading from
the fd results in the position being moved forward by 5 bytes.

Fixes: 9191f87 ("criu/files-reg.c: add build-id validation functionality")
Signed-off-by: Michal Clapinski <mclapinski@google.com>
Signed-off-by: Michal Clapinski <mclapinski@google.com>
Signed-off-by: Michal Clapinski <mclapinski@google.com>
cgroup_ifpriomap test needs net_prio cgroup, which might not be
available. Make the .checkskip script check it.

Signed-off-by: Michał Mirosław <emmir@google.com>
Newer versions of pip use an isolated virtual environment when building
Python projects. However, when the source code of CRIT is copied into
the isolated environment, the symlink for `../lib/py` (pycriu) becomes
invalid. As a workaround, we used the `--no-build-isolation` option for
`pip install`. However, this functionality has issues in some versions
of PIP [1, 2]. To fix this problem, this patch adds separate packages
for pycriu and crit, and each package is installed independently.

[1] pypa/pip#8221
[2] pypa/pip#8165 (comment)

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Do not use $(USERCFLAGS) for anything other than what the user provide.

Signed-off-by: Marcus Folkesson <marcus.folkesson@gmail.com>
Copy link

A friendly reminder that this PR had no activity for 30 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet