Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error code indicates success when OOM #68

Closed
adriendelsalle opened this issue Nov 3, 2021 · 4 comments · Fixed by #69
Closed

Error code indicates success when OOM #68

adriendelsalle opened this issue Nov 3, 2021 · 4 comments · Fixed by #69

Comments

@adriendelsalle
Copy link
Contributor

Hey @DaanDeMeyer thanks for this project!

Description

I'm running reproc++ inside a Docker container and I would like to catch OOM errors.
When a sub-process get killed by OOM, status is 137 reflecting the fatal error with SIGKILL (AFAIK https://tldp.org/LDP/abs/html/exitcodes.html) but the error code is system:0.
I don't really get why the error code doesn't reflect that.

Does the error_code_from function called from process::stop will consider positive status as successful?

static std::error_code error_code_from(int r)
{
  if (r >= 0) {
    return {};
  }
  (...)
}

Thanks!

To reproduce

A minimal example trying to allocate a numpy array:

// test.cpp
#include <reproc++/run.hpp>

#include <vector>
#include <iostream>

int main()
{
        reproc::options options;

        std::vector<std::string> cmd = { "python3", "-c", "import numpy; numpy.random.rand(1000000)" };
        auto [status, ec] = reproc::run(cmd, options);

        std::cout << status << std::endl;
        std::cout << ec << std::endl;

        return 1;
}

The CMake lists to build the project:

cmake_minimum_required (VERSION 3.2)

project(reproc-oom)

add_executable(reproc test.cpp)

find_package(reproc++ CONFIG REQUIRED)
find_package(Threads REQUIRED)

target_link_options(reproc PUBLIC -static-libstdc++ -static-libgcc)
target_link_libraries(reproc PRIVATE libreproc++.a libreproc.a rt dl resolv Threads::Threads)

set_property(TARGET reproc PROPERTY CXX_STANDARD 17)

A Docker image with numpy installed:

FROM ubuntu:20.04

RUN apt update && apt-get install -y python3 python3-numpy

Image built with tag reprocoom: docker build . -t reprocoom
And finally the Docker command to reproduce:

docker run -it --rm -v <path-to-build-dir>/reproc:/bin/reproc --memory=6m --memory-swap=6m reprocoom:latest reproc
137
system:0
@adriendelsalle
Copy link
Contributor Author

adriendelsalle commented Nov 3, 2021

Digging a bit more I'm wondering why process_wait return a positive status (then no error) when a signal caused the termination of the child process:

int process_wait(pid_t process)
{
ASSERT(process != PROCESS_INVALID);
int status = 0;
int r = waitpid(process, &status, 0);
if (r < 0) {
return -errno;
}
ASSERT(r == process);
return parse_status(status);
}

With parse_status:

static int parse_status(int status)
{
return WIFEXITED(status) ? WEXITSTATUS(status) : WTERMSIG(status) + 128;
}

Is it to support stop actions (reproc_terminate and reproc_kill) as "normal"/"controlled" exits (e.g. without error codes)?

Do I need then to filter the status code depending on the stop options I passed (or left empty)?
Looks like reproc_wait could be called from reproc_stop with a flag to specify that those signals should be considered as "normal" terminations?

We could also document a bit more this behavior

@DaanDeMeyer
Copy link
Owner

DaanDeMeyer commented Nov 3, 2021

I agree we should probably deal better with signals killing processes that aren't sent by reproc. For now, you've understood it correctly, we don't interpret exit statuses so you'd have to do that yourself after the process exits. I'd happily merge a PR that more explicitly states that in the documentation.

(We also don't expose a constant for the oom signal yet but that should be easy to implement as well)

@adriendelsalle
Copy link
Contributor Author

I would like to document a bit the workaround I implemented in mamba to get your feedback on that @DaanDeMeyer but also to help anyone else facing that situation:

#include <reproc/reproc.h>

bool reproc_killed(int status)
{
    return status == REPROC_SIGKILL;
}

bool reproc_terminated(int status)
{
    return status == REPROC_SIGTERM;
}

void assert_reproc_success(reproc::options options, int status, std::error_code ec)
{
    bool killed_not_an_err = (options.stop.first.action == reproc::stop::kill)
                             || (options.stop.second.action == reproc::stop::kill)
                             || (options.stop.third.action == reproc::stop::kill);

    bool terminated_not_an_err = (options.stop.first.action == reproc::stop::terminate)
                                 || (options.stop.second.action == reproc::stop::terminate)
                                 || (options.stop.third.action == reproc::stop::terminate);

    if (ec || (!killed_not_an_err && reproc_killed(status))
        || (!terminated_not_an_err && reproc_terminated(status)))
    {
        if (ec)
            LOG_ERROR << "Subprocess call failed: " << ec.message();
        else if (reproc_killed(status))
            LOG_ERROR << "Subprocess call failed (killed)";
        else
            LOG_ERROR << "Subprocess call failed (terminated)";
        throw std::runtime_error("Subprocess call failed. Aborting.");
    }
}

@jli
Copy link

jli commented Jan 20, 2022

@adriendelsalle Heya, just checking, so recent versions of mamba/micromamba have a mitigation/workaround for this? (My team is still using an old micromamba version for our Docker images and new teammates run into this issue until they bump up their Docker memory resources.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants