Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev.icinga.com #11390] Command pipe overloaded: Can't send external Icinga command to the local command file #4037

Closed
icinga-migration opened this issue Mar 15, 2016 · 20 comments

Comments

Projects
None yet
1 participant
@icinga-migration
Copy link
Member

commented Mar 15, 2016

This issue has been migrated from Redmine: https://dev.icinga.com/issues/11390

Created by critical on 2016-03-15 17:47:41 +00:00

Assignee: mfriedrich
Status: Resolved (closed on 2016-05-11 09:10:04 +00:00)
Target Version: 2.4.8
Last Update: 2016-05-11 09:34:36 +00:00 (in Redmine)

Icinga Version: 2.3.11
Backport?: Not yet backported
Include in Changelog: 1

I have a custom icingaweb2 module that uses the icinga php command pipe library to schedule dynamic downtime in icinga2.

$cmd = new ScheduleHostDowntimeCommand();
// set data
$this->transport->send($cmd);

Before I schedule the downtime (above) I have to make sure that all other scheduled downtimes are cleared. Without knowing the downtime ID's I have to make a query:

$query = $this->backend->select()->from('downtime',
            array(
                'id'              => 'downtime_internal_id',
            ))
        ->where('downtime_author_name', 'My Scheduler');

And then use the command pipe library:

foreach ($downtimes as $downtime) {
    foreach ([true, false] as $flag) {
        $cmd = new DeleteDowntimeCommand();
        $cmd->setDowntimeId($downtime->id);
        $cmd->setIsService($flag);
        $this->transport->send($cmd);
    }
}

So for each host that is currently in downtime I have two commands to send (clear service and host downtime), and if the host requires downtime scheduling I need one more. So with 100 hosts I have (at worst) 300 commands to send. When this happens a few of my `$this~~transport~~>send($cmd);` sometimes fail with:

Can't send external Icinga command to the local command file "/var/run/icinga2/cmd/icinga2.cmd":  No such device or address

But the pipe exists prior and after the crash. It seems to get overloaded with data, close, and then reopen.

Helpful info:

Disabled features: compatlog debuglog gelf icingastatus livestatus notification opentsdb perfdata statusdata syslog
Enabled features: api checker command graphite ido-mysql mainlog

/etc/icingaweb2/modules/monitoring/instances.ini 
[cmd]
transport   = "local"
path        = "/var/run/icinga2/cmd/icinga2.cmd"

ls -lah /var/run/icinga2/cmd/icinga2.cmd 
prw-rw---- 1 nagios www-data 0 Mar 15 11:27 /var/run/icinga2/cmd/icinga2.cmd

Changesets

2016-05-11 09:04:28 +00:00 by mfriedrich a529725

Fix overloaded command pipe errors

fixes #11390

2016-05-12 09:11:02 +00:00 by mfriedrich b39634d

Fix overloaded command pipe errors

fixes #11390

Relations:

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 15, 2016

Updated by critical on 2016-03-15 17:54:36 +00:00

Versions:

icinga2:         2.3.11
icingaweb2: 2.0.0-rc1
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 15, 2016

Updated by critical on 2016-03-15 18:42:52 +00:00

Logs

icingaweb2 log: nothing found (seems to throw PHP exception which I catch):
icinga2 log:         nothing found
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 15, 2016

Updated by critical on 2016-03-15 18:57:02 +00:00

https://dev.icinga.org/issues/8815: Already have this patch.
https://dev.icinga.org/issues/9570: Suggests to upgrade to icinga2 2.4 (as the issue has not been found there). Unfortunately that is not possible in my case.
https://dev.icinga.org/issues/10410: The possible fix in 2.4 (ref: 0a6505c#diff-41f5f9b62fd89e63b82e66bbe76c0e73)

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 15, 2016

Updated by critical on 2016-03-15 21:41:22 +00:00

Can someone move this to icinga2?

I've applied the patches from 0a6505c#diff-41f5f9b62fd89e63b82e66bbe76c0e73 to my 2.3.11 release and still I am having problems.

After reviewing the code further I see some implementation problems. Specifically, https://github.com/Icinga/icinga2/blob/master/lib/compat/externalcommandlistener.cpp#L120, there is a break command for all return codes of sock->Read(...). This should only occur if errno != EAGAIN because we are using non blocking resources.

            try {
                rc = sock->Read(buffer, sizeof(buffer));
            } catch (const std::exception& ex) {
                if (errno == EAGAIN)
                    continue;
                Log(LogWarning, "ExternalCommandListener")
                    << "Cannot read from socket." << DiagnosticInformation(ex);
                break;
            }

            if (rc == 0)
                break;

After applying these fixes and running my tests I have not seen the pipe unexpectedly close.

Some extra work might be required in base/socket.cpp Read(...) to decrease log spam, these shouldn't be critical:

[2016-03-15 15:39:50 -0600] critical/Socket: recv() failed with error code 11, "Resource temporarily unavailable"
[2016-03-15 15:39:50 -0600] critical/Socket: recv() failed with error code 11, "Resource temporarily unavailable"
[2016-03-15 15:39:50 -0600] critical/Socket: recv() failed with error code 11, "Resource temporarily unavailable"
[2016-03-15 15:39:50 -0600] critical/Socket: recv() failed with error code 11, "Resource temporarily unavailable"
[2016-03-15 15:39:50 -0600] critical/Socket: recv() failed with error code 11, "Resource temporarily unavailable"
[2016-03-15 15:39:50 -0600] critical/Socket: recv() failed with error code 11, "Resource temporarily unavailable"
[2016-03-15 15:39:50 -0600] critical/Socket: recv() failed with error code 11, "Resource temporarily unavailable"
[2016-03-15 15:39:50 -0600] critical/Socket: recv() failed with error code 11, "Resource temporarily unavailable"
[2016-03-15 15:39:50 -0600] critical/Socket: recv() failed with error code 11, "Resource temporarily unavailable"
[2016-03-15 15:39:50 -0600] critical/Socket: recv() failed with error code 11, "Resource temporarily unavailable"
[2016-03-15 15:39:50 -0600] critical/Socket: recv() failed with error code 11, "Resource temporarily unavailable"
[2016-03-15 15:39:50 -0600] critical/Socket: recv() failed with error code 11, "Resource temporarily unavailable"
[2016-03-15 15:39:50 -0600] critical/Socket: recv() failed with error code 11, "Resource temporarily unavailable"
[2016-03-15 15:39:50 -0600] critical/Socket: recv() failed with error code 11, "Resource temporarily unavailable"
[2016-03-15 15:39:50 -0600] critical/Socket: recv() failed with error code 11, "Resource temporarily unavailable"
[2016-03-15 15:39:50 -0600] critical/Socket: recv() failed with error code 11, "Resource temporarily unavailable"
[2016-03-15 15:39:50 -0600] critical/Socket: recv() failed with error code 11, "Resource temporarily unavailable"
[2016-03-15 15:39:50 -0600] critical/Socket: recv() failed with error code 11, "Resource temporarily unavailable"
[2016-03-15 15:39:50 -0600] critical/Socket: recv() failed with error code 11, "Resource temporarily unavailable"
[2016-03-15 15:39:50 -0600] critical/Socket: recv() failed with error code 11, "Resource temporarily unavailable"

Doing more testing.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 16, 2016

Updated by mfriedrich on 2016-03-16 09:16:50 +00:00

  • Project changed from Icinga Web 2 to Icinga 2
  • Category changed from 133 to Compat
  • Icinga Version set to 2

First, thanks for all the comments. I would suggest though to get a test vm with 2.4.x+ allowing you to develop and test changes.

We'll happily review and test your patch then.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 16, 2016

Updated by critical on 2016-03-16 14:59:18 +00:00

Not familiar with centos or vagrant. Can we spin up a Debian 8.1 i386 minimal rather than using icinga-vagrant?

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 16, 2016

Updated by mfriedrich on 2016-03-16 15:08:21 +00:00

If you are developing and compiling stuff it is fairly up to you which distribution you prefer :)

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 16, 2016

Updated by critical on 2016-03-16 17:36:50 +00:00

VM Info

OS: Debian 8.1 Jessie i386 (minimal clean)
icinga2: 2.4.4
icingaweb2: 2.0.0rc1

Testing

  1. Enable command pipe
  2. Reproduce error by sending downtime clear and downtime schedule events for 300 hosts:
  • Run 1: pass
  • Run 2: pass
  • Run 3: fail

OUTPUT:

OK
OK
OK
FAIL: Can't send external Icinga command to the local command file "/var/run/icinga2/cmd/icinga2.cmd":  No such device or address
FAIL: Can't send external Icinga command to the local command file "/var/run/icinga2/cmd/icinga2.cmd":  No such device or address
FAIL: Can't send external Icinga command to the local command file "/var/run/icinga2/cmd/icinga2.cmd":  No such device or address
FAIL: Can't send external Icinga command to the local command file "/var/run/icinga2/cmd/icinga2.cmd":  No such device or address
FAIL: Can't send external Icinga command to the local command file "/var/run/icinga2/cmd/icinga2.cmd":  No such device or address
FAIL: Can't send external Icinga command to the local command file "/var/run/icinga2/cmd/icinga2.cmd":  No such device or address
OK
OK
...

LOG:

[2016-03-16 05:14:46 -0600] critical/Socket: recv() failed with error code 11, "Resource temporarily unavailable"
[2016-03-16 05:14:46 -0600] critical/Socket: recv() failed with error code 11, "Resource temporarily unavailable"
[2016-03-16 05:14:46 -0600] critical/Socket: recv() failed with error code 11, "Resource temporarily unavailable"

Note: Seems to fail both on rc < 0 and rc == 0 when it shouldn't [1].

  1. Apply patch:

      try {
                 rc = sock->Read(buffer, sizeof(buffer));
             } catch (const std::exception& ex) {
                 if (errno == EAGAIN) /* we've read all the data */
                     continue;
                 Log(LogWarning, "ExternalCommandListener")
                     << "Cannot read from socket." << DiagnosticInformation(ex);
                 break;
             }
    
     if (rc == 0) /* empty pipe (EOF) */
             continue;
    
  2. Rebuild and retry tests

  • Run 1: pass
  • Run 2: pass
  • Run 3: pass
  • Run 4: pass
  • Run 5: pass
  • Run 6: pass

LOG:

[2016-03-16 05:16:31 -0600] critical/Socket: recv() failed with error code 11, "Resource temporarily unavailable"
[2016-03-16 05:16:31 -0600] critical/Socket: recv() failed with error code 11, "Resource temporarily unavailable"
  1. TODO

A way to verify that all data has passed through the pipe when this situation occurs.

References

  1. http://pubs.opengroup.org/onlinepubs/007908775/xsh/read.html
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 18, 2016

Updated by mfriedrich on 2016-03-18 10:10:40 +00:00

  • Status changed from New to Assigned
  • Assigned to set to critical

Cool, thanks. Please let us know when you've got a final patch :)

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Apr 6, 2016

Updated by tgelf on 2016-04-06 10:35:34 +00:00

Probably same problem here, but running still 2.4.3. Not so funny for automation, as commands tend to "randomly fail" based on the number of commands sent through the pipe.

Cheers,
Thomas

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Apr 6, 2016

Updated by mfriedrich on 2016-04-06 15:33:56 +00:00

@critical

Do you have any updates from your side? :) Thanks.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Apr 7, 2016

Updated by mfriedrich on 2016-04-07 08:26:36 +00:00

  • Assigned to changed from critical to mfriedrich
  • Priority changed from Normal to High
  • Target Version set to 2.4.6
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Apr 11, 2016

Updated by elippmann on 2016-04-11 08:54:21 +00:00

  • Blocks set to 10181
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Apr 20, 2016

Updated by gbeutner on 2016-04-20 16:35:41 +00:00

  • Target Version changed from 2.4.6 to 2.4.7
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Apr 21, 2016

Updated by gbeutner on 2016-04-21 07:47:00 +00:00

  • Target Version changed from 2.4.7 to 2.4.8
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Apr 21, 2016

Updated by mfriedrich on 2016-04-21 15:41:17 +00:00

Hm, I'm looking into a reasonably easy way to reproduce the issue. Putting my Icinga 2 box under stress level (10k hosts, 100k services) and then firing that small script does not give any errors at the moment. Any hints or scripts which would help tackle the issue? Thanks.

const countHosts = 10000;
const countServices = 10;


for (i in range(countHosts)) {
  object Host "many-test-" + i {
    check_command = "random"
  }
}

for (j in range(countServices)) {
  apply Service "many-test-" + j {
    check_command = "random"
    assign where match("many*", host.name)
  }
}

#!/bin/bash

cnt=1
attempts=1000

ECHO="/bin/echo"
CMDFILE="/usr/local/icinga2/var/run/icinga2/cmd/icinga2.cmd"

N=1000
output=`seq 1 $N | sed 's/.*/x/' | tr -d '\n'`

while [ $cnt -le $attempts ]
do
  DATETIME=`date +%s`
  CMDLINE="[$DATETIME] PROCESS_SERVICE_CHECK_RESULT;mbmif.int.netways.de;ping4;0;loop $cnt: $output"

  echo "Executing command: '$CMDLINE'"

  $ECHO $CMDLINE >> $CMDFILE

  sleep 0.00001
done

echo "Done."
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Apr 21, 2016

Updated by critical on 2016-04-21 16:17:08 +00:00

Unfortunately I have no updates on my end - I have been using the patch I have provided to mitigate this issue.

Would this bash script fail if the CMDFILE did not exist? Or would it write a file instead? Could you try using a C alternative?

// Crude, untested example
#include 
#include 
#include 
#include 
#include 
#include 

int main(int argc, char *argv[])
{
    int fd;
    const size_t MAX_ATTEMPTS = 1000;
    const char *CMD_FILE = "/usr/local/icinga2/var/run/icinga2/cmd/icinga2.cmd";
    size_t attempt = 1;
    const char *CMD_FMT = "[%d] PROCESS_SERVICE_CHECK_RESULT;mbmif.int.netways.de;ping4;0;loop %zu: xxxxxxxxxxxxxxx
    time_t now;
    char buf[2048] = {0};

    do {
        printf("Attempt: %zu\n", attempt);

        now = time(NULL);
        sprintf(buf, CMD_FMT, now, attempt);

        printf("Executing command: %s\n", buf);

        fd = open(CMD_FILE, O_WRONLY);
        if (fd == -1) {
            perror("open");
            break;
        }
        if (write(fd, buf, strlen(buf)) == -1) {
            perror("write");
            break;
        }
        close(fd);
    } while(attempt++ < MAX_ATTEMPTS);

    return 0;
}

Also, what are the specs of the machine you are using? When I am executing these commands my machine's CPU usage (icinga2 and MySQL) raises to 80-90%. Could you try these tests on a VM where you can limit the CPU to 1 or 2 cores with 1GB memory?

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented May 11, 2016

Updated by mfriedrich on 2016-05-11 09:10:04 +00:00

  • Status changed from Assigned to Resolved
  • Done % changed from 0 to 100

Applied in changeset a529725.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented May 11, 2016

Updated by mfriedrich on 2016-05-11 09:18:17 +00:00

  • Backport? changed from __ to Not yet backported
  • Include in Changelog changed from __ to 1
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented May 11, 2016

Updated by mfriedrich on 2016-05-11 09:34:36 +00:00

FYI - the tests were run on my macbook pro (early 2015, i5, 8 gb ram) taking the resources for the icinga2 and mariadb process when adding 10k hosts and 100k services. At some point the database might be blocking, but all tests were fired before that point. Your patch is reasonable (checking EAGAIN as well as rc=0 continue) so we applied it. Please test the git master / snapshot packages :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.