Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.11 Release Tasks #7380

Open
dnsmichi opened this issue Jul 31, 2019 · 17 comments

Comments

@dnsmichi
Copy link
Member

commented Jul 31, 2019

RC Feedback

  • Wait for RC feedback

General Feedback

Reported Issues

  • Cipher lists RHEL7 agents #7366
  • Windows agent 2.10.4 -> 2.11.0 RC1 master: no shared cipher #7386
  • downtime start/end time API validation #7384 (I've reverted the incomplete fix)
  • R2.11.0-rc1-1 endless config sync loop master-satellite #7382 - this syncs unsupported binaries - updated the docs in #7390
  • Problems with changing "run as user" on Windows with 2.11 RC1 - Was: Check command 'powershell' does not exist. with new Agent: v2.11.0-rc1 #7387 Turns out that the Powershell module doesn't change this accordingly. Out of our support scope, tested this extensively w/o the module.
  • v2.11.0-rc1 Windows Agent does not create debug.log #7388 - cannot be reproduced with win10 nor win2012. The PS module works differently thus creating some trouble here.
  • Deny non-utf8 files other than .conf for the cluster config sync #7391
  • 2.11.0 rc1 - An error occurred while upgrading the database #7393 - now using upgrade safe procedure
  • New umbrella system seems to break startup logging under systemd #7394
  • Improve logging of Downtimes #7374 - to mitigate future loops better
  • 2.11 rc1: built-in check command "icinga" doesn't work with version compare #7415
  • Same downtime being created repeatedly in a cluster loop #7198
    • Fix and improve logging for runtime object sync #7423
  • 2.11 RC1: Nessus Scan crash the Windows-Client. #7431

Docs

Release

02-getting-started.md -> 02-Installation.md
04-configuring-icinga-2.md -> 04-configuration.md

@dnsmichi dnsmichi added this to the 2.11.0 milestone Jul 31, 2019

@dnsmichi dnsmichi self-assigned this Jul 31, 2019

@dnsmichi dnsmichi referenced this issue Jul 31, 2019
7 of 8 tasks complete
@dnsmichi

This comment has been minimized.

Copy link
Member Author

commented Jul 31, 2019

@mcktr @Al2Klimov @Crunsher @htriem @lippserd @bobapple The master is now fully frozen. Do not merge anything except for small typo fixes. All remaining PRs are on hold.

Waiting for RC/snapshot customer and user feedback.

@widhalmt

This comment has been minimized.

Copy link
Member

commented Aug 1, 2019

Just to let you know: We got the first replies of customers who started test runs of 2.11 RC on Monday. I'll let you know ASAP when I get feedback how the tests went.

@dnsmichi dnsmichi added the ref/NC label Aug 1, 2019

@dnsmichi

This comment has been minimized.

Copy link
Member Author

commented Aug 1, 2019

ref/NC/627739

@dnsmichi

This comment has been minimized.

Copy link
Member Author

commented Aug 1, 2019

The "no shared cipher" problem with Windows agents was successfully mitigated and fixed with one of our customers.

Next up, is #7382 with a possible upgrade & config sync loop.

@dnsmichi

This comment has been minimized.

Copy link
Member Author

commented Aug 2, 2019

The sync loop was from binary files which we don't support. Adding detection is hard, and not reasonable for the 99.9% of users who already do use the config sync just for config files. Therefore a doc fix only.

Two new issues with a missing check command - in feedback loop, and debuglog on Windows missing.

@Thomas-Gelf

This comment has been minimized.

Copy link
Contributor

commented Aug 2, 2019

@dnsmichi: couldn't we be strict and refuse to work with anything but 100% valid UTF-8?

@Al2Klimov

This comment has been minimized.

Copy link
Contributor

commented Aug 2, 2019

@Thomas-Gelf We already auto-sanitize JSON I/O (as otherwise our new JSON lib would complain).

@Thomas-Gelf

This comment has been minimized.

Copy link
Contributor

commented Aug 2, 2019

Auto-sanitation has it's place. It is required to deal with unclean plugin output and (eventually) configuration "from /etc". I would not apply it to data from "trusted" sources. Read: invalid (non-UTF-8) data in "/var" should lead to an error log message followed by an immediate process shutdown. Invalid data via Netstring should lead to a terminated connection. In these contexts auto-sanitation doesn't help and instead makes part of the problem.

@dnsmichi

This comment has been minimized.

Copy link
Member Author

commented Aug 2, 2019

Please move this discussion into #7391 - I've been working on this offline with Tom's help already.

This issue is solely for tracking the tasks left open for 2.11, to keep @lippserd & @bobapple updated.

@dnsmichi

This comment has been minimized.

Copy link
Member Author

commented Aug 5, 2019

Cluster config sync is done, the missing powershell command turns into wrong permissions and not really being a bug, the Windows debuglog issue remains non-reproducible.

New to the party is the systemd logging which is part of this week's fixing.

@dnsmichi

This comment has been minimized.

Copy link
Member Author

commented Aug 6, 2019

  • v2.11.0-rc1 Windows Agent does not create debug.log #7388

is solved. The Powershell module is being used, which doesn't support icinga2 feature list and variants. It also collides with our graphical setup wizard using the default configuration layout instead of a single icinga2.conf file.

TL;DR - don't use the Powershell module for RC tests.

@dnsmichi

This comment has been minimized.

Copy link
Member Author

commented Aug 7, 2019

The reload logging with failed config validation in systemd #7394 now logs this correctly. Alex and myself also decided to add an additional log line to point users to running icinga2 daemon -C afterwards.

Aug 07 11:53:43 icinga2-centos7-dev.vagrant.demo.icinga.com icinga2[22031]: [2019-08-07 11:53:43 +0200] critical/cli: Config validation failed. Re-run with 'icinga2 daemon -C' after fixing the config.

While testing the Windows agent I was looking at something in the docs and decided to restructure the agents chapter. That's following the updates for the distributed monitoring chapter. Done.

The Windows permission problem in #7387 turned into a problem with the Powershell module, and @LordHepipud pointed me to a Director issue. Icinga/icingaweb2-module-director#1297 - was released with 1.6.1 already 🕺

@dnsmichi

This comment has been minimized.

Copy link
Member Author

commented Aug 8, 2019

The network stack with Boost Asio may create fifo pipes visible with lsof. If there's too much, fork errors with too many open files may occur. Under investigation at a customer, raising the nofiles limits as a first shot.

@dnsmichi

This comment has been minimized.

Copy link
Member Author

commented Aug 9, 2019

It is not the network stack, it has to do with the check process execution. While mitigating the issue, we've raised the number of open files.

systemctl edit icinga2

LimitNOFILE=50000
LimitNPROC=50000
TasksMax=infinity

vim /etc/default/icinga2

ICINGA2_RLIMIT_FILES=50000

systemctl daemon-reload
systemctl restart icinga2

for p in $(pidof icinga2); do echo -e "$p\n" && ps -ef | grep $p && echo && cat /proc/$p/limits | grep 'open files' && echo; done

for p in $(pidof icinga2); do echo -e "$p\n" && ps -ef | grep $p && echo && lsof -p $p && echo; done

This increased the number of pipes in the main process and fork errors are now gone. Still under investigation why check execution rate may drop - 1000/s vs current_concurrent_checks=10k.

@dnsmichi dnsmichi pinned this issue Aug 13, 2019

@dnsmichi

This comment has been minimized.

Copy link
Member Author

commented Aug 13, 2019

MaxConcurrentChecks is under investigation in our cleanup sprint week, same as the downtime loop. Team @Al2Klimov @bobapple @dnsmichi.

Small version parse fix incoming for the icinga check.

@dnsmichi

This comment has been minimized.

Copy link
Member Author

commented Aug 16, 2019

Fork errors are resolved with raising the number of open files, as described in the troubleshooting docs. The general performance is analysed and tested once more.

@dnsmichi

This comment has been minimized.

Copy link
Member Author

commented Aug 16, 2019

Coming late to the party, the downtime create/delete loop in HA clusters has been fixed this week with #7198. A nearly 4 year old problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.