New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service origin-master-controllers crash when service systemd-journald reloads #40

Closed
jperville opened this Issue Jan 3, 2017 · 7 comments

Comments

Projects
None yet
2 participants
@jperville
Collaborator

jperville commented Jan 3, 2017

Following the release of v 1.10.26 of this cookbook, I want to submit my last issue with running this cookbook in openshift_HA mode (with external etcd).

Some context: In my environment cookbook, at some point the systemd-journald configuration gets reloaded (I enable persisting journal to disk and setup some max sizes).

The bug: Reloading systemd-journald has the repeatable effect of crashing the origin-master-controllers service which won't come back until I run chef again or manually restart the service. I still don't know if this is openshift issue or this cookbook's issue (the origin-master-* systemd units are created by this cookbook).

Here is how to reproduce, step by step:

  1. checkout https://github.com/PerfectMemory/origin-provision-bug-demo.git

  2. vagrant up master

This boots a Vagrant VM with a working openshift3 1.3.1 master configured:

  • to use external etcd (HA mode)
  • to directly log to journald (important!)

You may need to install Vagrant, the latest chef-dk and the vagrant-berkshelf plugin to make it work.

Once the VM is provisioned, ssh into it for the rest of the reproduction steps.

vagrant ssh master
  1. in the VM, check that the different openshift master services are working
[vagrant@master ~]$ sudo netstat -ntlp | egrep ':(8443|8444)'
tcp        0      0 0.0.0.0:8443            0.0.0.0:*               LISTEN      24445/openshift
tcp        0      0 0.0.0.0:8444            0.0.0.0:*               LISTEN      24468/openshift
  1. in another vagrant ssh terminal, tail system messages
[vagrant@master ~]$ sudo journalctl -f
  1. restart systemd-journald service
[vagrant@master ~]$ sudo systemctl restart systemd-journald.service

In the log we tailed in step 4, a message "Started Flush Journal to Persistent Storage" appears and just after that point the openshift some master services will be crashed.

  1. observe crashed origin-master services
[vagrant@master ~]$ sudo netstat -ntlp | egrep ':(8443|8444)'
tcp        0      0 0.0.0.0:8443            0.0.0.0:*               LISTEN      24445/openshift

[vagrant@master ~]$ sudo service origin-master-controllers status
Redirecting to /bin/systemctl status  origin-master-controllers.service
● origin-master-controllers.service - Atomic OpenShift Master Controllers
   Loaded: loaded (/usr/lib/systemd/system/origin-master-controllers.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Tue 2017-01-03 16:05:13 UTC; 50s ago
     Docs: https://github.com/openshift/origin
 Main PID: 24468 (code=killed, signal=PIPE)

Jan 03 16:04:23 master origin-master-controllers[24468]: I0103 16:04:23.521639   24468 nodecontroller.go:609] NodeController is entering network segmentation mode.
@IshentRas

This comment has been minimized.

Show comment
Hide comment
@IshentRas

IshentRas Jan 3, 2017

Owner

@jperville The issue was on our side...👎
The service file for master-controllers was outdated and did not include some conditions which do help against the issue you reported.
Check this one and let me know if it works better : https://github.com/IshentRas/cookbook-openshift3/tree/release/1.10.27

3d9e605#diff-dda3923836a9a27946faf8134cc3445d

Owner

IshentRas commented Jan 3, 2017

@jperville The issue was on our side...👎
The service file for master-controllers was outdated and did not include some conditions which do help against the issue you reported.
Check this one and let me know if it works better : https://github.com/IshentRas/cookbook-openshift3/tree/release/1.10.27

3d9e605#diff-dda3923836a9a27946faf8134cc3445d

@jperville

This comment has been minimized.

Show comment
Hide comment
@jperville

jperville Jan 3, 2017

Collaborator

Cheers !

Collaborator

jperville commented Jan 3, 2017

Cheers !

@jperville

This comment has been minimized.

Show comment
Hide comment
@jperville

jperville Jan 3, 2017

Collaborator

I will test this when I return to work tomorrow morning. Thanks for the quick investigation and fix @IshentRas

Collaborator

jperville commented Jan 3, 2017

I will test this when I return to work tomorrow morning. Thanks for the quick investigation and fix @IshentRas

@IshentRas

This comment has been minimized.

Show comment
Hide comment
@IshentRas

IshentRas Jan 3, 2017

Owner

Let me know if it does work and I'll merge, upload the code...

Owner

IshentRas commented Jan 3, 2017

Let me know if it does work and I'll merge, upload the code...

@jperville

This comment has been minimized.

Show comment
Hide comment
@jperville

jperville Jan 4, 2017

Collaborator

Hello @IshentRas, I got success with your fix. After restarting systemd-journald, the origin-master daemons come back as expected.

Collaborator

jperville commented Jan 4, 2017

Hello @IshentRas, I got success with your fix. After restarting systemd-journald, the origin-master daemons come back as expected.

@IshentRas

This comment has been minimized.

Show comment
Hide comment
@IshentRas

IshentRas Jan 4, 2017

Owner

Perfect, will be merged soon 👍
Thanks for pointing it out.

Owner

IshentRas commented Jan 4, 2017

Perfect, will be merged soon 👍
Thanks for pointing it out.

@IshentRas IshentRas closed this Jan 4, 2017

@IshentRas IshentRas reopened this Jan 4, 2017

@IshentRas

This comment has been minimized.

Show comment
Hide comment

@IshentRas IshentRas closed this Jan 4, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment