Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service origin-master-controllers crash when service systemd-journald reloads #40

Closed
jperville opened this issue Jan 3, 2017 · 7 comments
Assignees
Labels

Comments

@jperville
Copy link
Collaborator

jperville commented Jan 3, 2017

Following the release of v 1.10.26 of this cookbook, I want to submit my last issue with running this cookbook in openshift_HA mode (with external etcd).

Some context: In my environment cookbook, at some point the systemd-journald configuration gets reloaded (I enable persisting journal to disk and setup some max sizes).

The bug: Reloading systemd-journald has the repeatable effect of crashing the origin-master-controllers service which won't come back until I run chef again or manually restart the service. I still don't know if this is openshift issue or this cookbook's issue (the origin-master-* systemd units are created by this cookbook).

Here is how to reproduce, step by step:

  1. checkout https://github.com/PerfectMemory/origin-provision-bug-demo.git

  2. vagrant up master

This boots a Vagrant VM with a working openshift3 1.3.1 master configured:

  • to use external etcd (HA mode)
  • to directly log to journald (important!)

You may need to install Vagrant, the latest chef-dk and the vagrant-berkshelf plugin to make it work.

Once the VM is provisioned, ssh into it for the rest of the reproduction steps.

vagrant ssh master
  1. in the VM, check that the different openshift master services are working
[vagrant@master ~]$ sudo netstat -ntlp | egrep ':(8443|8444)'
tcp        0      0 0.0.0.0:8443            0.0.0.0:*               LISTEN      24445/openshift
tcp        0      0 0.0.0.0:8444            0.0.0.0:*               LISTEN      24468/openshift
  1. in another vagrant ssh terminal, tail system messages
[vagrant@master ~]$ sudo journalctl -f
  1. restart systemd-journald service
[vagrant@master ~]$ sudo systemctl restart systemd-journald.service

In the log we tailed in step 4, a message "Started Flush Journal to Persistent Storage" appears and just after that point the openshift some master services will be crashed.

  1. observe crashed origin-master services
[vagrant@master ~]$ sudo netstat -ntlp | egrep ':(8443|8444)'
tcp        0      0 0.0.0.0:8443            0.0.0.0:*               LISTEN      24445/openshift

[vagrant@master ~]$ sudo service origin-master-controllers status
Redirecting to /bin/systemctl status  origin-master-controllers.service
● origin-master-controllers.service - Atomic OpenShift Master Controllers
   Loaded: loaded (/usr/lib/systemd/system/origin-master-controllers.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Tue 2017-01-03 16:05:13 UTC; 50s ago
     Docs: https://github.com/openshift/origin
 Main PID: 24468 (code=killed, signal=PIPE)

Jan 03 16:04:23 master origin-master-controllers[24468]: I0103 16:04:23.521639   24468 nodecontroller.go:609] NodeController is entering network segmentation mode.
@IshentRas
Copy link
Owner

IshentRas commented Jan 3, 2017

@jperville The issue was on our side...:-1:
The service file for master-controllers was outdated and did not include some conditions which do help against the issue you reported.
Check this one and let me know if it works better : https://github.com/IshentRas/cookbook-openshift3/tree/release/1.10.27

3d9e605#diff-dda3923836a9a27946faf8134cc3445d

@jperville
Copy link
Collaborator Author

Cheers !

@jperville
Copy link
Collaborator Author

I will test this when I return to work tomorrow morning. Thanks for the quick investigation and fix @IshentRas

@IshentRas
Copy link
Owner

Let me know if it does work and I'll merge, upload the code...

@jperville
Copy link
Collaborator Author

Hello @IshentRas, I got success with your fix. After restarting systemd-journald, the origin-master daemons come back as expected.

@IshentRas
Copy link
Owner

IshentRas commented Jan 4, 2017

Perfect, will be merged soon 👍
Thanks for pointing it out.

@IshentRas IshentRas reopened this Jan 4, 2017
@IshentRas
Copy link
Owner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants