Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster/HA Install of AWX #26

Closed
MrMEEE opened this issue Jun 4, 2018 · 49 comments
Closed

Cluster/HA Install of AWX #26

MrMEEE opened this issue Jun 4, 2018 · 49 comments
Assignees

Comments

@MrMEEE
Copy link
Owner

MrMEEE commented Jun 4, 2018

Moved from here:
subuk/awx-rpm#11

@MrMEEE MrMEEE self-assigned this Jun 4, 2018
@Aglidic
Copy link

Aglidic commented Aug 9, 2018

hello, we have a successfull HA deployement thanks to your rpm.

Here is what we have done:
rabbitmq clustering
disable celery-beat service
modify the celery-worker execstart command

we have made a lot of test on it and everything seems fine

@MrMEEE
Copy link
Owner Author

MrMEEE commented Aug 10, 2018

That is great to hear... thanks for your feedback..

If you have a more detailed installation description, I would love to add it to the documentation..

@MrMEEE MrMEEE closed this as completed Aug 10, 2018
@Aglidic
Copy link

Aglidic commented Aug 13, 2018

ok so here is the process:
install db on an external server with your install guide.
install 1st awx server with your install guide (connect it to the db)
install 2&3 awx server with your install guide (connect them to the DB and don't make those commands:
echo "from django.contrib.auth.models import User; User.objects.create_superuser('admin', 'root@localhost', 'password')" | sudo -u awx /opt/awx/bin/awx-manage shell
sudo -u awx /opt/awx/bin/awx-manage create_preload_data)

When all nodes are installed we can now build the rabbitmq cluster.
Connect on node 1 an copy the erlang cookie to node 2 and 3
var/lib/rabbitmq/.erlang.cookie

Connect to nodes 2 and 3:
restart app to make it see the new cookie
rabbitmqctl stop_app
rabbitmqctl start_app
create rabbitmqctl cluster:
rabbitmqctl stop_app
rabbitmqctl join_cluster rabbit@node1
rabbitmqctl start_app
set the HA policy
rabbitmq-plugins enable rabbitmq_management
rabbitmqctl set_policy ha-all "" '{"ha-mode":"all","ha-sync-mode":"automatic"}'
systemctl restart rabbitmq-server

rabbitmq is now in cluster.

Second step is celery:
first disable and stop celery beat on all server:
Second modify the exec command of the celery services:
etc/systemd/system/multi-user.target.wants/awx-celery-worker.service ->
ExecStart=/opt/awx/bin/celery worker -A awx -l info --autoscale=50,4 -Ofair -Q tower_scheduler,tower,%(ENV_HOSTNAME)s -n celery@%(ENV_HOSTNAME)s
restart celery services on all node.

We also saw that at this step it can be better to reboot all 3 nodes but one by one to kept the rabbitmq cluster in good shape.

hope that can help

@Aglidic
Copy link

Aglidic commented Aug 13, 2018

i forgot but of course final step go to the web interface and create the instance with all 3 nodes

@sujiar37
Copy link

sujiar37 commented Apr 23, 2019

@MrMEEE , this a bit out of topic. But for those who wish to explore and automate the HA / Instance group using official docker stand alone method can be access it from my repository . Were you able to add this piece of info under your wiki, may be some people out there could be helpful. Thanks

@MrMEEE
Copy link
Owner Author

MrMEEE commented Apr 23, 2019

@sujiar37

So, basically, everything that is needed for a HA setup is to do a standalone postgresql (cluster) and a rabbitmq (cluster)

and then do frontends that connects to these??

Should be pretty simple to implement..

I have added a links section

@sujiar37
Copy link

sujiar37 commented Apr 24, 2019

@MrMEEE , thank you for adding these piece of info under your wiki.

Only requirement is to setup a standalone postgresql and rest all will be taken care by playbook such as building and configuring the rabbitmq cluster and enabling the docker version of HA in all nodes. And Yes, it is pretty simple to implement now through my playbook

@powertim
Copy link

Hi guys,
As we worked on it with @Aglidic to build the first HA implementation of the RPM, we have a playbook which does the full setup automatically.
It's just corporate currently, so I need find some time to generalize it if you want to add it somewhere.

Best,

Tim.

@MrMEEE
Copy link
Owner Author

MrMEEE commented Apr 24, 2019

@powertim

I would love to include playbooks for installing in the RPM...

@powertim
Copy link

OK so I'll add that to my TODO for the next days...

@bufooo
Copy link

bufooo commented Jun 14, 2019

OK so I'll add that to my TODO for the next days...

Did you had a chance to do it? I would love to try them.

@dnc92301
Copy link

dnc92301 commented Jun 24, 2019

Hi all, very much interested in the playbook. If playbook is not available now, can someone highlight what's needed for pointing to external Postgres server - using the RPM installation method please.
Something like this -

Set pg_hostname if you have an external postgres server, otherwise

a new postgres service will be created

pg_hostname=hostname
pg_username=awx
pg_password=xxxxx
pg_database=awx
pg_port=5432

Thanks everyone for great efforts!

@MrMEEE
Copy link
Owner Author

MrMEEE commented Jun 24, 2019

In regards to the external postgres, you basically only needs to setup an external postgres (cluster?) And change the configuration in /etc/tower/settings.py to point to that server, before running the database initialization..

@dnc92301
Copy link

Thanks much for the quick response . Yes it will be a Postgres 2-node cluster with steaming replication . Yes I see there’s a section for configuring USER/PASSWORD/HOST/PORT in settings.py. So Initializing DB / all steps listed in awx.wiki/multi-section-page/configuration is still required?

@dnc92301
Copy link

No issues setting up external Postgresdb . Issues with setting up cluster . I followed the previous comments on setting up clustering and got to point of enabling rabbitmq cluster within 2 nodes - but awx didn’t detect the additional node . The endpoint - api/v2/Ping only displays one activenode. Also there’s no awx-celery-worker - this service appears to have been deprecated? Thanks.

@MrMEEE
Copy link
Owner Author

MrMEEE commented Jun 26, 2019

Hi..

I think you have to enable each of the awx nodes with the command:

sudo -u awx scl enable rh-python36 rh-postgresql10 "awx-manage register_queue --queuename=tower --hostnames=$(hostname)"

and yes, the celery worker is deprecated...

@dnc92301
Copy link

Thanks much it worked! I had to run this command first - sudo -u awx scl enable rh-python36 rh-postgresql10 "awx-manage provision_instance --hostname=$(hostname)"

before running your command -
sudo -u awx scl enable rh-python36 rh-postgresql10 "awx-manage register_queue --queuename=tower --hostnames=$(hostname)"

@dnc92301
Copy link

Also as far as upgrading to latest AWX version, I presume it will still work just so that we have to upgrade on all Nodes within the cluster. Thanks again!

@MrMEEE
Copy link
Owner Author

MrMEEE commented Jun 27, 2019

Ah, yes.. of course you have to do the provision_instance first :)..

I will do a write-up on this and put on awx.wiki as soon as possible.. also i'm planning a setup-tools for simpler installation and configuration, which will also contain the HA... Could you share the exact changes you have made to the systemd files??

Remember not to change the files themselves, but to overwrite them with copies in /etc/systemd/system.. else they will get returned to default on the next update...

In regards to updating, I think you should update the ansible-awx (and dependencies) on all nodes before running the database migrations...

@dnc92301
Copy link

I said it too fast :)
Yes, both nodes are within the cluster. For some reason jobs couldn't execute on the newly added node. When attempted to run a job against new node - it goes to a "Wait" state before timing out with the message -
Task was marked as running in Tower but was not present in the job queue, so it has been marked as failed..

Tried - rabbitmqctl stop_app/ rabbitmqctl start_app , systemctl restart rabbitmq-server on the server. Also bounce both nodes. On the web GUI, I had switched from OFF to ON, but USED CAPACITY eventually becomes "UNAVAILABLE."

@dnc92301
Copy link

ignore - the issue was with AWX not running during startup on the new node:)
still doing more testing. Thanks again..

@dnc92301
Copy link

so far so good . I didn't make any systemD changes since celery has been deprecated.

@dnc92301
Copy link

one issue came up so far is that when a job finished running on new node, the node USED CAPACITY goes into "UNAVAILABLE." It is as though the node lost its heartbeat to the rabbitmq cluster. Need to troubleshoot further.

@MrMEEE
Copy link
Owner Author

MrMEEE commented Jun 27, 2019

I'm in Praque for the week for a Red Hat event.. I will try to setup a HA environment when I get home, then we can debug together

@dnc92301
Copy link

This is error msg I'm getting.

2019-06-27 14:01:34.390 [info] <0.1498.0> connection <0.1498.0> (127.0.0.1:42950 -> 127.0.0.1:5672): user 'guest' authenticated and granted access to vhost '/'
2019-06-27 14:01:44.556 [warning] <0.1498.0> closing AMQP connection <0.1498.0> (127.0.0.1:42950 -> 127.0.0.1:5672, vhost: '/', user: 'guest'):
client unexpectedly closed TCP connection

Thanks

@dnc92301
Copy link

Looks like the issue has to do with the fact - awx requires 'tower' vhost. Currently we're using host (default) '/'. So getting a bunch of closing AMQP connections.

2019-06-27 15:30:29.289 [info] <0.2130.0> connection <0.2130.0> (127.0.0.1:47012 -> 127.0.0.1:5672): user 'guest' authenticated and granted access to vhost '/'
2019-06-27 15:30:49.603 [info] <0.2139.0> accepting AMQP connection <0.2139.0> (127.0.0.1:47390 -> 127.0.0.1:5672)
2019-06-27 15:30:49.610 [info] <0.2139.0> connection <0.2139.0> (127.0.0.1:47390 -> 127.0.0.1:5672): user 'guest' authenticated and granted access to vhost '/'
2019-06-27 15:30:49.619 [info] <0.2139.0> closing AMQP connection <0.2139.0> (127.0.0.1:47390 -> 127.0.0.1:5672, vhost: '/', user: 'guest')
2019-06-27 15:30:49.687 [info] <0.2150.0> accepting AMQP connection <0.2150.0> (127.0.0.1:47394 -> 127.0.0.1:5672)
2019-06-27 15:30:49.695 [info] <0.2150.0> connection <0.2150.0> (127.0.0.1:47394 -> 127.0.0.1:5672): user 'guest' authenticated and granted access to vhost '/'
2019-06-27 15:30:49.710 [info] <0.2150.0> closing AMQP connection <0.2150.0> (127.0.0.1:47394 -> 127.0.0.1:5672, vhost: '/', user: 'guest')
2019-06-27 15:30:49.748 [info] <0.2161.0> accepting AMQP connection <0.2161.0> (127.0.0.1:47396 -> 127.0.0.1:5672)
2019-06-27 15:30:49.755 [info] <0.2161.0> connection <0.2161.0> (127.0.0.1:47396 -> 127.0.0.1:5672): user 'guest' authenticated and granted access to vhost '/'
2019-06-27 15:30:49.771 [info] <0.2161.0> closing AMQP connection <0.2161.0> (127.0.0.1:47396 -> 127.0.0.1:5672, vhost: '/', user: 'guest')

@MrMEEE MrMEEE reopened this Jun 28, 2019
@MrMEEE MrMEEE closed this as completed Jun 28, 2019
@MrMEEE
Copy link
Owner Author

MrMEEE commented Jun 28, 2019

@dnc92301 Let's move the discussion to #121

@powertim
Copy link

Hi guys,
As we worked on it with @Aglidic to build the first HA implementation of the RPM, we have a playbook which does the full setup automatically.
It's just corporate currently, so I need find some time to generalize it if you want to add it somewhere.

Best,

Tim.

Hi guys,

Finally the playbook is here: https://github.com/powertim/deploy_awx-rpm
Currently designed for RHEL7 x86_64 with Satellite repos.
I will try to update it with manual repos as described on https://awx.wiki/installation/repositories/rhel7-x86_64.
And why not in the future for the different OS supported on the awx.wiki...

Please try first to adapt the playbook before opening an issue.
I'll fill up the README soon.

Best,

Tim

@gowthamakanthan
Copy link

gowthamakanthan commented Jul 24, 2019 via email

@powertim
Copy link

Hi @gowthamakanthan ,

It should work on CentOS 7 with a few changes:

  1. Add local repos with module 'yum_repository' instead of the Satellite repos I'm using with module 'rhsm_repository' in files roles/db_prereqs/tasks/main.yml & roles/nodes_prereqs/tasks/main.yml.

  2. Maybe change the line #26 of file roles/nodes_prereqs/tasks/main.yml to succeed installation of dependencies.

But I'll try to add this content when I find the time for that (I hope quickly).

@dnc92301
Copy link

@powertim - Thanks for the efforts! I've tested and works as expected. However, the previous reported issue still exists where the 2nd node (I have a 2 node rabbit mq cluster) goes into "UNAVAILABLE" state as soon as Job finished running. hostnameB is the 2nd node which has a Capacity of 0 because it's NOT availalbe. Primary node I've DISABLED it intentionally.

[root@hostnameA deploy_awx-rpm]# sudo -u awx scl enable rh-python36 rh-postgresql10 "awx-manage list_instances"

[tower capacity=0]
hostnameB capacity=0 version=6.1.0
[DISABLED] hostnameA capacity=0 version=6.1.0

@dnc92301
Copy link

This is installed using latest AWX 6.10. This is example of a run where node becomes "unavailable" where job no longer exists in the queue - with the below explanation.

EXPLANATION
Task was marked as running in Tower but was not present in the job queue, so it has been marked as failed.
STARTED
7/26/2019 1:18:34 PM
FINISHED
7/26/2019 1:20:25 PM

@dnc92301
Copy link

dnc92301 commented Aug 6, 2019

Hi all,
It looks like the problem is no longer servicing after setting up a new server . However, I'm hitting the following issues with starting up awx.

Issue with - scl: RuntimeError: Django version other than 2.2.2 detected: 2.2.4.

Django is what comes by default - rh-python36-Django-2.2.4-1.noarch

Thanks.

Aug 6 18:40:19 hostnameA scl: Traceback (most recent call last):
Aug 6 18:40:19 hostnameA scl: File "/opt/rh/rh-python36/root/usr/bin/daphne", line 11, in
Aug 6 18:40:19 hostnameA scl: load_entry_point('daphne==1.3.0', 'console_scripts', 'daphne')()
Aug 6 18:40:19 hostnameA scl: File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/daphne/cli.py", line 144, in entrypoint
Aug 6 18:40:19 hostnameA scl: cls().run(sys.argv[1:])
Aug 6 18:40:19 hostnameA scl: File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/daphne/cli.py", line 174, in run
Aug 6 18:40:19 hostnameA scl: channel_layer = importlib.import_module(module_path)
Aug 6 18:40:19 hostnameA scl: File "/opt/rh/rh-python36/root/usr/lib64/python3.6/importlib/init.py", line 126, in import_module
Aug 6 18:40:19 hostnameA scl: return _bootstrap._gcd_import(name[level:], package, level)
Aug 6 18:40:19 hostnameA scl: File "", line 994, in _gcd_import
Aug 6 18:40:19 hostnameA scl: File "", line 971, in _find_and_load
Aug 6 18:40:19 hostnameA scl: File "", line 941, in _find_and_load_unlocked
Aug 6 18:40:19 hostnameA scl: File "", line 219, in _call_with_frames_removed
Aug 6 18:40:19 hostnameA scl: File "", line 994, in _gcd_import
Aug 6 18:40:19 hostnameA scl: File "", line 971, in _find_and_load
Aug 6 18:40:19 hostnameA scl: File "", line 955, in _find_and_load_unlocked
Aug 6 18:40:19 hostnameA scl: File "", line 665, in _load_unlocked
Aug 6 18:40:19 hostnameA scl: File "", line 678, in exec_module
Aug 6 18:40:19 hostnameA scl: File "", line 219, in _call_with_frames_removed
Aug 6 18:40:19 hostnameA scl: File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/awx/init.py", line 49, in
Aug 6 18:40:19 hostnameA scl: current=django.version)
Aug 6 18:40:19 hostnameA scl: RuntimeError: Django version other than 2.2.2 detected: 2.2.4. Overriding names_digest is known to work for Django 2.2.2 and may not work in other Django versions.
Aug 6 18:40:19 hostnameA systemd: awx-daphne.service: main process exited, code=exited, status=1/FAILURE
Aug 6 18:40:19 hostnameA systemd: Unit awx-daphne.service entered failed state.
Aug 6 18:40:19 hostnameA systemd: awx-daphne.service failed.
Aug 6 18:40:21 hostnameA systemd: awx-cbreceiver.service holdoff time over, scheduling restart.
Aug 6 18:40:21 hostnameA systemd: awx-channels-worker.service holdoff time over, scheduling restart.
Aug 6 18:40:21 hostnameA systemd: awx-dispatcher.service holdoff time over, scheduling restart.
Aug 6 18:40:21 hostnameA systemd: Stopped AWX Dispatcher.
Aug 6 18:40:21 hostnameA systemd: Stopped AWX channels worker service.
Aug 6 18:40:21 hostnameA systemd: Stopping AWX web service...
Aug 6 18:40:21 hostnameA systemd: Stopped AWX cbreceiver service.
Aug 6 18:40:21 hostnameA systemd: awx-daphne.service holdoff time over, scheduling restart.
Aug 6 18:40:21 hostnameA systemd: Stopped AWX daphne service.
[root@hostnameA ~]#

@powertim
Copy link

powertim commented Aug 6, 2019 via email

@MrMEEE
Copy link
Owner Author

MrMEEE commented Aug 6, 2019

@dnc92301 Please create new issues, instead of reusing old ones...

Have you remembered to update the ansible-awx package???

@MrMEEE
Copy link
Owner Author

MrMEEE commented Aug 6, 2019

@powertim Maybe the playbook doesn't update the ansible-awx package??

@dnc92301
Copy link

dnc92301 commented Aug 6, 2019

@tim - yes this happens after rerunning playbook. After upgrading to latest ansible-awx version it worked!

@powertim
Copy link

powertim commented Aug 9, 2019

@powertim Maybe the playbook doesn't update the ansible-awx package??

It's updated now !
See commit df571c0

@powertim
Copy link

powertim commented Aug 9, 2019

@tim - yes this happens after rerunning playbook. After upgrading to latest ansible-awx version it worked!

Yeah unfortunately re-running playbook cause failures.
I need to improve that.

@VJoshi0
Copy link

VJoshi0 commented Oct 18, 2019

Hello, I have offline VMs where I need to build AWX. As listed above I saw about 160 rh-python36-* dependencies. Where I can find a tar ball or url for all rpms I need for AWX?
Not using docker, plan to use RHEL7 VMs to create HA.
But I'm lost to collect all rh-python36-* from mirror sites one by one , will appreciate if I know what order which rpm needs to get installed. Thanks.

@cameronkerrnz
Copy link

@VJoshi0: yum install --download-only --download-dir /to/here/ rh-python36-*

Further example at https://unix.stackexchange.com/questions/259640/how-to-use-yum-to-get-all-rpms-required-for-offline-use

@cs-laurentiuvasiescu
Copy link

So after having the 3 instances clustered is a loadbalancer used at all?

What about manual projects that are on the local filesystem? Rsync them?

@elstoncawley
Copy link

Hi All thanks for the great work that you are doing, I was wondering if there was a step by step guide for the HA setup similar to the standalone setup in this wiki guide https://awx.wiki/installation/installation

@powertim
Copy link

Hi @elstoncawley,

Unfortunately not, and I didn't work for a long time on the HA setup but you'll find steps in the playbook here https://github.com/powertim/deploy_awx-rpm.
Roles names should help you to find the steps for building the cluster.

Cheers,

Tim.

@elstoncawley
Copy link

Thanks @powertim
I am actually installing on a CentOS 7 server and was wondering about the repo. In the vars/nodes.yml file. Could I use https://awx.wiki/repository/ for the awx_repo variable?

@powertim
Copy link

powertim commented Apr 1, 2020

Yes in theory you can use all the repos you want but you need to change the way you enable repos and call them because I only provided a RHEL conf with Satellitte so subscription-manager command won't be available for you.

@bryanasdev000
Copy link

Hi everbody!

Did anyone get HA/Clustering running with AWX 11.X.X and redis?

@bryanasdev000
Copy link

Hi everbody!

Did anyone get HA/Clustering running with AWX 11.X.X and redis?

Responding to myself, and leaving reference material for those who need it, issue of https://github.com/sujiar37/AWX-HA-InstanceGroup/issues/26 seems to shed some light.

I will test asap.

@Nikkurer
Copy link

https://github.com/fitbeard/awx-ha-cluster this playbook working well. I'm using it for a time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests