AWS Hubs cloud went offline after 4 months of working perfectly ... and it refuses to go online again #4500

donromeo · 2021-08-10T17:57:12Z

Description
We run a permanent museum that went live on April 15, 2021. It worked perfectly until a few days ago when the message "Hubs Cloud is currently offline. Check back shortly." showed up.
I tried everything in bugs 3071, 3429, and 4097 with no luck.
The message shows up trying to use admin, spoke, or the site.

Once I found the offline message:
1.- Check that the server was up and running (t3.medium).
2.- Connected to the server thru SSH.
3.- Ran "top" and couldn't identify a Hubs-related process.
4.- With "ps", I found a few processes belonging to the user "hab", not sure if they are related to hubs.

5.- As per one of the recommendations, I terminated the server, as expected another one started automátically. After 30 minutes the offline message was still there.

6.- Then I updated the stack to put it offline, the server was shut down properly. I waited a few minutes and put it back on line. A new server started nicely but after 30 minutes the offline message was still there. Note: I have AutoPauseDb set to "Yes - Pause database when not in use" but it was running like that all the time, and as I understand, it may delay the process for a few seconds, not minutes.

Questions:

What processes should be running?
What logs can I check?
Any other suggestions?

To Reproduce
Steps to reproduce the behavior:

Start a stack
Wait 4 months (or less) until it fails and shows the "Hubs Cloud is currently offline. Check back shortly." message.
Try restarting the server.
Try restarting the stack

Expected behavior
My Museum should show up.

Screenshots

Hardware

Device: t3.medium
OS: ubuntu
Browser: any

pattersonbl2 · 2021-08-11T16:56:39Z

When you checked the instance,did you see if you probably ran out of space on the EBS volume?

donromeo · 2021-08-11T17:21:20Z

Thanks Brandon.

Current status, as per "df", 3GB free out of 8GB.

Filesystem 1K-blocks Used Available Use% Mounted on
.
.
/dev/nvme0n1p1 8065444 4946860 3102200 62% /
.
.

The message "Hubs Cloud is currently offline. Check back shortly." is still there.

donromeo · 2021-08-11T17:34:48Z

Syslog shows an eternal loop with errors of Certbot:

Aug 11 17:26:21 flamboyant-giant bash[2100]: certbot.default@sopart-01(O): Wed Aug 11 17:26:21 UTC 2021 Renewing LetsEncrypt certificates if neccessary
Aug 11 17:26:22 flamboyant-giant bash[2100]: certbot.default@sopart-01(O): usage:
Aug 11 17:26:22 flamboyant-giant bash[2100]: certbot.default@sopart-01(O): certbot [SUBCOMMAND] [options] [-d DOMAIN] [-d DOMAIN] ...
Aug 11 17:26:22 flamboyant-giant bash[2100]: certbot.default@sopart-01(O):
Aug 11 17:26:22 flamboyant-giant bash[2100]: certbot.default@sopart-01(O): Certbot can obtain and install HTTPS/TLS/SSL certificates. By default,
Aug 11 17:26:22 flamboyant-giant bash[2100]: certbot.default@sopart-01(O): it will attempt to use a webserver both for obtaining and installing the
Aug 11 17:26:22 flamboyant-giant bash[2100]: certbot.default@sopart-01(O): certificate.
Aug 11 17:26:22 flamboyant-giant bash[2100]: certbot.default@sopart-01(O): certbot: error: unrecognized arguments: -- --config=/hab/svc/certbot/config/certbot.renew.ini --register-unsafely-without-email --agree-tos
Aug 11 17:26:22 flamboyant-giant bash[2100]: certbot.default@sopart-01(O): chown: cannot access '/hab/svc/certbot/data/live': No such file or directory
Aug 11 17:26:22 flamboyant-giant bash[2100]: certbot.default@sopart-01(O): find: '/hab/svc/certbot/data/live': No such file or directory
Aug 11 17:26:22 flamboyant-giant bash[2100]: certbot.default@sopart-01(O): find: '/hab/svc/certbot/data/live': No such file or directory
Aug 11 17:26:22 flamboyant-giant bash[2100]: certbot.default@sopart-01(O): chown: cannot access '/hab/svc/certbot/data/archive': No such file or directory
Aug 11 17:26:22 flamboyant-giant bash[2100]: certbot.default@sopart-01(O): find: '/hab/svc/certbot/data/archive': No such file or directory
Aug 11 17:26:22 flamboyant-giant bash[2100]: certbot.default@sopart-01(O): find: '/hab/svc/certbot/data/archive': No such file or directory
Aug 11 17:26:22 flamboyant-giant bash[2100]: certbot.default@sopart-01(O): sleep: missing operand
Aug 11 17:26:22 flamboyant-giant bash[2100]: certbot.default@sopart-01(O): Try 'sleep --help' for more information.

Maybe a problem with certificates? I don't know anything about how Hubs handles certificates.

pattersonbl2 · 2021-08-11T17:55:38Z

to my knowledge, the hub instance shouldn't be handling the ssl cert that would be at the CDN level for AWS set up. I just looked for certbot on my hubs instance and it's not installed

pattersonbl2 · 2021-08-11T18:04:11Z

I am getting direction to saying to terminate the instance again. it may take up to two or 3 times for the stack to self heal.

donromeo · 2021-08-11T18:09:06Z

I'll do that and will keep you posted.

Thanks, Brandon.

donromeo · 2021-08-11T23:39:41Z

I've restarted the whole stack 7 times, 4 in previous days, and 3 more today. Same result.

If I send the syslog, will it help to analyze the problem?

A portion of syslog:

Aug 11 23:12:50 romantic-rogue bash[2193]: bio-sup(MR): Updating from mozillareality/ita to mozillareality/ita/0.0.1/20200526203229
Aug 11 23:12:50 romantic-rogue bash[2193]: ita.default@sopart-01(ST): Terminating service (PID: 2517)
Aug 11 23:12:50 romantic-rogue bash[2193]: ita.default@sopart-01(SR): Health checking has been stopped
Aug 11 23:12:50 romantic-rogue bash[2193]: reticulum(HK): The 'reload' hook has been deprecated. You should use the 'reconfigure' hook instead.
Aug 11 23:12:50 romantic-rogue bash[2193]: bio-launch(SV): Child for service 'ita.default@sopart-01' with PID 2517 exited with code signal: 15
Aug 11 23:12:50 romantic-rogue bash[2193]: ita.default@sopart-01(ST): Service gracefully terminated (PID: 2517)
Aug 11 23:12:50 romantic-rogue amazon-ssm-agent.amazon-ssm-agent[2693]: 2021-08-11 23:12:50 INFO [amazon-ssm-agent] [LongRunningWorkerContainer] [WorkerProvider] Worker ssm-agent-worker is not running, starting worker process
Aug 11 23:12:50 romantic-rogue amazon-ssm-agent.amazon-ssm-agent[2693]: 2021-08-11 23:12:50 INFO [amazon-ssm-agent] [LongRunningWorkerContainer] [WorkerProvider] Worker ssm-agent-worker (pid:2739) started
Aug 11 23:12:50 romantic-rogue amazon-ssm-agent.amazon-ssm-agent[2693]: 2021-08-11 23:12:50 INFO [amazon-ssm-agent] [LongRunningWorkerContainer] Monitor long running worker health every 60 seconds
Aug 11 23:12:50 romantic-rogue bash[2193]: bio-sup(AG): Supervisor starting mozillareality/certbot. See the Supervisor output for more details.
Aug 11 23:12:50 romantic-rogue cloud-init[1262]: #33[0m#033[0mSupervisor starting mozillareality/certbot. See the Supervisor output for more details.
Aug 11 23:12:51 romantic-rogue bash[2193]: bio-sup(MR): Starting mozillareality/certbot (mozillareality/certbot/1.0.0/20191224043510)
Aug 11 23:12:51 romantic-rogue bash[2193]: certbot.default@sopart-01(UCW): Watching user.toml
Aug 11 23:12:51 romantic-rogue bash[2193]: bio-sup(MR): Starting mozillareality/ita (mozillareality/ita/0.0.1/20200526203229)
Aug 11 23:12:51 romantic-rogue bash[2193]: ita.default@sopart-01(UCW): Watching user.toml
Aug 11 23:12:51 romantic-rogue bash[2193]: reticulum(HK): The 'reload' hook has been deprecated. You should use the 'reconfigure' hook instead.
Aug 11 23:12:51 romantic-rogue bash[2193]: ita.default@sopart-01(HK): Modified hook content in /hab/svc/ita/hooks/run
Aug 11 23:12:51 romantic-rogue bash[2193]: ita.default@sopart-01(SR): Hooks recompiled
Aug 11 23:12:51 romantic-rogue bash[2193]: ita.default@sopart-01(SR): Initializing
Aug 11 23:12:51 romantic-rogue bash[2193]: ita.default@sopart-01(SV): Starting service as user=hab, group=hab
Aug 11 23:12:51 romantic-rogue bash[2193]: certbot.default@sopart-01(HK): Modified hook content in /hab/svc/certbot/hooks/run
Aug 11 23:12:51 romantic-rogue bash[2193]: certbot.default@sopart-01(SR): Hooks recompiled
Aug 11 23:12:51 romantic-rogue bash[2193]: certbot.default@sopart-01(CF): Created configuration file /hab/svc/certbot/config/certbot.renew.ini
Aug 11 23:12:51 romantic-rogue bash[2193]: certbot.default@sopart-01(CF): Created configuration file /hab/svc/certbot/config/certbot.ini
Aug 11 23:12:51 romantic-rogue bash[2193]: certbot.default@sopart-01(SR): Initializing
Aug 11 23:12:51 romantic-rogue bash[2193]: certbot.default@sopart-01(SV): Starting service as user=root, group=hab
Aug 11 23:12:51 romantic-rogue systemd-resolved[787]: Server returned error NXDOMAIN, mitigating potential DNS violation DVE-2018-0001, retrying transaction with reduced feature level UDP.
Aug 11 23:12:51 romantic-rogue bash[2193]: ita.default@sopart-01(O): 2021-08-11T23:12:51.279Z ita Firing up...
Aug 11 23:12:51 romantic-rogue bash[2193]: certbot.default@sopart-01(O): Wed Aug 11 23:12:51 UTC 2021 Getting any needed LetsEncrypt certificates for 'romantic-rogue.gelattinahub.info,sopart-01-app.gelattinahub.info'
Aug 11 23:12:51 romantic-rogue bash[2193]: ita.default@sopart-01(O): 2021-08-11T23:12:51.841Z ita Listening on port 6000
Aug 11 23:12:52 romantic-rogue bash[2193]: reticulum(HK): The 'reload' hook has been deprecated. You should use the 'reconfigure' hook instead.
Aug 11 23:12:53 romantic-rogue bash[2193]: certbot.default@sopart-01(O): usage:
Aug 11 23:12:53 romantic-rogue bash[2193]: certbot.default@sopart-01(O): certbot [SUBCOMMAND] [options] [-d DOMAIN] [-d DOMAIN] ...
Aug 11 23:12:53 romantic-rogue bash[2193]: certbot.default@sopart-01(O):
Aug 11 23:12:53 romantic-rogue bash[2193]: certbot.default@sopart-01(O): Certbot can obtain and install HTTPS/TLS/SSL certificates. By default,
Aug 11 23:12:53 romantic-rogue bash[2193]: certbot.default@sopart-01(O): it will attempt to use a webserver both for obtaining and installing the
Aug 11 23:12:53 romantic-rogue bash[2193]: certbot.default@sopart-01(O): certificate.
Aug 11 23:12:53 romantic-rogue bash[2193]: certbot.default@sopart-01(O): certbot: error: unrecognized arguments: -- --config=/hab/svc/certbot/config/certbot.ini --noninteractive --register-unsafely-without-email --agree-tos
Aug 11 23:12:53 romantic-rogue bash[2193]: certbot.default@sopart-01(O): chown: cannot access '/hab/svc/certbot/data/live': No such file or directory
Aug 11 23:12:53 romantic-rogue bash[2193]: certbot.default@sopart-01(O): find: '/hab/svc/certbot/data/live': No such file or directory
Aug 11 23:12:53 romantic-rogue bash[2193]: certbot.default@sopart-01(O): find: '/hab/svc/certbot/data/live': No such file or directory
Aug 11 23:12:53 romantic-rogue bash[2193]: certbot.default@sopart-01(O): chown: cannot access '/hab/svc/certbot/data/archive': No such file or directory
Aug 11 23:12:53 romantic-rogue bash[2193]: certbot.default@sopart-01(O): find: '/hab/svc/certbot/data/archive': No such file or directory
Aug 11 23:12:53 romantic-rogue bash[2193]: certbot.default@sopart-01(O): find: '/hab/svc/certbot/data/archive': No such file or directory

pattersonbl2 · 2021-08-11T23:57:25Z

Wait are you restarting the instance or terminating them. Terminating is similar to deleting the ec2 instance. https://www.youtube.com/watch?v=Zwjc1VMKOv0. I know the terminology says restart but you would need to terminate them. The error could be related to " because {{cfg.general.plugin}}'s empty, so the certbot renew's "chef-habitat-run-hook" templated command's not correctly constructed
and the cause of {{cfg.general.plugin}} being empty is that his certbot-habitat-pkg files are missing"

donromeo · 2021-08-12T00:35:15Z

Thanks, Brandon.

I've been actually putting the stack Offline in Cloudformation, as I understand it, this action terminates the instance.

I love Robin, a very nice video. I'll try terminating the instance a few times.

I'll be back.

donromeo · 2021-08-12T03:51:53Z

I terminated the EC2 instance 5 times (waiting 15 minutes) and still "Hubs Cloud is currently offline. Check back shortly."

pattersonbl2 · 2021-08-12T04:03:09Z

Let me report back to my team in the morning to see if we can come together for a possible fix for this

donromeo · 2021-08-12T05:02:48Z

TQVM Brandon. Good night.

pattersonbl2 · 2021-08-12T17:15:27Z

I have a couple things i want you to do to help with troubleshooting. If you are in the discord community, please DM me because this information will output non public information on your stack

donromeo · 2021-08-12T17:26:54Z

Thanks Brandon.
I just created an account on discord: donromeo#4161. Please DM me.

pattersonbl2 · 2021-08-13T14:57:08Z

Issue required the stack to be deleted and refreshed.

donromeo added bug needs triage For bugs that have not yet been assigned a fix priority labels Aug 10, 2021

pattersonbl2 closed this as completed Aug 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS Hubs cloud went offline after 4 months of working perfectly ... and it refuses to go online again #4500

AWS Hubs cloud went offline after 4 months of working perfectly ... and it refuses to go online again #4500

donromeo commented Aug 10, 2021

pattersonbl2 commented Aug 11, 2021

donromeo commented Aug 11, 2021

donromeo commented Aug 11, 2021

pattersonbl2 commented Aug 11, 2021

pattersonbl2 commented Aug 11, 2021

donromeo commented Aug 11, 2021

donromeo commented Aug 11, 2021

pattersonbl2 commented Aug 11, 2021

donromeo commented Aug 12, 2021

donromeo commented Aug 12, 2021

pattersonbl2 commented Aug 12, 2021

donromeo commented Aug 12, 2021

pattersonbl2 commented Aug 12, 2021

donromeo commented Aug 12, 2021

pattersonbl2 commented Aug 13, 2021

AWS Hubs cloud went offline after 4 months of working perfectly ... and it refuses to go online again #4500

AWS Hubs cloud went offline after 4 months of working perfectly ... and it refuses to go online again #4500

Comments

donromeo commented Aug 10, 2021

pattersonbl2 commented Aug 11, 2021

donromeo commented Aug 11, 2021

donromeo commented Aug 11, 2021

pattersonbl2 commented Aug 11, 2021

pattersonbl2 commented Aug 11, 2021

donromeo commented Aug 11, 2021

donromeo commented Aug 11, 2021

pattersonbl2 commented Aug 11, 2021

donromeo commented Aug 12, 2021

donromeo commented Aug 12, 2021

pattersonbl2 commented Aug 12, 2021

donromeo commented Aug 12, 2021

pattersonbl2 commented Aug 12, 2021

donromeo commented Aug 12, 2021

pattersonbl2 commented Aug 13, 2021