Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS Hubs cloud went offline after 4 months of working perfectly ... and it refuses to go online again #4500

Closed
donromeo opened this issue Aug 10, 2021 · 15 comments
Labels
bug needs triage For bugs that have not yet been assigned a fix priority

Comments

@donromeo
Copy link

Description
We run a permanent museum that went live on April 15, 2021. It worked perfectly until a few days ago when the message "Hubs Cloud is currently offline. Check back shortly." showed up.
I tried everything in bugs 3071, 3429, and 4097 with no luck.
The message shows up trying to use admin, spoke, or the site.

Once I found the offline message:
1.- Check that the server was up and running (t3.medium).
2.- Connected to the server thru SSH.
3.- Ran "top" and couldn't identify a Hubs-related process.
4.- With "ps", I found a few processes belonging to the user "hab", not sure if they are related to hubs.

5.- As per one of the recommendations, I terminated the server, as expected another one started automátically. After 30 minutes the offline message was still there.

6.- Then I updated the stack to put it offline, the server was shut down properly. I waited a few minutes and put it back on line. A new server started nicely but after 30 minutes the offline message was still there. Note: I have AutoPauseDb set to "Yes - Pause database when not in use" but it was running like that all the time, and as I understand, it may delay the process for a few seconds, not minutes.

Questions:

  • What processes should be running?
  • What logs can I check?
  • Any other suggestions?

To Reproduce
Steps to reproduce the behavior:

  1. Start a stack
  2. Wait 4 months (or less) until it fails and shows the "Hubs Cloud is currently offline. Check back shortly." message.
  3. Try restarting the server.
  4. Try restarting the stack

Expected behavior
My Museum should show up.

Screenshots
image

Hardware

  • Device: t3.medium
  • OS: ubuntu
  • Browser: any
@donromeo donromeo added bug needs triage For bugs that have not yet been assigned a fix priority labels Aug 10, 2021
@pattersonbl2
Copy link
Contributor

When you checked the instance,did you see if you probably ran out of space on the EBS volume?

@donromeo
Copy link
Author

Thanks Brandon.

Current status, as per "df", 3GB free out of 8GB.

Filesystem 1K-blocks Used Available Use% Mounted on
.
.
/dev/nvme0n1p1 8065444 4946860 3102200 62% /
.
.

The message "Hubs Cloud is currently offline. Check back shortly." is still there.

@donromeo
Copy link
Author

Syslog shows an eternal loop with errors of Certbot:

Aug 11 17:26:21 flamboyant-giant bash[2100]: certbot.default@sopart-01(O): Wed Aug 11 17:26:21 UTC 2021 Renewing LetsEncrypt certificates if neccessary
Aug 11 17:26:22 flamboyant-giant bash[2100]: certbot.default@sopart-01(O): usage:
Aug 11 17:26:22 flamboyant-giant bash[2100]: certbot.default@sopart-01(O): certbot [SUBCOMMAND] [options] [-d DOMAIN] [-d DOMAIN] ...
Aug 11 17:26:22 flamboyant-giant bash[2100]: certbot.default@sopart-01(O):
Aug 11 17:26:22 flamboyant-giant bash[2100]: certbot.default@sopart-01(O): Certbot can obtain and install HTTPS/TLS/SSL certificates. By default,
Aug 11 17:26:22 flamboyant-giant bash[2100]: certbot.default@sopart-01(O): it will attempt to use a webserver both for obtaining and installing the
Aug 11 17:26:22 flamboyant-giant bash[2100]: certbot.default@sopart-01(O): certificate.
Aug 11 17:26:22 flamboyant-giant bash[2100]: certbot.default@sopart-01(O): certbot: error: unrecognized arguments: -- --config=/hab/svc/certbot/config/certbot.renew.ini --register-unsafely-without-email --agree-tos
Aug 11 17:26:22 flamboyant-giant bash[2100]: certbot.default@sopart-01(O): chown: cannot access '/hab/svc/certbot/data/live': No such file or directory
Aug 11 17:26:22 flamboyant-giant bash[2100]: certbot.default@sopart-01(O): find: '/hab/svc/certbot/data/live': No such file or directory
Aug 11 17:26:22 flamboyant-giant bash[2100]: certbot.default@sopart-01(O): find: '/hab/svc/certbot/data/live': No such file or directory
Aug 11 17:26:22 flamboyant-giant bash[2100]: certbot.default@sopart-01(O): chown: cannot access '/hab/svc/certbot/data/archive': No such file or directory
Aug 11 17:26:22 flamboyant-giant bash[2100]: certbot.default@sopart-01(O): find: '/hab/svc/certbot/data/archive': No such file or directory
Aug 11 17:26:22 flamboyant-giant bash[2100]: certbot.default@sopart-01(O): find: '/hab/svc/certbot/data/archive': No such file or directory
Aug 11 17:26:22 flamboyant-giant bash[2100]: certbot.default@sopart-01(O): sleep: missing operand
Aug 11 17:26:22 flamboyant-giant bash[2100]: certbot.default@sopart-01(O): Try 'sleep --help' for more information.

Maybe a problem with certificates? I don't know anything about how Hubs handles certificates.

@pattersonbl2
Copy link
Contributor

to my knowledge, the hub instance shouldn't be handling the ssl cert that would be at the CDN level for AWS set up. I just looked for certbot on my hubs instance and it's not installed

@pattersonbl2
Copy link
Contributor

I am getting direction to saying to terminate the instance again. it may take up to two or 3 times for the stack to self heal.

@donromeo
Copy link
Author

I'll do that and will keep you posted.

Thanks, Brandon.

@donromeo
Copy link
Author

I've restarted the whole stack 7 times, 4 in previous days, and 3 more today. Same result.

If I send the syslog, will it help to analyze the problem?

A portion of syslog:

Aug 11 23:12:50 romantic-rogue bash[2193]: bio-sup(MR): Updating from mozillareality/ita to mozillareality/ita/0.0.1/20200526203229
Aug 11 23:12:50 romantic-rogue bash[2193]: ita.default@sopart-01(ST): Terminating service (PID: 2517)
Aug 11 23:12:50 romantic-rogue bash[2193]: ita.default@sopart-01(SR): Health checking has been stopped
Aug 11 23:12:50 romantic-rogue bash[2193]: reticulum(HK): The 'reload' hook has been deprecated. You should use the 'reconfigure' hook instead.
Aug 11 23:12:50 romantic-rogue bash[2193]: bio-launch(SV): Child for service 'ita.default@sopart-01' with PID 2517 exited with code signal: 15
Aug 11 23:12:50 romantic-rogue bash[2193]: ita.default@sopart-01(ST): Service gracefully terminated (PID: 2517)
Aug 11 23:12:50 romantic-rogue amazon-ssm-agent.amazon-ssm-agent[2693]: 2021-08-11 23:12:50 INFO [amazon-ssm-agent] [LongRunningWorkerContainer] [WorkerProvider] Worker ssm-agent-worker is not running, starting worker process
Aug 11 23:12:50 romantic-rogue amazon-ssm-agent.amazon-ssm-agent[2693]: 2021-08-11 23:12:50 INFO [amazon-ssm-agent] [LongRunningWorkerContainer] [WorkerProvider] Worker ssm-agent-worker (pid:2739) started
Aug 11 23:12:50 romantic-rogue amazon-ssm-agent.amazon-ssm-agent[2693]: 2021-08-11 23:12:50 INFO [amazon-ssm-agent] [LongRunningWorkerContainer] Monitor long running worker health every 60 seconds
Aug 11 23:12:50 romantic-rogue bash[2193]: bio-sup(AG): Supervisor starting mozillareality/certbot. See the Supervisor output for more details.
Aug 11 23:12:50 romantic-rogue cloud-init[1262]: #33[0m#033[0mSupervisor starting mozillareality/certbot. See the Supervisor output for more details.
Aug 11 23:12:51 romantic-rogue bash[2193]: bio-sup(MR): Starting mozillareality/certbot (mozillareality/certbot/1.0.0/20191224043510)
Aug 11 23:12:51 romantic-rogue bash[2193]: certbot.default@sopart-01(UCW): Watching user.toml
Aug 11 23:12:51 romantic-rogue bash[2193]: bio-sup(MR): Starting mozillareality/ita (mozillareality/ita/0.0.1/20200526203229)
Aug 11 23:12:51 romantic-rogue bash[2193]: ita.default@sopart-01(UCW): Watching user.toml
Aug 11 23:12:51 romantic-rogue bash[2193]: reticulum(HK): The 'reload' hook has been deprecated. You should use the 'reconfigure' hook instead.
Aug 11 23:12:51 romantic-rogue bash[2193]: ita.default@sopart-01(HK): Modified hook content in /hab/svc/ita/hooks/run
Aug 11 23:12:51 romantic-rogue bash[2193]: ita.default@sopart-01(SR): Hooks recompiled
Aug 11 23:12:51 romantic-rogue bash[2193]: ita.default@sopart-01(SR): Initializing
Aug 11 23:12:51 romantic-rogue bash[2193]: ita.default@sopart-01(SV): Starting service as user=hab, group=hab
Aug 11 23:12:51 romantic-rogue bash[2193]: certbot.default@sopart-01(HK): Modified hook content in /hab/svc/certbot/hooks/run
Aug 11 23:12:51 romantic-rogue bash[2193]: certbot.default@sopart-01(SR): Hooks recompiled
Aug 11 23:12:51 romantic-rogue bash[2193]: certbot.default@sopart-01(CF): Created configuration file /hab/svc/certbot/config/certbot.renew.ini
Aug 11 23:12:51 romantic-rogue bash[2193]: certbot.default@sopart-01(CF): Created configuration file /hab/svc/certbot/config/certbot.ini
Aug 11 23:12:51 romantic-rogue bash[2193]: certbot.default@sopart-01(SR): Initializing
Aug 11 23:12:51 romantic-rogue bash[2193]: certbot.default@sopart-01(SV): Starting service as user=root, group=hab
Aug 11 23:12:51 romantic-rogue systemd-resolved[787]: Server returned error NXDOMAIN, mitigating potential DNS violation DVE-2018-0001, retrying transaction with reduced feature level UDP.
Aug 11 23:12:51 romantic-rogue bash[2193]: ita.default@sopart-01(O): 2021-08-11T23:12:51.279Z ita Firing up...
Aug 11 23:12:51 romantic-rogue bash[2193]: certbot.default@sopart-01(O): Wed Aug 11 23:12:51 UTC 2021 Getting any needed LetsEncrypt certificates for 'romantic-rogue.gelattinahub.info,sopart-01-app.gelattinahub.info'
Aug 11 23:12:51 romantic-rogue bash[2193]: ita.default@sopart-01(O): 2021-08-11T23:12:51.841Z ita Listening on port 6000
Aug 11 23:12:52 romantic-rogue bash[2193]: reticulum(HK): The 'reload' hook has been deprecated. You should use the 'reconfigure' hook instead.
Aug 11 23:12:53 romantic-rogue bash[2193]: certbot.default@sopart-01(O): usage:
Aug 11 23:12:53 romantic-rogue bash[2193]: certbot.default@sopart-01(O): certbot [SUBCOMMAND] [options] [-d DOMAIN] [-d DOMAIN] ...
Aug 11 23:12:53 romantic-rogue bash[2193]: certbot.default@sopart-01(O):
Aug 11 23:12:53 romantic-rogue bash[2193]: certbot.default@sopart-01(O): Certbot can obtain and install HTTPS/TLS/SSL certificates. By default,
Aug 11 23:12:53 romantic-rogue bash[2193]: certbot.default@sopart-01(O): it will attempt to use a webserver both for obtaining and installing the
Aug 11 23:12:53 romantic-rogue bash[2193]: certbot.default@sopart-01(O): certificate.
Aug 11 23:12:53 romantic-rogue bash[2193]: certbot.default@sopart-01(O): certbot: error: unrecognized arguments: -- --config=/hab/svc/certbot/config/certbot.ini --noninteractive --register-unsafely-without-email --agree-tos
Aug 11 23:12:53 romantic-rogue bash[2193]: certbot.default@sopart-01(O): chown: cannot access '/hab/svc/certbot/data/live': No such file or directory
Aug 11 23:12:53 romantic-rogue bash[2193]: certbot.default@sopart-01(O): find: '/hab/svc/certbot/data/live': No such file or directory
Aug 11 23:12:53 romantic-rogue bash[2193]: certbot.default@sopart-01(O): find: '/hab/svc/certbot/data/live': No such file or directory
Aug 11 23:12:53 romantic-rogue bash[2193]: certbot.default@sopart-01(O): chown: cannot access '/hab/svc/certbot/data/archive': No such file or directory
Aug 11 23:12:53 romantic-rogue bash[2193]: certbot.default@sopart-01(O): find: '/hab/svc/certbot/data/archive': No such file or directory
Aug 11 23:12:53 romantic-rogue bash[2193]: certbot.default@sopart-01(O): find: '/hab/svc/certbot/data/archive': No such file or directory

@pattersonbl2
Copy link
Contributor

Wait are you restarting the instance or terminating them. Terminating is similar to deleting the ec2 instance. https://www.youtube.com/watch?v=Zwjc1VMKOv0. I know the terminology says restart but you would need to terminate them. The error could be related to " because {{cfg.general.plugin}}'s empty, so the certbot renew's "chef-habitat-run-hook" templated command's not correctly constructed
and the cause of {{cfg.general.plugin}} being empty is that his certbot-habitat-pkg files are missing"

@donromeo
Copy link
Author

Thanks, Brandon.

I've been actually putting the stack Offline in Cloudformation, as I understand it, this action terminates the instance.

I love Robin, a very nice video. I'll try terminating the instance a few times.

I'll be back.

@donromeo
Copy link
Author

I terminated the EC2 instance 5 times (waiting 15 minutes) and still "Hubs Cloud is currently offline. Check back shortly."

@pattersonbl2
Copy link
Contributor

Let me report back to my team in the morning to see if we can come together for a possible fix for this

@donromeo
Copy link
Author

TQVM Brandon. Good night.

@pattersonbl2
Copy link
Contributor

I have a couple things i want you to do to help with troubleshooting. If you are in the discord community, please DM me because this information will output non public information on your stack

@donromeo
Copy link
Author

Thanks Brandon.
I just created an account on discord: donromeo#4161. Please DM me.

@pattersonbl2
Copy link
Contributor

Issue required the stack to be deleted and refreshed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug needs triage For bugs that have not yet been assigned a fix priority
Projects
None yet
Development

No branches or pull requests

2 participants