Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart problem with datadog-agent if configured mongo not running #2330

Closed
spiritmonger opened this issue Mar 8, 2016 · 6 comments
Closed
Milestone

Comments

@spiritmonger
Copy link

datadog-agent: 5.7.0
os: amazon linux 2015.09

configured mongo (/etc/dd-agent/conf.d/mongo.yaml):

instances:
  - server: mongodb://localhost/db1:27017
    tags:
    - db1
  - server: mongodb://localhost/admin:27017
    tags:
    - admin

init_config:
# No init_config details needed

PROBLEM
if mongo not running, datadog can't stop correctly. If i tried restart, i saw:

$ service datadog-agent restart
Stopping Datadog Agent (using killproc on supervisord):    [  OK  ]
Starting Datadog Agent (using supervisord):Unlinking stale socket /opt/datadog-agent/run/datadog-supervisor.sock

datadog-agent:collector          FATAL     Exited too quickly (process log may have details)
datadog-agent:dogstatsd          RUNNING   pid 14792, uptime 0:00:10
datadog-agent:forwarder          RUNNING   pid 14794, uptime 0:00:10
datadog-agent:go-metro           EXITED    Mar 08 02:27 PM
datadog-agent:jmxfetch           EXITED    Mar 08 02:27 PM
Datadog Agent (supervisor) is NOT running all child process[FAILED]
Stopping Datadog Agent (using killproc on supervisord):    [  OK  ]

...and i saw in process list is "/opt/datadog-agent/embedded/bin/python /opt/datadog-agent/agent/agent.py foreground --use-local-forwarder"

Can't restarting datadog-agent immediately, with stopped mongo.


if agent.py process had finished/died, then is possible start datadog-agent!
if mongo is running, then can restart datadog-agent.
if i use datadog-agent package version 5.6.3, then restart working correctly.

@hkaj
Copy link
Member

hkaj commented Mar 8, 2016

Hi @spiritmonger
Thank you for reporting this with great details. Could you also send us a flare please? https://help.datadoghq.com/hc/en-us/articles/204991415-Send-logs-and-configs-to-Datadog-via-flare-command
It looks like the collector is failing somewhere and the flare would help us investigate that.

Thanks again

@spiritmonger
Copy link
Author

Hi,
now i have servers with datadog-agent rollbacked. If you can’t replicate this problem, please, try ask once again and i try “flare” send.

with regards
Michal Kucera

    1. 2016 v 16:12, Haïssam Kaj notifications@github.com:

Hi @spiritmonger https://github.com/spiritmonger
Thank you for reporting this with great details. Could you also send us a flare please? https://help.datadoghq.com/hc/en-us/articles/204991415-Send-logs-and-configs-to-Datadog-via-flare-command https://help.datadoghq.com/hc/en-us/articles/204991415-Send-logs-and-configs-to-Datadog-via-flare-command
It looks like the collector is failing somewhere and the flare would help us investigate that.

Thanks again


Reply to this email directly or view it on GitHub #2330 (comment).

Note: Privileged/Confidential information may be contained in this message and may be subject to legal privilege. Access to this e-mail by anyone other than the intended recipient is unauthorised. If you are not the intended recipient (or responsible for delivery of the message to such person), you may not use, copy, distribute or deliver to anyone this message (or any part of its contents ) or take any action in reliance on it. In such case, you should destroy this message, and notify us immediately. If you have received this email in error, please notify us immediately by e-mail or telephone and delete the e-mail. If you or your employer does not consent to internet e-mail messages of this kind, please notify us immediately. All reasonable precautions have been taken to ensure no viruses are present in this e-mail. As our company cannot accept responsibility for any loss or damage arising from the use of this e-mail or attachments we recommend that you subject these to your virus checking procedures prior to use. The views, opinions, conclusions and other informations expressed in this electronic mail are not given or endorsed by the company unless otherwise indicated by an authorized representative independent of this message.

@olivielpeau
Copy link
Member

@spiritmonger I've tried reproducing your issue but haven't been able to replicate this behavior.

Could you send us a flare please? This could allow us to understand what happened.

Thanks!

@olivielpeau olivielpeau added this to the Triage milestone Mar 9, 2016
@spiritmonger
Copy link
Author

Hi,
today i try. But for fastest info:

  • Amazon LInux
  • mongo is disabled from init.d
  • Chef orchestrate (Opsworks)
  • datadog-agent started on boot
  • mongo started by chef in step |configuration”

but…
as i saw, problem is, when in stop phase from init.d
process /opt/datadog-agent/embedded/bin/python /opt/datadog-agent/agent/agent.py foreground --use-local-forwarder is still UP
then step start can’t starting, because datadog-agent is (for init script) “running” :(

and init finished with exit 1

from log:

2016-03-08 14:26:48 UTC | INFO | dd.collector | utils.pidfile(pidfile.py:31) | Pid file is: /opt/datadog-agent/run/dd-agent.pid
2016-03-08 14:26:48 UTC | INFO | dd.collector | collector(agent.py:297) | Agent version 5.7.0
2016-03-08 14:26:48 UTC | INFO | dd.collector | root(agent.py:318) | Running in foreground
2016-03-08 14:26:48 UTC | INFO | dd.collector | daemon(daemon.py:157) | Starting
2016-03-08 14:26:48 UTC | ERROR | dd.collector | daemon(daemon.py:165) | Not starting, another instance is already running (using pidfile /opt/datadog-agent/run/dd-agent.pid)
2016-03-08 14:27:03 UTC | ERROR | dd.collector | checks.mongo(init.py:763) | Check 'mongo' instance #1 failed
Traceback (most recent call last):
File "/opt/datadog-agent/agent/checks/init.py", line 746, in run
self.check(copy.deepcopy(instance))
File "/opt/datadog-agent/agent/checks.d/mongo.py", line 577, in check
status = db.command('serverStatus', tcmalloc=collect_tcmalloc_metrics)
File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/pymongo/database.py", line 478, in command
with client._socket_for_reads(read_preference) as (sock_info, slave_ok):
File "/opt/datadog-agent/embedded/lib/python2.7/contextlib.py", line 17, in enter
return self.gen.next()
File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/pymongo/mongo_client.py", line 748, in _socket_for_reads
with self._get_socket(read_preference) as sock_info:
File "/opt/datadog-agent/embedded/lib/python2.7/contextlib.py", line 17, in enter
return self.gen.next()
File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/pymongo/mongo_client.py", line 712, in _get_socket
server = self._get_topology().select_server(selector)
File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/pymongo/topology.py", line 141, in select_server
address))
File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/pymongo/topology.py", line 117, in select_servers
self._error_message(selector))
ServerSelectionTimeoutError: localhost:27017: [Errno 111] Connection refused

2016-03-08 14:27:03 UTC | INFO | dd.collector | collector(agent.py:215) | Exiting. Bye bye.

  1. because /opt/datadog-agent/agent/agent.py still running then “i saw not starting"
  2. why is in log full python exception? no only one line error : "can’t connect to mongo?”
  3. isn’t problem in any timeout in trying for reconnection to mongo? which not running?

with regards
Michal Kucera

    1. 2016 v 1:31, Olivier Vielpeau notifications@github.com:

@spiritmonger https://github.com/spiritmonger I've tried reproducing your issue but haven't been able to replicate this behavior.

Could you send us a flare please? This could allow us to understand what happened.

Thanks!


Reply to this email directly or view it on GitHub #2330 (comment).

Note: Privileged/Confidential information may be contained in this message and may be subject to legal privilege. Access to this e-mail by anyone other than the intended recipient is unauthorised. If you are not the intended recipient (or responsible for delivery of the message to such person), you may not use, copy, distribute or deliver to anyone this message (or any part of its contents ) or take any action in reliance on it. In such case, you should destroy this message, and notify us immediately. If you have received this email in error, please notify us immediately by e-mail or telephone and delete the e-mail. If you or your employer does not consent to internet e-mail messages of this kind, please notify us immediately. All reasonable precautions have been taken to ensure no viruses are present in this e-mail. As our company cannot accept responsibility for any loss or damage arising from the use of this e-mail or attachments we recommend that you subject these to your virus checking procedures prior to use. The views, opinions, conclusions and other informations expressed in this electronic mail are not given or endorsed by the company unless otherwise indicated by an authorized representative independent of this message.

@olivielpeau
Copy link
Member

Thanks @spiritmonger for the details.

After some digging here's what I've found: there's currently an issue with our init script on RHEL, and that explains why the restart command doesn't work properly: #2349 will fix the issue (you can apply the same change to your init script in /etc/dd-agent/datadog-agent).

Now I haven't found exactly why the issue only exists since 5.7.0 on your setup, there have been a lot of changes in the mongodb check in 5.7.0 (including an upgrade of the pymongomodule) so it's likely related to one of those changes.

Hope these bits help.

@irabinovitch
Copy link
Contributor

@spiritmonger #2439 should be available in our most recent release. Please let us know if this helped address the issues you were experiencing.

@ian28223 ian28223 closed this as completed Dec 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants