Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OMSAgent fills the os disk and wreck havoc the system #366

Closed
brandubh opened this issue Feb 2, 2017 · 26 comments
Closed

OMSAgent fills the os disk and wreck havoc the system #366

brandubh opened this issue Feb 2, 2017 · 26 comments

Comments

@brandubh
Copy link

brandubh commented Feb 2, 2017

two different causes, same effects: the omsagent.log grows out of control filling the os disk and basically disrupting the monitored system. This is a condition that must be avoided at all costs, the agent must not disrupt the monitored system.

  1. during a solution upgrade pushed from the cloud we got the following entry filling the log and the disk of several VMs: [error]: exec failed to run or shutdown child process error="No such file or directory - /opt/microsoft/omsagent/plugin/omsbaseline" err$

  2. the in_tail plugin doesn't manage appropriately any access denied error, filling the log and the disk with:
    2017-02-02 10:21:49 +0100 [error]: /opt/microsoft/omsagent/ruby/lib/ruby/gems/2.2.0/gems/fluentd-0.12.24/lib/fluent/plugin/in_tail.rb:484:in on_timer'
    2017-02-02 10:21:49 +0100 [error]: /opt/microsoft/omsagent/ruby/lib/ruby/gems/2.2.0/gems/cool.io-1.4.4/lib/cool.io/loop.rb:88:in run_once' 2017-02-02 10:21:49 +0100 [error]: /opt/microsoft/omsagent/ruby/lib/ruby/gems/2.2.0/gems/cool.io-1.4.4/lib/cool.io/loop.rb:88:in run'
    2017-02-02 10:21:49 +0100 [error]: /opt/microsoft/omsagent/ruby/lib/ruby/gems/2.2.0/gems/fluentd-0.12.24/lib/fluent/plugin/in_tail.rb:253:in run' 2017-02-02 10:21:50 +0100 [error]: Permission denied @ rb_file_s_stat - /var/log/php-fpm/error.log

@EtienneDeneuve
Copy link

Where could we give the logs ? 24 go ;)

@EtienneDeneuve
Copy link

I needed to recover my service so I deleted it but here is a tail -n 200 on the omsagent :
https://gist.github.com/EtienneDeneuve/cd6b86d6f175663f02e21cbc7d60ce59

@EtienneDeneuve
Copy link

I've published a quick fix on my blog using logrotate to avoid that before a fix http://etienne.deneuve.xyz/2017/02/02/agent-oms-sur-linux-dans-azure/

@ngroon
Copy link

ngroon commented Feb 3, 2017

+1 on issue #1 mentioned by @brandubh

@slumos
Copy link

slumos commented Feb 3, 2017

Getting the same issue. Same error message. Relevant lines from our logs below. There's nothing relevant for hours before this happens. After, the last line below is logged about 2000 times per second until omsagent is restarted.

2017-02-02 22:30:47 +0000 [error]: exec failed to run or shutdown child process error="No such file or directory - /opt/microsoft/omsagent/plugin/omsbaseline" error_class="Errno::ENOENT"
  2017-02-02 22:30:47 +0000 [warn]: /opt/microsoft/omsagent/ruby/lib/ruby/gems/2.2.0/gems/fluentd-0.12.24/lib/fluent/plugin/in_exec.rb:140:in `popen'
  2017-02-02 22:30:47 +0000 [warn]: /opt/microsoft/omsagent/ruby/lib/ruby/gems/2.2.0/gems/fluentd-0.12.24/lib/fluent/plugin/in_exec.rb:140:in `run_periodic'
2017-02-02 22:30:47 +0000 [error]: exec failed to run or shutdown child process error="No such file or directory - /opt/microsoft/omsagent/plugin/omsbaseline" error_class="Errno::ENOENT"
  2017-02-02 22:30:47 +0000 [warn]: suppressed same stacktrace

@williamayerst
Copy link

Getting this issue also on Ubuntu 16.04 in multiple OMS agents connecting to multiple dashboards

@hepf68
Copy link

hepf68 commented Feb 6, 2017

I'm getting this issue on CentOS release 6.8 (Final). It creates 20GB of logs on each machine :O

@williamayerst
Copy link

Any update on this? We're not the only ones affected!

@webash
Copy link

webash commented Feb 8, 2017

+1 to having this issue

@webash
Copy link

webash commented Feb 8, 2017

@robbiezhang @lagalbra @agup006 @NarineM - anyone able to take a look at this for us?

@hbrother
Copy link

hbrother commented Feb 8, 2017

We have been fighting this issue too. It has filled the disk on several of our prod servers. Thankfully, our Platform Admins have prevented outages.

Here is what we are seeing in our logs:
2017-02-02 09:48:02 -0500 [info]: reading config file path="/etc/opt/microsoft/omsagent/conf/omsagent.conf" 2017-02-02 09:48:02 -0500 [info]: starting fluentd-0.12.24 without supervision 2017-02-02 09:48:02 -0500 [info]: gem 'fluentd' version '0.12.24' 2017-02-02 09:48:02 -0500 [info]: adding filter pattern="oms.changetracking.package" type="filter_changetracking" 2017-02-02 09:48:02 -0500 [info]: adding filter pattern="oms.health.**" type="filter_operation" 2017-02-02 09:48:02 -0500 [info]: adding match pattern="oms.health.** oms.heartbeat.**" type="out_oms" 2017-02-02 09:48:02 -0500 [info]: adding filter pattern="oms.operation.**" type="filter_operation" 2017-02-02 09:48:02 -0500 [info]: adding filter pattern="oms.patch_management" type="filter_patch_management" 2017-02-02 09:48:02 -0500 [info]: adding filter pattern="oms.patch_management_immediate_run" type="filter_patch_management" 2017-02-02 09:48:02 -0500 [info]: adding filter pattern="oms.update_progress.apt" type="grep" 2017-02-02 09:48:02 -0500 [info]: adding filter pattern="oms.update_progress.apt" type="filter_linux_update_run_progress" 2017-02-02 09:48:02 -0500 [info]: adding filter pattern="oms.update_progress.yum" type="filter_linux_update_run_progress" 2017-02-02 09:48:02 -0500 [info]: adding filter pattern="oms.security_baseline" type="filter_security_baseline" 2017-02-02 09:48:02 -0500 [info]: adding filter pattern="oms.changetracking.service" type="filter_changetracking" 2017-02-02 09:48:02 -0500 [info]: adding filter pattern="oms.syslog.**" type="filter_syslog" 2017-02-02 09:48:02 -0500 [info]: adding match pattern="oms.blob.**" type="out_oms_blob" 2017-02-02 09:48:02 -0500 [info]: adding match pattern="oms.** docker.**" type="out_oms" 2017-02-02 09:48:02 -0500 [info]: adding source type="exec" 2017-02-02 09:48:02 -0500 [info]: adding source type="monitor_agent" 2017-02-02 09:48:02 -0500 [info]: adding source type="oms_heartbeat" 2017-02-02 09:48:02 -0500 [info]: adding source type="exec" 2017-02-02 09:48:02 -0500 [info]: adding source type="dsc_monitor" 2017-02-02 09:48:02 -0500 [info]: adding source type="exec" 2017-02-02 09:48:02 -0500 [info]: adding source type="exec" 2017-02-02 09:48:02 -0500 [info]: adding source type="tail" 2017-02-02 09:48:02 -0500 [info]: adding source type="sudo_tail" 2017-02-02 09:48:02 -0500 [info]: adding source type="exec" 2017-02-02 09:48:02 -0500 [info]: adding source type="exec" 2017-02-02 09:48:02 -0500 [info]: adding source type="exec" 2017-02-02 09:48:02 -0500 [info]: adding source type="exec" 2017-02-02 09:48:02 -0500 [info]: adding source type="oms_omi" 2017-02-02 09:48:02 -0500 [info]: adding source type="oms_omi" 2017-02-02 09:48:02 -0500 [info]: adding source type="oms_omi" 2017-02-02 09:48:02 -0500 [info]: adding source type="syslog" 2017-02-02 09:48:02 -0500 [info]: using configuration file: <ROOT>

Then:
2017-02-02 09:48:02 -0500 [warn]: /var/log/apt/history.log not found. Continuing without tailing it. 2017-02-02 09:48:02 -0500 [info]: listening syslog socket on 127.0.0.1:25224 with udp 2017-02-03 09:24:20 -0500 [info]: Encountered retryable exception. Will retry sending data later. 2017-02-03 09:24:20 -0500 [warn]: temporarily failed to flush the buffer. next_retry=2017-02-03 09:23:47 -0500 error_class="RuntimeError" error="Net::HTTP.Post raises exception: Net::ReadTimeout, 'Net::ReadTimeout'" plugin_id="object:3fd6c17d328c" 2017-02-03 09:24:20 -0500 [warn]: /opt/microsoft/omsagent/plugin/out_oms.rb:62:in rescue in handle_record'
2017-02-03 09:24:20 -0500 [warn]: /opt/microsoft/omsagent/plugin/out_oms.rb:43:in handle_record' 2017-02-03 09:24:20 -0500 [warn]: /opt/microsoft/omsagent/plugin/out_oms.rb:113:in block in write'
2017-02-03 09:24:20 -0500 [warn]: /opt/microsoft/omsagent/plugin/out_oms.rb:112:in each' 2017-02-03 09:24:20 -0500 [warn]: /opt/microsoft/omsagent/plugin/out_oms.rb:112:in write'
2017-02-03 09:24:20 -0500 [warn]: /opt/microsoft/omsagent/ruby/lib/ruby/gems/2.2.0/gems/fluentd-0.12.24/lib/fluent/buffer.rb:354:in write_chunk' 2017-02-03 09:24:20 -0500 [warn]: /opt/microsoft/omsagent/ruby/lib/ruby/gems/2.2.0/gems/fluentd-0.12.24/lib/fluent/buffer.rb:333:in pop'
2017-02-03 09:24:20 -0500 [warn]: /opt/microsoft/omsagent/ruby/lib/ruby/gems/2.2.0/gems/fluentd-0.12.24/lib/fluent/output.rb:338:in try_flush' 2017-02-03 09:24:20 -0500 [warn]: /opt/microsoft/omsagent/ruby/lib/ruby/gems/2.2.0/gems/fluentd-0.12.24/lib/fluent/output.rb:149:in run'
2017-02-03 09:24:20 -0500 [warn]: retry succeeded. plugin_id="object:3fd6c17d328c"
2017-02-03 09:48:02 -0500 [error]: exec failed to run or shutdown child process error="No such file or directory - /opt/microsoft/omsagent/plugin/omsbaseline" error_class="Errno::ENOENT"
2017-02-03 09:48:02 -0500 [warn]: /opt/microsoft/omsagent/ruby/lib/ruby/gems/2.2.0/gems/fluentd-0.12.24/lib/fluent/plugin/in_exec.rb:140:in popen' 2017-02-03 09:48:02 -0500 [warn]: /opt/microsoft/omsagent/ruby/lib/ruby/gems/2.2.0/gems/fluentd-0.12.24/lib/fluent/plugin/in_exec.rb:140:in run_periodic'
2017-02-03 09:48:02 -0500 [error]: exec failed to run or shutdown child process error="No such file or directory - /opt/microsoft/omsagent/plugin/omsbaseline" error_class="Errno::ENOENT"
2017-02-03 09:48:02 -0500 [warn]: suppressed same stacktrace
2017-02-03 09:48:02 -0500 [error]: exec failed to run or shutdown child process error="No such file or directory - /opt/microsoft/omsagent/plugin/omsbaseline" error_class="Errno::ENOENT"
2017-02-03 09:48:02 -0500 [warn]: suppressed same stacktrace
2017-02-03 09:48:02 -0500 [error]: exec failed to run or shutdown child process error="No such file or directory - /opt/microsoft/omsagent/plugin/omsbaseline" error_class="Errno::ENOENT"
2017-02-03 09:48:02 -0500 [warn]: suppressed same stacktrace
2017-02-03 09:48:02 -0500 [error]: exec failed to run or shutdown child process error="No such file or directory - /opt/microsoft/omsagent/plugin/omsbaseline" error_class="Errno::ENOENT"
2017-02-03 09:48:02 -0500 [warn]: suppressed same stacktrace`

The last two entries repeat until the disk fills.

Etienne's fix does not work for us as we already had similar settings on our logrotate. The only sure way to prevent this, that I know of, is to put the logs on a separate partition.

We are at the point of shutting down the agent on all Linux prod systems. We had already disabled centralized configuration due to another issue that was consuming RAM on our systems. Clearly that wasn't enough to prevent updates from impacting prod.

@hbrother
Copy link

hbrother commented Feb 8, 2017

Update: We are setting our logging level to "fatal" to avoid filling up disk.

If anyone knows of a way to change the location of the logs that would give us better options.

@EtienneDeneuve
Copy link

Sorry @hbrother if my fix not work on your side.. feel free to give my contact info to your admins, maybe we can find a better solution together :)

@hbrother
Copy link

hbrother commented Feb 8, 2017

@EtienneDeneuve No worries. It may be that one of the other settings we have interfered with your working solution. Our settings were: "{
rotate 5
missingok
notifempty
compress
size 50k
copytruncate
postrotate
/sbin/service omsagent restart
endscript
}"

@lagalbra
Copy link
Contributor

lagalbra commented Feb 8, 2017

The root issue has been mitigated. An error-causing conf file security_baseline.conf has been removed via a DSC update to affected machines (those with the Security and Audit OMS solution enabled). To check that it has been removed from your machine, verify that the file /etc/opt/microsoft/omsagent/conf/omsagent.d/security_baseline.conf does not exist.

Our team is currently working on improving our logging strategy to avoid generating large log files.

If the security_baseline.conf file does exist in the above location, follow these steps:

  1. Stop the OMS Agent daemon: sudo /opt/microsoft/omsagent/bin/service_control stop
  2. Stop the OMI daemon: sudo /opt/omi/bin/service_control stop
  3. Delete the large log file(s) at : /var/opt/microsoft/omsagent/log/
  4. Uninstall the OMS Agent using the purge option. This will uninstall the OMS Agent package and remove all related files (including the logs).
    For example: sudo sh ./omsagent-1.2.0-148.universal.x64.sh --purge (you can use the agent version installed)
  5. Install the OMS Agent according to the documented instructions

@lagalbra lagalbra closed this as completed Feb 8, 2017
@webash
Copy link

webash commented Feb 9, 2017

Thanks @lagalbra. I notice the installation instructions refer to version 1.3.0, yet that version isn't in the releases. Will that be released soon?

@webash
Copy link

webash commented Feb 9, 2017

I can confirm that my machine did still have the security_baseline.conf file, so had to follow the purge steps and reinstall.

@lagalbra
Copy link
Contributor

lagalbra commented Feb 9, 2017

@webash Our goal is to release OMSAgent version 1.3.0 this month.

@williamayerst
Copy link

@lagalbra can you please advise how we deal with the OMS Agent Extension object? It will still be extant on the machines even though the OMS agent will be installed in Step 4 of your instructions, and ideally we'd like to use this method to reinstall rather than transferring files to EVERY affected linux machine in our enterprise.

Thanks,

@webash
Copy link

webash commented Feb 11, 2017

Unfortunately this issue hasn't been resolved by reinstalling the agent. security_baseline.conf has returned to the machine and the log has filled the drive again. @lagalbra

@lagalbra
Copy link
Contributor

@webash Could you copy the contents of security_baseline.conf into this thread?
If you want to temporarily resolve the issue before the fixed version is fully deployed, the file will be removed from your machine if you disable the Security and Audit solution in your OMS workspace.

@ngroon
Copy link

ngroon commented Feb 13, 2017

We too had the security_baseline.conf file. Just wanted to confirm that the purge uninstall/reinstall fixed the issue for us.

Thanks, @lagalbra

@webash
Copy link

webash commented Feb 21, 2017

Apologies for the delay @lagalbra, here is the contents of security_baseline.conf

# Security Baseline plugins

<source>
    type exec
    tag oms.security_baseline
    command sleep 60 && /opt/microsoft/omsagent/plugin/omsbaseline -d  /etc/opt/microsoft/omsagent/conf/omsagent.d/
    format json
</source>

<source>
    type exec
    tag oms.security_baseline
    command /opt/microsoft/omsagent/plugin/omsbaseline -d /etc/opt/microsoft/omsagent/conf/omsagent.d/
    format json
    run_interval 24h
</source>

<filter oms.security_baseline>
    type filter_security_baseline
    log_level info
</filter>

#<match oms.security_baseline.** oms.security_baseline_summary.**>
#    type stdout
#</match>

@lagalbra
Copy link
Contributor

@webash Your security_baseline.conf file is the buggy one that should have been removed from all customers' machines. I would recommend purging the agent and disabling the Security and Audit solution on the connected workspace. Delete any remaining large log files and make sure there are no remaining omsconfig/ or omsagent/ folders remaining on the machine. Then install the agent again.

We should have a new configuration file for this issue deployed within the week.

@webash
Copy link

webash commented Feb 25, 2017

Its a personal lab environment where I'm having the most profound effects from this, so I've simply stopped the OMS service on the affected machines for now. Will you be able to update us here when the new conf file has been deployed?

@lagalbra
Copy link
Contributor

The new conf file has been deployed, but is not publicly available for the purposes of a safer preview period. If you re-onboard your machines to the OMS service now, you will not receive any version of the security_baseline.conf file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants