OMSAgent fills the os disk and wreck havoc the system #366

brandubh · 2017-02-02T11:02:55Z

two different causes, same effects: the omsagent.log grows out of control filling the os disk and basically disrupting the monitored system. This is a condition that must be avoided at all costs, the agent must not disrupt the monitored system.

during a solution upgrade pushed from the cloud we got the following entry filling the log and the disk of several VMs: [error]: exec failed to run or shutdown child process error="No such file or directory - /opt/microsoft/omsagent/plugin/omsbaseline" err$
the in_tail plugin doesn't manage appropriately any access denied error, filling the log and the disk with:
2017-02-02 10:21:49 +0100 [error]: /opt/microsoft/omsagent/ruby/lib/ruby/gems/2.2.0/gems/fluentd-0.12.24/lib/fluent/plugin/in_tail.rb:484:in on_timer'
2017-02-02 10:21:49 +0100 [error]: /opt/microsoft/omsagent/ruby/lib/ruby/gems/2.2.0/gems/cool.io-1.4.4/lib/cool.io/loop.rb:88:in run_once' 2017-02-02 10:21:49 +0100 [error]: /opt/microsoft/omsagent/ruby/lib/ruby/gems/2.2.0/gems/cool.io-1.4.4/lib/cool.io/loop.rb:88:in run'
2017-02-02 10:21:49 +0100 [error]: /opt/microsoft/omsagent/ruby/lib/ruby/gems/2.2.0/gems/fluentd-0.12.24/lib/fluent/plugin/in_tail.rb:253:in run' 2017-02-02 10:21:50 +0100 [error]: Permission denied @ rb_file_s_stat - /var/log/php-fpm/error.log

The text was updated successfully, but these errors were encountered:

EtienneDeneuve · 2017-02-02T18:00:18Z

Where could we give the logs ? 24 go ;)

EtienneDeneuve · 2017-02-02T18:17:44Z

I needed to recover my service so I deleted it but here is a tail -n 200 on the omsagent :
https://gist.github.com/EtienneDeneuve/cd6b86d6f175663f02e21cbc7d60ce59

EtienneDeneuve · 2017-02-02T18:59:02Z

I've published a quick fix on my blog using logrotate to avoid that before a fix http://etienne.deneuve.xyz/2017/02/02/agent-oms-sur-linux-dans-azure/

ngroon · 2017-02-03T14:59:51Z

+1 on issue #1 mentioned by @brandubh

slumos · 2017-02-03T17:38:48Z

Getting the same issue. Same error message. Relevant lines from our logs below. There's nothing relevant for hours before this happens. After, the last line below is logged about 2000 times per second until omsagent is restarted.

2017-02-02 22:30:47 +0000 [error]: exec failed to run or shutdown child process error="No such file or directory - /opt/microsoft/omsagent/plugin/omsbaseline" error_class="Errno::ENOENT"
  2017-02-02 22:30:47 +0000 [warn]: /opt/microsoft/omsagent/ruby/lib/ruby/gems/2.2.0/gems/fluentd-0.12.24/lib/fluent/plugin/in_exec.rb:140:in `popen'
  2017-02-02 22:30:47 +0000 [warn]: /opt/microsoft/omsagent/ruby/lib/ruby/gems/2.2.0/gems/fluentd-0.12.24/lib/fluent/plugin/in_exec.rb:140:in `run_periodic'
2017-02-02 22:30:47 +0000 [error]: exec failed to run or shutdown child process error="No such file or directory - /opt/microsoft/omsagent/plugin/omsbaseline" error_class="Errno::ENOENT"
  2017-02-02 22:30:47 +0000 [warn]: suppressed same stacktrace

williamayerst · 2017-02-06T09:56:47Z

Getting this issue also on Ubuntu 16.04 in multiple OMS agents connecting to multiple dashboards

hepf68 · 2017-02-06T10:06:29Z

I'm getting this issue on CentOS release 6.8 (Final). It creates 20GB of logs on each machine :O

williamayerst · 2017-02-08T10:01:55Z

Any update on this? We're not the only ones affected!

webash · 2017-02-08T11:44:43Z

+1 to having this issue

webash · 2017-02-08T12:06:17Z

@robbiezhang @lagalbra @agup006 @NarineM - anyone able to take a look at this for us?

hbrother · 2017-02-08T14:25:24Z

We have been fighting this issue too. It has filled the disk on several of our prod servers. Thankfully, our Platform Admins have prevented outages.

Here is what we are seeing in our logs:
2017-02-02 09:48:02 -0500 [info]: reading config file path="/etc/opt/microsoft/omsagent/conf/omsagent.conf" 2017-02-02 09:48:02 -0500 [info]: starting fluentd-0.12.24 without supervision 2017-02-02 09:48:02 -0500 [info]: gem 'fluentd' version '0.12.24' 2017-02-02 09:48:02 -0500 [info]: adding filter pattern="oms.changetracking.package" type="filter_changetracking" 2017-02-02 09:48:02 -0500 [info]: adding filter pattern="oms.health.**" type="filter_operation" 2017-02-02 09:48:02 -0500 [info]: adding match pattern="oms.health.** oms.heartbeat.**" type="out_oms" 2017-02-02 09:48:02 -0500 [info]: adding filter pattern="oms.operation.**" type="filter_operation" 2017-02-02 09:48:02 -0500 [info]: adding filter pattern="oms.patch_management" type="filter_patch_management" 2017-02-02 09:48:02 -0500 [info]: adding filter pattern="oms.patch_management_immediate_run" type="filter_patch_management" 2017-02-02 09:48:02 -0500 [info]: adding filter pattern="oms.update_progress.apt" type="grep" 2017-02-02 09:48:02 -0500 [info]: adding filter pattern="oms.update_progress.apt" type="filter_linux_update_run_progress" 2017-02-02 09:48:02 -0500 [info]: adding filter pattern="oms.update_progress.yum" type="filter_linux_update_run_progress" 2017-02-02 09:48:02 -0500 [info]: adding filter pattern="oms.security_baseline" type="filter_security_baseline" 2017-02-02 09:48:02 -0500 [info]: adding filter pattern="oms.changetracking.service" type="filter_changetracking" 2017-02-02 09:48:02 -0500 [info]: adding filter pattern="oms.syslog.**" type="filter_syslog" 2017-02-02 09:48:02 -0500 [info]: adding match pattern="oms.blob.**" type="out_oms_blob" 2017-02-02 09:48:02 -0500 [info]: adding match pattern="oms.** docker.**" type="out_oms" 2017-02-02 09:48:02 -0500 [info]: adding source type="exec" 2017-02-02 09:48:02 -0500 [info]: adding source type="monitor_agent" 2017-02-02 09:48:02 -0500 [info]: adding source type="oms_heartbeat" 2017-02-02 09:48:02 -0500 [info]: adding source type="exec" 2017-02-02 09:48:02 -0500 [info]: adding source type="dsc_monitor" 2017-02-02 09:48:02 -0500 [info]: adding source type="exec" 2017-02-02 09:48:02 -0500 [info]: adding source type="exec" 2017-02-02 09:48:02 -0500 [info]: adding source type="tail" 2017-02-02 09:48:02 -0500 [info]: adding source type="sudo_tail" 2017-02-02 09:48:02 -0500 [info]: adding source type="exec" 2017-02-02 09:48:02 -0500 [info]: adding source type="exec" 2017-02-02 09:48:02 -0500 [info]: adding source type="exec" 2017-02-02 09:48:02 -0500 [info]: adding source type="exec" 2017-02-02 09:48:02 -0500 [info]: adding source type="oms_omi" 2017-02-02 09:48:02 -0500 [info]: adding source type="oms_omi" 2017-02-02 09:48:02 -0500 [info]: adding source type="oms_omi" 2017-02-02 09:48:02 -0500 [info]: adding source type="syslog" 2017-02-02 09:48:02 -0500 [info]: using configuration file: <ROOT>

Then:
2017-02-02 09:48:02 -0500 [warn]: /var/log/apt/history.log not found. Continuing without tailing it. 2017-02-02 09:48:02 -0500 [info]: listening syslog socket on 127.0.0.1:25224 with udp 2017-02-03 09:24:20 -0500 [info]: Encountered retryable exception. Will retry sending data later. 2017-02-03 09:24:20 -0500 [warn]: temporarily failed to flush the buffer. next_retry=2017-02-03 09:23:47 -0500 error_class="RuntimeError" error="Net::HTTP.Post raises exception: Net::ReadTimeout, 'Net::ReadTimeout'" plugin_id="object:3fd6c17d328c" 2017-02-03 09:24:20 -0500 [warn]: /opt/microsoft/omsagent/plugin/out_oms.rb:62:in rescue in handle_record'
2017-02-03 09:24:20 -0500 [warn]: /opt/microsoft/omsagent/plugin/out_oms.rb:43:in handle_record' 2017-02-03 09:24:20 -0500 [warn]: /opt/microsoft/omsagent/plugin/out_oms.rb:113:in block in write'
2017-02-03 09:24:20 -0500 [warn]: /opt/microsoft/omsagent/plugin/out_oms.rb:112:in each' 2017-02-03 09:24:20 -0500 [warn]: /opt/microsoft/omsagent/plugin/out_oms.rb:112:in write'
2017-02-03 09:24:20 -0500 [warn]: /opt/microsoft/omsagent/ruby/lib/ruby/gems/2.2.0/gems/fluentd-0.12.24/lib/fluent/buffer.rb:354:in write_chunk' 2017-02-03 09:24:20 -0500 [warn]: /opt/microsoft/omsagent/ruby/lib/ruby/gems/2.2.0/gems/fluentd-0.12.24/lib/fluent/buffer.rb:333:in pop'
2017-02-03 09:24:20 -0500 [warn]: /opt/microsoft/omsagent/ruby/lib/ruby/gems/2.2.0/gems/fluentd-0.12.24/lib/fluent/output.rb:338:in try_flush' 2017-02-03 09:24:20 -0500 [warn]: /opt/microsoft/omsagent/ruby/lib/ruby/gems/2.2.0/gems/fluentd-0.12.24/lib/fluent/output.rb:149:in run'
2017-02-03 09:24:20 -0500 [warn]: retry succeeded. plugin_id="object:3fd6c17d328c"
2017-02-03 09:48:02 -0500 [error]: exec failed to run or shutdown child process error="No such file or directory - /opt/microsoft/omsagent/plugin/omsbaseline" error_class="Errno::ENOENT"
2017-02-03 09:48:02 -0500 [warn]: /opt/microsoft/omsagent/ruby/lib/ruby/gems/2.2.0/gems/fluentd-0.12.24/lib/fluent/plugin/in_exec.rb:140:in popen' 2017-02-03 09:48:02 -0500 [warn]: /opt/microsoft/omsagent/ruby/lib/ruby/gems/2.2.0/gems/fluentd-0.12.24/lib/fluent/plugin/in_exec.rb:140:in run_periodic'
2017-02-03 09:48:02 -0500 [error]: exec failed to run or shutdown child process error="No such file or directory - /opt/microsoft/omsagent/plugin/omsbaseline" error_class="Errno::ENOENT"
2017-02-03 09:48:02 -0500 [warn]: suppressed same stacktrace
2017-02-03 09:48:02 -0500 [error]: exec failed to run or shutdown child process error="No such file or directory - /opt/microsoft/omsagent/plugin/omsbaseline" error_class="Errno::ENOENT"
2017-02-03 09:48:02 -0500 [warn]: suppressed same stacktrace
2017-02-03 09:48:02 -0500 [error]: exec failed to run or shutdown child process error="No such file or directory - /opt/microsoft/omsagent/plugin/omsbaseline" error_class="Errno::ENOENT"
2017-02-03 09:48:02 -0500 [warn]: suppressed same stacktrace
2017-02-03 09:48:02 -0500 [error]: exec failed to run or shutdown child process error="No such file or directory - /opt/microsoft/omsagent/plugin/omsbaseline" error_class="Errno::ENOENT"
2017-02-03 09:48:02 -0500 [warn]: suppressed same stacktrace`

The last two entries repeat until the disk fills.

Etienne's fix does not work for us as we already had similar settings on our logrotate. The only sure way to prevent this, that I know of, is to put the logs on a separate partition.

We are at the point of shutting down the agent on all Linux prod systems. We had already disabled centralized configuration due to another issue that was consuming RAM on our systems. Clearly that wasn't enough to prevent updates from impacting prod.

hbrother · 2017-02-08T14:54:36Z

Update: We are setting our logging level to "fatal" to avoid filling up disk.

If anyone knows of a way to change the location of the logs that would give us better options.

EtienneDeneuve · 2017-02-08T15:41:55Z

Sorry @hbrother if my fix not work on your side.. feel free to give my contact info to your admins, maybe we can find a better solution together :)

hbrother · 2017-02-08T15:51:52Z

@EtienneDeneuve No worries. It may be that one of the other settings we have interfered with your working solution. Our settings were: "{
rotate 5
missingok
notifempty
compress
size 50k
copytruncate
postrotate
/sbin/service omsagent restart
endscript
}"

lagalbra · 2017-02-08T19:30:59Z

The root issue has been mitigated. An error-causing conf file security_baseline.conf has been removed via a DSC update to affected machines (those with the Security and Audit OMS solution enabled). To check that it has been removed from your machine, verify that the file /etc/opt/microsoft/omsagent/conf/omsagent.d/security_baseline.conf does not exist.

Our team is currently working on improving our logging strategy to avoid generating large log files.

If the security_baseline.conf file does exist in the above location, follow these steps:

Stop the OMS Agent daemon: sudo /opt/microsoft/omsagent/bin/service_control stop
Stop the OMI daemon: sudo /opt/omi/bin/service_control stop
Delete the large log file(s) at : /var/opt/microsoft/omsagent/log/
Uninstall the OMS Agent using the purge option. This will uninstall the OMS Agent package and remove all related files (including the logs).
For example: sudo sh ./omsagent-1.2.0-148.universal.x64.sh --purge (you can use the agent version installed)
Install the OMS Agent according to the documented instructions

webash · 2017-02-09T11:59:50Z

Thanks @lagalbra. I notice the installation instructions refer to version 1.3.0, yet that version isn't in the releases. Will that be released soon?

webash · 2017-02-09T12:05:58Z

I can confirm that my machine did still have the security_baseline.conf file, so had to follow the purge steps and reinstall.

lagalbra · 2017-02-09T17:06:49Z

@webash Our goal is to release OMSAgent version 1.3.0 this month.

williamayerst · 2017-02-10T10:32:44Z

@lagalbra can you please advise how we deal with the OMS Agent Extension object? It will still be extant on the machines even though the OMS agent will be installed in Step 4 of your instructions, and ideally we'd like to use this method to reinstall rather than transferring files to EVERY affected linux machine in our enterprise.

Thanks,

webash · 2017-02-11T14:51:04Z

Unfortunately this issue hasn't been resolved by reinstalling the agent. security_baseline.conf has returned to the machine and the log has filled the drive again. @lagalbra

lagalbra · 2017-02-13T17:46:46Z

@webash Could you copy the contents of security_baseline.conf into this thread?
If you want to temporarily resolve the issue before the fixed version is fully deployed, the file will be removed from your machine if you disable the Security and Audit solution in your OMS workspace.

ngroon · 2017-02-13T18:30:21Z

We too had the security_baseline.conf file. Just wanted to confirm that the purge uninstall/reinstall fixed the issue for us.

Thanks, @lagalbra

webash · 2017-02-21T23:31:01Z

Apologies for the delay @lagalbra, here is the contents of security_baseline.conf

# Security Baseline plugins

<source>
    type exec
    tag oms.security_baseline
    command sleep 60 && /opt/microsoft/omsagent/plugin/omsbaseline -d  /etc/opt/microsoft/omsagent/conf/omsagent.d/
    format json
</source>

<source>
    type exec
    tag oms.security_baseline
    command /opt/microsoft/omsagent/plugin/omsbaseline -d /etc/opt/microsoft/omsagent/conf/omsagent.d/
    format json
    run_interval 24h
</source>

<filter oms.security_baseline>
    type filter_security_baseline
    log_level info
</filter>

#<match oms.security_baseline.** oms.security_baseline_summary.**>
#    type stdout
#</match>

lagalbra · 2017-02-22T00:18:27Z

@webash Your security_baseline.conf file is the buggy one that should have been removed from all customers' machines. I would recommend purging the agent and disabling the Security and Audit solution on the connected workspace. Delete any remaining large log files and make sure there are no remaining omsconfig/ or omsagent/ folders remaining on the machine. Then install the agent again.

We should have a new configuration file for this issue deployed within the week.

webash · 2017-02-25T13:43:02Z

Its a personal lab environment where I'm having the most profound effects from this, so I've simply stopped the OMS service on the affected machines for now. Will you be able to update us here when the new conf file has been deployed?

lagalbra · 2017-02-27T17:43:50Z

The new conf file has been deployed, but is not publicly available for the purposes of a safer preview period. If you re-onboard your machines to the OMS service now, you will not receive any version of the security_baseline.conf file.

webash mentioned this issue Feb 8, 2017

OMI Consumes all available space #315

Closed

lagalbra closed this as completed Feb 8, 2017

brandubh mentioned this issue Feb 15, 2017

Custom Log ingestion issue with noperm files #375

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OMSAgent fills the os disk and wreck havoc the system #366

OMSAgent fills the os disk and wreck havoc the system #366

brandubh commented Feb 2, 2017

EtienneDeneuve commented Feb 2, 2017

EtienneDeneuve commented Feb 2, 2017

EtienneDeneuve commented Feb 2, 2017

ngroon commented Feb 3, 2017

slumos commented Feb 3, 2017

williamayerst commented Feb 6, 2017

hepf68 commented Feb 6, 2017

williamayerst commented Feb 8, 2017

webash commented Feb 8, 2017

webash commented Feb 8, 2017

hbrother commented Feb 8, 2017 •

edited

hbrother commented Feb 8, 2017

EtienneDeneuve commented Feb 8, 2017

hbrother commented Feb 8, 2017 •

edited

lagalbra commented Feb 8, 2017

webash commented Feb 9, 2017

webash commented Feb 9, 2017

lagalbra commented Feb 9, 2017

williamayerst commented Feb 10, 2017

webash commented Feb 11, 2017

lagalbra commented Feb 13, 2017

ngroon commented Feb 13, 2017

webash commented Feb 21, 2017

lagalbra commented Feb 22, 2017

webash commented Feb 25, 2017

lagalbra commented Feb 27, 2017

OMSAgent fills the os disk and wreck havoc the system #366

OMSAgent fills the os disk and wreck havoc the system #366

Comments

brandubh commented Feb 2, 2017

EtienneDeneuve commented Feb 2, 2017

EtienneDeneuve commented Feb 2, 2017

EtienneDeneuve commented Feb 2, 2017

ngroon commented Feb 3, 2017

slumos commented Feb 3, 2017

williamayerst commented Feb 6, 2017

hepf68 commented Feb 6, 2017

williamayerst commented Feb 8, 2017

webash commented Feb 8, 2017

webash commented Feb 8, 2017

hbrother commented Feb 8, 2017 • edited

hbrother commented Feb 8, 2017

EtienneDeneuve commented Feb 8, 2017

hbrother commented Feb 8, 2017 • edited

lagalbra commented Feb 8, 2017

webash commented Feb 9, 2017

webash commented Feb 9, 2017

lagalbra commented Feb 9, 2017

williamayerst commented Feb 10, 2017

webash commented Feb 11, 2017

lagalbra commented Feb 13, 2017

ngroon commented Feb 13, 2017

webash commented Feb 21, 2017

lagalbra commented Feb 22, 2017

webash commented Feb 25, 2017

lagalbra commented Feb 27, 2017

hbrother commented Feb 8, 2017 •

edited

hbrother commented Feb 8, 2017 •

edited