Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

feat: updates for container monitoring addon omsagent agent September 2020 release #3942

Merged
merged 9 commits into from
Oct 28, 2020
Merged

Conversation

ganga1980
Copy link
Contributor

@ganga1980 ganga1980 commented Oct 15, 2020

Reason for Change:

This release has updates specified
https://github.com/microsoft/Docker-Provider/blob/ci_prod/ReleaseNotes.md#10052020--
https://github.com/microsoft/Docker-Provider/blob/ci_prod/ReleaseNotes.md#09162020--
https://github.com/microsoft/Docker-Provider/blob/ci_prod/ReleaseNotes.md#08072020--
https://github.com/microsoft/Docker-Provider/blob/ci_prod/ReleaseNotes.md#07152020--

Both windows and linux agent supports containerd with this version.

Issue Fixed:

Credit Where Due:

Does this change contain code from or inspired by another project?

  • No
  • Yes

If "Yes," did you notify that project's maintainers and provide attribution?

  • No
  • Yes

Requirements:

Notes:

@ganga1980 ganga1980 changed the title updates for omsagent agent September 2020 release feat: updates for omsagent agent September 2020 release Oct 15, 2020
@ganga1980
Copy link
Contributor Author

/assign @jackfrancis

@ganga1980
Copy link
Contributor Author

/assign @rashmichandrashekar

@ganga1980
Copy link
Contributor Author

/assign @vishiy

@jackfrancis
Copy link
Member

Thanks @ganga1980! Will this address #2764?

@ganga1980 ganga1980 changed the title feat: updates for omsagent agent September 2020 release feat: updates for container monitoring addon omsagent agent September 2020 release Oct 15, 2020
@ganga1980
Copy link
Contributor Author

Thanks @ganga1980! Will this address #2764?

Yes, @jackfrancis, this agent version has containerd support for both windows and linux.

@jackfrancis
Copy link
Member

/azp run pr-e2e

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jackfrancis
Copy link
Member

Is this updated supposed to work for 1.16? Got this error on the omsagent pod on a 1.16 test cluster:

INFO:  Configuring OMS agent service 9257fb9c-24d2-433a-a57a-fed7ff3eb584 ...
 invoke-rc.d: could not determine current runlevel
  * Starting Operations Management Suite agent (9257fb9c-24d2-433a-a57a-fed7ff3eb584): 
    ...done.
 -e error	MetaConfig generation script not available at /opt/microsoft/omsconfig/Scripts/OMS_MetaConfigHelper.py
  * Starting periodic command scheduler cron
    ...done.
 Primary Workspace: 9257fb9c-24d2-433a-a57a-fed7ff3eb584    Status: Onboarded(OMSAgent Running)
 omsagent 1.10.0.1
 docker-cimprov 10.1.0.0
 DOCKER_CIMPROV_VERSION=10.1.0.0
 since container run time is containerd update the container log fluentbit Parser to cri from docker
 nodename: k8s-master-24293709-0
 replacing nodename in telegraf config
 File Doesnt Exist. Creating file...
 �[1mFluent Bit v1.4.2�[0m
 * �[1m�[93mCopyright (C) 2019-2020 The Fluent Bit Authors�[0m
 * �[1m�[93mCopyright (C) 2015-2018 Treasure Data�[0m
 * Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
 * https://fluentbit.io
 
 2020-10-15T19:04:48Z I! Starting Telegraf 
 2020-10-15T19:04:48Z E! [agent] Failed to connect to output socket_writer, retrying in 15s, error was 'dial tcp 0.0.0.0:25226: getsockopt: connection refused' 
 Telegraf unknown (git: fork 50cd124)
 td-agent-bit 1.4.2

@jackfrancis
Copy link
Member

Same symptom on the 1.17 test as well

@ganga1980
Copy link
Contributor Author

ganga1980 commented Oct 15, 2020

Same symptom on the 1.17 test as well

Yes, we have seen this initial startup error 'Failed to connect to output socket_writer, ...' because fluent bit takes sometime to start and shouldn't impact anything up with container monitoring. yes, container monitoring should work on all the supported k8s versions. we have tested on all the supported versions.

@jackfrancis
Copy link
Member

/hold

until we can update our tests to tolerate the additional startup time due to fluentd

@jackfrancis
Copy link
Member

@ganga1980 anecdotally it seems that it's taking 20 minutes or longer for the omsagent pods to come online, which is why our tests are failing

@ganga1980
Copy link
Contributor Author

@ganga1980 anecdotally it seems that it's taking 20 minutes or longer for the omsagent pods to come online, which is why our tests are failing
are you referring "20 minute time" for windows omsagent? if yes, this is expected and we have seen high start up time of because of the heavy image size due to servercore base image.

@ganga1980
Copy link
Contributor Author

/hold

until we can update our tests to tolerate the additional startup time due to fluentd

can you please point me to tests which you referring here and afik, this should not have impact on the tests.

@jackfrancis
Copy link
Member

This test is failing 100% of the time:

https://github.com/Azure/aks-engine/blob/master/test/e2e/kubernetes/kubernetes_test.go#L1027

(It is not failing at all on the current omsagent version)

@ganga1980
Copy link
Contributor Author

This test is failing 100% of the time:

https://github.com/Azure/aks-engine/blob/master/test/e2e/kubernetes/kubernetes_test.go#L1027

(It is not failing at all on the current omsagent version)

Hi, @jackfrancis , only the change, we have in this agent which could impact is, the change in windows base image
from: mcr.microsoft.com/windows/servercore:1809
to: mcr.microsoft.com/windows/servercore:ltsc2019

Looks either we have to increase test timeout or see if there is any way we can cache or pre-pull this base image before test starts.

@ganga1980
Copy link
Contributor Author

This test is failing 100% of the time:
https://github.com/Azure/aks-engine/blob/master/test/e2e/kubernetes/kubernetes_test.go#L1027
(It is not failing at all on the current omsagent version)

Hi, @jackfrancis , only the change, we have in this agent which could impact is, the change in windows base image
from: mcr.microsoft.com/windows/servercore:1809
to: mcr.microsoft.com/windows/servercore:ltsc2019

Looks either we have to increase test timeout or see if there is any way we can cache or pre-pull this base image before test starts.

@jackfrancis , is it possible to pre-bake or pre-pull the base image: mcr.microsoft.com/windows/servercore:ltsc2019 for our windows agent code to get unblocked on this? would you guys able to help on this? please let me know if you guys need any help on this.

Copy link
Member

@jackfrancis jackfrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@acs-bot acs-bot added the lgtm label Oct 28, 2020
@jackfrancis jackfrancis merged commit 4302a60 into Azure:master Oct 28, 2020
@acs-bot
Copy link

acs-bot commented Oct 28, 2020

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ganga1980, jackfrancis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants