Filter out false positive errors during cluster maintenance #36

atanasdinov · 2022-12-12T08:27:40Z

Description

What

Introduces AWS SDK connection in order to determine cluster maintenance windows.

Why

JIRA Ticket

Scope and particulars of this PR (Please tick all that apply)

This Pull Request follows the rules described in our Pull Requests Guide

coveralls · 2022-12-21T16:14:16Z

Coverage: 86.782% (-2.7%) from 89.44% when pulling 7cf08ec on feature/UPPSF-3587-avoid-monitoring-errors-on-maintenance into 8270e47 on v4.

atanasdinov · 2022-12-23T16:17:34Z

cluster_describer.go

+
+// Verifies whether the Kafka cluster is available or not.
+// False positive healthcheck errors are being ignored during maintenance windows.
+func checkClusterAvailability(healthErr error, describer clusterDescriber, arn *string) error {


Definitely not satisfied with this but just copy-pasting the body of the function across all healthchecks is not cool either.

Modifying the error seems to be the correct way to avoid the repetition but I'm open to suggestions if I've missed something obvious.

nitpick (non-blocking): Just semantics - from a function named checkClusterAvailability I wouldn't expect to receive an error as argument. I would expect to check whether the cluster is available or not and to let the code using the function decides what to do in case of Maintenance error. The downside is what you've described. I would rename the function to something like verifyHealthErrorServerity.

epavlova · 2023-01-04T11:19:31Z

cluster_describer.go

+
+// Verifies whether the Kafka cluster is available or not.
+// False positive healthcheck errors are being ignored during maintenance windows.
+func checkClusterAvailability(healthErr error, describer clusterDescriber, arn *string) error {


nitpick (non-blocking): Just semantics - from a function named checkClusterAvailability I wouldn't expect to receive an error as argument. I would expect to check whether the cluster is available or not and to let the code using the function decides what to do in case of Maintenance error. The downside is what you've described. I would rename the function to something like verifyHealthErrorServerity.

epavlova · 2023-01-04T11:24:19Z

cluster_describer.go

+	ctx, cancel := context.WithTimeout(context.Background(), clusterDescriptionTimeout)
+	defer cancel()
+
+	cluster, err := describer.DescribeClusterV2(ctx, &kafka.DescribeClusterV2Input{


nitpick (non-blocking): A very theoretical scenario would be for newClusterDescriber() to return an error when created in the NewConsumer or NewProducer. When using checkClusterAvailability you only check for available config.ClusterArn. Again theoretically describer could be nil here.

The NewConsumer and NewProducer functions will only attempt to create a describer if ClusterArn is provided. If that fails, they'd return an error which would cause the termination of the whole application. By doing this the describer is never going to be nil.

Then I opted for checking ClusterArn just because it's simpler than nil checking an interface. Do you think it's better to check the describer instead?

Okay, I understand the idea now, my bad. I think it's fine that way it's implemented having in mind that the whole logic is private.
If retrieveClusterState was public and we wanted to be super foolproof we can check if the describer is not null before describer.DescribeClusterV2 call.

Filter out false positive monitor errors

b2bd7af

atanasdinov force-pushed the feature/UPPSF-3587-avoid-monitoring-errors-on-maintenance branch from 98dba24 to b2bd7af Compare December 21, 2022 16:12

atanasdinov added 2 commits December 21, 2022 20:31

Filter out false positive connectivty check errors

92b84a8

Upgrade aws sdk modules

dc2ed5f

atanasdinov force-pushed the feature/UPPSF-3587-avoid-monitoring-errors-on-maintenance branch from a57c53d to dc2ed5f Compare December 21, 2022 18:31

Extend connectivity tests

63a3476

atanasdinov force-pushed the feature/UPPSF-3587-avoid-monitoring-errors-on-maintenance branch from f5e0955 to 63a3476 Compare December 23, 2022 16:07

atanasdinov commented Dec 23, 2022

View reviewed changes

atanasdinov changed the title ~~Filter out false positive monitor errors~~ Filter out false positive errors during cluster maintenance Dec 23, 2022

atanasdinov marked this pull request as ready for review December 23, 2022 16:19

atanasdinov requested review from a team as code owners December 23, 2022 16:19

ilian2233 approved these changes Jan 3, 2023

View reviewed changes

epavlova self-requested a review January 3, 2023 13:18

epavlova approved these changes Jan 4, 2023

View reviewed changes

Rename health error handler

7cf08ec

Eng3lFT approved these changes Jan 5, 2023

View reviewed changes

martin-stanchev approved these changes Jan 5, 2023

View reviewed changes

atanasdinov changed the base branch from v4 to feature/cluster-maintenance-monitoring January 13, 2023 08:19

atanasdinov merged commit bee19d4 into feature/cluster-maintenance-monitoring Jan 13, 2023

atanasdinov deleted the feature/UPPSF-3587-avoid-monitoring-errors-on-maintenance branch January 13, 2023 08:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter out false positive errors during cluster maintenance #36

Filter out false positive errors during cluster maintenance #36

atanasdinov commented Dec 12, 2022 •

edited by jira bot

Loading

coveralls commented Dec 21, 2022 •

edited by jira bot

Loading

atanasdinov Dec 23, 2022

epavlova Jan 4, 2023

epavlova Jan 4, 2023

epavlova Jan 4, 2023

atanasdinov Jan 4, 2023

epavlova Jan 13, 2023

Filter out false positive errors during cluster maintenance #36

Filter out false positive errors during cluster maintenance #36

Conversation

atanasdinov commented Dec 12, 2022 • edited by jira bot Loading

Description

What

Why

Scope and particulars of this PR (Please tick all that apply)

coveralls commented Dec 21, 2022 • edited by jira bot Loading

atanasdinov Dec 23, 2022

Choose a reason for hiding this comment

epavlova Jan 4, 2023

Choose a reason for hiding this comment

epavlova Jan 4, 2023

Choose a reason for hiding this comment

epavlova Jan 4, 2023

Choose a reason for hiding this comment

atanasdinov Jan 4, 2023

Choose a reason for hiding this comment

epavlova Jan 13, 2023

Choose a reason for hiding this comment

atanasdinov commented Dec 12, 2022 •

edited by jira bot

Loading

coveralls commented Dec 21, 2022 •

edited by jira bot

Loading