Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter out false positive errors during cluster maintenance #36

Conversation

atanasdinov
Copy link
Contributor

@atanasdinov atanasdinov commented Dec 12, 2022

Description

What

Introduces AWS SDK connection in order to determine cluster maintenance windows.

Why

JIRA Ticket

Scope and particulars of this PR (Please tick all that apply)

  • Tech hygiene (dependency updating & other tech debt)
  • Bug fix
  • Feature
  • Documentation
  • Breaking change
  • Minor change (e.g. fixing a typo, adding config)

This Pull Request follows the rules described in our Pull Requests Guide

@atanasdinov atanasdinov force-pushed the feature/UPPSF-3587-avoid-monitoring-errors-on-maintenance branch from 98dba24 to b2bd7af Compare December 21, 2022 16:12
@coveralls
Copy link

coveralls commented Dec 21, 2022

Coverage Status

Coverage: 86.782% (-2.7%) from 89.44% when pulling 7cf08ec on feature/UPPSF-3587-avoid-monitoring-errors-on-maintenance into 8270e47 on v4.

@atanasdinov atanasdinov force-pushed the feature/UPPSF-3587-avoid-monitoring-errors-on-maintenance branch from a57c53d to dc2ed5f Compare December 21, 2022 18:31
@atanasdinov atanasdinov force-pushed the feature/UPPSF-3587-avoid-monitoring-errors-on-maintenance branch from f5e0955 to 63a3476 Compare December 23, 2022 16:07

// Verifies whether the Kafka cluster is available or not.
// False positive healthcheck errors are being ignored during maintenance windows.
func checkClusterAvailability(healthErr error, describer clusterDescriber, arn *string) error {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely not satisfied with this but just copy-pasting the body of the function across all healthchecks is not cool either.

Modifying the error seems to be the correct way to avoid the repetition but I'm open to suggestions if I've missed something obvious.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick (non-blocking): Just semantics - from a function named checkClusterAvailability I wouldn't expect to receive an error as argument. I would expect to check whether the cluster is available or not and to let the code using the function decides what to do in case of Maintenance error. The downside is what you've described. I would rename the function to something like verifyHealthErrorServerity.

@atanasdinov atanasdinov changed the title Filter out false positive monitor errors Filter out false positive errors during cluster maintenance Dec 23, 2022
@atanasdinov atanasdinov marked this pull request as ready for review December 23, 2022 16:19
@atanasdinov atanasdinov requested review from a team as code owners December 23, 2022 16:19
@epavlova epavlova self-requested a review January 3, 2023 13:18

// Verifies whether the Kafka cluster is available or not.
// False positive healthcheck errors are being ignored during maintenance windows.
func checkClusterAvailability(healthErr error, describer clusterDescriber, arn *string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick (non-blocking): Just semantics - from a function named checkClusterAvailability I wouldn't expect to receive an error as argument. I would expect to check whether the cluster is available or not and to let the code using the function decides what to do in case of Maintenance error. The downside is what you've described. I would rename the function to something like verifyHealthErrorServerity.

ctx, cancel := context.WithTimeout(context.Background(), clusterDescriptionTimeout)
defer cancel()

cluster, err := describer.DescribeClusterV2(ctx, &kafka.DescribeClusterV2Input{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick (non-blocking): A very theoretical scenario would be for newClusterDescriber() to return an error when created in the NewConsumer or NewProducer. When using checkClusterAvailability you only check for available config.ClusterArn. Again theoretically describer could be nil here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NewConsumer and NewProducer functions will only attempt to create a describer if ClusterArn is provided. If that fails, they'd return an error which would cause the termination of the whole application. By doing this the describer is never going to be nil.

Then I opted for checking ClusterArn just because it's simpler than nil checking an interface. Do you think it's better to check the describer instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I understand the idea now, my bad. I think it's fine that way it's implemented having in mind that the whole logic is private.
If retrieveClusterState was public and we wanted to be super foolproof we can check if the describer is not null before describer.DescribeClusterV2 call.

@atanasdinov atanasdinov changed the base branch from v4 to feature/cluster-maintenance-monitoring January 13, 2023 08:19
@atanasdinov atanasdinov merged commit bee19d4 into feature/cluster-maintenance-monitoring Jan 13, 2023
@atanasdinov atanasdinov deleted the feature/UPPSF-3587-avoid-monitoring-errors-on-maintenance branch January 13, 2023 08:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants