Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud Managed Service Disruption #583

Conversation

clairecng
Copy link
Contributor

@clairecng clairecng commented Aug 29, 2022

What does this PR do?

  • Adds new functionality
  • Alters existing functionality
  • Fixes a bug
  • Improves documentation or testing

Please briefly describe your changes as well as the motivation behind them:

Context

Large cloud services providers are using wide IP ranges. Hostnames used to identify those services are resolving with some IPs of that range, and resolved IPs can change between each DNS request. Applying a network disruption using those hostnames only doesn’t work well since retrying the resolution of such hostname would return new IPs (not disrupted) and the disruption would be ineffective.
We should add a new option to the network disruption to target specific cloud services providers. Each provider hosts a JSON file with the list of all IP ranges. The format differs between each provider but all of them contain information about the IP ranges and the associated service(s).

In this PR, we're creating:

  • the logic behind the pull and parse of those json files
  • the logic behind transforming the new disruption into a hosts disruption
  • a first cloud provider: AWS.

How

  1. On startup, we get the ip range file for each cloud provider. We parse it and keep in memory those ip ranges per service cloud service.
  2. On disruption creation, we verify the user is creating a valid cloud managed service disruption and we transform the cloud disruption to create a hosts disruption, where we put all ip ranges as hosts.

Spec

Similar to a network disruption, a new field has been added: cloud field.

cloud:
  aws: # cloud provider name. For now, we can only support AWS
    - service: "S3" ## the list of services we want to affect
    - service: "EC2"
    - ...

Full disruption example:

apiVersion: chaos.datadoghq.com/v1beta1
kind: Disruption
metadata:
  name: network-cloud
  namespace: chaos-demo
spec:
  level: pod
  selector:
    app: demo-cirl
  count: 1
  network:
    cloud:
      aws:
        - service: "S3"
          flow: "egress" # optional
          protocol: "tcp" # optional
    delay: 1000 # delay (in milliseconds) to add to outgoing packets, 10% of jitter will be added by default
    delayJitter: 5 # (optional) add X % (1-100) of delay as jitter to delay (+- X% ms to original delay), defaults to 10%

Code Quality Checklist

  • The documentation is up to date.
  • My code is sufficiently commented and passes continuous integration checks.
  • I have signed my commit (see Contributing Docs).

Testing

  • I leveraged continuous integration testing
    • by depending on existing unit tests or end-to-end tests.
    • by adding new unit tests or end-to-end tests.
  • I manually tested the following steps:
    • aws cloud managed disruption with 1 service / with too much services
    • locally.
    • as a canary deployment to a cluster.

@clairecng clairecng changed the title Claire/cloud managed service disruption Cloud Managed Service Disruption Aug 29, 2022
}
}
}

Copy link
Contributor Author

@clairecng clairecng Aug 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this in disruption_webhook.go instead of the specific Validate() method of the NetworkDisruptionSpec here due to a problem with the package deepcopy-gen. It's preventing us from passing the cloudProviderManager to those methods because we can't generate the DeepCopyInto method on interfaces (the interface at fault here).
In my opinion, we should put this in Validate() but I'm not sure the time wasted trying to find a workaround is worth it.


return hosts
}

Copy link
Contributor Author

@clairecng clairecng Aug 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same problem as the comment above:

Added this in helpers.go instead of the specific GenerateArgs() method of the NetworkDisruptionSpec here due to a problem with the package deepcopy-gen. It's preventing us from passing the cloudProviderManager to those methods because we can't generate the DeepCopyInto method on interfaces (the interface at fault here).
In my opinion, we should put this in GenerateArgs() but I'm not sure the time wasted trying to find a workaround is worth it.

@clairecng clairecng requested a review from Azoam August 29, 2022 12:56
@ptnapoleon ptnapoleon self-requested a review August 29, 2022 18:32
examples/network_cloud.yaml Show resolved Hide resolved
api/v1beta1/network_disruption.go Outdated Show resolved Hide resolved
cloudservice/cloudservice.go Outdated Show resolved Hide resolved
if len(serviceList) == 0 {
ips, err = cloudManager.GetServiceIPRanges(cloudName, "")
if err != nil {
log.Errorf("could not retrieve the ip range of all services of %s", cloudName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to log the error message here and below. The only errors I can possibly see coming from this function are "cloud provider doesn't exist" and "service name doesn't exist", both of which should already have been caught via validation before we get to this point. So these might be serious errors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chose to log it in the createChaosPods func instead, wdyt?

main.go Outdated
@@ -245,6 +249,9 @@ func main() {
"Threshold which safemode checks against to see if the number of targets is over safety measures within a cluster")
handleFatalError(viper.BindPFlag("controller.safemode.clusterThreshold", pflag.Lookup("safemode-cluster-threshold")))

pflag.StringVar(&cfg.Controller.CloudProviders.Aws.IPRangesURL, "cloud-providers-aws-ip-ranges-url", "", "URL used to pull ip ranges from AWS (defaulted to \"\")")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We default to an empty string, but won't we error out in our Cloud Service if that happens? How do we want to handle a situation where we don't get a config (or if the config we do have doesn't work)? Fail all disruptions that try to block to aws?

Copy link
Contributor Author

@clairecng clairecng Sep 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this part to not have any configuration as the URL is the same for everyone.
For now, we don't start the controller if an error is thrown at the parsing of the ip ranges.

@clairecng clairecng marked this pull request as ready for review September 8, 2022 07:53
@clairecng clairecng requested a review from a team as a code owner September 8, 2022 07:53
api/v1beta1/network_disruption.go Outdated Show resolved Hide resolved
api/v1beta1/network_disruption.go Show resolved Hide resolved
chart/values.yaml Outdated Show resolved Hide resolved
cloudservice/aws/aws_test.go Outdated Show resolved Hide resolved
cloudservice/aws/aws_test.go Outdated Show resolved Hide resolved
@@ -30,7 +30,7 @@ import (
type linkOperation func([]string, string, uint32) error

// tcPriority the lowest priority set by tc automatically when adding a tc filter
var tcPriority = uint32(49149)
var tcPriority = uint32(1000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know where the old value comes from? Just to be sure it has no side effect to start from 1000 now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the old value comes from the automatically set priority. I took it because of this, but it's totally safe to use this instead

injector/network_disruption.go Outdated Show resolved Hide resolved
cloudservice/manager.go Show resolved Hide resolved
controllers/helpers.go Outdated Show resolved Hide resolved
chart/templates/configmap.yaml Outdated Show resolved Hide resolved
chart/values.yaml Outdated Show resolved Hide resolved
@clairecng clairecng merged commit 1ef1f7c into feature/cloud-managed-service-disruption Sep 12, 2022
@clairecng clairecng deleted the claire/cloud-managed-service-disruption branch September 12, 2022 13:41
clairecng added a commit that referenced this pull request Oct 7, 2022
* Creating boilerplate for pull and parse of ip range file from cloud providers

* transmitting the ip ranges of cloud aws to network with hosts disruption

* Read chaos-controller config for cloud providers

* Update license headers

* Lint

* Tidy

* Resolve cyclic import

* Unit tests + start of doc

* Removing conf about urls

* lint...

* complete doc

* Remove useless parameter, add tc filter count

* Tests + lint

* fix tests

* Tests again

* better logging + better naming

* limiting the nb of networking resources to be applied at the same time in a disruption

* removing service AMAZON from aws services: it's useless

* document limitations

* lint...

* feedback correction

* fix tests

* Add flow and protocol for cloud service

* Fix tests

* Fix tests
clairecng added a commit that referenced this pull request Oct 13, 2022
* Creating boilerplate for pull and parse of ip range file from cloud providers

* transmitting the ip ranges of cloud aws to network with hosts disruption

* Read chaos-controller config for cloud providers

* Update license headers

* Lint

* Tidy

* Resolve cyclic import

* Unit tests + start of doc

* Removing conf about urls

* lint...

* complete doc

* Remove useless parameter, add tc filter count

* Tests + lint

* fix tests

* Tests again

* better logging + better naming

* limiting the nb of networking resources to be applied at the same time in a disruption

* removing service AMAZON from aws services: it's useless

* document limitations

* lint...

* feedback correction

* fix tests

* Add flow and protocol for cloud service

* Fix tests

* Fix tests
clairecng added a commit that referenced this pull request Oct 19, 2022
* Cloud Managed Service Disruption (#583)

* Creating boilerplate for pull and parse of ip range file from cloud providers

* transmitting the ip ranges of cloud aws to network with hosts disruption

* Read chaos-controller config for cloud providers

* Update license headers

* Lint

* Tidy

* Resolve cyclic import

* Unit tests + start of doc

* Removing conf about urls

* lint...

* complete doc

* Remove useless parameter, add tc filter count

* Tests + lint

* fix tests

* Tests again

* better logging + better naming

* limiting the nb of networking resources to be applied at the same time in a disruption

* removing service AMAZON from aws services: it's useless

* document limitations

* lint...

* feedback correction

* fix tests

* Add flow and protocol for cloud service

* Fix tests

* Fix tests

* Fix tests

* Fix tests

* Fix tests

* fix some bugs

* fix some bugs

* Fix bugs

* Better spec verification on cloud service transformation to hosts

* Update doc file
clairecng added a commit that referenced this pull request Oct 25, 2022
* Cloud Managed Service Disruption (#583)

* Creating boilerplate for pull and parse of ip range file from cloud providers

* transmitting the ip ranges of cloud aws to network with hosts disruption

* Read chaos-controller config for cloud providers

* Update license headers

* Lint

* Tidy

* Resolve cyclic import

* Unit tests + start of doc

* Removing conf about urls

* lint...

* complete doc

* Remove useless parameter, add tc filter count

* Tests + lint

* fix tests

* Tests again

* better logging + better naming

* limiting the nb of networking resources to be applied at the same time in a disruption

* removing service AMAZON from aws services: it's useless

* document limitations

* lint...

* feedback correction

* fix tests

* Add flow and protocol for cloud service

* Fix tests

* Fix tests

* GCP managed service disruption

* fix tests

* Uses multiple urls for one cloud service + tests

* Switch back to only one url

* Update doc file

* Use Google Cloud as service name instead of Google

* Add missing license headers

* Lint

* Removing non valid changes

* Removing non valid changes

* Better code

* Lint

* Updating tests and clarifying the gcp convert func

* Changing Google Cloud to Google for gcp service name

* Remove merging mistake

* Remove useless slice init

* Adding one unit test case
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants