-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cloud Managed Service Disruption #583
Cloud Managed Service Disruption #583
Conversation
…loud-managed-service-disruption
} | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this in disruption_webhook.go instead of the specific Validate() method of the NetworkDisruptionSpec here due to a problem with the package deepcopy-gen. It's preventing us from passing the cloudProviderManager to those methods because we can't generate the DeepCopyInto method on interfaces (the interface at fault here).
In my opinion, we should put this in Validate() but I'm not sure the time wasted trying to find a workaround is worth it.
|
||
return hosts | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same problem as the comment above:
Added this in helpers.go instead of the specific GenerateArgs() method of the NetworkDisruptionSpec here due to a problem with the package deepcopy-gen. It's preventing us from passing the cloudProviderManager to those methods because we can't generate the DeepCopyInto method on interfaces (the interface at fault here).
In my opinion, we should put this in GenerateArgs() but I'm not sure the time wasted trying to find a workaround is worth it.
controllers/helpers.go
Outdated
if len(serviceList) == 0 { | ||
ips, err = cloudManager.GetServiceIPRanges(cloudName, "") | ||
if err != nil { | ||
log.Errorf("could not retrieve the ip range of all services of %s", cloudName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to log the error message here and below. The only errors I can possibly see coming from this function are "cloud provider doesn't exist" and "service name doesn't exist", both of which should already have been caught via validation before we get to this point. So these might be serious errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chose to log it in the createChaosPods func instead, wdyt?
main.go
Outdated
@@ -245,6 +249,9 @@ func main() { | |||
"Threshold which safemode checks against to see if the number of targets is over safety measures within a cluster") | |||
handleFatalError(viper.BindPFlag("controller.safemode.clusterThreshold", pflag.Lookup("safemode-cluster-threshold"))) | |||
|
|||
pflag.StringVar(&cfg.Controller.CloudProviders.Aws.IPRangesURL, "cloud-providers-aws-ip-ranges-url", "", "URL used to pull ip ranges from AWS (defaulted to \"\")") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We default to an empty string, but won't we error out in our Cloud Service if that happens? How do we want to handle a situation where we don't get a config (or if the config we do have doesn't work)? Fail all disruptions that try to block to aws?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this part to not have any configuration as the URL is the same for everyone.
For now, we don't start the controller if an error is thrown at the parsing of the ip ranges.
…DataDog/chaos-controller into claire/cloud-managed-service-disruption
…e in a disruption
injector/network_disruption.go
Outdated
@@ -30,7 +30,7 @@ import ( | |||
type linkOperation func([]string, string, uint32) error | |||
|
|||
// tcPriority the lowest priority set by tc automatically when adding a tc filter | |||
var tcPriority = uint32(49149) | |||
var tcPriority = uint32(1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know where the old value comes from? Just to be sure it has no side effect to start from 1000 now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the old value comes from the automatically set priority. I took it because of this, but it's totally safe to use this instead
* Creating boilerplate for pull and parse of ip range file from cloud providers * transmitting the ip ranges of cloud aws to network with hosts disruption * Read chaos-controller config for cloud providers * Update license headers * Lint * Tidy * Resolve cyclic import * Unit tests + start of doc * Removing conf about urls * lint... * complete doc * Remove useless parameter, add tc filter count * Tests + lint * fix tests * Tests again * better logging + better naming * limiting the nb of networking resources to be applied at the same time in a disruption * removing service AMAZON from aws services: it's useless * document limitations * lint... * feedback correction * fix tests * Add flow and protocol for cloud service * Fix tests * Fix tests
* Creating boilerplate for pull and parse of ip range file from cloud providers * transmitting the ip ranges of cloud aws to network with hosts disruption * Read chaos-controller config for cloud providers * Update license headers * Lint * Tidy * Resolve cyclic import * Unit tests + start of doc * Removing conf about urls * lint... * complete doc * Remove useless parameter, add tc filter count * Tests + lint * fix tests * Tests again * better logging + better naming * limiting the nb of networking resources to be applied at the same time in a disruption * removing service AMAZON from aws services: it's useless * document limitations * lint... * feedback correction * fix tests * Add flow and protocol for cloud service * Fix tests * Fix tests
* Cloud Managed Service Disruption (#583) * Creating boilerplate for pull and parse of ip range file from cloud providers * transmitting the ip ranges of cloud aws to network with hosts disruption * Read chaos-controller config for cloud providers * Update license headers * Lint * Tidy * Resolve cyclic import * Unit tests + start of doc * Removing conf about urls * lint... * complete doc * Remove useless parameter, add tc filter count * Tests + lint * fix tests * Tests again * better logging + better naming * limiting the nb of networking resources to be applied at the same time in a disruption * removing service AMAZON from aws services: it's useless * document limitations * lint... * feedback correction * fix tests * Add flow and protocol for cloud service * Fix tests * Fix tests * Fix tests * Fix tests * Fix tests * fix some bugs * fix some bugs * Fix bugs * Better spec verification on cloud service transformation to hosts * Update doc file
* Cloud Managed Service Disruption (#583) * Creating boilerplate for pull and parse of ip range file from cloud providers * transmitting the ip ranges of cloud aws to network with hosts disruption * Read chaos-controller config for cloud providers * Update license headers * Lint * Tidy * Resolve cyclic import * Unit tests + start of doc * Removing conf about urls * lint... * complete doc * Remove useless parameter, add tc filter count * Tests + lint * fix tests * Tests again * better logging + better naming * limiting the nb of networking resources to be applied at the same time in a disruption * removing service AMAZON from aws services: it's useless * document limitations * lint... * feedback correction * fix tests * Add flow and protocol for cloud service * Fix tests * Fix tests * GCP managed service disruption * fix tests * Uses multiple urls for one cloud service + tests * Switch back to only one url * Update doc file * Use Google Cloud as service name instead of Google * Add missing license headers * Lint * Removing non valid changes * Removing non valid changes * Better code * Lint * Updating tests and clarifying the gcp convert func * Changing Google Cloud to Google for gcp service name * Remove merging mistake * Remove useless slice init * Adding one unit test case
What does this PR do?
Please briefly describe your changes as well as the motivation behind them:
Context
Large cloud services providers are using wide IP ranges. Hostnames used to identify those services are resolving with some IPs of that range, and resolved IPs can change between each DNS request. Applying a network disruption using those hostnames only doesn’t work well since retrying the resolution of such hostname would return new IPs (not disrupted) and the disruption would be ineffective.
We should add a new option to the network disruption to target specific cloud services providers. Each provider hosts a JSON file with the list of all IP ranges. The format differs between each provider but all of them contain information about the IP ranges and the associated service(s).
In this PR, we're creating:
How
Spec
Similar to a network disruption, a new field has been added:
cloud
field.Full disruption example:
Code Quality Checklist
Testing
unit
tests orend-to-end
tests.unit
tests orend-to-end
tests.aws cloud managed disruption with 1 service / with too much services