Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add chaos resource debugging to readme #2409

Merged
merged 5 commits into from Dec 10, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
104 changes: 81 additions & 23 deletions tools/stress-cluster/chaos/README.md
Expand Up @@ -9,20 +9,21 @@ The chaos environment is an AKS cluster (Azure Kubernetes Service) with several
* [Quick Testing with no Dependencies](#quick-testing-with-no-dependencies)
* [Creating a Stress Test](#creating-a-stress-test)
* [Layout](#layout)
* [Stress Test Metadata](#stress-test-metadata)
* [Stress Test Secrets](#stress-test-secrets)
* [Stress Test File Share](#stress-test-file-share)
* [Stress Test Azure Resources](#stress-test-azure-resources)
* [Helm Chart Dependencies](#helm-chart-dependencies)
* [Job Manifest](#job-manifest)
* [Chaos Manifest](#chaos-manifest)
* [Scenarios and values.yaml](#scenarios-and-valuesyaml)
* [Node size requirements](#node-size-requirements)
* [Node Size Requirements](#node-size-requirements)
* [Deploying a Stress Test](#deploying-a-stress-test)
* [Configuring faults](#configuring-faults)
* [Faults via Dashboard](#faults-via-dashboard)
* [Faults via Config](#faults-via-config)
* [Running the example test with a network fault](#running-the-example-test-with-a-network-fault)
* [Configure Faults via Dashboard](#configure-faults-via-dashboard)
* [Debugging chaos resources and events](#debugging-chaos-resources-and-events)
* [Running the example test with a network fault](#running-the-example-test-with-a-network-fault)


Technologies used:
Expand Down Expand Up @@ -507,6 +508,8 @@ Faults can be configured via kubernetes manifests or via the UI (which is a help

### Faults via Dashboard

NOTE: The chaos mesh dashboard is just a helper for generating manifest under the hood. You can create and submit these directly as well. See the [docs](https://chaos-mesh.org/docs/simulate-network-chaos-on-kubernetes/).

To configure faults via the UI, make sure you can access the chaos dashboard by running the below command, and navigating to `localhost:2333` in your browser.

```
Expand All @@ -523,16 +526,85 @@ a label like `job-name: <your job name>` in the drop down.

See [Chaos Manifest](#chaos-manifest).

## Running the example test with a network fault
### Debugging chaos resources and events

There are a few ways to check on the status of your chaos resources, after your stress test pod(s) reach a `Running` state.

From the [test cluster dashboard](https://aka.ms/azsdk/stress/dashboard), select your stress test pods from the dropdown
and verify there are entries in the logs in the **Chaos Daemon Events** table.

On the stress cluster, you can view the status of your chaos resources. For example, to check on all the network chaos
resources you have deployed:

```
kubectl get networkchaos -n <your alias>
```

Pick the one relevant to your test and print the detailed view:

```
kubectl get networkchaos -n <your alias> <networkchaos resource name> -o yaml
```

The yaml output should show a success or failure:

**Example Success**

```
status:
experiment:
duration: 10.000411955s
endTime: "2021-12-09T01:20:57Z"
phase: Waiting
podRecords:
- action: loss
hostIP: 10.240.0.7
message: This is a source pod.network traffic control action duration 10s
name: stress-python-eventhubs-stress-test-1-m2hhh
namespace: yuling
podIP: 10.244.1.40
startTime: "2021-12-09T01:20:47Z"
scheduler:
nextRecover: "2021-12-09T01:21:27Z"
nextStart: "2021-12-09T01:21:17Z"
```

**Example Failure**

```
status:
experiment:
phase: Failed
failedMessage: 'lookup some-bad-host.foobar.net;:
no such host'
scheduler: {}
```

For chaos resource types other than network chaos, you can also query these by their `kind`. To list those available:

```
⇉ ⇉ ⇉ kubectl api-resources | grep chaos-mesh.org
awschaos chaos-mesh.org/v1alpha1 true AwsChaos
dnschaos chaos-mesh.org/v1alpha1 true DNSChaos
httpchaos chaos-mesh.org/v1alpha1 true HTTPChaos
iochaos chaos-mesh.org/v1alpha1 true IoChaos
jvmchaos chaos-mesh.org/v1alpha1 true JVMChaos
kernelchaos chaos-mesh.org/v1alpha1 true KernelChaos
networkchaos chaos-mesh.org/v1alpha1 true NetworkChaos
podchaos chaos-mesh.org/v1alpha1 true PodChaos
podiochaos chaos-mesh.org/v1alpha1 true PodIoChaos
podnetworkchaos chaos-mesh.org/v1alpha1 true PodNetworkChaos
stresschaos chaos-mesh.org/v1alpha1 true StressChaos
timechaos chaos-mesh.org/v1alpha1 true TimeChaos
```

### Running the example test with a network fault

Follow the below commands to execute a sample test.

```
cd ./examples/network_stress_example
# This will build the docker images and helm chart dependencies
./build.sh
# This will log in to the cluster and container registry, publish the image and the chart
./deploy.sh <your alias>
pwsh ../../../../../eng/common/scripts/stress-testing/deploy-stress-tests.ps1 -Login -PushImages
```

Verify the pods in the job have booted and are running ok (with chaos network failures):
Expand All @@ -542,7 +614,7 @@ Verify the pods in the job have booted and are running ok (with chaos network fa
NAME READY STATUS RESTARTS AGE
network-example-0629200737-bk647 1/1 Running 0 89s

⇉ ⇉ ⇉ kubectl logs -n <YOUR NAMESPACE> network-example-0629200737-bk647 -f
⇉ ⇉ ⇉ kubectl logs -n <your alias> network-example-0629200737-bk647 -f
Spider mode enabled. Check if remote file exists.
--2021-06-09 00:51:52-- http://www.bing.com/
Resolving www.bing.com (www.bing.com)... 204.79.197.200, 13.107.21.200, 2620:1ec:c11::200
Expand All @@ -551,17 +623,3 @@ Connecting to www.bing.com (www.bing.com)|13.107.21.200|:80... failed: Connectio
Connecting to www.bing.com (www.bing.com)|2620:1ec:c11::200|:80... failed: Cannot assign requested address.
Giving up.
```

### Configure Faults via Dashboard

Navigate to the chaos dashboard at `localhost:2333`

NOTE: The chaos mesh dashbaord is just a helper for generating manifest under the hood. You can create and submit these directly as well. See the [docs](https://chaos-mesh.org/docs/chaos_experiments/networkchaos_experiment).

1. From the UI, click `New Experiment`
1. Select `Network Attack` and then select `LOSS`
1. In the `Loss` textbox, enter `100`
1. Scroll down to `Scope`. Enter your namespace in the `Namespace Selectors` field, and find a Label Selector that matches your test (e.g. `testInstance: network-example-<your alias>`).
1. Enter a name for experiment like `<YOUR NAME>-<chaos type>`.
1. Enable `Run Continuously`
1. Click through the multiple `Submit` buttons.