-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci: [NPM] scale test pipeline using KWOK #1915
Conversation
/azp run |
Azure Pipelines successfully started running 1 pipeline(s), but failed to run 1 pipeline(s). |
@@ -345,14 +345,14 @@ jobs: | |||
done | |||
|
|||
exit $exitCode | |||
displayName: "Run Test Suite and Get Logs" | |||
displayName: "Run Test Suite and Get Logs - $(PROFILE)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change lets us distinguish job failures in our pipeline report
@@ -463,7 +463,169 @@ jobs: | |||
name: cyclonus | |||
displayName: "Run Cyclonus Test" | |||
failOnStderr: false | |||
condition: succeeded() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
used to run cyclonus even if previous step failed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure you add times on why it is taking almost 2hrs for the connectivity tests to start ?
Now takes 6 minutes. Kwok wasn't running, so the fake Pods weren't being scheduled |
…ure-container-networking into hgregory/04-14-pipeline
This reverts commit 7b21125.
.pipelines/npm/npm-scale-test.yaml
Outdated
echo "waiting for NPM to start running..." | ||
kubectl wait --for=condition=Ready pod -l k8s-app=azure-npm -n kube-system --timeout=20m | ||
echo "sleep 1m to let NPM restart in case of bootup failure due to HNS errors" | ||
sleep 1m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
windows pods can take longer than 1m sometimes it think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now sleep for longer
inlineScript: | | ||
set -e | ||
mkdir -p $(System.DefaultWorkingDirectory)/$(RESOURCE_GROUP)-$(PROFILE) | ||
./kwok --kubeconfig ~/.kube/config \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is pretty useful for other scenarios, maybe after this add a make recipe to attach kwok nodes
going to hold off on merging until HNS fixes are in prod (probably next week) |
* wip * temporarily disable most conf runs * update readme * back to raw yamls and clone the branch to run scale test * fix raw yaml URLs * fix inline script * fix length of rg name * uncomment all conf again * comment out everything unnecessary for testing * remove commented out dependencies * use master branch for pipeline * label nodes * multiple nodes * uncomment rest of conformance pipeline (originally commented for testing) * fix print out for time taken in test-connectivity.sh * fix: run kwok command in background * mkdir for kwok log * try azure cli 1 to fix login error * Revert "try azure cli 1 to fix login error" This reverts commit f1671e3. * move scale test to new pipeline yaml * remove scale test from conformance pipeline yaml * revert name change for cyclonus job * remove unnecessary image build and variable * error codes and display names * change sleep and wait for npm logic * look at directory * use pre-cloned repo * fix directory path * install kubectl first * FIXME: comment out succeeded condition * kubectl binary arg * kubectl for scale test * fix label selector * fix kubectl path * fix kubectl binary arg * fix kwok, more steps * FIXME: temporarily use custom fast image * fix kwok pid and add comment * 10m timeout for connectivity after crud * fix kwok command invocation * bump up timeouts for testing * higher memory limit * add note to connectivity script * fix sed * no need to curl npm yaml * tmp: comment things out to test final step * only check if kwok pods are running, not necessarily ready * Revert "tmp: comment things out to test final step" This reverts commit 7b21125. * update registry keys to fix HNS reliability * update regkey code * sleep to let NPM restart in case of bootup failure * adaptive wait timeout * change some errors to warnings * log date * make sure all pods are labeled * delete and readd labels after deleting pods * tmp: skip large scale up and connectivity check for testing * fix overwrite arg * rename tasks and uncomment things * update command for updating reg key * make timeout logic simpler * back to reg add command for regkeys * official timeouts instead of test values * delete task updating registry keys and stop hardcoding npm image * increase sleep
Reason for Change:
Automate NPM scale tests.
Issue Fixed:
Requirements:
Notes:
Test Steps
1. Scale Up Large (~1K fake Pods, 30 real Pods)
2. Test NPM Bootup Latency and Connectivity
Must achieve initial connectivity within 15 minutes, and apply a new NetworkPolicy within 30 seconds.
3. General CRUD at Medium Scale (~30 fake Pods, 10 real Pods)
4. Test Connectivity After CRUD
Must achieve initial connectivity within 10 minutes, and apply a new NetworkPolicy within 20 seconds.