Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: [NPM] scale test pipeline using KWOK #1915

Merged
merged 71 commits into from
May 18, 2023
Merged

Conversation

huntergregory
Copy link
Contributor

@huntergregory huntergregory commented Apr 14, 2023

Reason for Change:
Automate NPM scale tests.

Issue Fixed:

Requirements:

Notes:

Test Steps

1. Scale Up Large (~1K fake Pods, 30 real Pods)

  • Create Fake + Real Pods.
  • Label them.
  • Add NetPols.

2. Test NPM Bootup Latency and Connectivity

Must achieve initial connectivity within 15 minutes, and apply a new NetworkPolicy within 30 seconds.

  • Restart NPM to see performance improvements from perf: [WIN-NPM] fast bootup #1900.
  • Check connectivity. Fail if a timeout is exceeded.
    • Create two pinger Pods.
    • Check initial connectivity. If unsuccessful, retry until timeout exceeded.
      • Allowed: pinger <-> pinger
      • Allowed: real <-> real
      • Blocked: real <-> pinger
    • Add NetPol and assert connectivity. If unsuccessful, retry until timeout exceeded.
      • Allowed: real <-> pinger

3. General CRUD at Medium Scale (~30 fake Pods, 10 real Pods)

  • Delete previous namespaces (including Pods and Netpols).
  • Recreate everything in step 1.
  • Delete and re-add labels.
  • Delete and re-create Pods.
  • Delete and re-create NetPols.

4. Test Connectivity After CRUD

Must achieve initial connectivity within 10 minutes, and apply a new NetworkPolicy within 20 seconds.

  • Check connectivity like in step 2.

@huntergregory huntergregory added npm Related to NPM. ci Infra or tooling. windows labels Apr 14, 2023
@huntergregory
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s), but failed to run 1 pipeline(s).

@huntergregory huntergregory marked this pull request as draft April 14, 2023 20:28
@@ -345,14 +345,14 @@ jobs:
done

exit $exitCode
displayName: "Run Test Suite and Get Logs"
displayName: "Run Test Suite and Get Logs - $(PROFILE)"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change lets us distinguish job failures in our pipeline report

@@ -463,7 +463,169 @@ jobs:
name: cyclonus
displayName: "Run Cyclonus Test"
failOnStderr: false
condition: succeeded()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used to run cyclonus even if previous step failed

@huntergregory huntergregory marked this pull request as ready for review April 19, 2023 20:18
vakalapa
vakalapa previously approved these changes Apr 20, 2023
Copy link
Contributor

@vakalapa vakalapa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure you add times on why it is taking almost 2hrs for the connectivity tests to start ?

@huntergregory
Copy link
Contributor Author

huntergregory commented May 8, 2023

Make sure you add times on why it is taking almost 2hrs for the connectivity tests to start ?

Now takes 6 minutes. Kwok wasn't running, so the fake Pods weren't being scheduled

@huntergregory huntergregory requested a review from a team as a code owner May 9, 2023 16:57
echo "waiting for NPM to start running..."
kubectl wait --for=condition=Ready pod -l k8s-app=azure-npm -n kube-system --timeout=20m
echo "sleep 1m to let NPM restart in case of bootup failure due to HNS errors"
sleep 1m
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

windows pods can take longer than 1m sometimes it think ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now sleep for longer

inlineScript: |
set -e
mkdir -p $(System.DefaultWorkingDirectory)/$(RESOURCE_GROUP)-$(PROFILE)
./kwok --kubeconfig ~/.kube/config \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty useful for other scenarios, maybe after this add a make recipe to attach kwok nodes

@huntergregory
Copy link
Contributor Author

going to hold off on merging until HNS fixes are in prod (probably next week)

@vakalapa vakalapa merged commit ce11a8d into master May 18, 2023
37 of 45 checks passed
@vakalapa vakalapa deleted the hgregory/04-14-pipeline branch May 18, 2023 16:57
jpayne3506 pushed a commit that referenced this pull request May 18, 2023
* wip

* temporarily disable most conf runs

* update readme

* back to raw yamls and clone the branch to run scale test

* fix raw yaml URLs

* fix inline script

* fix length of rg name

* uncomment all conf again

* comment out everything unnecessary for testing

* remove commented out dependencies

* use master branch for pipeline

* label nodes

* multiple nodes

* uncomment rest of conformance pipeline (originally commented for testing)

* fix print out for time taken in test-connectivity.sh

* fix: run kwok command in background

* mkdir for kwok log

* try azure cli 1 to fix login error

* Revert "try azure cli 1 to fix login error"

This reverts commit f1671e3.

* move scale test to new pipeline yaml

* remove scale test from conformance pipeline yaml

* revert name change for cyclonus job

* remove unnecessary image build and variable

* error codes and display names

* change sleep and wait for npm logic

* look at directory

* use pre-cloned repo

* fix directory path

* install kubectl first

* FIXME: comment out succeeded condition

* kubectl binary arg

* kubectl for scale test

* fix label selector

* fix kubectl path

* fix kubectl binary arg

* fix kwok, more steps

* FIXME: temporarily use custom fast image

* fix kwok pid and add comment

* 10m timeout for connectivity after crud

* fix kwok command invocation

* bump up timeouts for testing

* higher memory limit

* add note to connectivity script

* fix sed

* no need to curl npm yaml

* tmp: comment things out to test final step

* only check if kwok pods are running, not necessarily ready

* Revert "tmp: comment things out to test final step"

This reverts commit 7b21125.

* update registry keys to fix HNS reliability

* update regkey code

* sleep to let NPM restart in case of bootup failure

* adaptive wait timeout

* change some errors to warnings

* log date

* make sure all pods are labeled

* delete and readd labels after deleting pods

* tmp: skip large scale up and connectivity check for testing

* fix overwrite arg

* rename tasks and uncomment things

* update command for updating reg key

* make timeout logic simpler

* back to reg add command for regkeys

* official timeouts instead of test values

* delete task updating registry keys and stop hardcoding npm image

* increase sleep
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Infra or tooling. npm Related to NPM. windows
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants