Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Controller: OperationJob #208

Open
wants to merge 70 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
c682dbf
import kruise v1.4.0
ColdsteelRail Apr 8, 2024
cf045b3
ensure TTL and activeDeadline
ColdsteelRail May 13, 2024
9994ef5
support admission webhook
ColdsteelRail May 15, 2024
088a399
rephase spec.partition
ColdsteelRail May 15, 2024
b5fe9d3
support feature partition
ColdsteelRail May 16, 2024
8fd7f46
fix bug: replace same targets by partition
ColdsteelRail May 16, 2024
c8d6b17
rename file: pkg/controllers/operationjob/recreate
ColdsteelRail May 16, 2024
3ed2839
rename file: pkg/controllers/operationjob/utils/finalizer.go
ColdsteelRail May 16, 2024
5e22103
add annotation to register recreate interface
ColdsteelRail May 16, 2024
7f87921
fix collaset ut: remove operationDelays
ColdsteelRail May 16, 2024
88128ac
validating webhook: deny partition decreased
ColdsteelRail May 17, 2024
7644859
(1) not check pod exist in webhook; (2) support recreate non exist po…
ColdsteelRail May 17, 2024
2a1192e
add ReleaseTarget (TODO)
ColdsteelRail May 17, 2024
a019ec4
refactor directory and structs
ColdsteelRail May 17, 2024
85b554d
(1) parally execute recreate and update; (2) release target when dele…
ColdsteelRail May 20, 2024
a67e67f
add ut for deadline and TTL
ColdsteelRail May 20, 2024
30fb43f
deleteExpectation after targets released when deletionTimestamp not nil
ColdsteelRail May 21, 2024
06ad732
refactor job and pod operation progress
ColdsteelRail May 22, 2024
9fd4e95
api: remove ExtraInfo, add Reason, Message
ColdsteelRail May 22, 2024
31786e5
validating webhook: targets filed is immutable
ColdsteelRail May 22, 2024
c02e895
add observedGeneration
ColdsteelRail May 22, 2024
13c2adc
stop operate target if job finished
ColdsteelRail May 22, 2024
d5b34d2
update and recreate execute parallel
ColdsteelRail May 23, 2024
c2197bc
reset collaset ut
ColdsteelRail May 23, 2024
ec7a772
remove finalizer utils, and use exist utils
ColdsteelRail May 23, 2024
9a9bdf4
operation delay seconds
ColdsteelRail May 24, 2024
8a568d0
add reason: ReplaceByNewPod
ColdsteelRail Jun 3, 2024
ddd4795
operationjob validating webhook simplified
ColdsteelRail Jul 8, 2024
846e8bb
move recreatMethod Anno from oj to pod
ColdsteelRail Jul 9, 2024
f158d3a
add charts of oj
ColdsteelRail Jul 9, 2024
b5f2f4b
support replace-oj deleted and cancel replace
ColdsteelRail Jul 9, 2024
e76a54a
operationJob e2e
ColdsteelRail Jul 10, 2024
5da808b
add featuregate: EnableKruiseToRecreate
ColdsteelRail Jul 12, 2024
2b6e948
(1) choose to import kruise api; (2) add ReasonInvalidRecreateMethod
ColdsteelRail Jul 12, 2024
5c2a87f
Action: Recreate -> Restart
ColdsteelRail Jul 13, 2024
000ae29
fix: finish restart ops lifecycle is operation finished
ColdsteelRail Jul 14, 2024
fcb28e7
fix: continue to operate target if ops finished, to clean some lifecycle
ColdsteelRail Jul 14, 2024
8e18953
fix: do not panic when opsAction not supported
ColdsteelRail Jul 14, 2024
c59958e
operate candidata using slowStartBatch with goroutine
ColdsteelRail Jul 15, 2024
27d49bd
refactor operationjob api: opsStatus and targetDetails
ColdsteelRail Jul 15, 2024
8da13fc
fix golint
ColdsteelRail Jul 15, 2024
86ad7ee
take out common codes of ActionOperater
ColdsteelRail Jul 15, 2024
9e242a4
fix: ReleaseTargetsForDeletion npe
ColdsteelRail Jul 15, 2024
59def70
(1) re-defein crr name; (2) register crr to new cache if EnableKruise…
ColdsteelRail Jul 15, 2024
4cda254
fix golint
ColdsteelRail Jul 15, 2024
d6ed6a4
fix golint2
ColdsteelRail Jul 15, 2024
ff6e878
refactor ActionHandler
ColdsteelRail Jul 15, 2024
c40df51
set EnableKruiseToRestart to false by default
ColdsteelRail Jul 15, 2024
b5aaece
set EnableKruiseToRestart in manager.yaml
ColdsteelRail Jul 15, 2024
87ccf1d
refactor OpsLifecycleAdapter
ColdsteelRail Jul 15, 2024
339ade8
remove AnnotationOperationJobRestartMethod
ColdsteelRail Jul 16, 2024
d369919
remove OperateInfo
ColdsteelRail Jul 16, 2024
8c7dd19
refactor KruiseRestartHandler PodReplaceHandler
ColdsteelRail Jul 16, 2024
caf2206
refactor ActionHandler
ColdsteelRail Jul 16, 2024
7b7f740
refactor usingOpsLifecycle logic
ColdsteelRail Jul 16, 2024
69a8560
refactor GetOpsProgress
ColdsteelRail Jul 16, 2024
af2f0c7
GetOpsStatus: do not change progress if crr not found
ColdsteelRail Jul 16, 2024
6a4f0d0
add note for register
ColdsteelRail Jul 16, 2024
0f88a43
ensure PodReplaceHandler and KruiseRestartHandler is good to register
ColdsteelRail Jul 16, 2024
70b01ff
featureGate: EnableKruiseToRestart
ColdsteelRail Jul 16, 2024
d4a98a5
featureGate: EnableKruiseToRestart
ColdsteelRail Jul 16, 2024
065a752
fix replace succeeded reason and message
ColdsteelRail Jul 16, 2024
c04a586
refactor register
ColdsteelRail Jul 16, 2024
dc4bf9b
add containers for all cnadidate and actions
ColdsteelRail Jul 17, 2024
6681b31
refactor handler parameters
ColdsteelRail Jul 17, 2024
da9bde8
remove redundant file
ColdsteelRail Jul 17, 2024
520e6dc
webhook validating: add action validate
ColdsteelRail Jul 17, 2024
077160b
improve logger: pass controller's logger to handler
ColdsteelRail Jul 17, 2024
fc51be2
webhook error: improve webhook respnse
ColdsteelRail Jul 17, 2024
80f38bb
fix: lifecycle id using oj name
ColdsteelRail Jul 18, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
144 changes: 144 additions & 0 deletions .github/workflows/e2e.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ env:
GO_VERSION: '1.19'
KIND_VERSION: 'v0.14.0'
KIND_IMAGE: 'kindest/node:v1.22.2'
KIND_IMAGE_KRUISE: 'kindest/node:v1.24.6'
KIND_CLUSTER_NAME: 'e2e-test'

jobs:
Expand Down Expand Up @@ -82,3 +83,146 @@ jobs:
kubectl get pod -n kusionstack-system --no-headers -l control-plane=controller-manager | awk '{print $1}' | xargs kubectl logs -p -n kusionstack-system
exit 1
fi

OperationJob:
runs-on: ubuntu-20.04
steps:
- uses: actions/checkout@v3
with:
submodules: true
- name: Setup Go
uses: actions/setup-go@v3
with:
go-version: ${{ env.GO_VERSION }}
- name: Cache Go Dependencies
uses: actions/cache@v2
with:
path: ~/go/pkg/mod
key: ${{ runner.os }}-go-${{ hashFiles('**/go.sum') }}
restore-keys: ${{ runner.os }}-go-
- name: Setup Kind Cluster
uses: helm/kind-action@v1.10.0
with:
node_image: ${{ env.KIND_IMAGE_KRUISE }}
cluster_name: ${{ env.KIND_CLUSTER_NAME }}
config: ./test/e2e/scripts/kind-conf.yaml
version: ${{ env.KIND_VERSION }}
- name: Setup Kube-config
run: |
mkdir -p /tmp/kind
make kind-kube-config
- name: Clone Kruise Repo
uses: GuillaumeFalourd/clone-github-repo-action@v2.3
with:
branch: release-1.6
owner: openkruise
repository: kruise
- name: Install Kruise
run: |
set -ex
cd kruise
export KRUISE_IMG="openkruise/kruise-manager:v1.6"
docker build -t ${KRUISE_IMG} .
kind load docker-image --name=${KIND_CLUSTER_NAME} ${KRUISE_IMG}
make deploy
NODES=$(kubectl get node | wc -l)
for ((i=1;i<10;i++));
do
set +e
PODS=$(kubectl get pod -n kruise-system | grep '1/1' | wc -l)
set -e
if [ "$PODS" -eq "$NODES" ]; then
break
fi
sleep 3
done
set +e
PODS=$(kubectl get pod -n kruise-system | grep '1/1' | wc -l)
kubectl get node -o yaml
kubectl get all -n kruise-system -o yaml
kubectl get pod -n kruise-system --no-headers | grep daemon | awk '{print $1}' | xargs kubectl logs -n kruise-system
kubectl get pod -n kruise-system --no-headers | grep daemon | awk '{print $1}' | xargs kubectl logs -n kruise-system --previous=true
set -e
if [ "$PODS" -eq "$NODES" ]; then
echo "Wait for kruise-manager and kruise-daemon ready successfully"
else
echo "Timeout to wait for kruise-manager and kruise-daemon ready"
exit 1
fi
- name: Install Operating
run: |
set -ex
kubectl cluster-info
make docker-build
make sync-kind-image
make deploy
for ((i=1;i<10;i++));
do
set +e
PODS=$(kubectl get pod -n kusionstack-system | grep -c '1/1')
set -e
if [ "$PODS" -eq 1 ]; then
break
fi
sleep 3
done
set -e
PODS=$(kubectl get pod -n kusionstack-system | grep -c '1/1')
if [ "$PODS" -eq 1 ]; then
echo "Wait for Kusionstack-manager ready successfully"
else
echo "Timeout to wait for Kusionstack-manager ready"
fi
- name: Run e2e Tests
run: |
make ginkgo
set -e
KUBECONFIG=/tmp/kind/kubeconfig.yaml ./bin/ginkgo -timeout 10m -v --focus='\[apps\] OperationJob' test/e2e
- name: Check Operating Manager
run: |
restartCount=$(kubectl get pod -n kusionstack-system -l control-plane=controller-manager --no-headers | awk '{print $4}')
if [ "${restartCount}" -eq "0" ];then
echo "Kusionstack-manager has not restarted"
else
kubectl get pod -n kusionstack-system -l control-plane=controller-manager --no-headers
echo "Kusionstack-manager has restarted, abort!!!"
kubectl get pod -n kusionstack-system --no-headers -l control-plane=controller-manager | awk '{print $1}' | xargs kubectl logs -p -n kusionstack-system
exit 1
fi
- name: Check Kruise Manager
run: |
retVal=$?
restartCount=$(kubectl get pod -n kruise-system -l control-plane=controller-manager --no-headers | awk '{print $4}')
if [ "${restartCount}" -eq "0" ];then
echo "$out"
echo "Kruise-manager has not restarted"
else
echo "$out"
echo "Kruise-manager has restarted, abort!!!"
kubectl get pod -n kruise-system --no-headers -l control-plane=controller-manager | awk '{print $1}' | xargs kubectl logs -p -n kruise-system
exit 1
fi
kubectl get pods -n kruise-system -l control-plane=daemon -o=jsonpath="{range .items[*]}{.metadata.namespace}{\"\t\"}{.metadata.name}{\"\n\"}{end}" | while read ns name;
do
restartCount=$(kubectl get pod -n ${ns} ${name} --no-headers | awk '{print $4}')
if [ "${restartCount}" -eq "0" ];then
echo "Kruise-daemon has not restarted"
else
kubectl get pods -n ${ns} -l control-plane=daemon --no-headers
echo "Kruise-daemon has restarted, abort!!!"
kubectl logs -p -n ${ns} ${name}
exit 1
fi
done

if [ "$retVal" -ne 0 ];then
echo "test fail, dump kruise-manager logs"
while read pod; do
kubectl logs -n kruise-system $pod
done < <(kubectl get pods -n kruise-system -l control-plane=controller-manager --no-headers | awk '{print $1}')
echo "test fail, dump kruise-daemon logs"
while read pod; do
kubectl logs -n kruise-system $pod
done < <(kubectl get pods -n kruise-system -l control-plane=daemon --no-headers | awk '{print $1}')
fi
exit $retVal
168 changes: 168 additions & 0 deletions apis/apps/v1alpha1/operationjob_types.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
/*
Copyright 2024 The KusionStack Authors.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

package v1alpha1

import (
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

const (
OpsActionRestart = "Restart"
OpsActionReplace = "Replace"
)

const (
ReasonPodNotFound = "PodNotFound"
ReasonContainerNotFound = "ContainerNotFound"
ReasonReplacedByNewPod = "ReplacedByNewPod"
)

// OperationProgress indicates operation progress of pod
type OperationProgress string

const (
OperationProgressPending OperationProgress = "Pending"
OperationProgressProcessing OperationProgress = "Processing"
OperationProgressFailed OperationProgress = "Failed"
OperationProgressSucceeded OperationProgress = "Succeeded"
)

// OperationJobSpec defines the desired state of OperationJob
type OperationJobSpec struct {
// Specify the operation actions including: Restart, Replace
// +optional
Action string `json:"action,omitempty"`

// Define the operation target pods
// +optional
Targets []PodOpsTarget `json:"targets,omitempty"`

// Partition controls the operation progress by indicating how many pods should be operated.
// Defaults to nil (all pods will be updated)
// +optional
Partition *int32 `json:"partition,omitempty"`

// OperationDelaySeconds indicates how many seconds it should delay before operating update.
// +optional
OperationDelaySeconds *int32 `json:"operationDelaySeconds,omitempty"`

// Specify the duration in seconds relative to the startTime
// that the job may be active before the system tries to terminate it
// +optional
ActiveDeadlineSeconds *int32 `json:"activeDeadlineSeconds,omitempty"`

// Limit the lifetime of an operation that has finished execution (either Complete or Failed)
// +optional
TTLSecondsAfterFinished *int32 `json:"TTLSecondsAfterFinished,omitempty"`
}

// PodOpsTarget defines the target pods of the OperationJob
type PodOpsTarget struct {
// Specify the operation target pods
// +optional
Name string `json:"name,omitempty"`

// Specify the containers to restart
// +optional
Containers []string `json:"containers,omitempty"`
}

// OperationJobStatus defines the observed state of OperationJob
type OperationJobStatus struct {
// ObservedGeneration is the most recent generation observed for this OperationJob. It corresponds to the
// OperationJob's generation, which is updated on mutation by the API Server.
// +optional
ObservedGeneration int64 `json:"observedGeneration,omitempty"`

// Phase indicates the of the OperationJob
// +optional
Progress OperationProgress `json:"progress,omitempty"`

// Operation start time
// +optional
StartTimestamp *metav1.Time `json:"startTimestamp,omitempty"`

// Operation end time
// +optional
EndTimestamp *metav1.Time `json:"endTimestamp,omitempty"`

// Replicas of the pods involved in the OperationJob
// +optional
TotalPodCount int32 `json:"totalPodCount,omitempty"`

// Succeeded replicas of the pods involved in the OperationJob
// +optional
SucceededPodCount int32 `json:"succeededPodCount,omitempty"`

// failed pod count of the pods involved in the OperationJob
// +optional
FailedPodCount int32 `json:"failedPodCount,omitempty"`

// Operation details of the target pods
// +optional
TargetDetails []OpsStatus `json:"targetDetails,omitempty"`
}

type OpsStatus struct {
// name of the target pod
// +optional
Name string `json:"name,omitempty"`

// operation progress of target pod
// +optional
Progress OperationProgress `json:"progress,omitempty"`

// reason for current operation progress
// +optional
Reason string `json:"reason,omitempty"`

// message displays detail of reason
// +optional
Message string `json:"message,omitempty"`
}

// +kubebuilder:object:root=true
// +kubebuilder:subresource:status

// +k8s:openapi-gen=true
// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object
// +kubebuilder:resource:shortName=oj
// +kubebuilder:subresource:status
// +kubebuilder:printcolumn:name="PROGRESS",type="string",JSONPath=".status.progress"
// +kubebuilder:printcolumn:name="AGE",type="date",JSONPath=".metadata.creationTimestamp"

// OperationJob is the Schema for the operationjobs API
type OperationJob struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`

Spec OperationJobSpec `json:"spec,omitempty"`
Status OperationJobStatus `json:"status,omitempty"`
}

//+kubebuilder:object:root=true

// OperationJobList contains a list of OperationJob
type OperationJobList struct {
metav1.TypeMeta `json:",inline"`
metav1.ListMeta `json:"metadata,omitempty"`
Items []OperationJob `json:"items"`
}

func init() {
SchemeBuilder.Register(&OperationJob{}, &OperationJobList{})
}
Loading
Loading