-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAISTRA-980 Introduce the ServiceMeshMember CRD and controller #250
Conversation
type ServiceMeshMemberStatus struct { | ||
ObservedGeneration int64 `json:"observedGeneration,omitempty"` | ||
ServiceMeshGeneration int64 `json:"meshGeneration,omitempty"` | ||
ConfiguredMembers []string `json:"configuredMembers,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want this on the member resources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this a while ago. I just pushed the commits instead of also commenting here. PTAL.
I've now squashed the three commits and rebased to current maistra-1.1 branch. Please take a look. |
verbs: | ||
- use | ||
resourceNames: | ||
- multitenant-install |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be something like minimal-install?
Also, not to expand the scope, but should we look at constraining the name of the smcp to be the namespace within which it's created? Or maybe a fixed name, like default? Seems like this might simplify usage of the member resource. Something for a follow on discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for enforcing 'default' and only referencing a namespace in the Member resource. But that's out of scope here I guess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I feel like requiring a specific resource name is an anti-pattern. Also, the usual structure of a reference (e.g. secretRef
) in core k8s objects is: name + namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed this to minimal-install.
kind: Role | ||
name: mesh-user | ||
subjects: | ||
- kind: User |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove the users and just keep the group? Maybe add a comment about how to add users using cli?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is an example, it makes sense to show both users and groups are possible (although someone familiar with RBAC should already know this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move the role into the smcp yaml? We can do that as follow on work.
|
||
// ServiceMeshMemberStatus contains the state last used to reconcile the list | ||
type ServiceMeshMemberStatus struct { | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have something here? Maybe something as simple as joinTimestamp
and/or reconcileTimestamp
? Basically, something that lets the user know whether or not they are a fully configured member.
Perhaps this could be added by the smmr controller when it updates the namespace containing the smm resource?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, we should also have ways to report errors, if any occur (e.g. control plane disappeared)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was planning to add this later - when the rest of the team starts using this and it becomes clear what users need. The only thing I'd add right now is some kind of status.
return err | ||
} | ||
// watch namespaces and trigger reconcile requests as those that match a member come and go | ||
err = c.Watch(&source.Kind{Type: &corev1.Namespace{}}, &handler.EnqueueRequestsFromMapFunc{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? If the smcp resource doesn't exist, then creation of the smm resource should fail, so it seems like there shouldn't be a need to watch namespaces at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, don't believe we need to watch namespaces; possibly SMCPs though? In case they're deleted and re-created
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't fail. As elsewhere in k8s, users should be able to create resources out of order. Preventing the creation of the Member
object before the SMCP
is created would break scenarios where you deploy all resources at once (e.g. SMCP
, Member
and app itself).
} | ||
// MemberRoll doesn't exist, let's create it | ||
isNewMemberRoll = true | ||
memberRoll = &v1.ServiceMeshMemberRoll{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add an owner reference back to the smm resource? Would this facilitate automatic deletion of the smmr resource, once the last smm is deleted?
Do we prevent deletion of the referenced smcp resource? Should we add an owner reference to smcp too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. The Member controller takes care of deleting the SMMR
. This allows it to determine if the SMMR should even be deleted. Currently, it deletes it if it was the one who created it. But I'll need to change this, since it would also delete the SMMR in the following scenario (when it really shouldn't):
- user creates Member object
- controller creates SMMR
- user manually adds another namespace to the SMMR
- user deletes Member object
The controller shouldn't delete the SMMR in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work here. Please excuse my wall of text.
TL;DR this will work fine if the user does not make mistakes but I'm afraid they're going to.
The first issue I see is that it's not completely idempotent in that we rely on capturing every event. If we e.g. miss the deletion of a Member object for whatever reason, we could try and add it to another MemberRoll when it's re-created because we don't explicitly check if it is already a member of any mesh. That would then cause errors in the MemberRollValidator. For the user, it looks as if she created a valid ServiceMeshMember object (validation passed) but then nothing happens (except for lots of errors in the logs).
Then there are manual edits of MemberRolls - that's less a problem of implementation but rather of having multiple APIs. Examples:
- If the namespace was manually added to a MemberRoll before, we will just go and try to add it to another one (can be fixed the same way as above)
- Namespaces could be removed from MemberRolls after the fact, so we have dangling Member resources hanging around
- SMCPs can be removed, also leading to dangling Members (potentially solved by ownerRefs?)
So I'm starting to think: should we maybe only support either the MemberRoll or Member API for a given control plane, to avoid these problems?
And we have to make sure to report any errors back to the user in a way that is visible and helps them find a way to recover.
These are all edge cases of course. I think the code as it is will work fine 99% of the time.
if smm.Spec.ControlPlaneRef.Name != oldSmm.Spec.ControlPlaneRef.Name || | ||
smm.Spec.ControlPlaneRef.Namespace != oldSmm.Spec.ControlPlaneRef.Namespace { | ||
logger.Info("Client tried to mutate ServiceMeshMember.spec.controlPlaneRef") | ||
return admission.ErrorResponse(http.StatusBadRequest, fmt.Errorf("Mutation of .spec.controlPlaneRef isn't allowed")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that it simplifies the controller logic, but this could be frustrating - is this standard behaviour when editing refs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving a namespace between meshes is not a thing that most people will want to do. Forcing them to delete the Member and create a new one therefore isn't a big deal. We can always extend the API and allow them to mutate the refs, but it makes sense to make the field immutable at first, and then open it up instead of the other way around. Allowing mutation of this field is problematic, because the controller can't just use the old value of the ref to figure out which SMMR to remove the member from. It would have to list all SMMRs to find the right one. Not a big deal. I can do that in a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Moving between meshes should IMO be an epic in itself that we should strive to make easier, because it allows for blue/green deployments of control planes. Anyway, separate discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Moving a namespace between meshes is not a thing that most people will want to do." And yet there's already a JIRA to document that very scenario (https://issues.jboss.org/browse/MAISTRA-859). Just an FYI.
var _ inject.Client = (*memberValidator)(nil) | ||
var _ inject.Decoder = (*memberValidator)(nil) | ||
|
||
func (v *memberValidator) Handle(ctx context.Context, req atypes.Request) atypes.Response { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should check if the namespace is already a member of another mesh, otherwise the MemberController and MemberRollValidator will throw errors back and forth if that's the case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the MemberRollValidator prevents the controller from adding the namespace to the SMMR, the returned validation message will be written to the Member's status. I don't see how the error would be bouncing back and forth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looked like we're just wrapping the error in "Could not update ServiceMeshMemberRoll" and return it, which IIRC triggers another reconciliation? Which will then again throw an error, ... Haven't tested this though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just verified this; it will throw errors forever in the operator log. The scenario I'm testing is:
- namespace a is part of mesh cp (manually added to cp's MemberRoll)
- create a Member resource in namespace a pointing to mesh cp2
- the validating webhook for Member resources passes *
- the Member controller tries to add namespace a to cp's MemberRoll
- validating webhook for MemberRoll resources fails: already member of cp
- Member controller will throw errors forever (exponential backoff means they become rarer over time, but they're there)
I think the most consistent way would be to not accept a Member resource that tries to re-assign a namespace ie checking whether it is already member of another namespace in the validating webhook (see *) - this is the same behaviour as with MemberRolls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wouldn't throw the error forever - only until it can finally successfully add the member to cp2 (when the user manually removes it from cp). That's not really wrong - that's how all k8s controllers should work - they should retry indefinitely.
What is wrong is that the operator logs this as an error. Instead of doing that, it should only emit an event and/or update the Member's status to notify the user of why it can't add the member to the mesh.
While adding a validating webhook would help notify the user immediately, it would also force users to first remove the member from the old cp and only then create the Member object. This would effectively prevent users from doing a kubectl apply -f *.yaml
, where one of the yaml removes the member from the SMMR and the other one creates the Member. That's why I'm not a big fan of these types of validating webhooks. The beauty of Kubernetes is that it doesn't force you to do things in a specific order (initially this was true for virtually everything; now some operations do demand performing them in a specific order - e.g. create CRD first, then do other stuff - but we should do our best to not introduce these types of ordering requirements).
There's one other reason against introducing a validating webhook like this: if the webhook uses a local cache (as it should and as it usually does), it could reject a perfectly valid request simply because its cache is stale (e.g. it still contains a SMMR that contains the member, even though the user has just removed it). To work around this, the webhook would have to bypass the cache and obtain a list of all SMMRs from the API server every time it is invoked. List operations aren't cheap, so this would cause additional strain on the API server.
In any case, we can add the validating webhook later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is wrong is that the operator logs this as an error. Instead of doing that, it should only emit an event and/or update the Member's status to notify the user of why it can't add the member to the mesh.
Agreed. I can live with moving the error message to the status.
return err | ||
} | ||
// watch namespaces and trigger reconcile requests as those that match a member come and go | ||
err = c.Watch(&source.Kind{Type: &corev1.Namespace{}}, &handler.EnqueueRequestsFromMapFunc{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, don't believe we need to watch namespaces; possibly SMCPs though? In case they're deleted and re-created
|
||
// ServiceMeshMemberStatus contains the state last used to reconcile the list | ||
type ServiceMeshMemberStatus struct { | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, we should also have ways to report errors, if any occur (e.g. control plane disappeared)
verbs: | ||
- use | ||
resourceNames: | ||
- multitenant-install |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for enforcing 'default' and only referencing a namespace in the Member resource. But that's out of scope here I guess
We can't miss the deletion because of the finalizer. The controller removes the finalizer only after removing the namespace from the SMMR's member list.
I'll add the status.
The MemberRoll validator prevents this.
The periodic resync should take care of this, but I'll replace the namespace watch with a SMMR watch and this should then add the member to the SMMR immediately after someone manually removes it.
There's nothing wrong with a dangling Member. Users should always be able to create resources in any order they want. This implies that they should also be able to delete a referenced resource and then re-create it, without having to delete every other object that refers to it. Also, k8s shouldn't delete resources that a users creates manually. When a user creates a resource, they own it.
No need. We can make everything work even if we allow both at once. Allowing only one would mean that all existing deployments would have to use just the MemberRoll. This would effectively prevent users from switching to the Member mechanism.
Absolutely. I never intended the Member to have an empty status. I just wanted to let others from the team to try this out and see what users really need. |
I've remembered why the controller needs to watch namespaces. If you create a Member that references a SMCP in a nonexistent namespace, the controller needs to create the SMMR as soon as the namespace is created. On that note, I have now also added a watch on SMMRs, to ensure the controller reverts any incompatible changes to the SMMR made by a user (e.g. a user removes a namespace from the SMMR's member list, but the ns still has a Member object). |
Regarding SMMR/SMM interaction, I think we should enforce a single usage pattern: you either use SMMR or you use SMM, but not both. We could enforce this through the validators, e.g. if the user creates SMMR, then creation of SMM should be rejected, and vice versa. We could do this in a follow on PR. |
Regarding namespace watch, won't the validation fail, because the referenced smcp resource is non existent, hence the user creating the smm would be able to |
But that kind of validation would just be complicating things. Right now, the Member controller is just a helper that adds or removes members to/from a SMMR instead of the user doing that manually. The whole member system was designed to be an additive change instead of replacing the MemberRoll - specifically to reduce the chance of breaking things. |
The SMCP doesn't need to exist for a user to have the privilege to use it. This ensures you can create objects out of order. RBAC rules are always completely separate from the (existence of) objects. |
break | ||
} | ||
|
||
unstructuredStatus, err := converter.ToUnstructured(member.Status) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is ToUnstructured()
part of this commit, or does it already exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's part of apimachinery.
The operator is the only one updating the status of the ServiceMeshMember object. Receiving an update event where only the status has changed shouldn't trigger a new reconcile, otherwise we'll run into race conditions between the status updates and the updates of owned objects.
After deleting the SMMR, it still exists until the MemberRoll controller performs the clean up. In the mean time, the Member controller may be triggered again and will delete the SMMR again. This isn't problematic, since the deletion operation is idempotent, but we do get multiple "Removing ServiceMeshMember from ServiceMeshMemberRoll" statements in the log. This commit fixes this.
The requeueing is necessary, because the Member controller isn't notified when the SMMR is modified (e.g. when the user manually removes the namespace from the other SMMR).
The operator owns the ServiceMeshMember's status (it is the only actor allowed to update it). If the operator encounters a conflict error when doing so, it now fetches a fresh version of the object directly from the API and tries to update it again.
… label concurrently The label update caused the finalizer removal code to run into a conflict. Since we don't requeue on conflicts and count on the watch event to trigger another reconciliation, if the watch predicate returns false, reconciliation won't get invoked until the next periodic resync. The predicate should therefore return true for all events where the new object has the deletion timestamp set.
Message string `json:"message,omitempty"` | ||
} | ||
|
||
// GetCondition removes a condition for the list of conditions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retrieves? This isn't removing anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with this, although I think we should have all the member logic implemented in a single controller. I think that would ease the synchronization issues between the two controllers. I would recommend using the existing member roll logic, but build the required members list off the union between the member roll and the list of member objects.
kind: Role | ||
name: mesh-user | ||
subjects: | ||
- kind: User |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move the role into the smcp yaml? We can do that as follow on work.
I can also imagine that we might want to merge them in the future, to do things such as create/delete Member resources when a MemberRoll is updated- then the two APIs would be fully aligned. But we'll have to see if we need that. |
Yes, merging the two controller may make sense, although I haven't yet seen any other controller operate on two separate CRs. We can try doing that later. The biggest reason for making the Member controller separate was to reduce the impact on existing functionality - the member controller is just an add-on, which doesn't require any changes to the existing SMCP and SMMR controller, ensuring it can't impact existing users at all. |
I guess it depends on how you view the controller: does it manage a specific resource or does it manage a feature? I was also thinking about this last night and it was wondering if there might be any performance implications doing a list across all namespaces for each reconcile. I was wondering if we could tune the logic by using custom events (e.g. NamespaceCreated, NamespaceDeleted, MemberCreated, MemberRollUpdated, etc.). Anyway, food for thought. |
No description provided.