Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAISTRA-980 Introduce the ServiceMeshMember CRD and controller #250

Merged
merged 17 commits into from
Dec 23, 2019

Conversation

luksa
Copy link
Contributor

@luksa luksa commented Sep 30, 2019

No description provided.

type ServiceMeshMemberStatus struct {
ObservedGeneration int64 `json:"observedGeneration,omitempty"`
ServiceMeshGeneration int64 `json:"meshGeneration,omitempty"`
ConfiguredMembers []string `json:"configuredMembers,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this on the member resources?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this a while ago. I just pushed the commits instead of also commenting here. PTAL.

pkg/controller/servicemesh/member/controller.go Outdated Show resolved Hide resolved
@luksa
Copy link
Contributor Author

luksa commented Oct 29, 2019

I've now squashed the three commits and rebased to current maistra-1.1 branch. Please take a look.

verbs:
- use
resourceNames:
- multitenant-install
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be something like minimal-install?

Also, not to expand the scope, but should we look at constraining the name of the smcp to be the namespace within which it's created? Or maybe a fixed name, like default? Seems like this might simplify usage of the member resource. Something for a follow on discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for enforcing 'default' and only referencing a namespace in the Member resource. But that's out of scope here I guess

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I feel like requiring a specific resource name is an anti-pattern. Also, the usual structure of a reference (e.g. secretRef) in core k8s objects is: name + namespace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed this to minimal-install.

kind: Role
name: mesh-user
subjects:
- kind: User
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove the users and just keep the group? Maybe add a comment about how to add users using cli?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is an example, it makes sense to show both users and groups are possible (although someone familiar with RBAC should already know this).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move the role into the smcp yaml? We can do that as follow on work.


// ServiceMeshMemberStatus contains the state last used to reconcile the list
type ServiceMeshMemberStatus struct {
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have something here? Maybe something as simple as joinTimestamp and/or reconcileTimestamp? Basically, something that lets the user know whether or not they are a fully configured member.

Perhaps this could be added by the smmr controller when it updates the namespace containing the smm resource?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, we should also have ways to report errors, if any occur (e.g. control plane disappeared)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning to add this later - when the rest of the team starts using this and it becomes clear what users need. The only thing I'd add right now is some kind of status.

return err
}
// watch namespaces and trigger reconcile requests as those that match a member come and go
err = c.Watch(&source.Kind{Type: &corev1.Namespace{}}, &handler.EnqueueRequestsFromMapFunc{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? If the smcp resource doesn't exist, then creation of the smm resource should fail, so it seems like there shouldn't be a need to watch namespaces at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, don't believe we need to watch namespaces; possibly SMCPs though? In case they're deleted and re-created

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't fail. As elsewhere in k8s, users should be able to create resources out of order. Preventing the creation of the Member object before the SMCP is created would break scenarios where you deploy all resources at once (e.g. SMCP, Member and app itself).

}
// MemberRoll doesn't exist, let's create it
isNewMemberRoll = true
memberRoll = &v1.ServiceMeshMemberRoll{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add an owner reference back to the smm resource? Would this facilitate automatic deletion of the smmr resource, once the last smm is deleted?

Do we prevent deletion of the referenced smcp resource? Should we add an owner reference to smcp too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The Member controller takes care of deleting the SMMR. This allows it to determine if the SMMR should even be deleted. Currently, it deletes it if it was the one who created it. But I'll need to change this, since it would also delete the SMMR in the following scenario (when it really shouldn't):

  • user creates Member object
  • controller creates SMMR
  • user manually adds another namespace to the SMMR
  • user deletes Member object

The controller shouldn't delete the SMMR in this case.

Copy link
Contributor

@dgn dgn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work here. Please excuse my wall of text.

TL;DR this will work fine if the user does not make mistakes but I'm afraid they're going to.

The first issue I see is that it's not completely idempotent in that we rely on capturing every event. If we e.g. miss the deletion of a Member object for whatever reason, we could try and add it to another MemberRoll when it's re-created because we don't explicitly check if it is already a member of any mesh. That would then cause errors in the MemberRollValidator. For the user, it looks as if she created a valid ServiceMeshMember object (validation passed) but then nothing happens (except for lots of errors in the logs).

Then there are manual edits of MemberRolls - that's less a problem of implementation but rather of having multiple APIs. Examples:

  • If the namespace was manually added to a MemberRoll before, we will just go and try to add it to another one (can be fixed the same way as above)
  • Namespaces could be removed from MemberRolls after the fact, so we have dangling Member resources hanging around
  • SMCPs can be removed, also leading to dangling Members (potentially solved by ownerRefs?)

So I'm starting to think: should we maybe only support either the MemberRoll or Member API for a given control plane, to avoid these problems?

And we have to make sure to report any errors back to the user in a way that is visible and helps them find a way to recover.

These are all edge cases of course. I think the code as it is will work fine 99% of the time.

if smm.Spec.ControlPlaneRef.Name != oldSmm.Spec.ControlPlaneRef.Name ||
smm.Spec.ControlPlaneRef.Namespace != oldSmm.Spec.ControlPlaneRef.Namespace {
logger.Info("Client tried to mutate ServiceMeshMember.spec.controlPlaneRef")
return admission.ErrorResponse(http.StatusBadRequest, fmt.Errorf("Mutation of .spec.controlPlaneRef isn't allowed"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that it simplifies the controller logic, but this could be frustrating - is this standard behaviour when editing refs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving a namespace between meshes is not a thing that most people will want to do. Forcing them to delete the Member and create a new one therefore isn't a big deal. We can always extend the API and allow them to mutate the refs, but it makes sense to make the field immutable at first, and then open it up instead of the other way around. Allowing mutation of this field is problematic, because the controller can't just use the old value of the ref to figure out which SMMR to remove the member from. It would have to list all SMMRs to find the right one. Not a big deal. I can do that in a follow-up PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Moving between meshes should IMO be an epic in itself that we should strive to make easier, because it allows for blue/green deployments of control planes. Anyway, separate discussion.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Moving a namespace between meshes is not a thing that most people will want to do." And yet there's already a JIRA to document that very scenario (https://issues.jboss.org/browse/MAISTRA-859). Just an FYI.

var _ inject.Client = (*memberValidator)(nil)
var _ inject.Decoder = (*memberValidator)(nil)

func (v *memberValidator) Handle(ctx context.Context, req atypes.Request) atypes.Response {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should check if the namespace is already a member of another mesh, otherwise the MemberController and MemberRollValidator will throw errors back and forth if that's the case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the MemberRollValidator prevents the controller from adding the namespace to the SMMR, the returned validation message will be written to the Member's status. I don't see how the error would be bouncing back and forth.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looked like we're just wrapping the error in "Could not update ServiceMeshMemberRoll" and return it, which IIRC triggers another reconciliation? Which will then again throw an error, ... Haven't tested this though

Copy link
Contributor

@dgn dgn Dec 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just verified this; it will throw errors forever in the operator log. The scenario I'm testing is:

  • namespace a is part of mesh cp (manually added to cp's MemberRoll)
  • create a Member resource in namespace a pointing to mesh cp2
  • the validating webhook for Member resources passes *
  • the Member controller tries to add namespace a to cp's MemberRoll
  • validating webhook for MemberRoll resources fails: already member of cp
  • Member controller will throw errors forever (exponential backoff means they become rarer over time, but they're there)

I think the most consistent way would be to not accept a Member resource that tries to re-assign a namespace ie checking whether it is already member of another namespace in the validating webhook (see *) - this is the same behaviour as with MemberRolls

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wouldn't throw the error forever - only until it can finally successfully add the member to cp2 (when the user manually removes it from cp). That's not really wrong - that's how all k8s controllers should work - they should retry indefinitely.

What is wrong is that the operator logs this as an error. Instead of doing that, it should only emit an event and/or update the Member's status to notify the user of why it can't add the member to the mesh.

While adding a validating webhook would help notify the user immediately, it would also force users to first remove the member from the old cp and only then create the Member object. This would effectively prevent users from doing a kubectl apply -f *.yaml, where one of the yaml removes the member from the SMMR and the other one creates the Member. That's why I'm not a big fan of these types of validating webhooks. The beauty of Kubernetes is that it doesn't force you to do things in a specific order (initially this was true for virtually everything; now some operations do demand performing them in a specific order - e.g. create CRD first, then do other stuff - but we should do our best to not introduce these types of ordering requirements).

There's one other reason against introducing a validating webhook like this: if the webhook uses a local cache (as it should and as it usually does), it could reject a perfectly valid request simply because its cache is stale (e.g. it still contains a SMMR that contains the member, even though the user has just removed it). To work around this, the webhook would have to bypass the cache and obtain a list of all SMMRs from the API server every time it is invoked. List operations aren't cheap, so this would cause additional strain on the API server.

In any case, we can add the validating webhook later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is wrong is that the operator logs this as an error. Instead of doing that, it should only emit an event and/or update the Member's status to notify the user of why it can't add the member to the mesh.

Agreed. I can live with moving the error message to the status.

return err
}
// watch namespaces and trigger reconcile requests as those that match a member come and go
err = c.Watch(&source.Kind{Type: &corev1.Namespace{}}, &handler.EnqueueRequestsFromMapFunc{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, don't believe we need to watch namespaces; possibly SMCPs though? In case they're deleted and re-created


// ServiceMeshMemberStatus contains the state last used to reconcile the list
type ServiceMeshMemberStatus struct {
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, we should also have ways to report errors, if any occur (e.g. control plane disappeared)

verbs:
- use
resourceNames:
- multitenant-install
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for enforcing 'default' and only referencing a namespace in the Member resource. But that's out of scope here I guess

@luksa
Copy link
Contributor Author

luksa commented Oct 30, 2019

The first issue I see is that it's not completely idempotent in that we rely on capturing every event. If we e.g. miss the deletion of a Member object for whatever reason, we could try and add it to another MemberRoll when it's re-created because we don't explicitly check if it is already a member of any mesh.

We can't miss the deletion because of the finalizer. The controller removes the finalizer only after removing the namespace from the SMMR's member list.

That would then cause errors in the MemberRollValidator. For the user, it looks as if she created a valid ServiceMeshMember object (validation passed) but then nothing happens (except for lots of errors in the logs).

I'll add the status.

Then there are manual edits of MemberRolls - that's less a problem of implementation but rather of having multiple APIs. Examples:

  • If the namespace was manually added to a MemberRoll before, we will just go and try to add it to another one (can be fixed the same way as above)

The MemberRoll validator prevents this.

  • Namespaces could be removed from MemberRolls after the fact, so we have dangling Member resources hanging around

The periodic resync should take care of this, but I'll replace the namespace watch with a SMMR watch and this should then add the member to the SMMR immediately after someone manually removes it.

  • SMCPs can be removed, also leading to dangling Members (potentially solved by ownerRefs?)

There's nothing wrong with a dangling Member. Users should always be able to create resources in any order they want. This implies that they should also be able to delete a referenced resource and then re-create it, without having to delete every other object that refers to it. Also, k8s shouldn't delete resources that a users creates manually. When a user creates a resource, they own it.

So I'm starting to think: should we maybe only support either the MemberRoll or Member API for a given control plane, to avoid these problems?

No need. We can make everything work even if we allow both at once. Allowing only one would mean that all existing deployments would have to use just the MemberRoll. This would effectively prevent users from switching to the Member mechanism.

And we have to make sure to report any errors back to the user in a way that is visible and helps them find a way to recover.

Absolutely. I never intended the Member to have an empty status. I just wanted to let others from the team to try this out and see what users really need.

@luksa
Copy link
Contributor Author

luksa commented Oct 30, 2019

I've remembered why the controller needs to watch namespaces. If you create a Member that references a SMCP in a nonexistent namespace, the controller needs to create the SMMR as soon as the namespace is created.

On that note, I have now also added a watch on SMMRs, to ensure the controller reverts any incompatible changes to the SMMR made by a user (e.g. a user removes a namespace from the SMMR's member list, but the ns still has a Member object).

@rcernich
Copy link
Contributor

Regarding SMMR/SMM interaction, I think we should enforce a single usage pattern: you either use SMMR or you use SMM, but not both. We could enforce this through the validators, e.g. if the user creates SMMR, then creation of SMM should be rejected, and vice versa. We could do this in a follow on PR.

@rcernich
Copy link
Contributor

Regarding namespace watch, won't the validation fail, because the referenced smcp resource is non existent, hence the user creating the smm would be able to use it? (Although, I guess cluster-admin could create it regardless, so maybe just leave it to be safe.)

@luksa
Copy link
Contributor Author

luksa commented Nov 4, 2019

Regarding SMMR/SMM interaction, I think we should enforce a single usage pattern: you either use SMMR or you use SMM, but not both. We could enforce this through the validators, e.g. if the user creates SMMR, then creation of SMM should be rejected, and vice versa. We could do this in a follow on PR.

But that kind of validation would just be complicating things. Right now, the Member controller is just a helper that adds or removes members to/from a SMMR instead of the user doing that manually. The whole member system was designed to be an additive change instead of replacing the MemberRoll - specifically to reduce the chance of breaking things.

@luksa
Copy link
Contributor Author

luksa commented Nov 4, 2019

Regarding namespace watch, won't the validation fail, because the referenced smcp resource is non existent, hence the user creating the smm would be able to use it? (Although, I guess cluster-admin could create it regardless, so maybe just leave it to be safe.)

The SMCP doesn't need to exist for a user to have the privilege to use it. This ensures you can create objects out of order. RBAC rules are always completely separate from the (existence of) objects.

break
}

unstructuredStatus, err := converter.ToUnstructured(member.Status)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ToUnstructured() part of this commit, or does it already exist?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's part of apimachinery.

The operator is the only one updating the status of the ServiceMeshMember object.
Receiving an update event where only the status has changed shouldn't trigger a
new reconcile, otherwise we'll run into race conditions between the status
updates and the updates of owned objects.
After deleting the SMMR, it still exists until the MemberRoll controller
performs the clean up. In the mean time, the Member controller may be
triggered again and will delete the SMMR again. This isn't problematic,
since the deletion operation is idempotent, but we do get multiple
"Removing ServiceMeshMember from ServiceMeshMemberRoll" statements in the
log. This commit fixes this.
The requeueing is necessary, because the Member controller isn't
notified when the SMMR is modified (e.g. when the user manually
removes the namespace from the other SMMR).
The operator owns the ServiceMeshMember's status (it is the only actor
allowed to update it). If the operator encounters a conflict error when
doing so, it now fetches a fresh version of the object directly from
the API and tries to update it again.
… label concurrently

The label update caused the finalizer removal code to run into a conflict. Since we don't requeue on conflicts and count on the watch event to trigger another reconciliation, if the watch predicate returns false, reconciliation won't get invoked until the next periodic resync. The predicate should therefore return true for all events where the new object has the deletion timestamp set.
Message string `json:"message,omitempty"`
}

// GetCondition removes a condition for the list of conditions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retrieves? This isn't removing anything.

Copy link
Contributor

@rcernich rcernich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with this, although I think we should have all the member logic implemented in a single controller. I think that would ease the synchronization issues between the two controllers. I would recommend using the existing member roll logic, but build the required members list off the union between the member roll and the list of member objects.

kind: Role
name: mesh-user
subjects:
- kind: User
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move the role into the smcp yaml? We can do that as follow on work.

@dgn
Copy link
Contributor

dgn commented Dec 20, 2019

I'm okay with this, although I think we should have all the member logic implemented in a single controller. I think that would ease the synchronization issues between the two controllers. I would recommend using the existing member roll logic, but build the required members list off the union between the member roll and the list of member objects.

I can also imagine that we might want to merge them in the future, to do things such as create/delete Member resources when a MemberRoll is updated- then the two APIs would be fully aligned. But we'll have to see if we need that.

@luksa
Copy link
Contributor Author

luksa commented Dec 20, 2019

I'm okay with this, although I think we should have all the member logic implemented in a single controller. I think that would ease the synchronization issues between the two controllers. I would recommend using the existing member roll logic, but build the required members list off the union between the member roll and the list of member objects.

I can also imagine that we might want to merge them in the future, to do things such as create/delete Member resources when a MemberRoll is updated- then the two APIs would be fully aligned. But we'll have to see if we need that.

Yes, merging the two controller may make sense, although I haven't yet seen any other controller operate on two separate CRs. We can try doing that later. The biggest reason for making the Member controller separate was to reduce the impact on existing functionality - the member controller is just an add-on, which doesn't require any changes to the existing SMCP and SMMR controller, ensuring it can't impact existing users at all.

@rcernich
Copy link
Contributor

I guess it depends on how you view the controller: does it manage a specific resource or does it manage a feature?

I was also thinking about this last night and it was wondering if there might be any performance implications doing a list across all namespaces for each reconcile. I was wondering if we could tune the logic by using custom events (e.g. NamespaceCreated, NamespaceDeleted, MemberCreated, MemberRollUpdated, etc.). Anyway, food for thought.

@mergify mergify bot merged commit 6ce56f2 into maistra:maistra-1.1 Dec 23, 2019
@luksa luksa deleted the member-crd branch December 23, 2019 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants