Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Watch and Read cilium network policies from static directory path #32599

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

tamilmani1989
Copy link
Contributor

@tamilmani1989 tamilmani1989 commented May 17, 2024

Cilium reads CNP files from directory if path is configured via static-cnp-path field in cilium config. It watches the directory for any changes and read those files and convert to CNP object and add it to policy engine. This allows admin to configure policy to not allow traffic to certain endpoints without showing up as policy resource in kubernetes. This is implemented based on this discussion: #30060 (comment)

Please ensure your pull request adheres to the following guidelines:

  • For first time contributors, read Submitting a pull request
  • All code is covered by unit and/or runtime tests where feasible.
  • All commits contain a well written commit description including a title,
    description and a Fixes: #XXX line if the commit addresses a particular
    GitHub issue.
  • If your commit description contains a Fixes: <commit-id> tag, then
    please add the commit author[s] as reviewer[s] to this issue.
  • All commits are signed off. See the section Developer’s Certificate of Origin
  • Provide a title or release-note blurb suitable for the release notes.
  • Are you a user of Cilium? Please add yourself to the Users doc
  • Thanks for contributing!

Fixes: #issue-number

Targeting release 1.16

@tamilmani1989 tamilmani1989 requested review from a team as code owners May 17, 2024 05:11
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 17, 2024
@github-actions github-actions bot added the sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. label May 17, 2024
@tamilmani1989 tamilmani1989 changed the title feat: Configure static cilium network policy feat: Watch and Read cilium network policies from static directory path May 17, 2024
@gandro gandro self-requested a review May 21, 2024 15:27
Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I think the motivation of the feature makes sense overall, but I have some feedback on the implementation

pkg/policy/k8s/watcher.go Outdated Show resolved Hide resolved
}

cnp := &cilium_v2.CiliumNetworkPolicy{}
err = json.Unmarshal(jsonData, cnp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think we should also think about what we want to do with the CNP fields which don't make sense when loaded from file, e.g. K8s name and namespace. In particular if there are two files with the same policy name on the object metadata, then I think the current system is non-deterministic in what policy gets applied.

Instead of parsing in the full cilium_v2.CiliumNetworkPolicy schema, we could consider only reading in api.Rule and overwrite the labels of each rule to contain the filename or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are 2 policies with same name, then it would merge both. If policy is same, then it would ignore adding same rule and if different it merges rules from both policy. Wouldn't namespace be significant if user wants to apply policy for specific namespace.? If we skip name and namespace, then it supports only clusterwide policies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are 2 policies with same name, then it would merge both.

Yes, I think this could be a good solution. This is not yet implemented in the current version, right?

I am still unsure though if we even want these policies to have names. The names don't serve any purpose and I think the file name is a much better identifier (since it is actually unique). Maybe it would be better if we just ignored the name completely.

Wouldn't namespace be significant if user wants to apply policy for specific namespace.?

Ah, I missed this bit. So the way this is implemented is that if something is read from a CNP that we simply attach the namespace to the endpoint selector when coverting them to low-level api.Rules:

if namespace != "" {
userNamespace, present := r.EndpointSelector.GetMatch(podPrefixLbl)
if present && !namespacesAreValid(namespace, userNamespace) {
log.WithFields(logrus.Fields{
logfields.K8sNamespace: namespace,
logfields.CiliumNetworkPolicyName: name,
logfields.K8sNamespace + ".illegal": userNamespace,
}).Warn("CiliumNetworkPolicy contains illegal namespace match in EndpointSelector." +
" EndpointSelector always applies in namespace of the policy resource, removing illegal namespace match'.")
}
retRule.EndpointSelector.AddMatch(podPrefixLbl, namespace)

If we skip name and namespace, then it supports only clusterwide policies.

Sort of, but not necessarily - a clusterwide policy can always be translated into a namespaced policy by modifying the endpoint selector. So reading api.Rules does not take away any capabilities.

But I do see an argument to be made that the CNP/CCNPs format might be easier for most Cilium users to understand - since this is how most users interact with policies these days. But it does have the downside that there are the "K8s" name field in the CNP metadata causes some confusion, since it is completely unused.

Copy link
Contributor

@derailed derailed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tamilmani1989 Nice work! Thank you for this PR. Might be out of context here but I have a few concerns as well regarding this feature.

}
}
if event.Op&fsnotify.Remove == fsnotify.Remove {
p.log.WithField("file", event.Name).Debug("CNP file removed from directory")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move this to a method i.e deleteFromPolicyEngine or the likes so we can test the functionality outside the watcher.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this should be moved to its own function. Why has this been marked as resolved?

pkg/policy/k8s/watcher.go Outdated Show resolved Hide resolved
@tamilmani1989
Copy link
Contributor Author

Thanks for reviewing this @gandro and @derailed. Initially I thought about having separate watcher for loading from file but it seems to be duplicating few api and structs again (PolicyResourcesWatcher, policyWatcher, newPolicyWatcher, newPolicyResourcesWatcher, WatchK8sPolicyResources, etc) and so I clubbed with k8s policy watcher to keep changes minimal. But I agree on your concerns on differentiating this policy from k8s policy makes debugging difficult. I can separate this out and add a separate watcher and also set labels on this policy for cilium-dbg or hubble to identify its source.

@gandro
Copy link
Member

gandro commented May 28, 2024

Thank you for the feedback! Overall I think one design decision to take is if we want ToServices and CIDRGroupRef to work for policies loaded from file. Those rules refer to resources in K8s - and therefore might not make sense in the context of a policy loaded from disk - it really depends on the usecase. The majority of the logic in pkg/policy/k8s exists to deal with those. If we don't want or need to support those K8s references, then I think the separate watcher can be much simpler.

CIDRGroupRef and ToServices currently don't work for policies imported via API cilium policy import, so I think there is precedence for not supporting all fields when a policy is not loaded from K8s.

Cilium reads CNP yaml if `static-cnp-path` is specified in cilium
config. It converts to rules and add those rules to policy engine. This
allows admin to configure policy to not allow traffic to certain secure
infrastructure endpoints from pods running in cloud.

Signed-off-by: tamanoha <tamanoha@microsoft.com>
@tamilmani1989 tamilmani1989 force-pushed the tamilmani/staticPolicies branch 3 times, most recently from fcd8a43 to 02bc933 Compare June 1, 2024 09:23
@tamilmani1989
Copy link
Contributor Author

tamilmani1989 commented Jun 1, 2024

@gandro @derailed I updated based on your suggestion. Separated directory watcher and starting it from a different cell. Also removed CIDRGroupRef and ToServices which are not relevant for policy loaded from file. Added labels to rules to differentiate from policy created via k8s. Please let know if this approach looks ok.

…g it.

Signed-off-by: tamanoha <tamanoha@microsoft.com>
Copy link
Member

@dylandreimerink dylandreimerink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we want to not add more logic to the daemon initializization.I understand the concern regarding the circular dependency on Daemon. I took a quick look this and it seems we can move the "policy manger" part into its own type.
I made a draft PR to do just that #32847.

I suggest we merge this PR in its current state, once it merges I can rebase and update this watcher to use the separate policy manager so a lifecycle can be used here.

@joestringer
Copy link
Member

joestringer commented Jun 4, 2024

@tamilmani1989 please adjust the ```release-note ...``` in the issue description. The text in that part of the description will appear verbatim in the actual release notes after this PR is merged, so the current text there doesn't make sense as a description of this field.

I'll also note that the release freeze for v1.16 is coming up quick, so you'll need to coordinate with sig-policy folks about timing and whether you think it's viable to land in the next week or so.

Additionally, the checkboxes in the issue description are for you to fill out to ensure the PR follows the guidelines for merging. Please check through them and check them off to indicate the status of the PR. Thanks.

@joestringer joestringer added the release-note/major This PR introduces major new functionality to Cilium. label Jun 4, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 4, 2024
@joestringer
Copy link
Member

/test

Copy link
Contributor

@learnitall learnitall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API changes look good to go, but I have a couple of questions regarding the implementation. Thanks!

}
}
if event.Op&fsnotify.Remove == fsnotify.Remove {
p.log.WithField("file", event.Name).Debug("CNP file removed from directory")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this should be moved to its own function. Why has this been marked as resolved?

// Listen for file add, update and delete
for {
select {
case event := <-watcher.Events:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why the Rename and Write events are not handled? Not handling these events could cause policy leaks and/or the state in the directory to not match the state in Cilium's policy engine. For example:

  • If a user modifies an existing policy on disk, that change will not be pushed to the policy engine.
  • If a user moves an existing policy on disk to a different directory, that policy will not be removed from the policy engine.

for {
select {
case event := <-watcher.Events:
if !(event.Op&fsnotify.Create == fsnotify.Create ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be safer and easier to read if event.Has was used instead of event.Op&, as this is the recommended usage in the documentation.

}
reportCNPChangeMetrics(err)
case err := <-watcher.Errors:
p.log.Error("Error:", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we expand on the error handling here to provide more information on the context of this message? Specifically, it would be useful it we could we add a prefix like unknown error from fsnotify while watching policy directory and even handle specific errors such as ErrEventOverflow.

Comment on lines +62 to +63
labels.NewLabel("name", name, labels.LabelSourceDirectory),
labels.NewLabel("path", filePath, labels.LabelSourceDirectory),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubernetes has restrictions on the characters that can be put into labels, as well as the label's format (see here). The filePath argument needs to undergo input validation to ensure it fits within these guidelines. For example, the forward-slashes in the path should probably be replaced with a different character, such as an underscore, since the forward-slash character has a specific meaning. Additionally, Kubernetes labels are restricted to ASCII characters, but file paths can contain UTF-8 characters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not K8s labels - they are Cilium labels. Slashes should be fine - as to my knowledge, they are not exposed to K8s anywhere, so I don't think we need to be overly restrictive with the label format. But I'm not 100% certain where labels on policy are are used - the only place I know is Hubble policy correlation.

Still, it probably would be worth sanitizing any non-UTF8 characters

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah interesting that's good to know, TIL. Do we have the format of Cilium labels documented somewhere?

Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thank you! I like this version much better.

I have focused mainly on questions around policy ingestion and policy lifetime. I have not yet focused on the details of the file watcher itself, but I see Ryan has looked at that.

I think there are still some questions around the lifecycle of policies and in particular the meaning of the name of a policy

@@ -760,6 +760,12 @@ func newDaemon(ctx context.Context, cleaner *daemonCleanup, params *daemonParams
}
}

if option.Config.StaticCiliumNetworkPolicyPath != "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having a global option, we could move this flag into the directory watcher cell. newDirectoryPolicyResourcesWatcher would then return nil if no path is set, thereby simplifying the code a bit and only requiring the nil check here

// WatchDirectoryPolicyResources starts watching Cilium Network policy files created under a directory.
func (p *PolicyResourcesWatcher) WatchDirectoryPolicyResources(ctx context.Context, policyManager PolicyManager) {
w := newPolicyWatcher(ctx, policyManager, p)
w.watchDirectory(ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So one thing the directory watcher here doesn't do (which the K8s watcher does do) is block endpoint regeneration until all initial policies have been ingested. The way the K8s watcher does this, is by synced.Resources.BlockWaitGroupToSyncResources which in turn blocks startDaemon here before it calls initRestore:

if params.Clientset.IsEnabled() {
// Wait only for certain caches, but not all!
// (Check Daemon.InitK8sSubsystem() for more info)
<-params.CacheStatus
}
bootstrapStats.k8sInit.End(true)
d.initRestore(restoredEndpoints, params.EndpointRegenerator)

Ensuring that all policies are ingested before initRestore is called is important, as otherwise we run into races where polices are not discovered for the initial endpoint version, thus causing unwanted drops or unwanted allows (both bad).

In my opinion, it doesn't make sense for the directory watcher to use the same synced.Resources mechanism, especially since that one is highly tailored towards K8s resources. Instead, I think we could just block this call here until all files in the directory have been added and before we start the fsnotify watcher. By blocking this call we're blocking the Daemon from continuing bootstrap, which basically what we want.

Comment on lines +62 to +63
labels.NewLabel("name", name, labels.LabelSourceDirectory),
labels.NewLabel("path", filePath, labels.LabelSourceDirectory),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not K8s labels - they are Cilium labels. Slashes should be fine - as to my knowledge, they are not exposed to K8s anywhere, so I don't think we need to be overly restrictive with the label format. But I'm not 100% certain where labels on policy are are used - the only place I know is Hubble policy correlation.

Still, it probably would be worth sanitizing any non-UTF8 characters

}

cnp := &cilium_v2.CiliumNetworkPolicy{}
err = json.Unmarshal(jsonData, cnp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are 2 policies with same name, then it would merge both.

Yes, I think this could be a good solution. This is not yet implemented in the current version, right?

I am still unsure though if we even want these policies to have names. The names don't serve any purpose and I think the file name is a much better identifier (since it is actually unique). Maybe it would be better if we just ignored the name completely.

Wouldn't namespace be significant if user wants to apply policy for specific namespace.?

Ah, I missed this bit. So the way this is implemented is that if something is read from a CNP that we simply attach the namespace to the endpoint selector when coverting them to low-level api.Rules:

if namespace != "" {
userNamespace, present := r.EndpointSelector.GetMatch(podPrefixLbl)
if present && !namespacesAreValid(namespace, userNamespace) {
log.WithFields(logrus.Fields{
logfields.K8sNamespace: namespace,
logfields.CiliumNetworkPolicyName: name,
logfields.K8sNamespace + ".illegal": userNamespace,
}).Warn("CiliumNetworkPolicy contains illegal namespace match in EndpointSelector." +
" EndpointSelector always applies in namespace of the policy resource, removing illegal namespace match'.")
}
retRule.EndpointSelector.AddMatch(podPrefixLbl, namespace)

If we skip name and namespace, then it supports only clusterwide policies.

Sort of, but not necessarily - a clusterwide policy can always be translated into a namespaced policy by modifying the endpoint selector. So reading api.Rules does not take away any capabilities.

But I do see an argument to be made that the CNP/CCNPs format might be easier for most Cilium users to understand - since this is how most users interact with policies these days. But it does have the downside that there are the "K8s" name field in the CNP metadata causes some confusion, since it is completely unused.

Comment on lines +89 to +93
resourceID := ipcacheTypes.NewResourceID(
ipcacheTypes.ResourceKindCNP,
cnp.ObjectMeta.Namespace,
cnp.ObjectMeta.Name,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect. We must not use the K8s name and namespace for tracking the lifecycle of IPCache entries owned by this policy. This ID here is used to e.g. if the policy contains a ToCIDR rule, requiring us to insert a CIDR prefix into IPCache. If we use the K8s name here, and you have a real K8s policy and a non-K8s policy read from disk with the same name and namespace, both resources would claim ownership over the IPCache CIDR entry, causing conflicting updates for those IPCache entries.

Instead, I would suggest introducing a new ipcacheTypes.ResourceKind here (e.g. ResourceKind("file")), leave the namespace empty and use the filename as the name. Then the IPCache entry is always owned by the file (and thus updated when the file is updated and removed when the file is removed, which is exactly what we want).


//update labels
for _, r := range rules {
r.Labels = getLabels(cnpFilePath, cnp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We could probably call getLabels only once before the loop

if err == nil {
delete(p.fileNameToCnpCache, event.Name)
} else {
p.log.WithField("err", err).Error("Deleting network policy failed")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably stil want to delete the filename from the map in this case, otherwise it's leaked


// add to policy engine
_, err = p.policyManager.PolicyAdd(rules, &policy.AddOptions{
ReplaceWithLabels: getLabels(cnpFilePath, cnp),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually maybe want to use ReplaceByResource: true here instead (assuming we use the filename as the resourceID as suggested above) and leave ReplaceWithLabels empty. This way, the policy lifetime is tied to the file itself (which is I think what we want).

ReplaceWithLabels is mainly (only?) used when importing a policy from API, where we don't have a unique resource handle for the API request. But when importing policies from CNP, we also use ReplaceByResource

fileNameToCnpCache map[string]*types.SlimCNP
}

func (p *policyWatcher) translateToCNPObject(file string) (*cilium_v2.CiliumNetworkPolicy, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, please squash your commits into one, it makes it easier to review what changes are actual changes and which ones are just undoing prior approaches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/major This PR introduces major new functionality to Cilium. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants