Skip to content

azure-cni may leak IP allocations after failing to ADD them to a pod #214

@PatrickLang

Description

@PatrickLang

Is this a request for help?: No


Is this an ISSUE or FEATURE REQUEST? (choose one): Issue


Which release version?: master + cherry-pick of #212


Which component (CNI/IPAM/CNM/CNS): CNI


Which Operating System (Linux/Windows): Windows Server version 1803


Which Orchestrator and version (e.g. Kubernetes, Docker): Kubernetes


What happened:

After scaling up a replica set, some containers failed to start. When this happens, the IPs were not freed. Here's an example of the end state after scaling back down. Only 1 pod IP should be in use on the node, but there are 3 marked as in use in the IPAM file

kubectl get pod -o wide
NAME                           READY     STATUS    RESTARTS   AGE       IP             NODE
psh-5d98ff98b5-qpbjv           1/1       Running   0          18h       10.240.0.141   k8s-linuxpool-13955535-1
whoami-1803-78fd64846f-lq9m7   1/1       Running   0          18h       10.240.0.99    13955k8s9001



# Run on 13955k8s9001
(get-content c:\k\azure-vnet-ipam.json | convertfrom-json).IPAM.AddressSpaces.local.Pools.'10.240.0.0/12'.Addresses


10.240.0.100 : @{ID=; Addr=10.240.0.100; InUse=False}
10.240.0.101 : @{ID=; Addr=10.240.0.101; InUse=False}
10.240.0.102 : @{ID=; Addr=10.240.0.102; InUse=False}
10.240.0.103 : @{ID=; Addr=10.240.0.103; InUse=False}
10.240.0.104 : @{ID=; Addr=10.240.0.104; InUse=False}
10.240.0.105 : @{ID=; Addr=10.240.0.105; InUse=False}
10.240.0.106 : @{ID=; Addr=10.240.0.106; InUse=False}
10.240.0.107 : @{ID=; Addr=10.240.0.107; InUse=False}
10.240.0.108 : @{ID=; Addr=10.240.0.108; InUse=False}
10.240.0.109 : @{ID=; Addr=10.240.0.109; InUse=False}
10.240.0.110 : @{ID=; Addr=10.240.0.110; InUse=False}
10.240.0.111 : @{ID=; Addr=10.240.0.111; InUse=False}
10.240.0.112 : @{ID=; Addr=10.240.0.112; InUse=False}
10.240.0.113 : @{ID=; Addr=10.240.0.113; InUse=True}
10.240.0.114 : @{ID=; Addr=10.240.0.114; InUse=False}
10.240.0.115 : @{ID=; Addr=10.240.0.115; InUse=False}
10.240.0.116 : @{ID=; Addr=10.240.0.116; InUse=False}
10.240.0.117 : @{ID=; Addr=10.240.0.117; InUse=False}
10.240.0.118 : @{ID=; Addr=10.240.0.118; InUse=False}
10.240.0.119 : @{ID=; Addr=10.240.0.119; InUse=False}
10.240.0.120 : @{ID=; Addr=10.240.0.120; InUse=False}
10.240.0.121 : @{ID=; Addr=10.240.0.121; InUse=False}
10.240.0.122 : @{ID=; Addr=10.240.0.122; InUse=False}
10.240.0.123 : @{ID=; Addr=10.240.0.123; InUse=False}
10.240.0.124 : @{ID=; Addr=10.240.0.124; InUse=False}
10.240.0.125 : @{ID=; Addr=10.240.0.125; InUse=True}
10.240.0.126 : @{ID=; Addr=10.240.0.126; InUse=False}
10.240.0.97  : @{ID=; Addr=10.240.0.97; InUse=False}
10.240.0.98  : @{ID=; Addr=10.240.0.98; InUse=False}
10.240.0.99  : @{ID=; Addr=10.240.0.99; InUse=True}

What you expected to happen:

No leaks


How to reproduce it (as minimally and precisely as possible):

# cordon all Windows nodes except 1
kubectl apply -f https://raw.githubusercontent.com/PatrickLang/Windows-K8s-Samples/master/HyperVExamples/whoami-1803.yaml
kubectl scale deploy whoami-1803 --replicas=6
# wait some time, not all 6 will start successfully
kubectl scale deploy whoami-1803 --replicas=1

Anything else we need to know:

Found this while testing fix for #195

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions