Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/Add HA to backend controller #29

Merged
merged 10 commits into from
Jul 28, 2020
Merged

Conversation

slopezz
Copy link
Member

@slopezz slopezz commented Jul 27, 2020

Closes #8

  • Adds Makefile target to run operator locally
  • Adds PodDisruptionBudget for backend-worker and backend-listener
    • Enabled by default (ansible state present)
    • Configured by default with maxUnavailable:1
    • CR field can be a number or a percentage (both cases tested and working)
    • If you want to use minAvailable instead of default maxUnavailable:
      • If you add it to the CR from the very beginning, it will be created PDB with minAvailable (ignoring default maxUnavailable)
      • It PDB was already created with maxUnavailable, operator task managing PDB will fail, because ansible operator executed patch operation, and these 2 vars are mutually exclusive and cannot coexists on same PDB
      • It happens the same the other way around, if you have minAvailable and want to use maxUnavailable
      • This race condition not being easy handled with limited ansible operator, has been documented at CR level, and it has been added a possible workaround either using CR fields or deleting object manually
    • There is a task converting CR boolean into internal ansible state "absent" in case you want to disable it (so we guarantee if you want to ensure it is not created, it wont be created, but in case it was already enabled and you disable it, then it will be deleted)
  • Adds HoritzontalPodAutoscaler for backend-worker and backend-listener
    • Enabled by default (ansible state present)
    • Configured by default with:
      • minReplicas:2
      • maxReplicas:4
      • resourceName:cpu (only cpu/memory admitted at CRD level)
      • resourceUtilization:90 (a percentage)
    • If set to true, then replicas CR field is ignored, not managed by Deployments
    • There is a task converting CR boolean into internal ansible state "absent" in case you want to disable it (so we guarantee if you want to ensure it is not created, it wont be created, but in case it was already enabled and you disable it, then it will be deleted)
  • Adds podAntiffinity for backend-worker and backend-listener following best practices:
    • It is a soft podAntiffinity (using Preferred instead of Required)
    • With highest priority, it tries to use hosts where there is no pod with specific label
    • With lower priority, it tries to use hosts from different AWS AZs where there is no pod with specific label
    • So finanly, il will try to balanced pods accross different AZs, and accross different hosts
  • Update Backend CRD validation with new pdb/hpa fields
  • Adds documentation for both PDB/HPA
  • Add backend-listener-internal Service, because:
    • Current backend-listener Service is published via NLB with proxy-protocol enabled (both 80/443 ports)
    • Marin3r destination ports for Service 80/443 have configured proxy-protocol (so internal communication directly to the k8s Service, instead of public NLB, won't work, because NLB is the one adding proxy-protocol)
    • Backend-listener needs to be accessed internally by ,at least, System component, so we require an extra Service whose marin3r port won't have proxy-protocol enabled

PodAntiffinity tests

Regarding podAntiffinity, I have done several testing to ensure it is working as expected.

So this is the worker nodes, 2 per AZ:

ip-10-65-0-91.ec2.internal    Ready     worker    82d       v1.16.2 --> us-east-1a
ip-10-65-1-200.ec2.internal   Ready     worker    132d      v1.16.2 --> us-east-1a
ip-10-65-6-169.ec2.internal   Ready     worker    132d      v1.16.2 --> us-east-1b
ip-10-65-7-70.ec2.internal    Ready     worker    82d       v1.16.2 --> us-east-1b
ip-10-65-9-40.ec2.internal    Ready     worker    132d      v1.16.2 --> us-east-1c
ip-10-65-11-2.ec2.internal    Ready     worker    82d       v1.16.2 --> us-east-1c
  • I have increased, one by one, the number of replicas per deployment, and it has always ensured the expected distribution accross AZs and nodes
  • Even when I decreased from 6 to 3 replicas, it still balanced one pod per AZ

Evolution history of AZs used by each deployment depending on number of replicas

  • Listener:

    • Initial 2 replicas: a/b
    • UP to 3 replicas: a/b/c
    • UP to 4 replicas: a/b/c/b
    • UP to 5 replicas: a/b/c/b/a
    • UP to 6 replicas: a/b/c/b/a/c
    • DOWN to 3 replicas: a/b/c
  • Worker:

    • Initial 2 replicas: c/b
    • UP to 3 replicas: c/b/a
    • UP to 4 replicas: c/b/a/b
    • UP to 5 replicas: c/b/a/b/c
    • UP to 6 replicas: c/b/a/b/c/a
    • DOWN to 3 replicas: a/b/c

@slopezz slopezz requested review from raelga and roivaz July 27, 2020 11:52
@slopezz slopezz self-assigned this Jul 27, 2020
Copy link
Member

@roivaz roivaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor typo needs fixing. Goog job!

roles/backend/tasks/main.yml Outdated Show resolved Hide resolved
roles/backend/tasks/main.yml Outdated Show resolved Hide resolved
@slopezz slopezz merged commit f2a8757 into master Jul 28, 2020
@slopezz slopezz deleted the feature/add-ha-backend branch July 28, 2020 08:29
@raelga raelga added kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple sprints to complete. size/M Requires about a day to complete the PR or the issue. labels Sep 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple sprints to complete. size/M Requires about a day to complete the PR or the issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve components HA, starting with backend
3 participants