-
Notifications
You must be signed in to change notification settings - Fork 260
Move store to bbolt database #251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: John Howard <jhoward@microsoft.com>
Signed-off-by: John Howard <jhoward@microsoft.com>
Signed-off-by: John Howard <jhoward@microsoft.com>
Signed-off-by: John Howard <jhoward@microsoft.com>
Signed-off-by: John Howard <jhoward@microsoft.com>
Signed-off-by: John Howard <jhoward@microsoft.com>
Signed-off-by: John Howard <jhoward@microsoft.com>
|
Have attached output/Linux_amd64/cni/azure-vnet and azure-vnet-ipam as built under WSL against this PR. |
|
(Looks like I will have to do some more vendoring work to get it to pass CI and unit tests though. For tomorrow....) |
Signed-off-by: John Howard <jhoward@microsoft.com>
Signed-off-by: John Howard <jhoward@microsoft.com>
Signed-off-by: John Howard <jhoward@microsoft.com>
Signed-off-by: John Howard <jhoward@microsoft.com>
0e0f44f to
0c31f85
Compare
Signed-off-by: John Howard <jhoward@microsoft.com>
Signed-off-by: John Howard <jhoward@microsoft.com>
Signed-off-by: John Howard <jhoward@microsoft.com>
Signed-off-by: John Howard <jhoward@microsoft.com> As per the code comment: // @jhowardmsft - These following lines are the original non-implementation. // I have kept these here, as by introducing a real implementation of this // function (which I have below) has side effects which I have no means // of being able to verify correct operation, have to leave that to the // owners of this repo to verify.
Signed-off-by: John Howard <jhoward@microsoft.com> Move store to bbolt database This PR is a follow on to Azure#247 @tamilmani1989 @sharmasushant PTAL. @PatrickLang, @DavidSchott @dineshgovindasamy @madhanrm @jterry75 FYI. @msuiche perhaps you are able to perform more verification on this as well? As per Azure#247 (comment), while that PR was better, it was far from perfect. This PR replaces the store entirely and uses a bolt database to store the data. See Azure#247 (comment) Azure#247 (comment) Patrick gave me access to one of his Windows clusters to perform verification. While there were some errors, none appear attributed to this change. I was able to scale from 1 to 25, back to 1 and back up again. Hopefully this is finally the end of those lock store-related errors. It is not however the end of no-errors-at-all during scaling. I will leave that to others to investigate... I have NOT been able to test this against a linux node - perhaps @tamilmani1989 would be able to that as per before. In addition, this PR has a bunch of commits which fix (most) vendoring issues in this repo. There is still more to do there, but again, I will leave that for others to resolve. I had to tackle vendoring to some extent to pull in bbolt. Finally, there are two other commits in this PR. - I have put in an implementation of GetLastRebootTime on Windows. As it's implementation changes the startup functionality, I have left that effectively stubbed out for someone else to follow through with. - I hit a SIGSEGV in testing in UpdateSendAndReport. Made that safe. Here's the 25 pods scaling up-and-down on Patricks cluster: ``` NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE iis-1803-687cdddf9f-28vrj 1/1 Running 0 10m 10.240.0.10 13833k8s9000 <none> iis-1803-687cdddf9f-5fjcn 1/1 Running 0 6m33s 10.240.0.30 13833k8s9000 <none> iis-1803-687cdddf9f-6dk28 1/1 Running 0 10m 10.240.0.31 13833k8s9000 <none> iis-1803-687cdddf9f-6j8wg 1/1 Running 0 6m33s 10.240.0.11 13833k8s9000 <none> iis-1803-687cdddf9f-8f5kc 1/1 Running 1 6m33s 10.240.0.14 13833k8s9000 <none> iis-1803-687cdddf9f-bkd7n 1/1 Running 0 10m 10.240.0.28 13833k8s9000 <none> iis-1803-687cdddf9f-bth4v 1/1 Running 0 6m33s 10.240.0.23 13833k8s9000 <none> iis-1803-687cdddf9f-csm2x 1/1 Running 0 10m 10.240.0.5 13833k8s9000 <none> iis-1803-687cdddf9f-dtvqp 1/1 Running 1 6m33s 10.240.0.9 13833k8s9000 <none> iis-1803-687cdddf9f-fv9rn 1/1 Running 1 6m33s 10.240.0.20 13833k8s9000 <none> iis-1803-687cdddf9f-gmzcz 1/1 Running 1 6m33s 10.240.0.12 13833k8s9000 <none> iis-1803-687cdddf9f-kzmcf 1/1 Running 0 10m 10.240.0.7 13833k8s9000 <none> iis-1803-687cdddf9f-lltjr 1/1 Running 1 6m33s 10.240.0.13 13833k8s9000 <none> iis-1803-687cdddf9f-lx2vf 1/1 Running 0 10m 10.240.0.26 13833k8s9000 <none> iis-1803-687cdddf9f-nn9pp 1/1 Running 1 6m33s 10.240.0.21 13833k8s9000 <none> iis-1803-687cdddf9f-pjcws 1/1 Running 1 6m33s 10.240.0.22 13833k8s9000 <none> iis-1803-687cdddf9f-q7hsf 1/1 Running 1 6m33s 10.240.0.33 13833k8s9000 <none> iis-1803-687cdddf9f-qn5c7 1/1 Running 0 10m 10.240.0.27 13833k8s9000 <none> iis-1803-687cdddf9f-rt6r5 1/1 Running 1 6m33s 10.240.0.17 13833k8s9000 <none> iis-1803-687cdddf9f-s2jsb 1/1 Running 0 10m 10.240.0.8 13833k8s9000 <none> iis-1803-687cdddf9f-sgwb8 1/1 Running 1 6m33s 10.240.0.25 13833k8s9000 <none> iis-1803-687cdddf9f-x9tpt 1/1 Running 0 10m 10.240.0.29 13833k8s9000 <none> iis-1803-687cdddf9f-xf6x9 1/1 Running 0 6m33s 10.240.0.24 13833k8s9000 <none> iis-1803-687cdddf9f-xwfxg 1/1 Running 0 10m 10.240.0.15 13833k8s9000 <none> iis-1803-687cdddf9f-zf8kv 1/1 Running 1 6m33s 10.240.0.16 13833k8s9000 <none> azureuser@k8s-master-13833463-0:~/john$ curl http://10.240.0.16 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> <title>IIS Windows Server</title> <style type="text/css"> <!-- body { color:#000000; background-color:#0072C6; margin:0; } margin-left:auto; margin-right:auto; text-align:center; } a img { border:none; } --> </style> </head> <body> <div id="container"> <a href="http://go.microsoft.com/fwlink/?linkid=66138&clcid=0x409"><img src="iisstart.png" alt="IIS" width="960" height="600" /></a> </div> </body> </html> ``` Then scaling back down: ``` azureuser@k8s-master-13833463-0:~/john$ kubectl scale deploy iis-1803 --replicas=1 deployment.extensions/iis-1803 scaled azureuser@k8s-master-13833463-0:~/john$ ``` Some time later... ``` azureuser@k8s-master-13833463-0:~/john$ kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE iis-1803-687cdddf9f-x9tpt 1/1 Running 0 16m 10.240.0.29 13833k8s9000 <none> ``` And scaling back up again ``` azureuser@k8s-master-13833463-0:~/john$ kubectl scale deploy iis-1803 --replicas=25 deployment.extensions/iis-1803 scaled azureuser@k8s-master-13833463-0:~/john$ ``` Some time later... ``` zureuser@k8s-master-13833463-0:~/john$ kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE iis-1803-687cdddf9f-26p6l 1/1 Running 0 7m 10.240.0.11 13833k8s9000 <none> iis-1803-687cdddf9f-2ktdz 1/1 Running 0 7m 10.240.0.10 13833k8s9000 <none> iis-1803-687cdddf9f-48ggp 1/1 Running 0 7m 10.240.0.5 13833k8s9000 <none> iis-1803-687cdddf9f-4gtmb 1/1 Running 0 7m 10.240.0.26 13833k8s9000 <none> iis-1803-687cdddf9f-4hd72 1/1 Running 0 7m 10.240.0.24 13833k8s9000 <none> iis-1803-687cdddf9f-4pwsq 1/1 Running 1 7m 10.240.0.33 13833k8s9000 <none> iis-1803-687cdddf9f-5kw22 1/1 Running 0 7m 10.240.0.9 13833k8s9000 <none> iis-1803-687cdddf9f-664z7 1/1 Running 1 7m 10.240.0.16 13833k8s9000 <none> iis-1803-687cdddf9f-8swz7 1/1 Running 1 7m1s 10.240.0.7 13833k8s9000 <none> iis-1803-687cdddf9f-9h98r 1/1 Running 1 7m 10.240.0.8 13833k8s9000 <none> iis-1803-687cdddf9f-9h9jd 1/1 Running 1 7m 10.240.0.14 13833k8s9000 <none> iis-1803-687cdddf9f-lftd7 1/1 Running 1 7m 10.240.0.19 13833k8s9000 <none> iis-1803-687cdddf9f-m9knq 1/1 Running 1 7m 10.240.0.31 13833k8s9000 <none> iis-1803-687cdddf9f-mplcc 1/1 Running 1 7m 10.240.0.21 13833k8s9000 <none> iis-1803-687cdddf9f-p7jn2 1/1 Running 0 7m 10.240.0.20 13833k8s9000 <none> iis-1803-687cdddf9f-sml2x 1/1 Running 0 7m1s 10.240.0.13 13833k8s9000 <none> iis-1803-687cdddf9f-tjfws 1/1 Running 0 7m 10.240.0.18 13833k8s9000 <none> iis-1803-687cdddf9f-vxdl4 1/1 Running 0 7m 10.240.0.15 13833k8s9000 <none> iis-1803-687cdddf9f-x26vj 1/1 Running 1 7m1s 10.240.0.30 13833k8s9000 <none> iis-1803-687cdddf9f-x2hll 1/1 Running 1 7m 10.240.0.28 13833k8s9000 <none> iis-1803-687cdddf9f-x9tpt 1/1 Running 0 24m 10.240.0.29 13833k8s9000 <none> iis-1803-687cdddf9f-xg5bm 1/1 Running 1 7m 10.240.0.23 13833k8s9000 <none> iis-1803-687cdddf9f-zkkzm 1/1 Running 1 7m 10.240.0.32 13833k8s9000 <none> iis-1803-687cdddf9f-zqv69 1/1 Running 0 7m 10.240.0.17 13833k8s9000 <none> iis-1803-687cdddf9f-zvzn9 1/1 Running 0 7m 10.240.0.27 13833k8s9000 <none> azureuser@k8s-master-13833463-0:~/john$ ```
Fixed and passing CI now. |
|
@sharmasushant can you review or assign this to someone? We're still getting reports of this failing in Kubernetes conformance tests |
@PatrickLang What are the failures? Please share the failed tests that we can run ourselves. This is quite a big change and will take time for us. It will not be merged soon. |
|
@PatrickLang Can someone run binaries from this branch with the conformance tests? |
|
@jhowardmsft I believe we ran conformance tests from your previous PR #247 and it resolved errors of kubelet posting "NotReady" status. We can try this PR next |
|
@jhowardmsft Thanks for the PR! We'll validate the PR soon for both Linux & Windows. |
|
@saiyan86 Any progress? I'm loathed to rebase this if it's just sitting here. |
|
Hi @jhowardmsft 1. Ops Experience: We currently just read the json file and any json parser can help us lookup what is in the state. With this transition, do we need new tools to read state? 2. Changing state: We can currently simply edit the json file and change the state. What tools will we need once we do the transition for changing state? 3. Migration: There are many VMs (especially the ones running multi-tenant workloads), where we have to have a migration story from json to bbolt. This PR currently do not address migration. |
Looks like there's a few tools out there. For example https://github.com/br0xen/boltbrowser, https://github.com/nisboo/BoltGUI Right, someone else is going to have to tackle that part of this problem. |
|
I don't think we need to deal with migration of existing nodes. AKS-Engine and AKS always deploy new nodes by updating the VM scale set definition. They don't upgrade in place. If there are other use cases where you can't migrate, I think we could put this behind a config flag to use boltdb instead. That would mean we could test the stability and use data to make a final decision. |
|
Closing as I am no longer working on container related things. This is still something which needs to be addressed as the implementation is not thread/multi-process safe, but I'll leave the container networking folks to take up if they wish. |
|
@lowenna - Thanks for your dedication to cleaning up loose ends :)! |
Signed-off-by: John Howard jhoward@microsoft.com
This PR is a follow on to #247
@tamilmani1989 @sharmasushant PTAL.
@PatrickLang, @DavidSchott @dineshgovindasamy @madhanrm @jterry75 FYI. @msuiche perhaps you are able to perform more verification on this as well?
As per #247 (comment), while that PR was better, it was far from perfect.
This PR replaces the store entirely and uses a bolt database to store the data. See #247 (comment) #247 (comment)
Patrick gave me access to one of his Windows clusters to perform verification. While there were some errors, none appear attributed to this change.
I was able to scale from 1 to 25, back to 1 and back up again. Hopefully this is finally the end of those lock store-related errors.
It is not however the end of no-errors-at-all during scaling. I will leave that to others to investigate...
I have NOT been able to test this against a linux node - perhaps @tamilmani1989 would be able to that as per before.
In addition, this PR has a bunch of commits which fix (most) vendoring issues in this repo. There is still more to do there, but again, I will leave that for others to resolve. I had to tackle vendoring to some extent to pull in bbolt.
Finally, there are three other commits in this PR.
Here's the 25 pods scaling up-and-down on Patricks cluster:
Then scaling back down:
Some time later...
And scaling back up again
Some time later...