Skip to content

Conversation

@lowenna
Copy link
Contributor

@lowenna lowenna commented Sep 20, 2018

Signed-off-by: John Howard jhoward@microsoft.com

Fixes #242

  • First, the store timeout is woefully low. Bumped to 20 seconds from 2 seconds.
    This appears to fix [cni] Timed out on locking store, err:Store is locked. #242 (comment)
    IMO, as only test code calls it non-blocked, why even have a block parameter to Lock()?
    IMO also, why a timeout at all? They're always fraught with error and machine timing.

  • Presence of a key should be checked using raw, ok := hvs.data[key], not the current nil checked

  • ErrKeyNotFound should be returned if the store file does not exist. It shouldn't ignore that error.

  • Actually now reports if a timeout occurred correctly, along with non-block lock attempt when already locked.

  • Serial pattern abuse in not always closing the lock file.

  • Some golang correctness (errors should be lower case)

  • go build ./... actually passes on Windows now - various compile errors previously.

  • golang pattern conformance if err:=<test>; err!=nil {....

  • take the mutex in GetModificationTime. Was not thread safe!

  • Simplified timeout duration (no need for time.Duration(...))

Signed-off-by: John Howard <jhoward@microsoft.com>

- First, the store timeout is woefully low. Bumped to 20 seconds from 2 seconds.
  This may fix Azure#242 (comment)
  IMO, as only test code calls it non-blocked, why even have a block parameter to Lock()?
  IMO also, why a timeout at all? They're always fraught with error and machine timing.

- Presence of a key should be checked using `raw, ok := hvs.data[key]`, not the current nil checked

- ErrKeyNotFound should be returned if the store file does not exist. It shouldn't ignore that error.

- Actually now reports if a timeout occurred correctly, along with non-block lock attempt when already locked.

- Serial pattern abuse in not always closing the lock file.

- Some golang correctness (errors should be lower case)

- go build ./... actually passes on Windows now - various compile errors previously.

- golang pattern conformance `if err:=<test>; err!=nil {....`

- Simplified timeout duration (no need for time.Duration(...))
@lowenna
Copy link
Contributor Author

lowenna commented Sep 20, 2018

@sharmasushant PTAL. @jterry75 FYI - a once over would be appreciated.

From @PatrickLang via email:

That change definitely seems to be scaling better. Can you open a PR? Before I would typically see a gap of a few minutes where no pods could start, then one more every few minutes. There were a lot of retries on the lock error, and I haven’t noticed any yet.

kubectl get rs -w
NAME DESIRED CURRENT READY AGE
iis-1803-8b7fdd569 20 20 2 20m
iis-1803-8b7fdd569 20 20 3 21m
iis-1803-8b7fdd569 20 20 4 21m
iis-1803-8b7fdd569 20 20 5 22m
iis-1803-8b7fdd569 20 20 6 22m
iis-1803-8b7fdd569 20 20 7 22m
iis-1803-8b7fdd569 20 20 8 22m
iis-1803-8b7fdd569 20 20 9 22m
iis-1803-8b7fdd569 20 20 10 22m
iis-1803-8b7fdd569 20 20 11 22m
iis-1803-8b7fdd569 20 20 12 22m
iis-1803-8b7fdd569 20 20 13 23m
iis-1803-8b7fdd569 20 20 14 23m
iis-1803-8b7fdd569 20 20 15 23m
iis-1803-8b7fdd569 20 20 16 23m

This is by no means a full test pass, but it looks better than before.

@lowenna
Copy link
Contributor Author

lowenna commented Sep 20, 2018

@sharmasushant A question, not knowing the architecture from where the store is invoked. Is it called from multiple external processes simultaneously? Or from different threads of a single process? Whereby each process/thread (goroutine) is accessing the -same- store file. If multiple -processes- are accessing the same store, then this, while better, is still not safe. I strongly recommend someone move this across to bbolt regardless as that's both multi-thread and multi-process safe.

@sharmasushant
Copy link
Contributor

@jhowardmsft it is accessed across different process (different invocations of cni binary). Sure, we can look into bbolt and evaluate it. Thanks a lot for suggestion.

plugin.Store, err = store.NewJsonFileStore(platform.CNIRuntimePath + plugin.Name + ".json")
if err != nil {
log.Printf("[cni] Failed to create store, err:%v.", err)
log.Printf("[cni] Failed to create store: %v.", err)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhowardmsft - I know its not you but typically no punctuation across other golang projects.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also. Typically the "[cni]" portion would be done via a logrus context or equivalent. (Again just FYI)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but I'm not changing the world here :)

// Copyright 2017 Microsoft. All rights reserved.
// MIT License

// +build linux

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work? I thought it had to be above header?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly does in 1.11

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...that lists the conditions under which a file should be included in the package. Constraints may appear in any kind of source file (not just Go), but they must appear near the top of the file, preceded only by blank lines and other line comments. These rules mean that in Go files a build constraint must appear before the package clause.

https://golang.org/pkg/go/build/

// Copyright 2017 Microsoft. All rights reserved.
// MIT License

// +build linux

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we rename? endpoint_linux.go specifically implies to me "// +build linux" by convention.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh you are just doing this because its already included by default via the naming?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just being consistent with what's there.


if !block || i == lockMaxRetries {
return ErrStoreLocked
if !block {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again not your code. But a better pattern would be to have:

func TryLock() bool -> Non-Blocking return value indicates wasLocked
func Lock() -> Blocking
func Unlock() -> Same for both

Copy link

@jterry75 jterry75 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Minor comments and improvements unrelated to change.

@lowenna
Copy link
Contributor Author

lowenna commented Sep 22, 2018

ping @sharmasushant Can you review please? This seems better than it was. I'm working on a larger fix to entirely remove the store, replacing it with bolt. Almost have it ready for submission. @DavidSchott FYI

@PatrickLang
Copy link
Contributor

ping @sharmasushant - this seems more stable. Can we merge so its not blocking other improvements?

@sharmasushant
Copy link
Contributor

@tamilmani1989 Can you please look?

@PatrickLang
Copy link
Contributor

@tamilmani1989 - can you help here? We have a lot of people working on Windows container stability over the next few months. The faster we can get changes in, the better.

@tamilmani1989
Copy link
Member

@PatrickLang Ya sure. I will do it in a day or two.

Copy link
Member

@tamilmani1989 tamilmani1989 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we test these changes in Linux and make sure that its not breaking anything?


// Maximum number of retries before failing a lock call.
lockMaxRetries = 20
lockMaxRetries = 200
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason changing this to 200?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout (2s) was woefully small. 20 seconds is still reasonably low. Arguably, there should be no need whatsoever for a timeout in a correctly operating system. In fact, in a subsequent change I have which removes the store entirely and moves this to boltdb, there is no timeout.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

if err != nil {
if os.IsNotExist(err) {
return nil
return ErrKeyNotFound
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not breaking anything right when CNI/CNM called first time on boot up when there is no state file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but it's a question of correctness, not obfuscation. The store package should behave and return errors as expected - a read of something which doesn't exist should return a not-found error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I got it. I'm just making sure we tested that scenario?

return err
time.Sleep(lockRetryDelay)
}
defer lockFile.Close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how defer works in a loop . Just clarifying if open called multiple times and is close getting called for each open?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The defer of close is outside the loop.... It's only called at function exit, and only in the case that the open succeeded.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry! I missed it

@lowenna
Copy link
Contributor Author

lowenna commented Sep 26, 2018

Did we test these changes in Linux and make sure that its not breaking anything?

@tamilmani1989 (See internal email too). The only thing I have been able to run is the unit tests, and ask Patrick to do some ad-hoc testing, but that's only on Windows. I would need to rely on you guys to verify this change fully.

@tamilmani1989
Copy link
Member

@jhowardmsft - We are still working on running tests automatically when a PR is opened against master. For now, we have to do it manually and make sure that doesn't break both in linux/windows. So we expect whoever opens a PR should atleast do sanity testing and make sure it doesn't break anything.

Basic Tests:

  1. Able to create and delete pods multiple times without error from CNI
  2. Pod to Pod connectivity (same node and pods in different nodes)
  3. Pod to Outbound connectivity

I can help you/patrick running theses tests in Linux.

@lowenna
Copy link
Contributor Author

lowenna commented Sep 26, 2018

I can help you/patrick running theses tests in Linux.

I would need a LOT of assistance here. As previously mentioned, I know next to nothing about how this is supposed to work, how to setup or configure an environment. You would have to assume I am starting from scratch. I would also prefer a means to verify in a local VM/VMs rather than requiring Azure VMs to do this testing.

@tamilmani1989
Copy link
Member

How did you test it for windows then? Didn't you create any kubernetes cluster?

@lowenna
Copy link
Contributor Author

lowenna commented Sep 26, 2018

How did you test it for windows then? Didn't you create any kubernetes cluster?

Patrick did in kubernetes - see his previous comment and on email internally. I only did go test ./… , and do not either have a kubernetes setup, or would know what to do with it even if I did have one.
The barrier to entry getting a k8 setup is exceptionally high, and I do not have the time to go through that. (That said, I tried, wasted several hours, and failed). Hence my reliance on others who are familiar with the overall code and architecture of this repo and how it fits into other repos is necessary for validation.

@tamilmani1989
Copy link
Member

@PatrickLang Have you done these tests for Linux. If not, Can you please do it for Linux?

@PatrickLang
Copy link
Contributor

Unfortunately I can't do any more testing on it this week as I'm out at a conference. @jackfrancis or @khenidak can you help out here?

I would also be open to merging this but only producing a Windows binary for now if there isn't enough testing on Linux. ACS-Engine can deploy different versions for Windows and Linux.

@tamilmani1989
Copy link
Member

@PatrickLang @jhowardmsft If you give me Linux binaries azure-vnet and azure-vnet-ipam, I will test it when i'm done with my current work.

@sharmasushant
Copy link
Contributor

sharmasushant commented Sep 26, 2018

@tamilmani1989 Attaching azure-vnet and azure-vnet-ipam binaries for linux

azure-cni-binaries.zip

@tamilmani1989
Copy link
Member

tamilmani1989 commented Oct 1, 2018

@sharmasushant I have tested it. LGTM.

@lowenna
Copy link
Contributor Author

lowenna commented Oct 1, 2018

Can we merge this? I'm ready to submit part 2 - moving to a bolt database.

@tamilmani1989 tamilmani1989 merged commit 2aace36 into Azure:master Oct 1, 2018
lowenna pushed a commit to lowenna/azure-container-networking that referenced this pull request Oct 1, 2018
Signed-off-by: John Howard <jhoward@microsoft.com>

Move store to bbolt database

This PR is a follow on to Azure#247

@tamilmani1989 @sharmasushant PTAL.

@PatrickLang, @DavidSchott @dineshgovindasamy @madhanrm @jterry75 FYI. @msuiche perhaps you are able to perform more verification on this as well?

As per Azure#247 (comment), while that PR was better, it was far from perfect.
This PR replaces the store entirely and uses a bolt database to store the data. See Azure#247 (comment) Azure#247 (comment)

Patrick gave me access to one of his Windows clusters to perform verification. While there were some errors, none appear attributed to this change.
I was able to scale from 1 to 25, back to 1 and back up again. Hopefully this is finally the end of those lock store-related errors.
It is not however the end of no-errors-at-all during scaling. I will leave that to others to investigate...

I have NOT been able to test this against a linux node - perhaps @tamilmani1989 would be able to that as per before.

In addition, this PR has a bunch of commits which fix (most) vendoring issues in this repo. There is still more to do there, but again, I will leave that for others to resolve. I had to tackle vendoring to some extent to pull in bbolt.

Finally, there are two other commits in this PR.
- I have put in an implementation of GetLastRebootTime on Windows. As it's implementation changes the startup functionality, I have left that effectively stubbed out for someone else to follow through with.
- I hit a SIGSEGV in testing in UpdateSendAndReport. Made that safe.

Here's the 25 pods scaling up-and-down on Patricks cluster:

```
NAME                        READY   STATUS    RESTARTS   AGE     IP            NODE           NOMINATED NODE
iis-1803-687cdddf9f-28vrj   1/1     Running   0          10m     10.240.0.10   13833k8s9000   <none>
iis-1803-687cdddf9f-5fjcn   1/1     Running   0          6m33s   10.240.0.30   13833k8s9000   <none>
iis-1803-687cdddf9f-6dk28   1/1     Running   0          10m     10.240.0.31   13833k8s9000   <none>
iis-1803-687cdddf9f-6j8wg   1/1     Running   0          6m33s   10.240.0.11   13833k8s9000   <none>
iis-1803-687cdddf9f-8f5kc   1/1     Running   1          6m33s   10.240.0.14   13833k8s9000   <none>
iis-1803-687cdddf9f-bkd7n   1/1     Running   0          10m     10.240.0.28   13833k8s9000   <none>
iis-1803-687cdddf9f-bth4v   1/1     Running   0          6m33s   10.240.0.23   13833k8s9000   <none>
iis-1803-687cdddf9f-csm2x   1/1     Running   0          10m     10.240.0.5    13833k8s9000   <none>
iis-1803-687cdddf9f-dtvqp   1/1     Running   1          6m33s   10.240.0.9    13833k8s9000   <none>
iis-1803-687cdddf9f-fv9rn   1/1     Running   1          6m33s   10.240.0.20   13833k8s9000   <none>
iis-1803-687cdddf9f-gmzcz   1/1     Running   1          6m33s   10.240.0.12   13833k8s9000   <none>
iis-1803-687cdddf9f-kzmcf   1/1     Running   0          10m     10.240.0.7    13833k8s9000   <none>
iis-1803-687cdddf9f-lltjr   1/1     Running   1          6m33s   10.240.0.13   13833k8s9000   <none>
iis-1803-687cdddf9f-lx2vf   1/1     Running   0          10m     10.240.0.26   13833k8s9000   <none>
iis-1803-687cdddf9f-nn9pp   1/1     Running   1          6m33s   10.240.0.21   13833k8s9000   <none>
iis-1803-687cdddf9f-pjcws   1/1     Running   1          6m33s   10.240.0.22   13833k8s9000   <none>
iis-1803-687cdddf9f-q7hsf   1/1     Running   1          6m33s   10.240.0.33   13833k8s9000   <none>
iis-1803-687cdddf9f-qn5c7   1/1     Running   0          10m     10.240.0.27   13833k8s9000   <none>
iis-1803-687cdddf9f-rt6r5   1/1     Running   1          6m33s   10.240.0.17   13833k8s9000   <none>
iis-1803-687cdddf9f-s2jsb   1/1     Running   0          10m     10.240.0.8    13833k8s9000   <none>
iis-1803-687cdddf9f-sgwb8   1/1     Running   1          6m33s   10.240.0.25   13833k8s9000   <none>
iis-1803-687cdddf9f-x9tpt   1/1     Running   0          10m     10.240.0.29   13833k8s9000   <none>
iis-1803-687cdddf9f-xf6x9   1/1     Running   0          6m33s   10.240.0.24   13833k8s9000   <none>
iis-1803-687cdddf9f-xwfxg   1/1     Running   0          10m     10.240.0.15   13833k8s9000   <none>
iis-1803-687cdddf9f-zf8kv   1/1     Running   1          6m33s   10.240.0.16   13833k8s9000   <none>
azureuser@k8s-master-13833463-0:~/john$ curl http://10.240.0.16
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>IIS Windows Server</title>
<style type="text/css">
<!--
body {
        color:#000000;
        background-color:#0072C6;
        margin:0;
}

        margin-left:auto;
        margin-right:auto;
        text-align:center;
        }

a img {
        border:none;
}

-->
</style>
</head>
<body>
<div id="container">
<a href="http://go.microsoft.com/fwlink/?linkid=66138&amp;clcid=0x409"><img src="iisstart.png" alt="IIS" width="960" height="600" /></a>
</div>
</body>
</html>
```

Then scaling back down:

```
azureuser@k8s-master-13833463-0:~/john$ kubectl scale deploy iis-1803 --replicas=1
deployment.extensions/iis-1803 scaled
azureuser@k8s-master-13833463-0:~/john$
```

Some time later...

```
azureuser@k8s-master-13833463-0:~/john$ kubectl get pod -o wide
NAME                        READY   STATUS    RESTARTS   AGE   IP            NODE           NOMINATED NODE
iis-1803-687cdddf9f-x9tpt   1/1     Running   0          16m   10.240.0.29   13833k8s9000   <none>
```

And scaling back up again

```
azureuser@k8s-master-13833463-0:~/john$ kubectl scale deploy iis-1803 --replicas=25
deployment.extensions/iis-1803 scaled
azureuser@k8s-master-13833463-0:~/john$
```

Some time later...

```
zureuser@k8s-master-13833463-0:~/john$ kubectl get pod -o wide
NAME                        READY   STATUS    RESTARTS   AGE    IP            NODE           NOMINATED NODE
iis-1803-687cdddf9f-26p6l   1/1     Running   0          7m     10.240.0.11   13833k8s9000   <none>
iis-1803-687cdddf9f-2ktdz   1/1     Running   0          7m     10.240.0.10   13833k8s9000   <none>
iis-1803-687cdddf9f-48ggp   1/1     Running   0          7m     10.240.0.5    13833k8s9000   <none>
iis-1803-687cdddf9f-4gtmb   1/1     Running   0          7m     10.240.0.26   13833k8s9000   <none>
iis-1803-687cdddf9f-4hd72   1/1     Running   0          7m     10.240.0.24   13833k8s9000   <none>
iis-1803-687cdddf9f-4pwsq   1/1     Running   1          7m     10.240.0.33   13833k8s9000   <none>
iis-1803-687cdddf9f-5kw22   1/1     Running   0          7m     10.240.0.9    13833k8s9000   <none>
iis-1803-687cdddf9f-664z7   1/1     Running   1          7m     10.240.0.16   13833k8s9000   <none>
iis-1803-687cdddf9f-8swz7   1/1     Running   1          7m1s   10.240.0.7    13833k8s9000   <none>
iis-1803-687cdddf9f-9h98r   1/1     Running   1          7m     10.240.0.8    13833k8s9000   <none>
iis-1803-687cdddf9f-9h9jd   1/1     Running   1          7m     10.240.0.14   13833k8s9000   <none>
iis-1803-687cdddf9f-lftd7   1/1     Running   1          7m     10.240.0.19   13833k8s9000   <none>
iis-1803-687cdddf9f-m9knq   1/1     Running   1          7m     10.240.0.31   13833k8s9000   <none>
iis-1803-687cdddf9f-mplcc   1/1     Running   1          7m     10.240.0.21   13833k8s9000   <none>
iis-1803-687cdddf9f-p7jn2   1/1     Running   0          7m     10.240.0.20   13833k8s9000   <none>
iis-1803-687cdddf9f-sml2x   1/1     Running   0          7m1s   10.240.0.13   13833k8s9000   <none>
iis-1803-687cdddf9f-tjfws   1/1     Running   0          7m     10.240.0.18   13833k8s9000   <none>
iis-1803-687cdddf9f-vxdl4   1/1     Running   0          7m     10.240.0.15   13833k8s9000   <none>
iis-1803-687cdddf9f-x26vj   1/1     Running   1          7m1s   10.240.0.30   13833k8s9000   <none>
iis-1803-687cdddf9f-x2hll   1/1     Running   1          7m     10.240.0.28   13833k8s9000   <none>
iis-1803-687cdddf9f-x9tpt   1/1     Running   0          24m    10.240.0.29   13833k8s9000   <none>
iis-1803-687cdddf9f-xg5bm   1/1     Running   1          7m     10.240.0.23   13833k8s9000   <none>
iis-1803-687cdddf9f-zkkzm   1/1     Running   1          7m     10.240.0.32   13833k8s9000   <none>
iis-1803-687cdddf9f-zqv69   1/1     Running   0          7m     10.240.0.17   13833k8s9000   <none>
iis-1803-687cdddf9f-zvzn9   1/1     Running   0          7m     10.240.0.27   13833k8s9000   <none>
azureuser@k8s-master-13833463-0:~/john$
```
lowenna pushed a commit to lowenna/azure-container-networking that referenced this pull request Oct 1, 2018
Signed-off-by: John Howard <jhoward@microsoft.com>

Move store to bbolt database

This PR is a follow on to Azure#247

@tamilmani1989 @sharmasushant PTAL.

@PatrickLang, @DavidSchott @dineshgovindasamy @madhanrm @jterry75 FYI. @msuiche perhaps you are able to perform more verification on this as well?

As per Azure#247 (comment), while that PR was better, it was far from perfect.
This PR replaces the store entirely and uses a bolt database to store the data. See Azure#247 (comment) Azure#247 (comment)

Patrick gave me access to one of his Windows clusters to perform verification. While there were some errors, none appear attributed to this change.
I was able to scale from 1 to 25, back to 1 and back up again. Hopefully this is finally the end of those lock store-related errors.
It is not however the end of no-errors-at-all during scaling. I will leave that to others to investigate...

I have NOT been able to test this against a linux node - perhaps @tamilmani1989 would be able to that as per before.

In addition, this PR has a bunch of commits which fix (most) vendoring issues in this repo. There is still more to do there, but again, I will leave that for others to resolve. I had to tackle vendoring to some extent to pull in bbolt.

Finally, there are two other commits in this PR.
- I have put in an implementation of GetLastRebootTime on Windows. As it's implementation changes the startup functionality, I have left that effectively stubbed out for someone else to follow through with.
- I hit a SIGSEGV in testing in UpdateSendAndReport. Made that safe.

Here's the 25 pods scaling up-and-down on Patricks cluster:

```
NAME                        READY   STATUS    RESTARTS   AGE     IP            NODE           NOMINATED NODE
iis-1803-687cdddf9f-28vrj   1/1     Running   0          10m     10.240.0.10   13833k8s9000   <none>
iis-1803-687cdddf9f-5fjcn   1/1     Running   0          6m33s   10.240.0.30   13833k8s9000   <none>
iis-1803-687cdddf9f-6dk28   1/1     Running   0          10m     10.240.0.31   13833k8s9000   <none>
iis-1803-687cdddf9f-6j8wg   1/1     Running   0          6m33s   10.240.0.11   13833k8s9000   <none>
iis-1803-687cdddf9f-8f5kc   1/1     Running   1          6m33s   10.240.0.14   13833k8s9000   <none>
iis-1803-687cdddf9f-bkd7n   1/1     Running   0          10m     10.240.0.28   13833k8s9000   <none>
iis-1803-687cdddf9f-bth4v   1/1     Running   0          6m33s   10.240.0.23   13833k8s9000   <none>
iis-1803-687cdddf9f-csm2x   1/1     Running   0          10m     10.240.0.5    13833k8s9000   <none>
iis-1803-687cdddf9f-dtvqp   1/1     Running   1          6m33s   10.240.0.9    13833k8s9000   <none>
iis-1803-687cdddf9f-fv9rn   1/1     Running   1          6m33s   10.240.0.20   13833k8s9000   <none>
iis-1803-687cdddf9f-gmzcz   1/1     Running   1          6m33s   10.240.0.12   13833k8s9000   <none>
iis-1803-687cdddf9f-kzmcf   1/1     Running   0          10m     10.240.0.7    13833k8s9000   <none>
iis-1803-687cdddf9f-lltjr   1/1     Running   1          6m33s   10.240.0.13   13833k8s9000   <none>
iis-1803-687cdddf9f-lx2vf   1/1     Running   0          10m     10.240.0.26   13833k8s9000   <none>
iis-1803-687cdddf9f-nn9pp   1/1     Running   1          6m33s   10.240.0.21   13833k8s9000   <none>
iis-1803-687cdddf9f-pjcws   1/1     Running   1          6m33s   10.240.0.22   13833k8s9000   <none>
iis-1803-687cdddf9f-q7hsf   1/1     Running   1          6m33s   10.240.0.33   13833k8s9000   <none>
iis-1803-687cdddf9f-qn5c7   1/1     Running   0          10m     10.240.0.27   13833k8s9000   <none>
iis-1803-687cdddf9f-rt6r5   1/1     Running   1          6m33s   10.240.0.17   13833k8s9000   <none>
iis-1803-687cdddf9f-s2jsb   1/1     Running   0          10m     10.240.0.8    13833k8s9000   <none>
iis-1803-687cdddf9f-sgwb8   1/1     Running   1          6m33s   10.240.0.25   13833k8s9000   <none>
iis-1803-687cdddf9f-x9tpt   1/1     Running   0          10m     10.240.0.29   13833k8s9000   <none>
iis-1803-687cdddf9f-xf6x9   1/1     Running   0          6m33s   10.240.0.24   13833k8s9000   <none>
iis-1803-687cdddf9f-xwfxg   1/1     Running   0          10m     10.240.0.15   13833k8s9000   <none>
iis-1803-687cdddf9f-zf8kv   1/1     Running   1          6m33s   10.240.0.16   13833k8s9000   <none>
azureuser@k8s-master-13833463-0:~/john$ curl http://10.240.0.16
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>IIS Windows Server</title>
<style type="text/css">
<!--
body {
        color:#000000;
        background-color:#0072C6;
        margin:0;
}

        margin-left:auto;
        margin-right:auto;
        text-align:center;
        }

a img {
        border:none;
}

-->
</style>
</head>
<body>
<div id="container">
<a href="http://go.microsoft.com/fwlink/?linkid=66138&amp;clcid=0x409"><img src="iisstart.png" alt="IIS" width="960" height="600" /></a>
</div>
</body>
</html>
```

Then scaling back down:

```
azureuser@k8s-master-13833463-0:~/john$ kubectl scale deploy iis-1803 --replicas=1
deployment.extensions/iis-1803 scaled
azureuser@k8s-master-13833463-0:~/john$
```

Some time later...

```
azureuser@k8s-master-13833463-0:~/john$ kubectl get pod -o wide
NAME                        READY   STATUS    RESTARTS   AGE   IP            NODE           NOMINATED NODE
iis-1803-687cdddf9f-x9tpt   1/1     Running   0          16m   10.240.0.29   13833k8s9000   <none>
```

And scaling back up again

```
azureuser@k8s-master-13833463-0:~/john$ kubectl scale deploy iis-1803 --replicas=25
deployment.extensions/iis-1803 scaled
azureuser@k8s-master-13833463-0:~/john$
```

Some time later...

```
zureuser@k8s-master-13833463-0:~/john$ kubectl get pod -o wide
NAME                        READY   STATUS    RESTARTS   AGE    IP            NODE           NOMINATED NODE
iis-1803-687cdddf9f-26p6l   1/1     Running   0          7m     10.240.0.11   13833k8s9000   <none>
iis-1803-687cdddf9f-2ktdz   1/1     Running   0          7m     10.240.0.10   13833k8s9000   <none>
iis-1803-687cdddf9f-48ggp   1/1     Running   0          7m     10.240.0.5    13833k8s9000   <none>
iis-1803-687cdddf9f-4gtmb   1/1     Running   0          7m     10.240.0.26   13833k8s9000   <none>
iis-1803-687cdddf9f-4hd72   1/1     Running   0          7m     10.240.0.24   13833k8s9000   <none>
iis-1803-687cdddf9f-4pwsq   1/1     Running   1          7m     10.240.0.33   13833k8s9000   <none>
iis-1803-687cdddf9f-5kw22   1/1     Running   0          7m     10.240.0.9    13833k8s9000   <none>
iis-1803-687cdddf9f-664z7   1/1     Running   1          7m     10.240.0.16   13833k8s9000   <none>
iis-1803-687cdddf9f-8swz7   1/1     Running   1          7m1s   10.240.0.7    13833k8s9000   <none>
iis-1803-687cdddf9f-9h98r   1/1     Running   1          7m     10.240.0.8    13833k8s9000   <none>
iis-1803-687cdddf9f-9h9jd   1/1     Running   1          7m     10.240.0.14   13833k8s9000   <none>
iis-1803-687cdddf9f-lftd7   1/1     Running   1          7m     10.240.0.19   13833k8s9000   <none>
iis-1803-687cdddf9f-m9knq   1/1     Running   1          7m     10.240.0.31   13833k8s9000   <none>
iis-1803-687cdddf9f-mplcc   1/1     Running   1          7m     10.240.0.21   13833k8s9000   <none>
iis-1803-687cdddf9f-p7jn2   1/1     Running   0          7m     10.240.0.20   13833k8s9000   <none>
iis-1803-687cdddf9f-sml2x   1/1     Running   0          7m1s   10.240.0.13   13833k8s9000   <none>
iis-1803-687cdddf9f-tjfws   1/1     Running   0          7m     10.240.0.18   13833k8s9000   <none>
iis-1803-687cdddf9f-vxdl4   1/1     Running   0          7m     10.240.0.15   13833k8s9000   <none>
iis-1803-687cdddf9f-x26vj   1/1     Running   1          7m1s   10.240.0.30   13833k8s9000   <none>
iis-1803-687cdddf9f-x2hll   1/1     Running   1          7m     10.240.0.28   13833k8s9000   <none>
iis-1803-687cdddf9f-x9tpt   1/1     Running   0          24m    10.240.0.29   13833k8s9000   <none>
iis-1803-687cdddf9f-xg5bm   1/1     Running   1          7m     10.240.0.23   13833k8s9000   <none>
iis-1803-687cdddf9f-zkkzm   1/1     Running   1          7m     10.240.0.32   13833k8s9000   <none>
iis-1803-687cdddf9f-zqv69   1/1     Running   0          7m     10.240.0.17   13833k8s9000   <none>
iis-1803-687cdddf9f-zvzn9   1/1     Running   0          7m     10.240.0.27   13833k8s9000   <none>
azureuser@k8s-master-13833463-0:~/john$
```
lowenna pushed a commit to lowenna/azure-container-networking that referenced this pull request Oct 2, 2018
Signed-off-by: John Howard <jhoward@microsoft.com>

Move store to bbolt database

This PR is a follow on to Azure#247

@tamilmani1989 @sharmasushant PTAL.

@PatrickLang, @DavidSchott @dineshgovindasamy @madhanrm @jterry75 FYI. @msuiche perhaps you are able to perform more verification on this as well?

As per Azure#247 (comment), while that PR was better, it was far from perfect.
This PR replaces the store entirely and uses a bolt database to store the data. See Azure#247 (comment) Azure#247 (comment)

Patrick gave me access to one of his Windows clusters to perform verification. While there were some errors, none appear attributed to this change.
I was able to scale from 1 to 25, back to 1 and back up again. Hopefully this is finally the end of those lock store-related errors.
It is not however the end of no-errors-at-all during scaling. I will leave that to others to investigate...

I have NOT been able to test this against a linux node - perhaps @tamilmani1989 would be able to that as per before.

In addition, this PR has a bunch of commits which fix (most) vendoring issues in this repo. There is still more to do there, but again, I will leave that for others to resolve. I had to tackle vendoring to some extent to pull in bbolt.

Finally, there are two other commits in this PR.
- I have put in an implementation of GetLastRebootTime on Windows. As it's implementation changes the startup functionality, I have left that effectively stubbed out for someone else to follow through with.
- I hit a SIGSEGV in testing in UpdateSendAndReport. Made that safe.

Here's the 25 pods scaling up-and-down on Patricks cluster:

```
NAME                        READY   STATUS    RESTARTS   AGE     IP            NODE           NOMINATED NODE
iis-1803-687cdddf9f-28vrj   1/1     Running   0          10m     10.240.0.10   13833k8s9000   <none>
iis-1803-687cdddf9f-5fjcn   1/1     Running   0          6m33s   10.240.0.30   13833k8s9000   <none>
iis-1803-687cdddf9f-6dk28   1/1     Running   0          10m     10.240.0.31   13833k8s9000   <none>
iis-1803-687cdddf9f-6j8wg   1/1     Running   0          6m33s   10.240.0.11   13833k8s9000   <none>
iis-1803-687cdddf9f-8f5kc   1/1     Running   1          6m33s   10.240.0.14   13833k8s9000   <none>
iis-1803-687cdddf9f-bkd7n   1/1     Running   0          10m     10.240.0.28   13833k8s9000   <none>
iis-1803-687cdddf9f-bth4v   1/1     Running   0          6m33s   10.240.0.23   13833k8s9000   <none>
iis-1803-687cdddf9f-csm2x   1/1     Running   0          10m     10.240.0.5    13833k8s9000   <none>
iis-1803-687cdddf9f-dtvqp   1/1     Running   1          6m33s   10.240.0.9    13833k8s9000   <none>
iis-1803-687cdddf9f-fv9rn   1/1     Running   1          6m33s   10.240.0.20   13833k8s9000   <none>
iis-1803-687cdddf9f-gmzcz   1/1     Running   1          6m33s   10.240.0.12   13833k8s9000   <none>
iis-1803-687cdddf9f-kzmcf   1/1     Running   0          10m     10.240.0.7    13833k8s9000   <none>
iis-1803-687cdddf9f-lltjr   1/1     Running   1          6m33s   10.240.0.13   13833k8s9000   <none>
iis-1803-687cdddf9f-lx2vf   1/1     Running   0          10m     10.240.0.26   13833k8s9000   <none>
iis-1803-687cdddf9f-nn9pp   1/1     Running   1          6m33s   10.240.0.21   13833k8s9000   <none>
iis-1803-687cdddf9f-pjcws   1/1     Running   1          6m33s   10.240.0.22   13833k8s9000   <none>
iis-1803-687cdddf9f-q7hsf   1/1     Running   1          6m33s   10.240.0.33   13833k8s9000   <none>
iis-1803-687cdddf9f-qn5c7   1/1     Running   0          10m     10.240.0.27   13833k8s9000   <none>
iis-1803-687cdddf9f-rt6r5   1/1     Running   1          6m33s   10.240.0.17   13833k8s9000   <none>
iis-1803-687cdddf9f-s2jsb   1/1     Running   0          10m     10.240.0.8    13833k8s9000   <none>
iis-1803-687cdddf9f-sgwb8   1/1     Running   1          6m33s   10.240.0.25   13833k8s9000   <none>
iis-1803-687cdddf9f-x9tpt   1/1     Running   0          10m     10.240.0.29   13833k8s9000   <none>
iis-1803-687cdddf9f-xf6x9   1/1     Running   0          6m33s   10.240.0.24   13833k8s9000   <none>
iis-1803-687cdddf9f-xwfxg   1/1     Running   0          10m     10.240.0.15   13833k8s9000   <none>
iis-1803-687cdddf9f-zf8kv   1/1     Running   1          6m33s   10.240.0.16   13833k8s9000   <none>
azureuser@k8s-master-13833463-0:~/john$ curl http://10.240.0.16
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>IIS Windows Server</title>
<style type="text/css">
<!--
body {
        color:#000000;
        background-color:#0072C6;
        margin:0;
}

        margin-left:auto;
        margin-right:auto;
        text-align:center;
        }

a img {
        border:none;
}

-->
</style>
</head>
<body>
<div id="container">
<a href="http://go.microsoft.com/fwlink/?linkid=66138&amp;clcid=0x409"><img src="iisstart.png" alt="IIS" width="960" height="600" /></a>
</div>
</body>
</html>
```

Then scaling back down:

```
azureuser@k8s-master-13833463-0:~/john$ kubectl scale deploy iis-1803 --replicas=1
deployment.extensions/iis-1803 scaled
azureuser@k8s-master-13833463-0:~/john$
```

Some time later...

```
azureuser@k8s-master-13833463-0:~/john$ kubectl get pod -o wide
NAME                        READY   STATUS    RESTARTS   AGE   IP            NODE           NOMINATED NODE
iis-1803-687cdddf9f-x9tpt   1/1     Running   0          16m   10.240.0.29   13833k8s9000   <none>
```

And scaling back up again

```
azureuser@k8s-master-13833463-0:~/john$ kubectl scale deploy iis-1803 --replicas=25
deployment.extensions/iis-1803 scaled
azureuser@k8s-master-13833463-0:~/john$
```

Some time later...

```
zureuser@k8s-master-13833463-0:~/john$ kubectl get pod -o wide
NAME                        READY   STATUS    RESTARTS   AGE    IP            NODE           NOMINATED NODE
iis-1803-687cdddf9f-26p6l   1/1     Running   0          7m     10.240.0.11   13833k8s9000   <none>
iis-1803-687cdddf9f-2ktdz   1/1     Running   0          7m     10.240.0.10   13833k8s9000   <none>
iis-1803-687cdddf9f-48ggp   1/1     Running   0          7m     10.240.0.5    13833k8s9000   <none>
iis-1803-687cdddf9f-4gtmb   1/1     Running   0          7m     10.240.0.26   13833k8s9000   <none>
iis-1803-687cdddf9f-4hd72   1/1     Running   0          7m     10.240.0.24   13833k8s9000   <none>
iis-1803-687cdddf9f-4pwsq   1/1     Running   1          7m     10.240.0.33   13833k8s9000   <none>
iis-1803-687cdddf9f-5kw22   1/1     Running   0          7m     10.240.0.9    13833k8s9000   <none>
iis-1803-687cdddf9f-664z7   1/1     Running   1          7m     10.240.0.16   13833k8s9000   <none>
iis-1803-687cdddf9f-8swz7   1/1     Running   1          7m1s   10.240.0.7    13833k8s9000   <none>
iis-1803-687cdddf9f-9h98r   1/1     Running   1          7m     10.240.0.8    13833k8s9000   <none>
iis-1803-687cdddf9f-9h9jd   1/1     Running   1          7m     10.240.0.14   13833k8s9000   <none>
iis-1803-687cdddf9f-lftd7   1/1     Running   1          7m     10.240.0.19   13833k8s9000   <none>
iis-1803-687cdddf9f-m9knq   1/1     Running   1          7m     10.240.0.31   13833k8s9000   <none>
iis-1803-687cdddf9f-mplcc   1/1     Running   1          7m     10.240.0.21   13833k8s9000   <none>
iis-1803-687cdddf9f-p7jn2   1/1     Running   0          7m     10.240.0.20   13833k8s9000   <none>
iis-1803-687cdddf9f-sml2x   1/1     Running   0          7m1s   10.240.0.13   13833k8s9000   <none>
iis-1803-687cdddf9f-tjfws   1/1     Running   0          7m     10.240.0.18   13833k8s9000   <none>
iis-1803-687cdddf9f-vxdl4   1/1     Running   0          7m     10.240.0.15   13833k8s9000   <none>
iis-1803-687cdddf9f-x26vj   1/1     Running   1          7m1s   10.240.0.30   13833k8s9000   <none>
iis-1803-687cdddf9f-x2hll   1/1     Running   1          7m     10.240.0.28   13833k8s9000   <none>
iis-1803-687cdddf9f-x9tpt   1/1     Running   0          24m    10.240.0.29   13833k8s9000   <none>
iis-1803-687cdddf9f-xg5bm   1/1     Running   1          7m     10.240.0.23   13833k8s9000   <none>
iis-1803-687cdddf9f-zkkzm   1/1     Running   1          7m     10.240.0.32   13833k8s9000   <none>
iis-1803-687cdddf9f-zqv69   1/1     Running   0          7m     10.240.0.17   13833k8s9000   <none>
iis-1803-687cdddf9f-zvzn9   1/1     Running   0          7m     10.240.0.27   13833k8s9000   <none>
azureuser@k8s-master-13833463-0:~/john$
```
lowenna pushed a commit to lowenna/azure-container-networking that referenced this pull request Oct 2, 2018
Signed-off-by: John Howard <jhoward@microsoft.com>

Move store to bbolt database

This PR is a follow on to Azure#247

@tamilmani1989 @sharmasushant PTAL.

@PatrickLang, @DavidSchott @dineshgovindasamy @madhanrm @jterry75 FYI. @msuiche perhaps you are able to perform more verification on this as well?

As per Azure#247 (comment), while that PR was better, it was far from perfect.
This PR replaces the store entirely and uses a bolt database to store the data. See Azure#247 (comment) Azure#247 (comment)

Patrick gave me access to one of his Windows clusters to perform verification. While there were some errors, none appear attributed to this change.
I was able to scale from 1 to 25, back to 1 and back up again. Hopefully this is finally the end of those lock store-related errors.
It is not however the end of no-errors-at-all during scaling. I will leave that to others to investigate...

I have NOT been able to test this against a linux node - perhaps @tamilmani1989 would be able to that as per before.

In addition, this PR has a bunch of commits which fix (most) vendoring issues in this repo. There is still more to do there, but again, I will leave that for others to resolve. I had to tackle vendoring to some extent to pull in bbolt.

Finally, there are two other commits in this PR.
- I have put in an implementation of GetLastRebootTime on Windows. As it's implementation changes the startup functionality, I have left that effectively stubbed out for someone else to follow through with.
- I hit a SIGSEGV in testing in UpdateSendAndReport. Made that safe.

Here's the 25 pods scaling up-and-down on Patricks cluster:

```
NAME                        READY   STATUS    RESTARTS   AGE     IP            NODE           NOMINATED NODE
iis-1803-687cdddf9f-28vrj   1/1     Running   0          10m     10.240.0.10   13833k8s9000   <none>
iis-1803-687cdddf9f-5fjcn   1/1     Running   0          6m33s   10.240.0.30   13833k8s9000   <none>
iis-1803-687cdddf9f-6dk28   1/1     Running   0          10m     10.240.0.31   13833k8s9000   <none>
iis-1803-687cdddf9f-6j8wg   1/1     Running   0          6m33s   10.240.0.11   13833k8s9000   <none>
iis-1803-687cdddf9f-8f5kc   1/1     Running   1          6m33s   10.240.0.14   13833k8s9000   <none>
iis-1803-687cdddf9f-bkd7n   1/1     Running   0          10m     10.240.0.28   13833k8s9000   <none>
iis-1803-687cdddf9f-bth4v   1/1     Running   0          6m33s   10.240.0.23   13833k8s9000   <none>
iis-1803-687cdddf9f-csm2x   1/1     Running   0          10m     10.240.0.5    13833k8s9000   <none>
iis-1803-687cdddf9f-dtvqp   1/1     Running   1          6m33s   10.240.0.9    13833k8s9000   <none>
iis-1803-687cdddf9f-fv9rn   1/1     Running   1          6m33s   10.240.0.20   13833k8s9000   <none>
iis-1803-687cdddf9f-gmzcz   1/1     Running   1          6m33s   10.240.0.12   13833k8s9000   <none>
iis-1803-687cdddf9f-kzmcf   1/1     Running   0          10m     10.240.0.7    13833k8s9000   <none>
iis-1803-687cdddf9f-lltjr   1/1     Running   1          6m33s   10.240.0.13   13833k8s9000   <none>
iis-1803-687cdddf9f-lx2vf   1/1     Running   0          10m     10.240.0.26   13833k8s9000   <none>
iis-1803-687cdddf9f-nn9pp   1/1     Running   1          6m33s   10.240.0.21   13833k8s9000   <none>
iis-1803-687cdddf9f-pjcws   1/1     Running   1          6m33s   10.240.0.22   13833k8s9000   <none>
iis-1803-687cdddf9f-q7hsf   1/1     Running   1          6m33s   10.240.0.33   13833k8s9000   <none>
iis-1803-687cdddf9f-qn5c7   1/1     Running   0          10m     10.240.0.27   13833k8s9000   <none>
iis-1803-687cdddf9f-rt6r5   1/1     Running   1          6m33s   10.240.0.17   13833k8s9000   <none>
iis-1803-687cdddf9f-s2jsb   1/1     Running   0          10m     10.240.0.8    13833k8s9000   <none>
iis-1803-687cdddf9f-sgwb8   1/1     Running   1          6m33s   10.240.0.25   13833k8s9000   <none>
iis-1803-687cdddf9f-x9tpt   1/1     Running   0          10m     10.240.0.29   13833k8s9000   <none>
iis-1803-687cdddf9f-xf6x9   1/1     Running   0          6m33s   10.240.0.24   13833k8s9000   <none>
iis-1803-687cdddf9f-xwfxg   1/1     Running   0          10m     10.240.0.15   13833k8s9000   <none>
iis-1803-687cdddf9f-zf8kv   1/1     Running   1          6m33s   10.240.0.16   13833k8s9000   <none>
azureuser@k8s-master-13833463-0:~/john$ curl http://10.240.0.16
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>IIS Windows Server</title>
<style type="text/css">
<!--
body {
        color:#000000;
        background-color:#0072C6;
        margin:0;
}

        margin-left:auto;
        margin-right:auto;
        text-align:center;
        }

a img {
        border:none;
}

-->
</style>
</head>
<body>
<div id="container">
<a href="http://go.microsoft.com/fwlink/?linkid=66138&amp;clcid=0x409"><img src="iisstart.png" alt="IIS" width="960" height="600" /></a>
</div>
</body>
</html>
```

Then scaling back down:

```
azureuser@k8s-master-13833463-0:~/john$ kubectl scale deploy iis-1803 --replicas=1
deployment.extensions/iis-1803 scaled
azureuser@k8s-master-13833463-0:~/john$
```

Some time later...

```
azureuser@k8s-master-13833463-0:~/john$ kubectl get pod -o wide
NAME                        READY   STATUS    RESTARTS   AGE   IP            NODE           NOMINATED NODE
iis-1803-687cdddf9f-x9tpt   1/1     Running   0          16m   10.240.0.29   13833k8s9000   <none>
```

And scaling back up again

```
azureuser@k8s-master-13833463-0:~/john$ kubectl scale deploy iis-1803 --replicas=25
deployment.extensions/iis-1803 scaled
azureuser@k8s-master-13833463-0:~/john$
```

Some time later...

```
zureuser@k8s-master-13833463-0:~/john$ kubectl get pod -o wide
NAME                        READY   STATUS    RESTARTS   AGE    IP            NODE           NOMINATED NODE
iis-1803-687cdddf9f-26p6l   1/1     Running   0          7m     10.240.0.11   13833k8s9000   <none>
iis-1803-687cdddf9f-2ktdz   1/1     Running   0          7m     10.240.0.10   13833k8s9000   <none>
iis-1803-687cdddf9f-48ggp   1/1     Running   0          7m     10.240.0.5    13833k8s9000   <none>
iis-1803-687cdddf9f-4gtmb   1/1     Running   0          7m     10.240.0.26   13833k8s9000   <none>
iis-1803-687cdddf9f-4hd72   1/1     Running   0          7m     10.240.0.24   13833k8s9000   <none>
iis-1803-687cdddf9f-4pwsq   1/1     Running   1          7m     10.240.0.33   13833k8s9000   <none>
iis-1803-687cdddf9f-5kw22   1/1     Running   0          7m     10.240.0.9    13833k8s9000   <none>
iis-1803-687cdddf9f-664z7   1/1     Running   1          7m     10.240.0.16   13833k8s9000   <none>
iis-1803-687cdddf9f-8swz7   1/1     Running   1          7m1s   10.240.0.7    13833k8s9000   <none>
iis-1803-687cdddf9f-9h98r   1/1     Running   1          7m     10.240.0.8    13833k8s9000   <none>
iis-1803-687cdddf9f-9h9jd   1/1     Running   1          7m     10.240.0.14   13833k8s9000   <none>
iis-1803-687cdddf9f-lftd7   1/1     Running   1          7m     10.240.0.19   13833k8s9000   <none>
iis-1803-687cdddf9f-m9knq   1/1     Running   1          7m     10.240.0.31   13833k8s9000   <none>
iis-1803-687cdddf9f-mplcc   1/1     Running   1          7m     10.240.0.21   13833k8s9000   <none>
iis-1803-687cdddf9f-p7jn2   1/1     Running   0          7m     10.240.0.20   13833k8s9000   <none>
iis-1803-687cdddf9f-sml2x   1/1     Running   0          7m1s   10.240.0.13   13833k8s9000   <none>
iis-1803-687cdddf9f-tjfws   1/1     Running   0          7m     10.240.0.18   13833k8s9000   <none>
iis-1803-687cdddf9f-vxdl4   1/1     Running   0          7m     10.240.0.15   13833k8s9000   <none>
iis-1803-687cdddf9f-x26vj   1/1     Running   1          7m1s   10.240.0.30   13833k8s9000   <none>
iis-1803-687cdddf9f-x2hll   1/1     Running   1          7m     10.240.0.28   13833k8s9000   <none>
iis-1803-687cdddf9f-x9tpt   1/1     Running   0          24m    10.240.0.29   13833k8s9000   <none>
iis-1803-687cdddf9f-xg5bm   1/1     Running   1          7m     10.240.0.23   13833k8s9000   <none>
iis-1803-687cdddf9f-zkkzm   1/1     Running   1          7m     10.240.0.32   13833k8s9000   <none>
iis-1803-687cdddf9f-zqv69   1/1     Running   0          7m     10.240.0.17   13833k8s9000   <none>
iis-1803-687cdddf9f-zvzn9   1/1     Running   0          7m     10.240.0.27   13833k8s9000   <none>
azureuser@k8s-master-13833463-0:~/john$
```
lowenna pushed a commit to lowenna/azure-container-networking that referenced this pull request Oct 2, 2018
Signed-off-by: John Howard <jhoward@microsoft.com>

Move store to bbolt database

This PR is a follow on to Azure#247

@tamilmani1989 @sharmasushant PTAL.

@PatrickLang, @DavidSchott @dineshgovindasamy @madhanrm @jterry75 FYI. @msuiche perhaps you are able to perform more verification on this as well?

As per Azure#247 (comment), while that PR was better, it was far from perfect.
This PR replaces the store entirely and uses a bolt database to store the data. See Azure#247 (comment) Azure#247 (comment)

Patrick gave me access to one of his Windows clusters to perform verification. While there were some errors, none appear attributed to this change.
I was able to scale from 1 to 25, back to 1 and back up again. Hopefully this is finally the end of those lock store-related errors.
It is not however the end of no-errors-at-all during scaling. I will leave that to others to investigate...

I have NOT been able to test this against a linux node - perhaps @tamilmani1989 would be able to that as per before.

In addition, this PR has a bunch of commits which fix (most) vendoring issues in this repo. There is still more to do there, but again, I will leave that for others to resolve. I had to tackle vendoring to some extent to pull in bbolt.

Finally, there are two other commits in this PR.
- I have put in an implementation of GetLastRebootTime on Windows. As it's implementation changes the startup functionality, I have left that effectively stubbed out for someone else to follow through with.
- I hit a SIGSEGV in testing in UpdateSendAndReport. Made that safe.

Here's the 25 pods scaling up-and-down on Patricks cluster:

```
NAME                        READY   STATUS    RESTARTS   AGE     IP            NODE           NOMINATED NODE
iis-1803-687cdddf9f-28vrj   1/1     Running   0          10m     10.240.0.10   13833k8s9000   <none>
iis-1803-687cdddf9f-5fjcn   1/1     Running   0          6m33s   10.240.0.30   13833k8s9000   <none>
iis-1803-687cdddf9f-6dk28   1/1     Running   0          10m     10.240.0.31   13833k8s9000   <none>
iis-1803-687cdddf9f-6j8wg   1/1     Running   0          6m33s   10.240.0.11   13833k8s9000   <none>
iis-1803-687cdddf9f-8f5kc   1/1     Running   1          6m33s   10.240.0.14   13833k8s9000   <none>
iis-1803-687cdddf9f-bkd7n   1/1     Running   0          10m     10.240.0.28   13833k8s9000   <none>
iis-1803-687cdddf9f-bth4v   1/1     Running   0          6m33s   10.240.0.23   13833k8s9000   <none>
iis-1803-687cdddf9f-csm2x   1/1     Running   0          10m     10.240.0.5    13833k8s9000   <none>
iis-1803-687cdddf9f-dtvqp   1/1     Running   1          6m33s   10.240.0.9    13833k8s9000   <none>
iis-1803-687cdddf9f-fv9rn   1/1     Running   1          6m33s   10.240.0.20   13833k8s9000   <none>
iis-1803-687cdddf9f-gmzcz   1/1     Running   1          6m33s   10.240.0.12   13833k8s9000   <none>
iis-1803-687cdddf9f-kzmcf   1/1     Running   0          10m     10.240.0.7    13833k8s9000   <none>
iis-1803-687cdddf9f-lltjr   1/1     Running   1          6m33s   10.240.0.13   13833k8s9000   <none>
iis-1803-687cdddf9f-lx2vf   1/1     Running   0          10m     10.240.0.26   13833k8s9000   <none>
iis-1803-687cdddf9f-nn9pp   1/1     Running   1          6m33s   10.240.0.21   13833k8s9000   <none>
iis-1803-687cdddf9f-pjcws   1/1     Running   1          6m33s   10.240.0.22   13833k8s9000   <none>
iis-1803-687cdddf9f-q7hsf   1/1     Running   1          6m33s   10.240.0.33   13833k8s9000   <none>
iis-1803-687cdddf9f-qn5c7   1/1     Running   0          10m     10.240.0.27   13833k8s9000   <none>
iis-1803-687cdddf9f-rt6r5   1/1     Running   1          6m33s   10.240.0.17   13833k8s9000   <none>
iis-1803-687cdddf9f-s2jsb   1/1     Running   0          10m     10.240.0.8    13833k8s9000   <none>
iis-1803-687cdddf9f-sgwb8   1/1     Running   1          6m33s   10.240.0.25   13833k8s9000   <none>
iis-1803-687cdddf9f-x9tpt   1/1     Running   0          10m     10.240.0.29   13833k8s9000   <none>
iis-1803-687cdddf9f-xf6x9   1/1     Running   0          6m33s   10.240.0.24   13833k8s9000   <none>
iis-1803-687cdddf9f-xwfxg   1/1     Running   0          10m     10.240.0.15   13833k8s9000   <none>
iis-1803-687cdddf9f-zf8kv   1/1     Running   1          6m33s   10.240.0.16   13833k8s9000   <none>
azureuser@k8s-master-13833463-0:~/john$ curl http://10.240.0.16
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>IIS Windows Server</title>
<style type="text/css">
<!--
body {
        color:#000000;
        background-color:#0072C6;
        margin:0;
}

        margin-left:auto;
        margin-right:auto;
        text-align:center;
        }

a img {
        border:none;
}

-->
</style>
</head>
<body>
<div id="container">
<a href="http://go.microsoft.com/fwlink/?linkid=66138&amp;clcid=0x409"><img src="iisstart.png" alt="IIS" width="960" height="600" /></a>
</div>
</body>
</html>
```

Then scaling back down:

```
azureuser@k8s-master-13833463-0:~/john$ kubectl scale deploy iis-1803 --replicas=1
deployment.extensions/iis-1803 scaled
azureuser@k8s-master-13833463-0:~/john$
```

Some time later...

```
azureuser@k8s-master-13833463-0:~/john$ kubectl get pod -o wide
NAME                        READY   STATUS    RESTARTS   AGE   IP            NODE           NOMINATED NODE
iis-1803-687cdddf9f-x9tpt   1/1     Running   0          16m   10.240.0.29   13833k8s9000   <none>
```

And scaling back up again

```
azureuser@k8s-master-13833463-0:~/john$ kubectl scale deploy iis-1803 --replicas=25
deployment.extensions/iis-1803 scaled
azureuser@k8s-master-13833463-0:~/john$
```

Some time later...

```
zureuser@k8s-master-13833463-0:~/john$ kubectl get pod -o wide
NAME                        READY   STATUS    RESTARTS   AGE    IP            NODE           NOMINATED NODE
iis-1803-687cdddf9f-26p6l   1/1     Running   0          7m     10.240.0.11   13833k8s9000   <none>
iis-1803-687cdddf9f-2ktdz   1/1     Running   0          7m     10.240.0.10   13833k8s9000   <none>
iis-1803-687cdddf9f-48ggp   1/1     Running   0          7m     10.240.0.5    13833k8s9000   <none>
iis-1803-687cdddf9f-4gtmb   1/1     Running   0          7m     10.240.0.26   13833k8s9000   <none>
iis-1803-687cdddf9f-4hd72   1/1     Running   0          7m     10.240.0.24   13833k8s9000   <none>
iis-1803-687cdddf9f-4pwsq   1/1     Running   1          7m     10.240.0.33   13833k8s9000   <none>
iis-1803-687cdddf9f-5kw22   1/1     Running   0          7m     10.240.0.9    13833k8s9000   <none>
iis-1803-687cdddf9f-664z7   1/1     Running   1          7m     10.240.0.16   13833k8s9000   <none>
iis-1803-687cdddf9f-8swz7   1/1     Running   1          7m1s   10.240.0.7    13833k8s9000   <none>
iis-1803-687cdddf9f-9h98r   1/1     Running   1          7m     10.240.0.8    13833k8s9000   <none>
iis-1803-687cdddf9f-9h9jd   1/1     Running   1          7m     10.240.0.14   13833k8s9000   <none>
iis-1803-687cdddf9f-lftd7   1/1     Running   1          7m     10.240.0.19   13833k8s9000   <none>
iis-1803-687cdddf9f-m9knq   1/1     Running   1          7m     10.240.0.31   13833k8s9000   <none>
iis-1803-687cdddf9f-mplcc   1/1     Running   1          7m     10.240.0.21   13833k8s9000   <none>
iis-1803-687cdddf9f-p7jn2   1/1     Running   0          7m     10.240.0.20   13833k8s9000   <none>
iis-1803-687cdddf9f-sml2x   1/1     Running   0          7m1s   10.240.0.13   13833k8s9000   <none>
iis-1803-687cdddf9f-tjfws   1/1     Running   0          7m     10.240.0.18   13833k8s9000   <none>
iis-1803-687cdddf9f-vxdl4   1/1     Running   0          7m     10.240.0.15   13833k8s9000   <none>
iis-1803-687cdddf9f-x26vj   1/1     Running   1          7m1s   10.240.0.30   13833k8s9000   <none>
iis-1803-687cdddf9f-x2hll   1/1     Running   1          7m     10.240.0.28   13833k8s9000   <none>
iis-1803-687cdddf9f-x9tpt   1/1     Running   0          24m    10.240.0.29   13833k8s9000   <none>
iis-1803-687cdddf9f-xg5bm   1/1     Running   1          7m     10.240.0.23   13833k8s9000   <none>
iis-1803-687cdddf9f-zkkzm   1/1     Running   1          7m     10.240.0.32   13833k8s9000   <none>
iis-1803-687cdddf9f-zqv69   1/1     Running   0          7m     10.240.0.17   13833k8s9000   <none>
iis-1803-687cdddf9f-zvzn9   1/1     Running   0          7m     10.240.0.27   13833k8s9000   <none>
azureuser@k8s-master-13833463-0:~/john$
```
@lowenna lowenna deleted the jjh/lock branch December 4, 2018 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[cni] Timed out on locking store, err:Store is locked.

5 participants