Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error during upgrades, secret already exist #21

Closed
qu1queee opened this issue May 3, 2018 · 22 comments
Closed

Error during upgrades, secret already exist #21

qu1queee opened this issue May 3, 2018 · 22 comments

Comments

@qu1queee
Copy link

qu1queee commented May 3, 2018

Hi,

We faced an issue regarding the generation of a secret through the secret-generation pod in SCF. It seems that doing SCF version upgrades between images that do not have this(current images) commit 94cbe1 and images that have it(upgrade-to images), will end up in an error of the secret-generation pod. Error is as follows:

....
2018/05/02 08:53:28 Done with generation
2018/05/02 08:53:28 Error creating secret secrets-1.15.0-rc.31-1: secrets "secrets-1.15.0-rc.31-1" already exist

It seems that config-version did not exist before the introduction of commit 94cbe1, which during upgrade(image with commit 94cbe1 changes) will try to retrieve that config-version, and because isn't there, will try to create a configmap, which already exists(cause we re doing an upgrade). Eventually the pod will fail and restart, but because the secret was already created, it will end with the error above.

Could we add another check for this , to verify that the configmap is there, before trying to create one?

Regards,
Enrique Encalada

@jandubois
Copy link
Member

Hi @qu1queee,

I tried to create a workaround for this issue, but it is not going to work. I'll have another go at it tomorrow, and will write up some background on it once I have a PR to review/test.

Cheers,
-Jan

@qu1queee
Copy link
Author

qu1queee commented May 4, 2018

Great, thanks!

Regards,
Enrique Encalada

@Xiaohua-Shen
Copy link

Xiaohua-Shen commented May 4, 2018

@qu1queee @jandubois I hit the error of Error creating secret <secret-name>: <secret-name> existed in another scenario, that need a fix too.

https://github.com/SUSE/scf-secret-generator/blob/master/secrets/secrets.go#L139

The reproduce steps are :

  1. install version A, then it create secret A
  2. upgrade to version B, then it check if the new secret name B is same as current Secret(now A). found not same, so create secret B.
  3. rollback to version A, then it check if the new secret name A is same as current Secret(now B).
    found not same, so try to create secret A.
    but secret A already exist, so it hit the Error creating secret <secret-name>: <secret-name> existed

@jandubois
Copy link
Member

Hi @qu1queee,

I've been looking into your problem some more, and I don't think i understand how you get into that situation.

If the secret already exists, then the current-secret-name in the configmap should be set to this name, even if config-version does not exist. And then this should be caught by

if sg.SecretsName == currentName {
log.Printf("Secret `%s` already exists; nothing to do\n", sg.SecretsName)
return nil, nil
}

Maybe the original error condition was something else. The problem is that if the secret-generator fails with an error, then kube will re-run it over and over, and on subsequent runs you might see different problems that are caused by the state the first failing run left things in.

The real source of your problem is that you are trying to upgrade a system that has been built from a random development commit and not from a verified release. When we did the first implementation of versioned secrets, we found out during QA that this system did not upgrade properly. It also had a number of other problems, causing frequent pod restarts whenever config parameters were modified.

For that reason we never made any releases of the 2.71 to 2.73 tags of the SCF repo: https://github.com/SUSE/scf/releases

Instead we redesigned the versioned secrets system from scratch and reimplemented it, which took quite a number of iterations:

We therefore never expected to support updates from any of the interim states, but only from SCF 2.7.0 to 2.8.0.

It is unfortunate that the non-upgradable code has lived for an extended time on the master branch, probably misleading you to assume that it was release-ready. Maybe we should have an RC branch, and only merge to master after we make a QA'ed release. But you do get the same information from the https://github.com/SUSE/scf/releases pages already.

Anyways, I will try to help you find a workaround, but I need more information about the exact state you are in, like the contents of the configmap, and the list of all the secrets that currently exist in the namespace. Ideally this information should be recorded before attempting the upgrade, as a failed secret-generator run could mess up things even more.

Or, if these are only internal systems, maybe it makes sense to just reinstall those instead of adding code to support upgrades for this legacy situation, that should not happen with installations of released versions?

@jandubois
Copy link
Member

Hi @Xiaohua-Shen,

your issue is different from the one @qu1queee is describing; it just shows the same symptoms.

We had previously worked under the assumption that rollback to older versions isn't supported. And in general it can't be, because database migrations might make the current state of the data incompatible with earlier versions of the software.

However, it seems sensible that you should be able to rollback to an earlier patch release of the same version, e.g. roll back 2.8.3 to 2.8.2 or even 2.8.0, so we would like to add support for that. I will try to produce a PR for this, as the basis of further discussion. I'll leave a comment here when I have something to share.

@qu1queee
Copy link
Author

qu1queee commented May 7, 2018

Hi @jandubois ,

Let me summarize again my issue. As mentioned, we are doing an upgrade, in this case is from CF v270 to cf-deployment v1.15. Initially(v270), the inital config its as follows:
configmap

{
  "apiVersion": "v1",
  "items": [
    {
      "apiVersion": "v1",
      "data": {
        "current-secrets-generation": "1",
        "current-secrets-name": "secrets-270.28.0-b7add10-1",
        "previous-secrets-name": ""
      },
....
}

As you can see ^^, the current-secrets-name its based on that v270. Also, this existing configmap was generated through a secret-generation pod, as follows:
logs from pod

....
....
2018/05/07 08:11:07 Done with generation
2018/05/07 08:11:07 Created secret `secrets-270.28.0-b7add10-1`
2018/05/07 08:11:07 Created configmap `secrets-config`

Then we do the upgrade, where a new secret-generatorpod will be created, the ENV variables of that pod are the following ones:

[
...
  {
    "name": "KUBE_SECRETS_GENERATION_COUNTER",
    "value": "1"
  },
  {
    "name": "KUBE_SECRETS_GENERATION_NAME",
    "value": "secrets-1.15.0-rc.31-1"
  },
...
]

During the first iteration, the secret-generator pod will fails as follows:

2018/05/07 08:52:52 Done with generation
2018/05/07 08:52:52 Created secret `secrets-1.15.0-rc.31-1`
2018/05/07 08:52:52 Error creating configmap secrets-config: resourceVersion should not be set on objects to be created

As you can see above ^^, I believe the configmap was intended to be created due to the missing parameter, but when the configmap creation happens, it fails, probably because there is already an existing configmap.

In the next secret-generator pod iteration, you hit the issue, of the secret already being there, as follows:

2018/05/07 08:53:13 Done with generation
2018/05/07 08:53:13 Error creating secret secrets-1.15.0-rc.31-1: secrets "secrets-1.15.0-rc.31-1" already exists

From your last comment, and I quote

If the secret already exists, then the current-secret-name in the configmap should be set to this name, even if config-version does not exist. And then this should be caught by

That logic, will never happen, because the names are not the same in our case, because creating the configmap failed during the first iteration.

I hope this help to clear up the issue a bit. If you have questions regarding SCF tags we use, please let us know.

Regards,
Enrique Encalada

jandubois added a commit that referenced this issue May 7, 2018
…ersion

The version number was added at the end of the re-design of the versioned
secrets mechanism, and is also used as a marker if the configmap needs to
be created or updated.

This change allows updates from pre-release versions missing the config-version,
but otherwise compatible with the version 1 format.

Ref #21
jandubois added a commit that referenced this issue May 7, 2018
This used to fail because we keep the pre-upgrade secrets around as a
fail-safe in case anything goes wrong during the upgrade and the cluster
needs to be manually resurrected. This secret was getting in the way
because the secrets generator assumes that the new secret cannot exist
yet.

Ref #21 (comment)
@jandubois
Copy link
Member

@qu1queee @Xiaohua-Shen

I've created PR #22 with separate commits for both your issues. I've tested that it works normally to deploy SCF, but we are not testing either of your scenarios in our QA setups, so could you please report back if these changes solve your problems?

@qu1queee
Copy link
Author

qu1queee commented May 9, 2018

@jandubois thanks!! I will test this soon, and then i will let you know.

@qu1queee
Copy link
Author

qu1queee commented May 9, 2018

@jandubois one more thing, is there any reason why the secret-generator pod have a restartPolicyof OnFailure, that is defined here. I´m asking because it was tricky to get the logs of this issue, before the first container restarted. If there is no reason behind for this OnFailure, I would prefer to have the Never restartPolicy, what do you think?

@jandubois
Copy link
Member

@qu1queee @Xiaohua-Shen Totally unrelated, but just wanted to give you a quick warning that our QA discovered some upgrade issues with the SCF version tagged as 2.9.0; so this will not become a regular release.

The problem is that some internal certs need changes to the SAN info (adding IP address 127.0.0.1), so re-using the certs from a 2.8.0 deployment will fail. Forcing a secrets rotation may solve this, but will cause temporary unavailability, so is not desirable. We are still investigating how to resolve this.

@jandubois
Copy link
Member

is there any reason why the secret-generator pod have a restartPolicy of OnFailure

Yes, there is no way that SCF will start up unless the secrets-generator finishes successfully. There are various failure modes, like a kube node failure or a random timeout, that could easily self-heal by restarting the job on a different node, or at a later time. So we would like to keep the restart policy for the operator experience in case of temporary failure.

I've run into the problem of getting logs of the initial failure myself, so I understand where your request is coming from. We are considering to just overwrite a secret if it already exists (the linked PR already does this in case the name matches the previous release). That would cover the problem when the generator can create the secret, but fails to update the config map. If we overwrite the secret, then at least the error about updating the config map should re-occur and be in the log as well.

I'm still pondering though if this should be done even when the config map doesn't exist yet. Because in that case we would overwrite immutable secrets, breaking the currently running setup. But then there is no way to ever upgrade from this scenario either, so neither solution seems ideal. I just hope that by fixing the other issues it becomes extremely unlikely to get into the broken state at all.

@jandubois
Copy link
Member

2.9.0 [...] will not become a regular release

It is actually still undecided. We may end up releasing it as-is, with a release note that says that you must trigger a secrets rotation during the upgrade to 2.9.0, to get working certificates. Just pointing this out here as well, so you are aware.

@Xiaohua-Shen
Copy link

@jandubois I also noticed some password's immutable: true are removed in SCF 2.9, include those db password. which may cause data in db not accessible after upgrade. https://github.com/SUSE/scf/blob/2.9.0/container-host-files/etc/scf/config/role-manifest.yml

what's the reason for this change?

@zhangtbj
Copy link

zhangtbj commented May 11, 2018

Hi @jandubois, we are upgrading SCF to the 2.9.0, but because the upgrade issues, it is not a regular release. Do you know when can you publish the new official release? I also need some upgrade fixes in 2.9.0 which are not in 2.8.0. And 2.8.0 is old.

@jandubois
Copy link
Member

what's the reason for this change?

We are trying to remove the immutable attribute from all secrets that don't really need it anymore, to make those secrets rotatable.

For example the mysql passwords are no longer immutable after updating to https://github.com/cloudfoundry/cf-mysql-release/releases/tag/v36.11.0 which includes https://www.pivotaltracker.com/n/projects/969486/stories/153529269.

which may cause data in db not accessible after upgrade

Do you mean the temporary downtime while the related pods are restarting, or did you observe this breaking things long term?

We have tested secrets rotation after making the password mutable, and everything was working as expected.

The (expected, short) downtime during secrets rotation is the main reason we don't automatically rotate secrets every time the user makes any changes via helm upgrade, but let the user request it explicitly. That way the operator can run it during scheduled maintenance windows (or immediately in case of any kind of exposure).

For reference, the PRs that made additional secrets mutable are SUSE/scf#1448 and SUSE/uaa-fissile-release#93

@jandubois
Copy link
Member

Do you know when can you publish the new official release?

Hi @zhangtbj, I don't know yet; it will be decided by product management, if they will release with the upgrade issue mentioned in the release notes, or if they want to delay for a proper fix.

For the proper fix, we are planning to store a normalized version of the generator config section from the role manifest together with the generated secrets. That way we can determine during upgrades if a specific secret needs to be rotated during a chart upgrade (normally only new secrets are generated at this time). Given that previous releases don't have this information, the initial upgrade will therefore cause an implicit rotation of all secrets and therefore cause the usual short downtime. But work on this fix hasn't started yet.

@zhangtbj
Copy link

Hi @jandubois , If we don't have the use case that upgrades from the previous version to SCF 2.9.0 (For example, 2.9.0 is our first version, no any migration to this version in our production), what is your opinion? Keep using 2.9.0 or rollback to official version 2.8.0?
Thanks!

@jandubois
Copy link
Member

Hi @zhangtbj, I would have said to use 2.9.0, but I just learned that we had 2 instances in testing where diego-api did not start up correctly. So far it looks like were some issue between BBS migration and the locket server, but it is still under investigation; so we don't really know.

We have never seen this particular failure mode before 2.9.0, so I'm not so sure about recommending 2.9.0 right now. I don't have any more information right now.

If possible I would recommend to wait a couple of days. We are looking at this urgently, as 2.9.0 is already behind our (internal) schedule.

@zhangtbj
Copy link

Hi @jandubois, I think a couple of days is fine for us. Wait for your good news. Thanks!
@qu1queee @Xiaohua-Shen FYI , if possible, I will provide two versions which use 2.8.0 and 2.9.0. If we can catch the SCF 2.9.0 schedule, we can continue, or we will use 2.8.0 as official. Thanks!

@qu1queee
Copy link
Author

@jandubois sorry for the late response. I just tested the PR #22 , worked as expected for my upgrade scenario, logs are the following:

2018/05/16 11:47:14 testing..
2018/05/16 11:47:15 Going to re use existing configmap
...
...
2018/05/16 11:47:35 Done with generation
2018/05/16 11:47:35 Created secret `secrets-1.15.0-rc.40-0`
2018/05/16 11:47:35 previous secret `secrets-270.28.0-b7add10-1`
2018/05/16 11:47:35 Updated configmap `secrets-config`

I added some extra logs on my fork(on top of your branch), to test if I was getting the changes, finding the right place under scf submodules was cumbersome.

Thanks again.

jandubois added a commit that referenced this issue May 16, 2018
…ersion

The version number was added at the end of the re-design of the versioned
secrets mechanism, and is also used as a marker if the configmap needs to
be created or updated.

This change allows updates from pre-release versions missing the config-version,
but otherwise compatible with the version 1 format.

Ref #21
jandubois added a commit that referenced this issue May 16, 2018
This used to fail because we keep the pre-upgrade secrets around as a
fail-safe in case anything goes wrong during the upgrade and the cluster
needs to be manually resurrected. This secret was getting in the way
because the secrets generator assumes that the new secret cannot exist
yet.

Ref #21 (comment)
mook-as added a commit that referenced this issue May 16, 2018
Fix issue #21: configmap/secret already existing
jandubois added a commit to SUSE/scf-helper-release that referenced this issue May 16, 2018
jandubois added a commit to SUSE/uaa-fissile-release that referenced this issue May 16, 2018
@qu1queee
Copy link
Author

@jandubois can we close this?

@jandubois
Copy link
Member

Yes, I think we can close this now. 😄

We did some extensive refactoring of the secrets generator in #28 that should make it much more robust and predictable, mostly by passing in the install/upgrade status via helm, and not by using heuristics.

These changes are not yet in the latest SCF release (but are included in HEAD of develop). If you have the time, it would be great if you could review the code to make sure it covers all your scenarios as well! If you find any issues, or have any comments, please file a new issue...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants