-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error during upgrades, secret already exist #21
Comments
Hi @qu1queee, I tried to create a workaround for this issue, but it is not going to work. I'll have another go at it tomorrow, and will write up some background on it once I have a PR to review/test. Cheers, |
Great, thanks! Regards, |
@qu1queee @jandubois I hit the error of https://github.com/SUSE/scf-secret-generator/blob/master/secrets/secrets.go#L139 The reproduce steps are :
|
Hi @qu1queee, I've been looking into your problem some more, and I don't think i understand how you get into that situation. If the secret already exists, then the scf-secret-generator/secrets/secrets.go Lines 139 to 142 in a600c8b
Maybe the original error condition was something else. The problem is that if the secret-generator fails with an error, then kube will re-run it over and over, and on subsequent runs you might see different problems that are caused by the state the first failing run left things in. The real source of your problem is that you are trying to upgrade a system that has been built from a random development commit and not from a verified release. When we did the first implementation of versioned secrets, we found out during QA that this system did not upgrade properly. It also had a number of other problems, causing frequent pod restarts whenever config parameters were modified. For that reason we never made any releases of the 2.71 to 2.73 tags of the SCF repo: https://github.com/SUSE/scf/releases Instead we redesigned the versioned secrets system from scratch and reimplemented it, which took quite a number of iterations:
We therefore never expected to support updates from any of the interim states, but only from SCF 2.7.0 to 2.8.0. It is unfortunate that the non-upgradable code has lived for an extended time on the master branch, probably misleading you to assume that it was release-ready. Maybe we should have an RC branch, and only merge to master after we make a QA'ed release. But you do get the same information from the https://github.com/SUSE/scf/releases pages already. Anyways, I will try to help you find a workaround, but I need more information about the exact state you are in, like the contents of the configmap, and the list of all the secrets that currently exist in the namespace. Ideally this information should be recorded before attempting the upgrade, as a failed secret-generator run could mess up things even more. Or, if these are only internal systems, maybe it makes sense to just reinstall those instead of adding code to support upgrades for this legacy situation, that should not happen with installations of released versions? |
Hi @Xiaohua-Shen, your issue is different from the one @qu1queee is describing; it just shows the same symptoms. We had previously worked under the assumption that rollback to older versions isn't supported. And in general it can't be, because database migrations might make the current state of the data incompatible with earlier versions of the software. However, it seems sensible that you should be able to rollback to an earlier patch release of the same version, e.g. roll back 2.8.3 to 2.8.2 or even 2.8.0, so we would like to add support for that. I will try to produce a PR for this, as the basis of further discussion. I'll leave a comment here when I have something to share. |
Hi @jandubois , Let me summarize again my issue. As mentioned, we are doing an upgrade, in this case is from CF v270 to cf-deployment v1.15. Initially(v270), the inital config its as follows: {
"apiVersion": "v1",
"items": [
{
"apiVersion": "v1",
"data": {
"current-secrets-generation": "1",
"current-secrets-name": "secrets-270.28.0-b7add10-1",
"previous-secrets-name": ""
},
....
} As you can see ^^, the
Then we do the upgrade, where a new
During the first iteration, the
As you can see above ^^, I believe the configmap was intended to be created due to the missing parameter, but when the configmap creation happens, it fails, probably because there is already an existing configmap. In the next
From your last comment, and I quote
That logic, will never happen, because the names are not the same in our case, because creating the configmap failed during the first iteration. I hope this help to clear up the issue a bit. If you have questions regarding SCF tags we use, please let us know. Regards, |
…ersion The version number was added at the end of the re-design of the versioned secrets mechanism, and is also used as a marker if the configmap needs to be created or updated. This change allows updates from pre-release versions missing the config-version, but otherwise compatible with the version 1 format. Ref #21
This used to fail because we keep the pre-upgrade secrets around as a fail-safe in case anything goes wrong during the upgrade and the cluster needs to be manually resurrected. This secret was getting in the way because the secrets generator assumes that the new secret cannot exist yet. Ref #21 (comment)
I've created PR #22 with separate commits for both your issues. I've tested that it works normally to deploy SCF, but we are not testing either of your scenarios in our QA setups, so could you please report back if these changes solve your problems? |
@jandubois thanks!! I will test this soon, and then i will let you know. |
@jandubois one more thing, is there any reason why the |
@qu1queee @Xiaohua-Shen Totally unrelated, but just wanted to give you a quick warning that our QA discovered some upgrade issues with the SCF version tagged as 2.9.0; so this will not become a regular release. The problem is that some internal certs need changes to the SAN info (adding IP address |
Yes, there is no way that SCF will start up unless the secrets-generator finishes successfully. There are various failure modes, like a kube node failure or a random timeout, that could easily self-heal by restarting the job on a different node, or at a later time. So we would like to keep the restart policy for the operator experience in case of temporary failure. I've run into the problem of getting logs of the initial failure myself, so I understand where your request is coming from. We are considering to just overwrite a secret if it already exists (the linked PR already does this in case the name matches the previous release). That would cover the problem when the generator can create the secret, but fails to update the config map. If we overwrite the secret, then at least the error about updating the config map should re-occur and be in the log as well. I'm still pondering though if this should be done even when the config map doesn't exist yet. Because in that case we would overwrite immutable secrets, breaking the currently running setup. But then there is no way to ever upgrade from this scenario either, so neither solution seems ideal. I just hope that by fixing the other issues it becomes extremely unlikely to get into the broken state at all. |
It is actually still undecided. We may end up releasing it as-is, with a release note that says that you must trigger a secrets rotation during the upgrade to 2.9.0, to get working certificates. Just pointing this out here as well, so you are aware. |
@jandubois I also noticed some password's what's the reason for this change? |
Hi @jandubois, we are upgrading SCF to the 2.9.0, but because the upgrade issues, it is not a regular release. Do you know when can you publish the new official release? I also need some upgrade fixes in 2.9.0 which are not in 2.8.0. And 2.8.0 is old. |
We are trying to remove the immutable attribute from all secrets that don't really need it anymore, to make those secrets rotatable. For example the mysql passwords are no longer immutable after updating to https://github.com/cloudfoundry/cf-mysql-release/releases/tag/v36.11.0 which includes https://www.pivotaltracker.com/n/projects/969486/stories/153529269.
Do you mean the temporary downtime while the related pods are restarting, or did you observe this breaking things long term? We have tested secrets rotation after making the password mutable, and everything was working as expected. The (expected, short) downtime during secrets rotation is the main reason we don't automatically rotate secrets every time the user makes any changes via For reference, the PRs that made additional secrets mutable are SUSE/scf#1448 and SUSE/uaa-fissile-release#93 |
Hi @zhangtbj, I don't know yet; it will be decided by product management, if they will release with the upgrade issue mentioned in the release notes, or if they want to delay for a proper fix. For the proper fix, we are planning to store a normalized version of the generator config section from the role manifest together with the generated secrets. That way we can determine during upgrades if a specific secret needs to be rotated during a chart upgrade (normally only new secrets are generated at this time). Given that previous releases don't have this information, the initial upgrade will therefore cause an implicit rotation of all secrets and therefore cause the usual short downtime. But work on this fix hasn't started yet. |
Hi @jandubois , If we don't have the use case that upgrades from the previous version to SCF 2.9.0 (For example, 2.9.0 is our first version, no any migration to this version in our production), what is your opinion? Keep using 2.9.0 or rollback to official version 2.8.0? |
Hi @zhangtbj, I would have said to use 2.9.0, but I just learned that we had 2 instances in testing where We have never seen this particular failure mode before 2.9.0, so I'm not so sure about recommending 2.9.0 right now. I don't have any more information right now. If possible I would recommend to wait a couple of days. We are looking at this urgently, as 2.9.0 is already behind our (internal) schedule. |
Hi @jandubois, I think a couple of days is fine for us. Wait for your good news. Thanks! |
@jandubois sorry for the late response. I just tested the PR #22 , worked as expected for my upgrade scenario, logs are the following:
I added some extra logs on my fork(on top of your branch), to test if I was getting the changes, finding the right place under scf submodules was cumbersome. Thanks again. |
…ersion The version number was added at the end of the re-design of the versioned secrets mechanism, and is also used as a marker if the configmap needs to be created or updated. This change allows updates from pre-release versions missing the config-version, but otherwise compatible with the version 1 format. Ref #21
This used to fail because we keep the pre-upgrade secrets around as a fail-safe in case anything goes wrong during the upgrade and the cluster needs to be manually resurrected. This secret was getting in the way because the secrets generator assumes that the new secret cannot exist yet. Ref #21 (comment)
Fix issue #21: configmap/secret already existing
@jandubois can we close this? |
Yes, I think we can close this now. 😄 We did some extensive refactoring of the secrets generator in #28 that should make it much more robust and predictable, mostly by passing in the install/upgrade status via helm, and not by using heuristics. These changes are not yet in the latest SCF release (but are included in HEAD of |
Hi,
We faced an issue regarding the generation of a secret through the
secret-generation
pod in SCF. It seems that doing SCF version upgrades between images that do not have this(current images) commit 94cbe1 and images that have it(upgrade-to images), will end up in an error of thesecret-generation
pod. Error is as follows:It seems that config-version did not exist before the introduction of commit 94cbe1, which during upgrade(image with commit 94cbe1 changes) will try to retrieve that config-version, and because isn't there, will try to create a
configmap
, which already exists(cause we re doing an upgrade). Eventually the pod will fail and restart, but because the secret was already created, it will end with the error above.Could we add another check for this , to verify that the configmap is there, before trying to create one?
Regards,
Enrique Encalada
The text was updated successfully, but these errors were encountered: