Skip to content

zync needs plaintext tokens#4309

Open
akostadinov wants to merge 1 commit into
masterfrom
zync_plaintext_tokens
Open

zync needs plaintext tokens#4309
akostadinov wants to merge 1 commit into
masterfrom
zync_plaintext_tokens

Conversation

@akostadinov
Copy link
Copy Markdown
Contributor

@akostadinov akostadinov commented May 26, 2026

Alternative to #4304

@jlledom , how about something like this instead? I think there is some value in keeping the tokens for auth.

update: if needed we can also memcache the token plaintexts but I doubt it is really needed. Also we can check Rails.configuration.zync.skip_non_oidc_applications but either way, unless tests in stg show a significant performance difference, I wouldn't complicate things with such logic.

@akostadinov akostadinov force-pushed the zync_plaintext_tokens branch from 929395c to dda5113 Compare May 26, 2026 20:42
@jlledom
Copy link
Copy Markdown
Contributor

jlledom commented May 27, 2026

I've been examining this approach and I think it could work. The main issues would be:

  1. The zync worker in porta would rotate the token at every event, see the code:

with_manual_retry_count(event_id, manual_retry_count) do
update_tenant(event)
http_put(notification_url, notification, event_id)
end

This is not necessarily bad, but I think it justifies adding some caching mechanism

  1. When a token is invalid in zync DB, there's no way for zync to get a new one so it'll remain in invalid state until porta has the initiative to send a new event to zync.
  2. There is a very tiny but not-zero window for race conditions, between token T1 is deleted and the recently created token T2 reaches zync. A zync UpdateJob running at that precise moment will fail with a 403.
  3. When creating a new token, we must decide if we make it expirable or not. If we make it expirable, point 2 will happen often. If not, we would be storing a non-expirable token in plaintext in zync DB. OIDC tokens wouldn't be protected via hashing.
  4. When a job fails with a 403, zync doesn't retry it, in the scenarios above, the jobs would be lost forever. However adding a retry logic to zync is as trivial as:
retry_on ForbiddenError, wait: :polynomially_longer, attempts: 5
  • 5 attemps -> ~15 minutes
  • 10 attempts -> ~7 hours

So summarizing, pros and cons of your approach and mine:

  • Using zinc token:
    • Pros:
      • No breaking changes, zero downtime
      • No need to make changes in zync
      • A leaked zync token can be rotated easily
    • Cons:
      • All security relies on the ZYNC_TOKEN not being leaked
      • A leaked zync token will give read only access to 3 endpoints on all providers, not only one.
        • These endpoints return credentials for Cinstances.
  • Rotating OIDC tokens:
    • Pros:
      • Not making the vector risk wider. In order to get credentials, an attacker would need
        the zync token and also an access token for each provider it wants to target
      • The fix is simpler.
    • Cons:
      • An attacker being able to get the zync token would also be able to get any access token, since they are stored and sent in the exact same trust boundary.
      • Constant rotation of tokens, not sure about how much overhead that would add
      • Still storing plaintext tokens in zync DB, and not any tokens, but tokens that can be used to get credentials for Cinstances. Which makes hashing access tokens efforts less effective for all providers, since all provders will have at least one pretty important token not hashed.
      • Requires changes in zync to add the retry logic for 403 errors
      • Downtime on deploy until zync gets new tokens
      • No solution for jobs when an invalid token doesn't get refreshed before retry logic gives up.

I wouldn't accept this approach unless we add some caching mechanism to avoid constant rotation of tokens and reduce the race condition window each rotation implies. Also, I think we definitely should add retrying logic to 403 errors if we go with this approach.

@mayorova WDYT?

@mayorova
Copy link
Copy Markdown
Contributor

On one hand, I like this solution, because it's simple and probably effective.

On the other hand, I am concerned about potential race conditions, when a job in zync is already being executed with an "older" token, and in the meantime, this token is being deleted by porta, and thus is invalidated.

@jlledom
Copy link
Copy Markdown
Contributor

jlledom commented May 27, 2026

On one hand, I like this solution, because it's simple and probably effective.

On the other hand, I am concerned about potential race conditions, when a job in zync is already being executed with an "older" token, and in the meantime, this token is being deleted by porta, and thus is invalidated.

Yeah, but I think the race condition is not the main concern. Caching the token during, say, 1 hour, will open the race condition window for a small fraction of a second once per hour. Besides, adding the retry logic to zync will fix the effects of the race condition probably all the times it happens. So I consider that problem is solved.

For me the main concern is what I mentioned about the tokens, storing them plaintext in zync DB is wrong IMO, is an existing vector of risk that we could eliminate.

If we accept storing tokens in plaintext in zync DB, then those tokens should be very short-lived. But then we would hit the problem of tokens expiring and zync not being able to refresh the tokens unless porta decides to send the new one. In practice, every time a job is queued in zync, it will happen that zync just received a request that included a new valid token. So zync failing to update due to expired tokens will only happen when the load is so high that the zync queue worker took more time to reach to the job than the time the token lived. Possible but unlikely I think.

@mayorova
Copy link
Copy Markdown
Contributor

For me the main concern is what I mentioned about the tokens, storing them plaintext in zync DB is wrong IMO, is an existing vector of risk that we could eliminate.

If we accept storing tokens in plaintext in zync DB, then those tokens should be very short-lived. But then we would hit the problem of tokens expiring and zync not being able to refresh the tokens unless porta decides to send the new one. In practice, every time a job is queued in zync, it will happen that zync just received a request that included a new valid token. So zync failing to update due to expired tokens will only happen when the load is so high that the zync queue worker took more time to reach to the job than the time the token lived. Possible but unlikely I think.

Yeah, but in every update in porta that triggers zync notification will rotate the token. Of course, we can't know how often these events will happen.

But on the other hand, if we opt for the ZYNC_TOKEN authentication, then we'll have a token that is also a plain text one (stored in clear text in OCP secrets, env vars etc.), that is never ever rotated 🤷 Well, it is "safer" in the sense that it only gives access to only a subset of endpoints, but not sure which approach is better, to be honest 🤷

@jlledom
Copy link
Copy Markdown
Contributor

jlledom commented May 27, 2026

For me the main concern is what I mentioned about the tokens, storing them plaintext in zync DB is wrong IMO, is an existing vector of risk that we could eliminate.
If we accept storing tokens in plaintext in zync DB, then those tokens should be very short-lived. But then we would hit the problem of tokens expiring and zync not being able to refresh the tokens unless porta decides to send the new one. In practice, every time a job is queued in zync, it will happen that zync just received a request that included a new valid token. So zync failing to update due to expired tokens will only happen when the load is so high that the zync queue worker took more time to reach to the job than the time the token lived. Possible but unlikely I think.

Yeah, but in every update in porta that triggers zync notification will rotate the token. Of course, we can't know how often these events will happen.

But on the other hand, if we opt for the ZYNC_TOKEN authentication, then we'll have a token that is also a plain text one (stored in clear text in OCP secrets, env vars etc.), that is never ever rotated 🤷 Well, it is "safer" in the sense that it only gives access to only a subset of endpoints, but not sure which approach is better, to be honest 🤷

I think we can reach this agreement:

  1. Go with this approach
  2. Cache the tokens in porta to open the race condition window just once per hour
  3. Add a retry logic to zync to retry 403 errors during say an hour (6-7 attempts)
  4. Make the oidc tokens expirable to mitigate leaking effects

The expiration times for tokens should be the shorter possible that would leave zync reasonable time to do its best trying to process the jobs. That would be:

  • 1 hour (because the tokens are cached for 1 hour, we don't want to send expired tokens)
  • plus another 1,5 hours (time to retry 7 attempts)
  • plus the amount of time we think it's reasonable for jobs to wait in queue until they are processed. We can check stats in graphana to figure out how much time this is in SaaS.

Jobs failing after that time, we assume it's fine to lose them. If new events come from porta they will include a new token and restore the provider.

@akostadinov @mayorova WDYT?

@akostadinov
Copy link
Copy Markdown
Contributor Author

For me the main concern is what I mentioned about the tokens, storing them plaintext in zync DB is wrong IMO, is an existing vector of risk that we could eliminate.

My suggestion is that in a separate PR, we switch zync to encrypt the tokens as passwords to prevent such storage. I think this is the standard way to do it. I'm not in favor of expiring tokens, that will require a separate mechanism to update them before expiry... I mean it is fine but more complex and not immediately necessary as previously we were still storing the tokens plaintext, so no regression in that regard. Also the tokens are read-only with limited scope.

@jlledom
Copy link
Copy Markdown
Contributor

jlledom commented May 27, 2026

My suggestion is that in a separate PR, we switch zync to encrypt the tokens as passwords to prevent such storage. I think this is the standard way to do it.

That would remove the need to make the tokens expirable. But it has it's own problems also like performance impact, maybe it's negligible but needs to be investigated. Also how would this be deployed, with or without a migration? and with or without downtime?

I'm not in favor of expiring tokens, that will require a separate mechanism to update them before expiry...

I my comments above I mention that we don't need that mechanism if we accept losing jobs until they succeeded to after a reasonable amount of time.

Also the tokens are read-only with limited scope.

The read-only limited scope still grants access to client credentials for Cinstances.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants