THREESCALE-14969: Rotate OIDC tokens for Zync#4310
Conversation
Previously, the OIDC sync token was reused indefinitely via find_or_create_by!. Zync stored the plaintext token in its DB without encryption, so a long-lived token is a security liability. This change rotates the token hourly: on each cache miss the active token is expired (expires_at set to 1 day from now) and a fresh one is created. The plaintext is cached for 1 hour so rotation does not happen on every Zync job. Row locks serialize concurrent cache misses to avoid stampede. Expired tokens are kept for 1 day so Zync can finish any in-flight requests before they become invalid. A new worker (DeleteExpiredOIDCSyncTokensWorker) handles pruning expired OIDC tokens from the database when called. Assisted-by: Claude Code
Now that OIDC sync tokens are expired instead of deleted immediately, there is a need to periodically purge them from the database. The janitor runs weekly and is the right place for this housekeeping. Assisted-by: Claude Code
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #4310 +/- ##
==========================================
- Coverage 88.92% 88.87% -0.06%
==========================================
Files 1752 1753 +1
Lines 44131 44146 +15
Branches 689 689
==========================================
- Hits 39245 39235 -10
- Misses 4870 4895 +25
Partials 16 16 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| def self.refresh_oidc_sync | ||
| user_id = scope_attributes["owner_id"] | ||
| cache_key = "access_tokens/user:#{user_id}/oidc" | ||
|
|
||
| # Hot path: skip the transaction entirely on cache hit (zero DB queries). | ||
| cached = Rails.cache.read(cache_key) | ||
| return cached if cached | ||
|
|
||
| transaction do | ||
| # Lock existing OIDC tokens to serialize concurrent cache misses (e.g. full resync). | ||
| # Workers that arrive simultaneously will queue here. On cold start (no tokens yet) | ||
| # there is nothing to lock and stampede churn is harmless — all created tokens are valid. | ||
| lock.where(name: OIDC_SYNC_TOKEN).load | ||
|
|
||
| # Double-check inside the transaction: a concurrent worker may have populated the cache | ||
| # while we were waiting on the row lock above. | ||
| Rails.cache.fetch(cache_key, expires_in: 1.hour) do | ||
| # Expire (not delete) the current active token so Zync can keep using it for up to | ||
| # 1 day while it picks up the new one. The janitor cleans up expired tokens weekly. | ||
| where(name: OIDC_SYNC_TOKEN, expires_at: nil).update_all(expires_at: 1.day.from_now) | ||
| create!(name: OIDC_SYNC_TOKEN, scopes: %w[account_management], permission: 'ro').plaintext_value | ||
| end | ||
| end | ||
| end |
There was a problem hiding this comment.
This method could be much more simple, like:
def self.refresh_oidc_sync
user_id = scope_attributes["owner_id"]
Rails.cache.fetch("access_tokens/user:#{user_id}/oidc", expires_in: 1.hour) do
where(name: OIDC_SYNC_TOKEN, expires_at: nil).update_all(expires_at: 1.day.from_now)
create!(name: OIDC_SYNC_TOKEN, scopes: %w[account_management], permission: 'ro').plaintext_value
end
endBut that would cause a non-deterministic result on concurrent cache misses. e.g. When we run the full resync task from #4307 and the cache key is invalid, it could happen that many concurrent processes enter the block at the same time and update-then-create tokens, which would en up on the last of them prevailing and expiring all previously created token. Claude explains:
Simple version (bare Rails.cache.fetch)
1. A, B, C all call Rails.cache.read — all miss
2. A enters Rails.cache.fetch — cache miss, enters block
3. B enters Rails.cache.fetch — also cache miss (A hasn't finished yet), enters block
4. C enters Rails.cache.fetch — also cache miss, enters block
5. A: update_all(expires_at: 1.day.from_now) — expires the active token. Then create! — new token.
6. B: update_all(expires_at: 1.day.from_now) — expires A's freshly created token (it has expires_at: nil). Then create! — another new token.
7. C: same — expires B's token, creates yet another.
8. Each process writes its own value to the cache — last writer wins.
Result: 3 rotations, 2 wasted tokens (still valid for 1 day, so no broken auth, but unnecessary churn).
But we could also end up with multiple non-expired tokens, in this scenario:
1. A: update_all — expires old token, commits
2. B: update_all — no rows with expires_at: nil, no-op
3. A: create! — creates T1 (expires_at: nil)
4. C: update_all — expires T1
5. B: create! — creates T2 (expires_at: nil)
6. C: create! — creates T3 (expires_at: nil)
The committed code prevents this because it sets a DB lock at lock.where(name: OIDC_SYNC_TOKEN).load, and processes B and C will wait there, until A closes the transaction block. Since the cache.fetch call is inside the transaction, we ensure the cache is also written when A finished, so B and C will always get a hit when calling cache.fetch.
1. A, B, C all call Rails.cache.read — all miss
2. A enters the transaction first, runs SELECT ... FOR UPDATE — locks the OIDC token row
3. B and C enter transactions, hit SELECT ... FOR UPDATE — blocked at DB level, waiting for A's lock
4. A runs Rails.cache.fetch — cache miss, executes block: expires old token, creates new one, writes to cache. Transaction commits, lock released.
5. B gets the lock, runs SELECT ... FOR UPDATE. Then Rails.cache.fetch — cache hit (A populated it). Returns cached value. Transaction commits.
6. C same as B — cache hit, no DB writes.
Result: 1 rotation, 0 wasted tokens.
This is convenient but not really necessary since such race conditions will be rare and the result is just some wasted tokens that the Janitor will purge anyway. So if you want the simple version, I'm fine with it.
There was a problem hiding this comment.
I also considered the :race_condition_ttl option for Rails.cache.fetch but discarded it because it only prevents the cache stampede in the short period of time of N seconds after the cache expires, in any other scenario all processes would enter the block anyway.
There was a problem hiding this comment.
The lock increases DB query count. If we really need locking, them maybe better use redlock like we do for billing.
Given the complexity I wonder if we should avoid caching, expire the tokens and clear them with janitor every night.
| # Expire (not delete) the current active token so Zync can keep using it for up to | ||
| # 1 day while it picks up the new one. The janitor cleans up expired tokens weekly. | ||
| where(name: OIDC_SYNC_TOKEN, expires_at: nil).update_all(expires_at: 1.day.from_now) |
There was a problem hiding this comment.
Doesn't zync pick up then new one almost immediately? 1 day to pick just he key looks excessive.
Also didn't we think to provide expiring tokens always to begin with?
What this PR does / why we need it:
The issue description explains the situation, but summarizing, #4236 broke the integration with zync. Porta calls
PUT /tenantendpoint on zync and pushes the access token to Zync, which will use that access token later to pull data from porta. After #4236, now the access token is sent hashed to zync, which is useless as authentication method, so zync can't retrieve anymore data from porta.To solve that, this PR implements token rotation every hour, that way we mitigate possible zync DB leaks containing plaintext tokens.
The rotation process implies a small race condition window between the moment the old token is expired and the new one is received by zync and available for next requests. For that reason, old tokens are not set to expire immediately or deleted, instead, they are set to expire 1 day after being discarded.
This way, even in the worse scenario when the zync queue gets stuck for some reason and doesn't process jobs, and also it holds a discarded token in the DB, it still has one complete day to recover.
In order to further mitigate the leaking problem, our plan is to make changes in Zync to implement client token encryption.
In order to further mitigate the race condition problems when rotating, we also include some caching that forces the rotation to happen once per hour, and includes protection against cache stampede when many processes get a cache miss concurrently at the same time. This is better explained in the in-code comments below.
Besides, we also plan to implement retry logic for auth errors in Zync.
Finally, the PR also includes some changes in the Janitor to purge all discarded tokens once per week.
Replaces #4304 and #4309
Which issue(s) this PR fixes
https://redhat.atlassian.net/browse/THREESCALE-14969
Verification steps
Test should pass. Also, just work normally with porta + zync and verify everuthing works as expected, creating domains, pushing apps to Keycloak, etc.