Skip to content

THREESCALE-14969: Rotate OIDC tokens for Zync#4310

Open
jlledom wants to merge 2 commits into
masterfrom
THREESCALE-14969-zync-rotate-tokens
Open

THREESCALE-14969: Rotate OIDC tokens for Zync#4310
jlledom wants to merge 2 commits into
masterfrom
THREESCALE-14969-zync-rotate-tokens

Conversation

@jlledom
Copy link
Copy Markdown
Contributor

@jlledom jlledom commented May 29, 2026

What this PR does / why we need it:

The issue description explains the situation, but summarizing, #4236 broke the integration with zync. Porta calls PUT /tenant endpoint on zync and pushes the access token to Zync, which will use that access token later to pull data from porta. After #4236, now the access token is sent hashed to zync, which is useless as authentication method, so zync can't retrieve anymore data from porta.

To solve that, this PR implements token rotation every hour, that way we mitigate possible zync DB leaks containing plaintext tokens.

The rotation process implies a small race condition window between the moment the old token is expired and the new one is received by zync and available for next requests. For that reason, old tokens are not set to expire immediately or deleted, instead, they are set to expire 1 day after being discarded.

This way, even in the worse scenario when the zync queue gets stuck for some reason and doesn't process jobs, and also it holds a discarded token in the DB, it still has one complete day to recover.

In order to further mitigate the leaking problem, our plan is to make changes in Zync to implement client token encryption.

In order to further mitigate the race condition problems when rotating, we also include some caching that forces the rotation to happen once per hour, and includes protection against cache stampede when many processes get a cache miss concurrently at the same time. This is better explained in the in-code comments below.

Besides, we also plan to implement retry logic for auth errors in Zync.

Finally, the PR also includes some changes in the Janitor to purge all discarded tokens once per week.

Replaces #4304 and #4309

Which issue(s) this PR fixes

https://redhat.atlassian.net/browse/THREESCALE-14969

Verification steps

Test should pass. Also, just work normally with porta + zync and verify everuthing works as expected, creating domains, pushing apps to Keycloak, etc.

jlledom added 2 commits May 29, 2026 10:15
Previously, the OIDC sync token was reused indefinitely via
find_or_create_by!. Zync stored the plaintext token in its DB
without encryption, so a long-lived token is a security liability.

This change rotates the token hourly: on each cache miss the active
token is expired (expires_at set to 1 day from now) and a fresh one
is created. The plaintext is cached for 1 hour so rotation does not
happen on every Zync job. Row locks serialize concurrent cache misses
to avoid stampede. Expired tokens are kept for 1 day so Zync can
finish any in-flight requests before they become invalid.

A new worker (DeleteExpiredOIDCSyncTokensWorker) handles pruning
expired OIDC tokens from the database when called.

Assisted-by: Claude Code
Now that OIDC sync tokens are expired instead of deleted immediately,
there is a need to periodically purge them from the database. The
janitor runs weekly and is the right place for this housekeeping.

Assisted-by: Claude Code
@jlledom jlledom self-assigned this May 29, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 29, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.87%. Comparing base (c708373) to head (1950368).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4310      +/-   ##
==========================================
- Coverage   88.92%   88.87%   -0.06%     
==========================================
  Files        1752     1753       +1     
  Lines       44131    44146      +15     
  Branches      689      689              
==========================================
- Hits        39245    39235      -10     
- Misses       4870     4895      +25     
  Partials       16       16              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@jlledom jlledom marked this pull request as ready for review May 29, 2026 11:09
Comment on lines +150 to 173
def self.refresh_oidc_sync
user_id = scope_attributes["owner_id"]
cache_key = "access_tokens/user:#{user_id}/oidc"

# Hot path: skip the transaction entirely on cache hit (zero DB queries).
cached = Rails.cache.read(cache_key)
return cached if cached

transaction do
# Lock existing OIDC tokens to serialize concurrent cache misses (e.g. full resync).
# Workers that arrive simultaneously will queue here. On cold start (no tokens yet)
# there is nothing to lock and stampede churn is harmless — all created tokens are valid.
lock.where(name: OIDC_SYNC_TOKEN).load

# Double-check inside the transaction: a concurrent worker may have populated the cache
# while we were waiting on the row lock above.
Rails.cache.fetch(cache_key, expires_in: 1.hour) do
# Expire (not delete) the current active token so Zync can keep using it for up to
# 1 day while it picks up the new one. The janitor cleans up expired tokens weekly.
where(name: OIDC_SYNC_TOKEN, expires_at: nil).update_all(expires_at: 1.day.from_now)
create!(name: OIDC_SYNC_TOKEN, scopes: %w[account_management], permission: 'ro').plaintext_value
end
end
end
Copy link
Copy Markdown
Contributor Author

@jlledom jlledom May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method could be much more simple, like:

def self.refresh_oidc_sync
  user_id = scope_attributes["owner_id"]
  Rails.cache.fetch("access_tokens/user:#{user_id}/oidc", expires_in: 1.hour) do
    where(name: OIDC_SYNC_TOKEN, expires_at: nil).update_all(expires_at: 1.day.from_now)
    create!(name: OIDC_SYNC_TOKEN, scopes: %w[account_management], permission: 'ro').plaintext_value
  end
end

But that would cause a non-deterministic result on concurrent cache misses. e.g. When we run the full resync task from #4307 and the cache key is invalid, it could happen that many concurrent processes enter the block at the same time and update-then-create tokens, which would en up on the last of them prevailing and expiring all previously created token. Claude explains:

Simple version (bare Rails.cache.fetch)
  
  1. A, B, C all call Rails.cache.read — all miss
  2. A enters Rails.cache.fetch — cache miss, enters block
  3. B enters Rails.cache.fetch — also cache miss (A hasn't finished yet), enters block
  4. C enters Rails.cache.fetch — also cache miss, enters block
  5. A: update_all(expires_at: 1.day.from_now) — expires the active token. Then create! — new token.
  6. B: update_all(expires_at: 1.day.from_now) — expires A's freshly created token (it has expires_at: nil). Then create! — another new token.
  7. C: same — expires B's token, creates yet another.
  8. Each process writes its own value to the cache — last writer wins.

  Result: 3 rotations, 2 wasted tokens (still valid for 1 day, so no broken auth, but unnecessary churn).

But we could also end up with multiple non-expired tokens, in this scenario:

1. A: update_all — expires old token, commits
2. B: update_all — no rows with expires_at: nil, no-op
3. A: create! — creates T1 (expires_at: nil)
4. C: update_all — expires T1
5. B: create! — creates T2 (expires_at: nil)
6. C: create! — creates T3 (expires_at: nil)

The committed code prevents this because it sets a DB lock at lock.where(name: OIDC_SYNC_TOKEN).load, and processes B and C will wait there, until A closes the transaction block. Since the cache.fetch call is inside the transaction, we ensure the cache is also written when A finished, so B and C will always get a hit when calling cache.fetch.

1. A, B, C all call Rails.cache.read — all miss
2. A enters the transaction first, runs SELECT ... FOR UPDATE — locks the OIDC token row
3. B and C enter transactions, hit SELECT ... FOR UPDATE — blocked at DB level, waiting for A's lock
4. A runs Rails.cache.fetch — cache miss, executes block: expires old token, creates new one, writes to cache. Transaction commits, lock released.
5. B gets the lock, runs SELECT ... FOR UPDATE. Then Rails.cache.fetch — cache hit (A populated it). Returns cached value. Transaction commits.
6. C same as B — cache hit, no DB writes.

Result: 1 rotation, 0 wasted tokens.

This is convenient but not really necessary since such race conditions will be rare and the result is just some wasted tokens that the Janitor will purge anyway. So if you want the simple version, I'm fine with it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also considered the :race_condition_ttl option for Rails.cache.fetch but discarded it because it only prevents the cache stampede in the short period of time of N seconds after the cache expires, in any other scenario all processes would enter the block anyway.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock increases DB query count. If we really need locking, them maybe better use redlock like we do for billing.

Given the complexity I wonder if we should avoid caching, expire the tokens and clear them with janitor every night.

Comment on lines +167 to +169
# Expire (not delete) the current active token so Zync can keep using it for up to
# 1 day while it picks up the new one. The janitor cleans up expired tokens weekly.
where(name: OIDC_SYNC_TOKEN, expires_at: nil).update_all(expires_at: 1.day.from_now)
Copy link
Copy Markdown
Contributor

@akostadinov akostadinov May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't zync pick up then new one almost immediately? 1 day to pick just he key looks excessive.

Also didn't we think to provide expiring tokens always to begin with?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants