Skip to content

Credential api use of tokens#1913

Merged
yoks merged 24 commits into
NVIDIA:mainfrom
yoks:credential-api-use-of-tokens
Jun 2, 2026
Merged

Credential api use of tokens#1913
yoks merged 24 commits into
NVIDIA:mainfrom
yoks:credential-api-use-of-tokens

Conversation

@yoks
Copy link
Copy Markdown
Contributor

@yoks yoks commented May 23, 2026

Description

First phase of SessionTokens API support.

Enforces GetBmcCredentials to use SessionService tokens, meaning if BMC does not support Session, API will error out.

API would first get spiffe identifier of the calling services, then try to rotate token, meaning if there is token in database (there is new table which stored token IDs), it will revoke old token and issue new one. If there is no token, it would just issue new token. Clients expected to call this api to rotate expired tokens themselves (on auth failure).

Another major change is the begging of movent of AvoidLockout circuit breaker to this function, as in future, this should be only place what handles Basic credentials. Auth tokens themselvels could cause lockout. This also why we preffer to not share credentials at all (to consilidate this CircuitBreaker behavior here).

Should in general, work for Sharded envs, but it is preffered what there is specific API instances work with specific set of BMC macs to avoid races/simultanious refreshes and avoid DB locks.

To get BmcCredentials, after this PR is merged, each service is required to have spiffe indentifier, this ensures what each service can get their own credentials/per spiffe. This also adds requirement for all sharded services to maintain propper sharding strategy per spiffe identifier (e.g. they should not overlap BMCs in shards), otherwise credentials will be rotated and can cause credentials reissue storm.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Implements big chunk of: #460

Should finaly fix this bug for good: #1292

Breaking Changes

  • This PR contains breaking changes
    Credentials API no longer returns passwords. It would explicitly not work with BMC which do not support SessionService. We can add flag in future to make exception for that.

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@yoks yoks requested a review from a team as a code owner May 23, 2026 02:08
@yoks yoks requested a review from Matthias247 May 23, 2026 02:09
yoks added 2 commits May 22, 2026 19:21
Signed-off-by: ianisimov <ianisimov@nvidia.com>
Signed-off-by: ianisimov <ianisimov@nvidia.com>
pub bmc_mac_address: MacAddress,
pub session_odata_id: String,
pub issued_at: DateTime<Utc>,
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It is better to move definition of StoredSession to model crate. At least, this is the pattern we use for most data types.

Copy link
Copy Markdown
Contributor

@kensimon kensimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questions about some of the main issues, I haven't reviewed the whole PR yet.

Comment thread crates/health/src/discovery/spawn.rs Outdated
let key = endpoint.key();
let endpoint_arc = endpoint.clone();

let credentials = endpoint.credentials().ok_or_else(|| {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these will be constructed without credentials (or at least I don't see a code path that sets them prior to spawn_collectors_for_endpoint getting called), should we call endpoint.ensure_credentials().await?; first here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, i moved credentials around few times (initialization), and i think i forgot to call init in spawn. On last move.

Comment thread crates/health/src/api_client.rs Outdated

Self { client }
let credential_provider: Arc<dyn CredentialProvider> = Arc::new(ApiCredentialProvider {
client: client.clone(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(See previous comment) I think the credentials are initialized here, but we don't call endpoint.ensure_credentials().await?; between here and run_discovery_iteration.

Comment thread crates/api-core/src/credentials/bmc_session_manager.rs
Copy link
Copy Markdown
Contributor

@kensimon kensimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Updating my review to Request changes... I'm not sure if I'm right about my feedback but it's probably better if this didn't merge until we discuss)

Comment thread crates/health/src/endpoint/model.rs Outdated
pub(crate) credentials: Arc<RwLock<Option<BmcCredentials>>>,
pub(crate) provider: Arc<dyn CredentialProvider>,
// Neded to ensure only one collector fetches endpoint
pub(crate) fetch_lock: Arc<AsyncMutex<()>>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of an out-of-band mutex on an empty tuple, could we instead lock self.credentials before fetching and it would accomplish the same thing? (We'd need to make self.credentials a tokio::RwLock, but that's it)

That is, instead of doing:

let _guard = self.fetch_lock.lock().await;
let fresh = self.provider.fetch_credentials(&self.addr).await?;
*self.credentials.write().expect("lock poisoned") = Some(fresh.clone());

Couldn't we just guard on self.credentials itself?

let mut credentials = self.credentials.write().await;
let fresh = self.provider.fetch_credentials(&self.addr).await?;
*credentials = Some(fresh.clone());

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this layerd syncronization need to go. I will try to wrap all inside BMC, is has the most important function (set_credentials) and we need synchronize on it. And not leak it outside

@yoks
Copy link
Copy Markdown
Contributor Author

yoks commented May 26, 2026

@kensimon thanks for review, i think i puzzled myself with several layers of collectors. i was completly focused on just running one BMC collector in all my tests and forgot this issue (with multiple collectors) so that one slipped through.

I need rethink how this whole credentials refresh works, it is several layers of historical (before tokens) refreshes, so better to rewrite it from scratch.

@Matthias247
Copy link
Copy Markdown
Contributor

Enforces GetBmcCredentials to use SessionService tokens, meaning if BMC does not support Session, API will error out.

Do we have to introduce that constraint?

My assumption was that most callers now get credentials from some abstract credentialprovider per entity. And that provider could then either hand out tokens or username/password - depending on whats available.
In that case it could be up for the provider to check whats available. If tokens are available - manage them (including rotate them) and hand them out. If not - hand out username/password.

@Matthias247
Copy link
Copy Markdown
Contributor

Can you add a bit more detail to the description of when sessions are established and tokens are rotated. Eg.

  • session establishment and token rotation is done in site-explorer
  • session establishment and token rotation is done by any callpath in current nico-core which fetches credentials, including site-explorer, state-handler code which interacts with BMCs, fetchBmcCredentials APIs, etc
  • a decided new process which is supposed to manage and rotate credentials

I think it probably works either way in the "there is just 1 nico-core instance" case, but for sharding things might become more interesting (because site-explorer sharding would not necessarily match how hw-health is sharded).

@yoks
Copy link
Copy Markdown
Contributor Author

yoks commented May 26, 2026

Enforces GetBmcCredentials to use SessionService tokens, meaning if BMC does not support Session, API will error out.

Do we have to introduce that constraint?

My assumption was that most callers now get credentials from some abstract credentialprovider per entity. And that provider could then either hand out tokens or username/password - depending on whats available. In that case it could be up for the provider to check whats available. If tokens are available - manage them (including rotate them) and hand them out. If not - hand out username/password.

This is artifical, if we hide/enforce it by config param, is this be ok? Thought is exposing Basic credentials prevent us from ensuring they not be locked out/abused in any way. Easier to add new integrations which would use credentials.

@yoks
Copy link
Copy Markdown
Contributor Author

yoks commented May 26, 2026

Can you add a bit more detail to the description of when sessions are established and tokens are rotated. Eg.

  • session establishment and token rotation is done in site-explorer
  • session establishment and token rotation is done by any callpath in current nico-core which fetches credentials, including site-explorer, state-handler code which interacts with BMCs, fetchBmcCredentials APIs, etc
  • a decided new process which is supposed to manage and rotate credentials

I think it probably works either way in the "there is just 1 nico-core instance" case, but for sharding things might become more interesting (because site-explorer sharding would not necessarily match how hw-health is sharded).

As long as each shard works with only one BMC it should be fine. Tokens should be issued per entity (in my case spiffe). So for current NICo it would be NICO-Core token, but if explorer som day become separat service it should have their own token.

@yoks
Copy link
Copy Markdown
Contributor Author

yoks commented May 27, 2026

@kensimon I removed most of the synchronization logic out and made what Endpoint owns BMCClient, which is only place where auth credentials are rejected and updated/fetched, via provided credential provider.

Also for NVUE i modelled it in similar fashion, with credentials provider refresh.

@Matthias247
Copy link
Copy Markdown
Contributor

should be fine. Tokens should be issued per entity (in my case spiffe). So for current NICo it would be NICO-Core token, but if explorer som day become separat service it should have their own token.

I meant adding these details to the PR description. Right now I'd really need to reverse engineer the code to understand how and when tokens are issued.

Copy link
Copy Markdown
Contributor

@kensimon kensimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with a question, but make sure to get feedback from @Matthias247 before merging (I wish GitHub had a way to add required reviews to a PR)


// Reset breaker
api.bmc_session_manager
.note_credentials_updated(bmc_mac_address)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Do we want to reset the lockout on every code path that sets credentials? If so we're missing a bunch... at the very least:

crates/api/src/credentials/mod.rs:65
crates/api/src/handlers/credential.rs:390
crates/site-explorer/src/bmc_endpoint_explorer.rs:259
crates/site-explorer/src/bmc_endpoint_explorer.rs:298

Copy link
Copy Markdown
Contributor Author

@yoks yoks May 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plan is to remove all basic credentials interaction outside of the bmc_session_manager, and enforce token on every path execept the one which issues token. So lockout clearing should live just here, and nowhere else.

As for paths you mentioned, i think it is safe to add it to crates/api/src/handlers/credential.rs:390, it calls credental manager anyway, so we can skip it (redundant). As for site explorer, it is in different path and i would omit it, as adding it here involves passing session manager to site explorer which couples it, and we want to decouple it from api as much as we can (in future not using credential manager directly).

TLDR in the end we want single point to access session (and possible lockout), and single point which sets credentials so it would clear lockout.

@yoks
Copy link
Copy Markdown
Contributor Author

yoks commented May 28, 2026

Added to more fixes, proposed by @poroh

  1. Concurrent requests for token refresh from different spiffe services can overcome lockout circuit breaker, this seems very unlikely (that they will land concurently), but still worth the fix
  2. In flight grow can grow without bound, also very unlikely in normal operations, but as theoretical problem - worth the fix., i used AI to help, and it came with Arc::strong_count for it, which i found quite elegant and low anount of actual loc for that fix.

feat: credential api use of tokens
@yoks yoks force-pushed the credential-api-use-of-tokens branch from 554c63a to be18719 Compare May 28, 2026 21:45
@yoks
Copy link
Copy Markdown
Contributor Author

yoks commented Jun 1, 2026

@Matthias247 Added flag to control basic credentials issue

@yoks yoks enabled auto-merge (squash) June 1, 2026 22:11
@yoks yoks disabled auto-merge June 2, 2026 11:44
@yoks yoks enabled auto-merge (squash) June 2, 2026 14:21
@yoks yoks merged commit 1199e1b into NVIDIA:main Jun 2, 2026
52 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants