Credential api use of tokens#1913
Conversation
Signed-off-by: ianisimov <ianisimov@nvidia.com>
Signed-off-by: ianisimov <ianisimov@nvidia.com>
Signed-off-by: ianisimov <ianisimov@nvidia.com>
| pub bmc_mac_address: MacAddress, | ||
| pub session_odata_id: String, | ||
| pub issued_at: DateTime<Utc>, | ||
| } |
There was a problem hiding this comment.
nit: It is better to move definition of StoredSession to model crate. At least, this is the pattern we use for most data types.
kensimon
left a comment
There was a problem hiding this comment.
Questions about some of the main issues, I haven't reviewed the whole PR yet.
| let key = endpoint.key(); | ||
| let endpoint_arc = endpoint.clone(); | ||
|
|
||
| let credentials = endpoint.credentials().ok_or_else(|| { |
There was a problem hiding this comment.
I think these will be constructed without credentials (or at least I don't see a code path that sets them prior to spawn_collectors_for_endpoint getting called), should we call endpoint.ensure_credentials().await?; first here?
There was a problem hiding this comment.
Yea, i moved credentials around few times (initialization), and i think i forgot to call init in spawn. On last move.
|
|
||
| Self { client } | ||
| let credential_provider: Arc<dyn CredentialProvider> = Arc::new(ApiCredentialProvider { | ||
| client: client.clone(), |
There was a problem hiding this comment.
(See previous comment) I think the credentials are initialized here, but we don't call endpoint.ensure_credentials().await?; between here and run_discovery_iteration.
kensimon
left a comment
There was a problem hiding this comment.
(Updating my review to Request changes... I'm not sure if I'm right about my feedback but it's probably better if this didn't merge until we discuss)
| pub(crate) credentials: Arc<RwLock<Option<BmcCredentials>>>, | ||
| pub(crate) provider: Arc<dyn CredentialProvider>, | ||
| // Neded to ensure only one collector fetches endpoint | ||
| pub(crate) fetch_lock: Arc<AsyncMutex<()>>, |
There was a problem hiding this comment.
Instead of an out-of-band mutex on an empty tuple, could we instead lock self.credentials before fetching and it would accomplish the same thing? (We'd need to make self.credentials a tokio::RwLock, but that's it)
That is, instead of doing:
let _guard = self.fetch_lock.lock().await;
let fresh = self.provider.fetch_credentials(&self.addr).await?;
*self.credentials.write().expect("lock poisoned") = Some(fresh.clone());
Couldn't we just guard on self.credentials itself?
let mut credentials = self.credentials.write().await;
let fresh = self.provider.fetch_credentials(&self.addr).await?;
*credentials = Some(fresh.clone());
There was a problem hiding this comment.
All this layerd syncronization need to go. I will try to wrap all inside BMC, is has the most important function (set_credentials) and we need synchronize on it. And not leak it outside
|
@kensimon thanks for review, i think i puzzled myself with several layers of collectors. i was completly focused on just running one BMC collector in all my tests and forgot this issue (with multiple collectors) so that one slipped through. I need rethink how this whole credentials refresh works, it is several layers of historical (before tokens) refreshes, so better to rewrite it from scratch. |
Do we have to introduce that constraint? My assumption was that most callers now get credentials from some abstract credentialprovider per entity. And that provider could then either hand out tokens or username/password - depending on whats available. |
|
Can you add a bit more detail to the description of when sessions are established and tokens are rotated. Eg.
I think it probably works either way in the "there is just 1 nico-core instance" case, but for sharding things might become more interesting (because site-explorer sharding would not necessarily match how hw-health is sharded). |
This is artifical, if we hide/enforce it by config param, is this be ok? Thought is exposing Basic credentials prevent us from ensuring they not be locked out/abused in any way. Easier to add new integrations which would use credentials. |
As long as each shard works with only one BMC it should be fine. Tokens should be issued per entity (in my case spiffe). So for current NICo it would be NICO-Core token, but if explorer som day become separat service it should have their own token. |
|
@kensimon I removed most of the synchronization logic out and made what Endpoint owns BMCClient, which is only place where auth credentials are rejected and updated/fetched, via provided credential provider. Also for NVUE i modelled it in similar fashion, with credentials provider refresh. |
I meant adding these details to the PR description. Right now I'd really need to reverse engineer the code to understand how and when tokens are issued. |
kensimon
left a comment
There was a problem hiding this comment.
Approving with a question, but make sure to get feedback from @Matthias247 before merging (I wish GitHub had a way to add required reviews to a PR)
|
|
||
| // Reset breaker | ||
| api.bmc_session_manager | ||
| .note_credentials_updated(bmc_mac_address) |
There was a problem hiding this comment.
Question: Do we want to reset the lockout on every code path that sets credentials? If so we're missing a bunch... at the very least:
crates/api/src/credentials/mod.rs:65
crates/api/src/handlers/credential.rs:390
crates/site-explorer/src/bmc_endpoint_explorer.rs:259
crates/site-explorer/src/bmc_endpoint_explorer.rs:298
There was a problem hiding this comment.
Plan is to remove all basic credentials interaction outside of the bmc_session_manager, and enforce token on every path execept the one which issues token. So lockout clearing should live just here, and nowhere else.
As for paths you mentioned, i think it is safe to add it to crates/api/src/handlers/credential.rs:390, it calls credental manager anyway, so we can skip it (redundant). As for site explorer, it is in different path and i would omit it, as adding it here involves passing session manager to site explorer which couples it, and we want to decouple it from api as much as we can (in future not using credential manager directly).
TLDR in the end we want single point to access session (and possible lockout), and single point which sets credentials so it would clear lockout.
|
Added to more fixes, proposed by @poroh
|
feat: credential api use of tokens
554c63a to
be18719
Compare
|
@Matthias247 Added flag to control basic credentials issue |
Description
First phase of SessionTokens API support.
Enforces GetBmcCredentials to use SessionService tokens, meaning if BMC does not support Session, API will error out.
API would first get spiffe identifier of the calling services, then try to rotate token, meaning if there is token in database (there is new table which stored token IDs), it will revoke old token and issue new one. If there is no token, it would just issue new token. Clients expected to call this api to rotate expired tokens themselves (on auth failure).
Another major change is the begging of movent of AvoidLockout circuit breaker to this function, as in future, this should be only place what handles Basic credentials. Auth tokens themselvels could cause lockout. This also why we preffer to not share credentials at all (to consilidate this CircuitBreaker behavior here).
Should in general, work for Sharded envs, but it is preffered what there is specific API instances work with specific set of BMC macs to avoid races/simultanious refreshes and avoid DB locks.
To get BmcCredentials, after this PR is merged, each service is required to have spiffe indentifier, this ensures what each service can get their own credentials/per spiffe. This also adds requirement for all sharded services to maintain propper sharding strategy per spiffe identifier (e.g. they should not overlap BMCs in shards), otherwise credentials will be rotated and can cause credentials reissue storm.
Type of Change
Related Issues (Optional)
Implements big chunk of: #460
Should finaly fix this bug for good: #1292
Breaking Changes
Credentials API no longer returns passwords. It would explicitly not work with BMC which do not support SessionService. We can add flag in future to make exception for that.
Testing
Additional Notes