security: harden agent against local and network attack vectors#163
security: harden agent against local and network attack vectors#163
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds Unix-domain-socket defaults and socket-aware HTTP clients, centralizes token resolution from Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI
participant Endpoint as endpoint helper\n(NewAgentHTTPClient / AgentBaseURL)
participant Transport as HTTP Transport\n(DialContext)
participant Socket as Unix Socket\n/run/fleetint/fleetint.sock
participant Server as Fleetint Server
CLI->>Endpoint: Resolve server URL and token
CLI->>Endpoint: Request HTTP client via NewAgentHTTPClient(serverURL)
Endpoint->>Endpoint: Detect unix scheme -> create client with DialContext (unix)
CLI->>Endpoint: Build request using AgentBaseURL(serverURL)
CLI->>Transport: Execute HTTP request
Transport->>Socket: Dial unix socket (/run/fleetint/fleetint.sock)
Socket-->>Transport: Connection established
Transport->>Server: Send HTTP request over socket
Server->>Server: Handle request and respond
Server-->>Transport: Return HTTP response
Transport-->>CLI: Deliver response
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9157a43510
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (1)
cmd/fleetint/enroll.go (1)
50-57: Bound token input size to avoid unbounded memory reads.Reading stdin/file token content with unbounded
ReadAll/ReadFilecan be abused with very large input. Add a small max size guard for token material.Hardening sketch
+const maxTokenBytes = 16 * 1024 @@ if tokenFile != "" { var raw []byte var err error if tokenFile == "-" { - raw, err = io.ReadAll(os.Stdin) + raw, err = io.ReadAll(io.LimitReader(os.Stdin, maxTokenBytes+1)) } else { - raw, err = os.ReadFile(tokenFile) + f, openErr := os.Open(tokenFile) + if openErr != nil { + return "", fmt.Errorf("failed to read token from %q: %w", tokenFile, openErr) + } + defer f.Close() + raw, err = io.ReadAll(io.LimitReader(f, maxTokenBytes+1)) } if err != nil { return "", fmt.Errorf("failed to read token from %q: %w", tokenFile, err) } + if len(raw) > maxTokenBytes { + return "", fmt.Errorf("token exceeds maximum allowed size") + } token = strings.TrimSpace(string(raw)) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/fleetint/enroll.go` around lines 50 - 57, The token reading path using io.ReadAll/os.ReadFile is unbounded; limit the amount read by introducing a maxTokenSize constant (e.g., 4KB) and replace io.ReadAll(os.Stdin) with io.ReadAll(io.LimitReader(os.Stdin, maxTokenSize+1)) and replace os.ReadFile(tokenFile) with opening the file and reading up to maxTokenSize+1 bytes (or using io.LimitReader on the file) so you can detect and error if input exceeds maxTokenSize; update the tokenFile handling branch (the code around the tokenFile variable and the io.ReadAll/os.ReadFile calls) to enforce the limit and return a clear error when the input is too large.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/fleetint/enroll.go`:
- Around line 43-67: Trim the raw --token input before any validation: after
reading token := cliContext.String("token") call strings.TrimSpace on token
(e.g., token = strings.TrimSpace(token)) so that whitespace-only values are
treated like an empty --token-file read; then proceed with the mutual-exclusion
check of token vs tokenFile and the final empty-token error. This ensures the
token variable (used in the existing token/tokenFile logic) is normalized the
same way as the token read path that uses strings.TrimSpace.
In `@cmd/fleetint/run.go`:
- Around line 53-67: The validateOfflinePath function currently only uses
filepath.Clean and lexical checks; update it to reject symlinks and resolve real
paths before comparing to restrictedOfflinePaths: use os.Lstat on the provided
path and its final component to return an error if any is ModeSymlink, use
filepath.EvalSymlinks (or iterate up parents calling EvalSymlinks on existing
components) to get the canonical/resolved path, then compare that resolved path
(and resolved parents) against restrictedOfflinePaths (using exact equals or
prefix checks like strings.HasPrefix(resolved, r+"/")) to prevent symlink-based
escapes while keeping the existing checks for emptiness and absoluteness in
validateOfflinePath.
In `@internal/config/default.go`:
- Around line 41-53: The compact command currently probes a hardcoded TCP port
(15133) to decide whether the daemon is running; update the probe in the compact
logic (the routine that checks "is daemon running" in cmd/fleetint/compact.go)
to respect the configured transport by using the
DefaultListenAddress/DefaultClientURL constants from internal/config (instead of
assuming TCP:15133): detect whether the listen address is a unix socket (e.g.,
starts with "unix://" or equals DefaultListenAddress) and attempt a unix-domain
socket connect, otherwise fall back to a TCP connect to the configured
host:port; make sure the same connection method the agent/client uses is used
for the liveness probe so compaction will not proceed while the daemon is
reachable via its configured transport.
In `@internal/endpoint/endpoint.go`:
- Around line 33-40: The unix:// branch in the URL handling bypasses url parsing
and therefore accepts query/fragment/host/user info; update the branch (the code
that currently checks strings.HasPrefix(raw, "unix://") and returns
&url.URL{Scheme:"unix", Path:socketPath}) to first call url.Parse(raw), then
validate that u.Scheme == "unix" and that u.Host, u.User, u.RawQuery and
u.Fragment are empty and that u.Path is an absolute path (starts with "/"); if
any of those validations fail return an error like "unix socket URL must not
contain host/user/query/fragment and must have an absolute path" otherwise
return the parsed *url.URL.
In `@internal/server/server.go`:
- Around line 410-415: The cleanup currently unconditionally calls stdos.Remove
on any absolute s.config.Address; instead, call stdos.Lstat on s.config.Address
first and reject removal if the target is a symlink or its FileMode does not
include os.ModeSocket, logging a warning; only then call stdos.Remove and
preserve the existing stdos.IsNotExist check. Apply the same change to the other
cleanup site that also removes s.config.Address (the second removal block around
the stop/shutdown logic) so both places validate with Lstat and require
os.ModeSocket before unlinking.
- Around line 464-496: The unix socket srv and listener are local to startServer
so Stop() cannot shut them down; add fields to the Server struct (e.g., srv
*http.Server and listener net.Listener) and assign the created srv and listener
inside startServer (use the existing variable names srv and listener to locate
the code), then update Stop() to check those fields, call srv.Shutdown(ctx)
(with a timeout context) and listener.Close() (handling http.ErrServerClosed and
nil checks), clear the fields after successful close, and ensure socket file
removal happens only after Shutdown/Close complete; consider guarding access
with the existing mutex if Server has one.
In `@third_party/fleet-intelligence-sdk/pkg/kmsg/watcher.go`:
- Around line 56-60: The current prefix check on p (strings.HasPrefix(p,
"/dev/")) is bypassable with paths like "/dev/../etc/passwd"; fix by normalizing
and resolving the path before validating containment: call filepath.Clean(p),
then filepath.EvalSymlinks on the cleaned path, and finally use
filepath.Rel("/dev", resolved) (or compare resolved path components) to ensure
the resolved path is actually inside /dev (reject if Rel returns an error or
starts with ".."). Update the check around variable p in watcher.go to perform
Clean + EvalSymlinks + Rel instead of the raw HasPrefix check.
In `@third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance.go`:
- Around line 70-74: When an invalid DCGM_URL is detected and addr is set to ""
in the invalid-check branch, also reset the unix-socket mode so the fallback
"localhost" uses TCP; update the block that checks isValidDCGMAddress(addr) to
clear or set DCGM_URL_IS_UNIX_SOCKET to false (or unset the env) so the
unix-socket flag does not survive the fallback—refer to the addr variable,
isValidDCGMAddress function, and the DCGM_URL_IS_UNIX_SOCKET flag when making
the change.
---
Nitpick comments:
In `@cmd/fleetint/enroll.go`:
- Around line 50-57: The token reading path using io.ReadAll/os.ReadFile is
unbounded; limit the amount read by introducing a maxTokenSize constant (e.g.,
4KB) and replace io.ReadAll(os.Stdin) with io.ReadAll(io.LimitReader(os.Stdin,
maxTokenSize+1)) and replace os.ReadFile(tokenFile) with opening the file and
reading up to maxTokenSize+1 bytes (or using io.LimitReader on the file) so you
can detect and error if input exceeds maxTokenSize; update the tokenFile
handling branch (the code around the tokenFile variable and the
io.ReadAll/os.ReadFile calls) to enforce the limit and return a clear error when
the input is too large.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: f4c7b276-8e73-449f-8f85-37153aafcb67
📒 Files selected for processing (11)
cmd/fleetint/enroll.gocmd/fleetint/inject.gocmd/fleetint/root.gocmd/fleetint/run.gocmd/fleetint/status.gointernal/attestation/attestation.gointernal/config/default.gointernal/endpoint/endpoint.gointernal/server/server.gothird_party/fleet-intelligence-sdk/pkg/kmsg/watcher.gothird_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance.go
9157a43 to
67e790d
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
internal/server/server.go (1)
484-490:⚠️ Potential issue | 🟠 Major
Stop()still lacks access to the live HTTP server for graceful shutdown.The
http.Server(srv) and Unixlistenerare local tostartServer, so an external call toStop()cannot invokesrv.Shutdown()orlistener.Close(). While the deferreds.Stop()at line 446 runs aftersrv.Servereturns naturally, ifStop()is called externally (e.g., signal handler), the socket file may be removed while the accept loop is still running.Consider storing
srvandlisteneron theServerstruct to enable proper graceful shutdown.,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/server/server.go` around lines 484 - 490, The http.Server instance (srv) and the Unix listener are local to startServer so Stop() cannot call srv.Shutdown() or listener.Close(); move these into the Server struct (add fields e.g., srv *http.Server and listener net.Listener) and assign them in startServer before calling srv.Serve, protecting access with the existing mutex/lock used by Server. Update Stop() to check and call s.srv.Shutdown(ctx) and s.listener.Close() (with nil-checks) to perform a graceful shutdown; ensure deferred s.Stop() behavior remains correct and clear the struct fields after shutdown.
🧹 Nitpick comments (1)
internal/endpoint/endpoint.go (1)
41-49: Add test coverage for Unix socket URL validation inTestValidateLocalServerURL.The test suite currently lacks test cases for Unix socket URLs, despite the validation code being present in the
ValidateLocalServerURLfunction. Add tests for:
- Valid:
unix:///run/fleetint/fleetint.sock- Invalid:
unix://relative/path,unix:///path?query=1,unix://host/path🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/endpoint/endpoint.go` around lines 41 - 49, Add unit tests to TestValidateLocalServerURL that exercise the unix-scheme branch in ValidateLocalServerURL: assert that "unix:///run/fleetint/fleetint.sock" parses successfully to a URL with Scheme "unix" and the absolute Path "/run/fleetint/fleetint.sock", and assert that "unix://relative/path", "unix:///path?query=1", and "unix://host/path" all return errors (i.e., validation fails). Locate the validation logic around ValidateLocalServerURL (the parsed.Scheme == "unix" branch) and add table-driven cases with expected error/no-error checks and appropriate error assertions for these four inputs. Ensure the tests check both error presence and, for the valid case, the returned url.URL fields (Scheme and Path).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@internal/server/server.go`:
- Around line 484-490: The http.Server instance (srv) and the Unix listener are
local to startServer so Stop() cannot call srv.Shutdown() or listener.Close();
move these into the Server struct (add fields e.g., srv *http.Server and
listener net.Listener) and assign them in startServer before calling srv.Serve,
protecting access with the existing mutex/lock used by Server. Update Stop() to
check and call s.srv.Shutdown(ctx) and s.listener.Close() (with nil-checks) to
perform a graceful shutdown; ensure deferred s.Stop() behavior remains correct
and clear the struct fields after shutdown.
---
Nitpick comments:
In `@internal/endpoint/endpoint.go`:
- Around line 41-49: Add unit tests to TestValidateLocalServerURL that exercise
the unix-scheme branch in ValidateLocalServerURL: assert that
"unix:///run/fleetint/fleetint.sock" parses successfully to a URL with Scheme
"unix" and the absolute Path "/run/fleetint/fleetint.sock", and assert that
"unix://relative/path", "unix:///path?query=1", and "unix://host/path" all
return errors (i.e., validation fails). Locate the validation logic around
ValidateLocalServerURL (the parsed.Scheme == "unix" branch) and add table-driven
cases with expected error/no-error checks and appropriate error assertions for
these four inputs. Ensure the tests check both error presence and, for the valid
case, the returned url.URL fields (Scheme and Path).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 82caadfe-229d-4fea-a2a4-9a021068f289
📒 Files selected for processing (15)
SECURITY.mdcmd/fleetint/compact.gocmd/fleetint/enroll.gocmd/fleetint/inject.gocmd/fleetint/root.gocmd/fleetint/run.gocmd/fleetint/status.godocs/configuration.mddocs/usage.mdinternal/attestation/attestation.gointernal/config/default.gointernal/endpoint/endpoint.gointernal/server/server.gothird_party/fleet-intelligence-sdk/pkg/kmsg/watcher.gothird_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance.go
✅ Files skipped from review due to trivial changes (2)
- SECURITY.md
- docs/usage.md
🚧 Files skipped from review as they are similar to previous changes (6)
- third_party/fleet-intelligence-sdk/pkg/kmsg/watcher.go
- cmd/fleetint/enroll.go
- internal/attestation/attestation.go
- cmd/fleetint/root.go
- internal/config/default.go
- cmd/fleetint/run.go
67e790d to
a0f8cab
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
cmd/fleetint/run.go (1)
338-346:⚠️ Potential issue | 🟡 MinorTOCTOU race condition between validation and directory creation.
The
validateOfflinePathfunction resolves symlinks in existing parent directories, but a race window exists between this validation (line 339) andos.MkdirAll(line 346). An attacker with write access to the parent directory could create a symlink after validation succeeds but before the directory is created, allowing writes to an unintended location.Consider re-validating the target or using safer directory creation semantics (e.g., checking the final path properties after
MkdirAll). That said, exploiting this race requires both parent directory write access and precise timing, making it a concern primarily for defense-in-depth rather than typical threat scenarios.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/fleetint/run.go` around lines 338 - 346, There is a TOCTOU between validateOfflinePath(offlineModePath) and os.MkdirAll(offlineModePath); after you create the directory you must re-check the final path to ensure no symlink attack slipped in. Fix by invoking the same validation (or an equivalent check) immediately after os.MkdirAll: re-resolve symlinks (filepath.EvalSymlinks), use os.Lstat on offlineModePath to assert it is not a symlink (ModeSymlink), and verify the resolved path matches the expected safe parent/resolution from validateOfflinePath; if the checks fail, remove the created directory and return an error. Ensure these checks reference validateOfflinePath and offlineModePath so reviewers can find them easily.internal/server/server.go (1)
460-469:⚠️ Potential issue | 🟠 MajorAdd synchronization to prevent concurrent
Stop()calls.When
startServerexits (as a background goroutine at line 360), its defer callss.Stop(). If the signal handler also callsserver.Stop()concurrently, both goroutines accesssrv,listener, and other fields without synchronization. The nil-checks provide only TOCTOU protection—both goroutines can pass the check before either modifies the fields, leading to concurrentShutdown()calls on the same server object.Wrap
Stop()withsync.Onceto ensure it executes only once:Suggested synchronization
type Server struct { // ... existing fields ... + stopOnce sync.Once } func (s *Server) Stop() { + s.stopOnce.Do(func() { + s.doStop() + }) +} + +func (s *Server) doStop() { // ... existing Stop() body ... }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/server/server.go` around lines 460 - 469, The defer in startServer currently calls s.Stop() and can race with other callers (signal handler) leading to concurrent shutdowns; add a sync.Once field (e.g., shutdownOnce sync.Once) to the Server struct and use it inside the Stop method to ensure shutdown logic (closing srv, listener, nvmlInstance handling, etc.) runs exactly once; update any direct calls to Stop to remain the same but rely on Stop invoking shutdownOnce.Do(...) so concurrent callers are serialized and TOCTOU races are eliminated.
🧹 Nitpick comments (2)
internal/endpoint/endpoint.go (1)
74-90: Consider honoring context cancellation in the Unix socket dialer.The
DialContextcallback ignores the providedcontext.Context, so connection attempts cannot be canceled by upstream timeouts or request cancellation. For short-lived local socket connections the 5s client timeout is likely sufficient, but for consistency with Go's context-aware patterns, consider usingnet.Dialer.DialContext.♻️ Optional: Use context-aware dialer
func NewAgentHTTPClient(serverURL *url.URL) *http.Client { if serverURL.Scheme == "unix" { socketPath := serverURL.Path return &http.Client{ Timeout: 5 * time.Second, Transport: &http.Transport{ - DialContext: func(_ context.Context, _, _ string) (net.Conn, error) { - return net.Dial("unix", socketPath) + DialContext: func(ctx context.Context, _, _ string) (net.Conn, error) { + var d net.Dialer + return d.DialContext(ctx, "unix", socketPath) }, }, } } return &http.Client{Timeout: 5 * time.Second} }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/endpoint/endpoint.go` around lines 74 - 90, The Unix-socket DialContext in NewAgentHTTPClient ignores the provided context, preventing cancellation; replace the inline net.Dial call with a context-aware dial using net.Dialer.DialContext so the passed context is respected (use the existing socketPath and the DialContext function signature to call (&net.Dialer{}).DialContext(ctx, "unix", socketPath)), preserving the 5s Timeout and Transport setup otherwise.cmd/fleetint/run.go (1)
45-50: Consider adding/hometo restricted paths for multi-user systems.The restricted list covers system directories but excludes
/home. On multi-user systems, a privileged agent could inadvertently write to another user's home directory. Consider whether/homeshould also be restricted, or if this is an acceptable use case.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/fleetint/run.go` around lines 45 - 50, Update the restrictedOfflinePaths slice in cmd/fleetint/run.go to include "/home" so the agent will never write into users' home directories on multi-user systems; locate the variable restrictedOfflinePaths and add the string "/home" to the slice (preserve existing entries and ordering), or if you intentionally allow writes to /home, add a code comment on restrictedOfflinePaths explaining that exception and why it's safe.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@cmd/fleetint/run.go`:
- Around line 338-346: There is a TOCTOU between
validateOfflinePath(offlineModePath) and os.MkdirAll(offlineModePath); after you
create the directory you must re-check the final path to ensure no symlink
attack slipped in. Fix by invoking the same validation (or an equivalent check)
immediately after os.MkdirAll: re-resolve symlinks (filepath.EvalSymlinks), use
os.Lstat on offlineModePath to assert it is not a symlink (ModeSymlink), and
verify the resolved path matches the expected safe parent/resolution from
validateOfflinePath; if the checks fail, remove the created directory and return
an error. Ensure these checks reference validateOfflinePath and offlineModePath
so reviewers can find them easily.
In `@internal/server/server.go`:
- Around line 460-469: The defer in startServer currently calls s.Stop() and can
race with other callers (signal handler) leading to concurrent shutdowns; add a
sync.Once field (e.g., shutdownOnce sync.Once) to the Server struct and use it
inside the Stop method to ensure shutdown logic (closing srv, listener,
nvmlInstance handling, etc.) runs exactly once; update any direct calls to Stop
to remain the same but rely on Stop invoking shutdownOnce.Do(...) so concurrent
callers are serialized and TOCTOU races are eliminated.
---
Nitpick comments:
In `@cmd/fleetint/run.go`:
- Around line 45-50: Update the restrictedOfflinePaths slice in
cmd/fleetint/run.go to include "/home" so the agent will never write into users'
home directories on multi-user systems; locate the variable
restrictedOfflinePaths and add the string "/home" to the slice (preserve
existing entries and ordering), or if you intentionally allow writes to /home,
add a code comment on restrictedOfflinePaths explaining that exception and why
it's safe.
In `@internal/endpoint/endpoint.go`:
- Around line 74-90: The Unix-socket DialContext in NewAgentHTTPClient ignores
the provided context, preventing cancellation; replace the inline net.Dial call
with a context-aware dial using net.Dialer.DialContext so the passed context is
respected (use the existing socketPath and the DialContext function signature to
call (&net.Dialer{}).DialContext(ctx, "unix", socketPath)), preserving the 5s
Timeout and Transport setup otherwise.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 969365e6-5b86-45f6-9613-cc2b54e1c9fd
📒 Files selected for processing (15)
SECURITY.mdcmd/fleetint/compact.gocmd/fleetint/enroll.gocmd/fleetint/inject.gocmd/fleetint/root.gocmd/fleetint/run.gocmd/fleetint/status.godocs/configuration.mddocs/usage.mdinternal/attestation/attestation.gointernal/config/default.gointernal/endpoint/endpoint.gointernal/server/server.gothird_party/fleet-intelligence-sdk/pkg/kmsg/watcher.gothird_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance.go
✅ Files skipped from review due to trivial changes (2)
- SECURITY.md
- third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance.go
🚧 Files skipped from review as they are similar to previous changes (8)
- cmd/fleetint/inject.go
- cmd/fleetint/compact.go
- third_party/fleet-intelligence-sdk/pkg/kmsg/watcher.go
- docs/configuration.md
- cmd/fleetint/enroll.go
- internal/config/default.go
- internal/attestation/attestation.go
- docs/usage.md
a0f8cab to
f0f0826
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
internal/attestation/attestation.go (1)
468-493: Good security hardening for nonce validation.The allowlist correctly covers hex and base64url encodings. The length limit and character validation prevent command injection via malicious backend-supplied nonces.
Optional defense-in-depth consideration: Since
-is allowed, a nonce like--verbosewould pass validation. Whileexec.CommandContextpasses arguments directly and most argument parsers correctly treat the value after--nonceas literal, adding a check that the nonce doesn't start with-could guard against edge-case quirks innvattest's argument parser.🛡️ Optional: Reject nonces starting with `-`
func validateNonce(nonce string) error { if nonce == "" { return fmt.Errorf("nonce is empty") } + if nonce[0] == '-' { + return fmt.Errorf("nonce cannot start with '-'") + } const maxLen = 512🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/attestation/attestation.go` around lines 468 - 493, validateNonce currently allows '-' which lets values like "--verbose" pass; add a guard in validateNonce (after the empty check) to reject nonces that start with '-' to prevent them being interpreted as flags by nvattest or other argument parsers; update the function validateNonce to return an error if nonce[0] == '-' (and keep existing length/char checks) and add/adjust tests for the new rejection case so callers using exec.CommandContext/nvattest are protected.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/fleetint/compact.go`:
- Around line 57-69: The current reachability check only tests the unix socket
when config.DefaultListenAddress starts with "/" so the legacy TCP loopback port
(config.DefaultHealthPort via netutil.IsPortOpen) can be skipped; update the
probe in cmd/fleetint/compact.go to test both transports: if
config.DefaultListenAddress is a unix path keep the unix DialTimeout check
(net.DialTimeout + conn.Close) but also always call
netutil.IsPortOpen(config.DefaultHealthPort) and fail if either probe indicates
a running daemon, or alternatively inspect config.DefaultListenAddress for a ":"
and if it looks like a TCP addr probe that specific TCP address instead of
skipping the port check; reference config.DefaultListenAddress, net.DialTimeout,
netutil.IsPortOpen, and config.DefaultHealthPort when making the change.
In `@internal/server/server.go`:
- Around line 77-79: Stop() is not idempotent: guard the shutdown/teardown path
so concurrent calls from startServer's deferred s.Stop() and an external Stop()
cannot race and double-tear down srv/listener/exporter/DB. Implement a
single-teardown mechanism (e.g., add a sync.Once field on the server struct or a
dedicated mutex) and wrap the body of Stop() (the calls to s.srv.Shutdown,
closing s.listener and exporter/DB teardown) in that guard; alternatively remove
the deferred s.Stop() in startServer and ensure all shutdowns funnel through the
guarded Stop() to prevent concurrent mutation of srv and listener. Ensure
references to srv, listener, and any exporter/DB teardown are protected by the
chosen guard (sync.Once or mutex) so Stop() becomes safe to call multiple times.
- Around line 432-437: The code currently unconditionally calls
removeStaleSocket(s.config.Address) which only checks inode type and can unlink
a live peer socket; instead, modify startup and shutdown flow so you probe the
socket before deleting and only unlink sockets this process actually created:
after successfully creating the Unix listener (where net.Listen / the server's
listener is assigned), set a boolean on the server instance (e.g.,
s.createdListener or s.socketOwner) to true; change removeStaleSocket to first
attempt a Unix domain dial/connect to s.config.Address and only unlink if the
connect fails (or the file is not a socket), and in Stop() and all cleanup sites
(the blocks that currently call removeStaleSocket) check s.createdListener
before attempting unlink so we never remove a socket we didn't create.
---
Nitpick comments:
In `@internal/attestation/attestation.go`:
- Around line 468-493: validateNonce currently allows '-' which lets values like
"--verbose" pass; add a guard in validateNonce (after the empty check) to reject
nonces that start with '-' to prevent them being interpreted as flags by
nvattest or other argument parsers; update the function validateNonce to return
an error if nonce[0] == '-' (and keep existing length/char checks) and
add/adjust tests for the new rejection case so callers using
exec.CommandContext/nvattest are protected.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: fd251f92-2886-4384-b87c-3502a15b8d10
📒 Files selected for processing (18)
SECURITY.mdcmd/fleetint/compact.gocmd/fleetint/enroll.gocmd/fleetint/inject.gocmd/fleetint/root.gocmd/fleetint/run.gocmd/fleetint/status.godocs/configuration.mddocs/usage.mdinternal/attestation/attestation.gointernal/config/default.gointernal/endpoint/endpoint.gointernal/server/server.gothird_party/fleet-intelligence-sdk/pkg/kmsg/watcher.gothird_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance.gothird_party/fleet-intelligence-sdk/pkg/process/runner.gothird_party/fleet-intelligence-sdk/pkg/process/runner_exclusive.gothird_party/fleet-intelligence-sdk/pkg/process/runner_exclusive_test.go
💤 Files with no reviewable changes (3)
- third_party/fleet-intelligence-sdk/pkg/process/runner.go
- third_party/fleet-intelligence-sdk/pkg/process/runner_exclusive.go
- third_party/fleet-intelligence-sdk/pkg/process/runner_exclusive_test.go
✅ Files skipped from review due to trivial changes (2)
- docs/configuration.md
- SECURITY.md
🚧 Files skipped from review as they are similar to previous changes (7)
- cmd/fleetint/run.go
- cmd/fleetint/status.go
- third_party/fleet-intelligence-sdk/pkg/kmsg/watcher.go
- cmd/fleetint/enroll.go
- cmd/fleetint/root.go
- docs/usage.md
- internal/endpoint/endpoint.go
Switch the default listener from TCP 127.0.0.1:15133 to a Unix socket (/run/fleetint/fleetint.sock) with 0600 permissions so that access is governed by filesystem ownership rather than firewall rules. TCP remains available via --listen-address for backward compatibility. Client commands (status, inject) transparently support both transports through the new endpoint.NewAgentHTTPClient / AgentBaseURL helpers. Additional hardening in this commit: - Validate attestation nonce character set and length before passing it to the nvattest exec call (defense-in-depth against a compromised backend crafting malicious arguments). - Add --token-file flag to `enroll` so the SAK token can be read from a file or stdin instead of appearing in /proc/<pid>/cmdline. - Reject --path values that point into restricted system directories (/etc, /usr, /sys, ...) when running in offline mode. - Restrict KMSG_FILE_PATH to paths under /dev/ and validate DCGM_URL character set to prevent environment-variable-based file read or connection redirection by a local attacker. - Add sync.WaitGroup to attestation Manager so Stop() blocks until the background goroutine exits, eliminating a data race on the global defaultStateFileFn in tests. Signed-off-by: Rodrigo Sampaio Vaz <rvaz@nvidia.com>
Add tests covering the new security-hardening code paths: - endpoint: unix socket URL validation (bare paths, unix:// scheme, reject query/fragment/host), NewAgentHTTPClient, AgentBaseURL - attestation: validateNonce (valid chars, empty, too long, shell metacharacters, null bytes, pipes) - server: removeStaleSocket (nonexistent, regular file, symlink, actual socket), Stop() idempotency via sync.Once - cmd/fleetint: validateOfflinePath (restricted dirs, traversal, symlinks), resolveToken (mutual exclusion, file read, missing token) Also simplify Stop() socket cleanup -- Go's UnixListener.Close() already unlinks the socket file, so the explicit removeStaleSocket call in Stop() was redundant. Signed-off-by: Rodrigo Sampaio Vaz <rvaz@nvidia.com>
f0f0826 to
d146629
Compare
Two changes to prevent breaking existing customer deployments: 1. fleetint status: when the default unix socket connection fails, transparently fall back to TCP localhost:15133. This handles the upgrade scenario where the daemon was started with an explicit --listen-address on TCP (via /etc/default/fleetint) but the CLI tools now default to the unix socket. 2. Remove /var from restrictedOfflinePaths. Blocking all of /var was too aggressive -- paths like /var/data/ and /var/opt/ are legitimate output locations. The remaining restricted list (/etc, /usr, /sys, /bin, /boot, /dev, /lib, /proc, /run, /sbin) covers directories where writes would damage the OS. Signed-off-by: Rodrigo Sampaio Vaz <rvaz@nvidia.com>
Signed-off-by: Rodrigo Sampaio Vaz <rvaz@nvidia.com>
Signed-off-by: Rodrigo Sampaio Vaz <rvaz@nvidia.com>
Signed-off-by: Rodrigo Sampaio Vaz <rvaz@nvidia.com>
Description
Switch the default listener from TCP 127.0.0.1:15133 to a Unix socket (/run/fleetint/fleetint.sock) with 0600 permissions so that access is governed by filesystem ownership rather than firewall rules. TCP remains available via --listen-address for backward compatibility. Client commands (status, inject) transparently support both transports through the new endpoint.NewAgentHTTPClient / AgentBaseURL helpers.
Additional hardening in this commit:
enrollso the SAK token can be read from a file or stdin instead of appearing in /proc//cmdline.Checklist
Summary by CodeRabbit
New Features
Improvements
Chores
Tests
Documentation