Skip to content

security: harden agent against local and network attack vectors#163

Merged
rsampaio merged 3 commits intomainfrom
security/additional-hardening-fixes
Apr 14, 2026
Merged

security: harden agent against local and network attack vectors#163
rsampaio merged 3 commits intomainfrom
security/additional-hardening-fixes

Conversation

@rsampaio
Copy link
Copy Markdown
Collaborator

@rsampaio rsampaio commented Apr 13, 2026

Description

Switch the default listener from TCP 127.0.0.1:15133 to a Unix socket (/run/fleetint/fleetint.sock) with 0600 permissions so that access is governed by filesystem ownership rather than firewall rules. TCP remains available via --listen-address for backward compatibility. Client commands (status, inject) transparently support both transports through the new endpoint.NewAgentHTTPClient / AgentBaseURL helpers.

Additional hardening in this commit:

  • Validate attestation nonce character set and length before passing it to the nvattest exec call (defense-in-depth against a compromised backend crafting malicious arguments).
  • Add --token-file flag to enroll so the SAK token can be read from a file or stdin instead of appearing in /proc//cmdline.
  • Reject --path values that point into restricted system directories (/etc, /usr, /sys, ...) when running in offline mode.
  • Restrict KMSG_FILE_PATH to paths under /dev/ and validate DCGM_URL character set to prevent environment-variable-based file read or connection redirection by a local attacker.
  • Add sync.WaitGroup to attestation Manager so Stop() blocks until the background goroutine exits, eliminating a data race on the global defaultStateFileFn in tests.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Summary by CodeRabbit

  • New Features

    • Accept enrollment token from file or stdin (mutually exclusive with token flag)
    • Default server listens on a Unix-domain socket with socket-based client connectivity
  • Improvements

    • Offline-mode path validation to forbid restricted system directories and symlinks
    • Stronger nonce validation and safer attestation request handling
    • Unified agent request/health handling and more robust server socket lifecycle
  • Chores

    • Removed legacy process-runner API and its tests
  • Tests

    • Added unit tests covering token resolution, unix-socket URL/clients, nonce validation, and server/socket behaviors
  • Documentation

    • Docs and security guidance updated to reflect socket-first defaults and token-file usage

@rsampaio rsampaio requested a review from jingxiang-z April 13, 2026 18:55
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 13, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds Unix-domain-socket defaults and socket-aware HTTP clients, centralizes token resolution from --token/--token-file (including stdin), validates offline paths and attestation nonces, tightens several env overrides, adjusts server socket lifecycle, and removes the third-party process runner.

Changes

Cohort / File(s) Summary
Token handling & CLI flags
cmd/fleetint/enroll.go, cmd/fleetint/root.go, cmd/fleetint/security_test.go
Added resolveToken() to accept --token or --token-file (file path or - for stdin), enforce mutual exclusivity, trim whitespace, limit size, and error on empty/missing token; adjusted flags and added tests.
Agent HTTP client & base URL
internal/endpoint/endpoint.go, cmd/fleetint/inject.go, cmd/fleetint/status.go, internal/endpoint/endpoint_test.go
Introduced NewAgentHTTPClient() (5s timeout, unix-dial transport) and AgentBaseURL() to normalize base URL for requests; replaced ad-hoc clients and adjusted request URL construction; added tests.
Unix-socket defaults & server lifecycle
internal/config/default.go, internal/server/server.go, internal/server/server_test.go, cmd/fleetint/compact.go
Switched defaults to /run/fleetint/fleetint.sock; server now creates, perms, serves, and cleans up unix socket listeners, makes Stop() idempotent and gracefully shutdown HTTP server; compact now probes socket first; added tests for socket removal and Stop() idempotence.
Offline path validation & docs
cmd/fleetint/run.go, cmd/fleetint/security_test.go, docs/configuration.md, docs/usage.md
Added validateOfflinePath() to require absolute path, reject symlinks and restricted directories, validate before mkdir; updated docs and tests to reflect restrictions and examples.
Attestation safety & lifecycle
internal/attestation/attestation.go, internal/attestation/attestation_test.go
Manager now uses sync.WaitGroup to track background goroutine; getNonce is cancellable; added validateNonce() to constrain nonce length/allowed chars and tests.
Env override tightening (third-party SDKs)
third_party/.../kmsg/watcher.go, third_party/.../dcgm/instance.go
Added resolveKmsgFilePath() to validate KMSG_FILE_PATH (must start with /dev/) with warnings/fallback; added isValidDCGMAddress() to vet DCGM_URL and clear invalid overrides.
Removed process runner (third-party SDK)
third_party/fleet-intelligence-sdk/pkg/process/*
Deleted process package runner implementation and its tests, removing Runner interface and ErrProcessAlreadyRunning.
Security & docs updates
SECURITY.md, docs/usage.md, docs/configuration.md
Documentation updated to reflect unix-socket default, --token-file/stdin usage, offline-path restrictions, and filesystem-permissions-based access model.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI
    participant Endpoint as endpoint helper\n(NewAgentHTTPClient / AgentBaseURL)
    participant Transport as HTTP Transport\n(DialContext)
    participant Socket as Unix Socket\n/run/fleetint/fleetint.sock
    participant Server as Fleetint Server

    CLI->>Endpoint: Resolve server URL and token
    CLI->>Endpoint: Request HTTP client via NewAgentHTTPClient(serverURL)
    Endpoint->>Endpoint: Detect unix scheme -> create client with DialContext (unix)
    CLI->>Endpoint: Build request using AgentBaseURL(serverURL)
    CLI->>Transport: Execute HTTP request
    Transport->>Socket: Dial unix socket (/run/fleetint/fleetint.sock)
    Socket-->>Transport: Connection established
    Transport->>Server: Send HTTP request over socket
    Server->>Server: Handle request and respond
    Server-->>Transport: Return HTTP response
    Transport-->>CLI: Deliver response
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I sniffed the socket, neat and small,
tokens read from file or stdin's sprawl,
paths guarded tight, nonces kept straight,
goroutines finish — shutdown's great,
hop along, fleetint stands tall.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 59.46% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main security focus: hardening against both local and network attack vectors, which aligns with the PR's multiple security improvements including Unix socket defaults, input validation, and restricted path checks.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch security/additional-hardening-fixes

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9157a43510

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread internal/server/server.go Outdated
Comment thread internal/attestation/attestation.go
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (1)
cmd/fleetint/enroll.go (1)

50-57: Bound token input size to avoid unbounded memory reads.

Reading stdin/file token content with unbounded ReadAll/ReadFile can be abused with very large input. Add a small max size guard for token material.

Hardening sketch
+const maxTokenBytes = 16 * 1024
@@
 	if tokenFile != "" {
 		var raw []byte
 		var err error
 		if tokenFile == "-" {
-			raw, err = io.ReadAll(os.Stdin)
+			raw, err = io.ReadAll(io.LimitReader(os.Stdin, maxTokenBytes+1))
 		} else {
-			raw, err = os.ReadFile(tokenFile)
+			f, openErr := os.Open(tokenFile)
+			if openErr != nil {
+				return "", fmt.Errorf("failed to read token from %q: %w", tokenFile, openErr)
+			}
+			defer f.Close()
+			raw, err = io.ReadAll(io.LimitReader(f, maxTokenBytes+1))
 		}
 		if err != nil {
 			return "", fmt.Errorf("failed to read token from %q: %w", tokenFile, err)
 		}
+		if len(raw) > maxTokenBytes {
+			return "", fmt.Errorf("token exceeds maximum allowed size")
+		}
 		token = strings.TrimSpace(string(raw))
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/fleetint/enroll.go` around lines 50 - 57, The token reading path using
io.ReadAll/os.ReadFile is unbounded; limit the amount read by introducing a
maxTokenSize constant (e.g., 4KB) and replace io.ReadAll(os.Stdin) with
io.ReadAll(io.LimitReader(os.Stdin, maxTokenSize+1)) and replace
os.ReadFile(tokenFile) with opening the file and reading up to maxTokenSize+1
bytes (or using io.LimitReader on the file) so you can detect and error if input
exceeds maxTokenSize; update the tokenFile handling branch (the code around the
tokenFile variable and the io.ReadAll/os.ReadFile calls) to enforce the limit
and return a clear error when the input is too large.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/fleetint/enroll.go`:
- Around line 43-67: Trim the raw --token input before any validation: after
reading token := cliContext.String("token") call strings.TrimSpace on token
(e.g., token = strings.TrimSpace(token)) so that whitespace-only values are
treated like an empty --token-file read; then proceed with the mutual-exclusion
check of token vs tokenFile and the final empty-token error. This ensures the
token variable (used in the existing token/tokenFile logic) is normalized the
same way as the token read path that uses strings.TrimSpace.

In `@cmd/fleetint/run.go`:
- Around line 53-67: The validateOfflinePath function currently only uses
filepath.Clean and lexical checks; update it to reject symlinks and resolve real
paths before comparing to restrictedOfflinePaths: use os.Lstat on the provided
path and its final component to return an error if any is ModeSymlink, use
filepath.EvalSymlinks (or iterate up parents calling EvalSymlinks on existing
components) to get the canonical/resolved path, then compare that resolved path
(and resolved parents) against restrictedOfflinePaths (using exact equals or
prefix checks like strings.HasPrefix(resolved, r+"/")) to prevent symlink-based
escapes while keeping the existing checks for emptiness and absoluteness in
validateOfflinePath.

In `@internal/config/default.go`:
- Around line 41-53: The compact command currently probes a hardcoded TCP port
(15133) to decide whether the daemon is running; update the probe in the compact
logic (the routine that checks "is daemon running" in cmd/fleetint/compact.go)
to respect the configured transport by using the
DefaultListenAddress/DefaultClientURL constants from internal/config (instead of
assuming TCP:15133): detect whether the listen address is a unix socket (e.g.,
starts with "unix://" or equals DefaultListenAddress) and attempt a unix-domain
socket connect, otherwise fall back to a TCP connect to the configured
host:port; make sure the same connection method the agent/client uses is used
for the liveness probe so compaction will not proceed while the daemon is
reachable via its configured transport.

In `@internal/endpoint/endpoint.go`:
- Around line 33-40: The unix:// branch in the URL handling bypasses url parsing
and therefore accepts query/fragment/host/user info; update the branch (the code
that currently checks strings.HasPrefix(raw, "unix://") and returns
&url.URL{Scheme:"unix", Path:socketPath}) to first call url.Parse(raw), then
validate that u.Scheme == "unix" and that u.Host, u.User, u.RawQuery and
u.Fragment are empty and that u.Path is an absolute path (starts with "/"); if
any of those validations fail return an error like "unix socket URL must not
contain host/user/query/fragment and must have an absolute path" otherwise
return the parsed *url.URL.

In `@internal/server/server.go`:
- Around line 410-415: The cleanup currently unconditionally calls stdos.Remove
on any absolute s.config.Address; instead, call stdos.Lstat on s.config.Address
first and reject removal if the target is a symlink or its FileMode does not
include os.ModeSocket, logging a warning; only then call stdos.Remove and
preserve the existing stdos.IsNotExist check. Apply the same change to the other
cleanup site that also removes s.config.Address (the second removal block around
the stop/shutdown logic) so both places validate with Lstat and require
os.ModeSocket before unlinking.
- Around line 464-496: The unix socket srv and listener are local to startServer
so Stop() cannot shut them down; add fields to the Server struct (e.g., srv
*http.Server and listener net.Listener) and assign the created srv and listener
inside startServer (use the existing variable names srv and listener to locate
the code), then update Stop() to check those fields, call srv.Shutdown(ctx)
(with a timeout context) and listener.Close() (handling http.ErrServerClosed and
nil checks), clear the fields after successful close, and ensure socket file
removal happens only after Shutdown/Close complete; consider guarding access
with the existing mutex if Server has one.

In `@third_party/fleet-intelligence-sdk/pkg/kmsg/watcher.go`:
- Around line 56-60: The current prefix check on p (strings.HasPrefix(p,
"/dev/")) is bypassable with paths like "/dev/../etc/passwd"; fix by normalizing
and resolving the path before validating containment: call filepath.Clean(p),
then filepath.EvalSymlinks on the cleaned path, and finally use
filepath.Rel("/dev", resolved) (or compare resolved path components) to ensure
the resolved path is actually inside /dev (reject if Rel returns an error or
starts with ".."). Update the check around variable p in watcher.go to perform
Clean + EvalSymlinks + Rel instead of the raw HasPrefix check.

In `@third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance.go`:
- Around line 70-74: When an invalid DCGM_URL is detected and addr is set to ""
in the invalid-check branch, also reset the unix-socket mode so the fallback
"localhost" uses TCP; update the block that checks isValidDCGMAddress(addr) to
clear or set DCGM_URL_IS_UNIX_SOCKET to false (or unset the env) so the
unix-socket flag does not survive the fallback—refer to the addr variable,
isValidDCGMAddress function, and the DCGM_URL_IS_UNIX_SOCKET flag when making
the change.

---

Nitpick comments:
In `@cmd/fleetint/enroll.go`:
- Around line 50-57: The token reading path using io.ReadAll/os.ReadFile is
unbounded; limit the amount read by introducing a maxTokenSize constant (e.g.,
4KB) and replace io.ReadAll(os.Stdin) with io.ReadAll(io.LimitReader(os.Stdin,
maxTokenSize+1)) and replace os.ReadFile(tokenFile) with opening the file and
reading up to maxTokenSize+1 bytes (or using io.LimitReader on the file) so you
can detect and error if input exceeds maxTokenSize; update the tokenFile
handling branch (the code around the tokenFile variable and the
io.ReadAll/os.ReadFile calls) to enforce the limit and return a clear error when
the input is too large.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: f4c7b276-8e73-449f-8f85-37153aafcb67

📥 Commits

Reviewing files that changed from the base of the PR and between 594104e and 9157a43.

📒 Files selected for processing (11)
  • cmd/fleetint/enroll.go
  • cmd/fleetint/inject.go
  • cmd/fleetint/root.go
  • cmd/fleetint/run.go
  • cmd/fleetint/status.go
  • internal/attestation/attestation.go
  • internal/config/default.go
  • internal/endpoint/endpoint.go
  • internal/server/server.go
  • third_party/fleet-intelligence-sdk/pkg/kmsg/watcher.go
  • third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance.go

Comment thread cmd/fleetint/enroll.go Outdated
Comment thread cmd/fleetint/run.go
Comment thread internal/config/default.go Outdated
Comment thread internal/endpoint/endpoint.go Outdated
Comment thread internal/server/server.go Outdated
Comment thread internal/server/server.go Outdated
Comment thread third_party/fleet-intelligence-sdk/pkg/kmsg/watcher.go Outdated
@rsampaio rsampaio force-pushed the security/additional-hardening-fixes branch from 9157a43 to 67e790d Compare April 13, 2026 19:41
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
internal/server/server.go (1)

484-490: ⚠️ Potential issue | 🟠 Major

Stop() still lacks access to the live HTTP server for graceful shutdown.

The http.Server (srv) and Unix listener are local to startServer, so an external call to Stop() cannot invoke srv.Shutdown() or listener.Close(). While the deferred s.Stop() at line 446 runs after srv.Serve returns naturally, if Stop() is called externally (e.g., signal handler), the socket file may be removed while the accept loop is still running.

Consider storing srv and listener on the Server struct to enable proper graceful shutdown.

,

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/server/server.go` around lines 484 - 490, The http.Server instance
(srv) and the Unix listener are local to startServer so Stop() cannot call
srv.Shutdown() or listener.Close(); move these into the Server struct (add
fields e.g., srv *http.Server and listener net.Listener) and assign them in
startServer before calling srv.Serve, protecting access with the existing
mutex/lock used by Server. Update Stop() to check and call s.srv.Shutdown(ctx)
and s.listener.Close() (with nil-checks) to perform a graceful shutdown; ensure
deferred s.Stop() behavior remains correct and clear the struct fields after
shutdown.
🧹 Nitpick comments (1)
internal/endpoint/endpoint.go (1)

41-49: Add test coverage for Unix socket URL validation in TestValidateLocalServerURL.

The test suite currently lacks test cases for Unix socket URLs, despite the validation code being present in the ValidateLocalServerURL function. Add tests for:

  • Valid: unix:///run/fleetint/fleetint.sock
  • Invalid: unix://relative/path, unix:///path?query=1, unix://host/path
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/endpoint/endpoint.go` around lines 41 - 49, Add unit tests to
TestValidateLocalServerURL that exercise the unix-scheme branch in
ValidateLocalServerURL: assert that "unix:///run/fleetint/fleetint.sock" parses
successfully to a URL with Scheme "unix" and the absolute Path
"/run/fleetint/fleetint.sock", and assert that "unix://relative/path",
"unix:///path?query=1", and "unix://host/path" all return errors (i.e.,
validation fails). Locate the validation logic around ValidateLocalServerURL
(the parsed.Scheme == "unix" branch) and add table-driven cases with expected
error/no-error checks and appropriate error assertions for these four inputs.
Ensure the tests check both error presence and, for the valid case, the returned
url.URL fields (Scheme and Path).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@internal/server/server.go`:
- Around line 484-490: The http.Server instance (srv) and the Unix listener are
local to startServer so Stop() cannot call srv.Shutdown() or listener.Close();
move these into the Server struct (add fields e.g., srv *http.Server and
listener net.Listener) and assign them in startServer before calling srv.Serve,
protecting access with the existing mutex/lock used by Server. Update Stop() to
check and call s.srv.Shutdown(ctx) and s.listener.Close() (with nil-checks) to
perform a graceful shutdown; ensure deferred s.Stop() behavior remains correct
and clear the struct fields after shutdown.

---

Nitpick comments:
In `@internal/endpoint/endpoint.go`:
- Around line 41-49: Add unit tests to TestValidateLocalServerURL that exercise
the unix-scheme branch in ValidateLocalServerURL: assert that
"unix:///run/fleetint/fleetint.sock" parses successfully to a URL with Scheme
"unix" and the absolute Path "/run/fleetint/fleetint.sock", and assert that
"unix://relative/path", "unix:///path?query=1", and "unix://host/path" all
return errors (i.e., validation fails). Locate the validation logic around
ValidateLocalServerURL (the parsed.Scheme == "unix" branch) and add table-driven
cases with expected error/no-error checks and appropriate error assertions for
these four inputs. Ensure the tests check both error presence and, for the valid
case, the returned url.URL fields (Scheme and Path).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 82caadfe-229d-4fea-a2a4-9a021068f289

📥 Commits

Reviewing files that changed from the base of the PR and between 9157a43 and 67e790d.

📒 Files selected for processing (15)
  • SECURITY.md
  • cmd/fleetint/compact.go
  • cmd/fleetint/enroll.go
  • cmd/fleetint/inject.go
  • cmd/fleetint/root.go
  • cmd/fleetint/run.go
  • cmd/fleetint/status.go
  • docs/configuration.md
  • docs/usage.md
  • internal/attestation/attestation.go
  • internal/config/default.go
  • internal/endpoint/endpoint.go
  • internal/server/server.go
  • third_party/fleet-intelligence-sdk/pkg/kmsg/watcher.go
  • third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance.go
✅ Files skipped from review due to trivial changes (2)
  • SECURITY.md
  • docs/usage.md
🚧 Files skipped from review as they are similar to previous changes (6)
  • third_party/fleet-intelligence-sdk/pkg/kmsg/watcher.go
  • cmd/fleetint/enroll.go
  • internal/attestation/attestation.go
  • cmd/fleetint/root.go
  • internal/config/default.go
  • cmd/fleetint/run.go

@rsampaio rsampaio force-pushed the security/additional-hardening-fixes branch from 67e790d to a0f8cab Compare April 13, 2026 22:47
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
cmd/fleetint/run.go (1)

338-346: ⚠️ Potential issue | 🟡 Minor

TOCTOU race condition between validation and directory creation.

The validateOfflinePath function resolves symlinks in existing parent directories, but a race window exists between this validation (line 339) and os.MkdirAll (line 346). An attacker with write access to the parent directory could create a symlink after validation succeeds but before the directory is created, allowing writes to an unintended location.

Consider re-validating the target or using safer directory creation semantics (e.g., checking the final path properties after MkdirAll). That said, exploiting this race requires both parent directory write access and precise timing, making it a concern primarily for defense-in-depth rather than typical threat scenarios.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/fleetint/run.go` around lines 338 - 346, There is a TOCTOU between
validateOfflinePath(offlineModePath) and os.MkdirAll(offlineModePath); after you
create the directory you must re-check the final path to ensure no symlink
attack slipped in. Fix by invoking the same validation (or an equivalent check)
immediately after os.MkdirAll: re-resolve symlinks (filepath.EvalSymlinks), use
os.Lstat on offlineModePath to assert it is not a symlink (ModeSymlink), and
verify the resolved path matches the expected safe parent/resolution from
validateOfflinePath; if the checks fail, remove the created directory and return
an error. Ensure these checks reference validateOfflinePath and offlineModePath
so reviewers can find them easily.
internal/server/server.go (1)

460-469: ⚠️ Potential issue | 🟠 Major

Add synchronization to prevent concurrent Stop() calls.

When startServer exits (as a background goroutine at line 360), its defer calls s.Stop(). If the signal handler also calls server.Stop() concurrently, both goroutines access srv, listener, and other fields without synchronization. The nil-checks provide only TOCTOU protection—both goroutines can pass the check before either modifies the fields, leading to concurrent Shutdown() calls on the same server object.

Wrap Stop() with sync.Once to ensure it executes only once:

Suggested synchronization
 type Server struct {
 	// ... existing fields ...
+	stopOnce sync.Once
 }

 func (s *Server) Stop() {
+	s.stopOnce.Do(func() {
+		s.doStop()
+	})
+}
+
+func (s *Server) doStop() {
 	// ... existing Stop() body ...
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/server/server.go` around lines 460 - 469, The defer in startServer
currently calls s.Stop() and can race with other callers (signal handler)
leading to concurrent shutdowns; add a sync.Once field (e.g., shutdownOnce
sync.Once) to the Server struct and use it inside the Stop method to ensure
shutdown logic (closing srv, listener, nvmlInstance handling, etc.) runs exactly
once; update any direct calls to Stop to remain the same but rely on Stop
invoking shutdownOnce.Do(...) so concurrent callers are serialized and TOCTOU
races are eliminated.
🧹 Nitpick comments (2)
internal/endpoint/endpoint.go (1)

74-90: Consider honoring context cancellation in the Unix socket dialer.

The DialContext callback ignores the provided context.Context, so connection attempts cannot be canceled by upstream timeouts or request cancellation. For short-lived local socket connections the 5s client timeout is likely sufficient, but for consistency with Go's context-aware patterns, consider using net.Dialer.DialContext.

♻️ Optional: Use context-aware dialer
 func NewAgentHTTPClient(serverURL *url.URL) *http.Client {
 	if serverURL.Scheme == "unix" {
 		socketPath := serverURL.Path
 		return &http.Client{
 			Timeout: 5 * time.Second,
 			Transport: &http.Transport{
-				DialContext: func(_ context.Context, _, _ string) (net.Conn, error) {
-					return net.Dial("unix", socketPath)
+				DialContext: func(ctx context.Context, _, _ string) (net.Conn, error) {
+					var d net.Dialer
+					return d.DialContext(ctx, "unix", socketPath)
 				},
 			},
 		}
 	}
 	return &http.Client{Timeout: 5 * time.Second}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/endpoint/endpoint.go` around lines 74 - 90, The Unix-socket
DialContext in NewAgentHTTPClient ignores the provided context, preventing
cancellation; replace the inline net.Dial call with a context-aware dial using
net.Dialer.DialContext so the passed context is respected (use the existing
socketPath and the DialContext function signature to call
(&net.Dialer{}).DialContext(ctx, "unix", socketPath)), preserving the 5s Timeout
and Transport setup otherwise.
cmd/fleetint/run.go (1)

45-50: Consider adding /home to restricted paths for multi-user systems.

The restricted list covers system directories but excludes /home. On multi-user systems, a privileged agent could inadvertently write to another user's home directory. Consider whether /home should also be restricted, or if this is an acceptable use case.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/fleetint/run.go` around lines 45 - 50, Update the restrictedOfflinePaths
slice in cmd/fleetint/run.go to include "/home" so the agent will never write
into users' home directories on multi-user systems; locate the variable
restrictedOfflinePaths and add the string "/home" to the slice (preserve
existing entries and ordering), or if you intentionally allow writes to /home,
add a code comment on restrictedOfflinePaths explaining that exception and why
it's safe.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@cmd/fleetint/run.go`:
- Around line 338-346: There is a TOCTOU between
validateOfflinePath(offlineModePath) and os.MkdirAll(offlineModePath); after you
create the directory you must re-check the final path to ensure no symlink
attack slipped in. Fix by invoking the same validation (or an equivalent check)
immediately after os.MkdirAll: re-resolve symlinks (filepath.EvalSymlinks), use
os.Lstat on offlineModePath to assert it is not a symlink (ModeSymlink), and
verify the resolved path matches the expected safe parent/resolution from
validateOfflinePath; if the checks fail, remove the created directory and return
an error. Ensure these checks reference validateOfflinePath and offlineModePath
so reviewers can find them easily.

In `@internal/server/server.go`:
- Around line 460-469: The defer in startServer currently calls s.Stop() and can
race with other callers (signal handler) leading to concurrent shutdowns; add a
sync.Once field (e.g., shutdownOnce sync.Once) to the Server struct and use it
inside the Stop method to ensure shutdown logic (closing srv, listener,
nvmlInstance handling, etc.) runs exactly once; update any direct calls to Stop
to remain the same but rely on Stop invoking shutdownOnce.Do(...) so concurrent
callers are serialized and TOCTOU races are eliminated.

---

Nitpick comments:
In `@cmd/fleetint/run.go`:
- Around line 45-50: Update the restrictedOfflinePaths slice in
cmd/fleetint/run.go to include "/home" so the agent will never write into users'
home directories on multi-user systems; locate the variable
restrictedOfflinePaths and add the string "/home" to the slice (preserve
existing entries and ordering), or if you intentionally allow writes to /home,
add a code comment on restrictedOfflinePaths explaining that exception and why
it's safe.

In `@internal/endpoint/endpoint.go`:
- Around line 74-90: The Unix-socket DialContext in NewAgentHTTPClient ignores
the provided context, preventing cancellation; replace the inline net.Dial call
with a context-aware dial using net.Dialer.DialContext so the passed context is
respected (use the existing socketPath and the DialContext function signature to
call (&net.Dialer{}).DialContext(ctx, "unix", socketPath)), preserving the 5s
Timeout and Transport setup otherwise.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 969365e6-5b86-45f6-9613-cc2b54e1c9fd

📥 Commits

Reviewing files that changed from the base of the PR and between 67e790d and a0f8cab.

📒 Files selected for processing (15)
  • SECURITY.md
  • cmd/fleetint/compact.go
  • cmd/fleetint/enroll.go
  • cmd/fleetint/inject.go
  • cmd/fleetint/root.go
  • cmd/fleetint/run.go
  • cmd/fleetint/status.go
  • docs/configuration.md
  • docs/usage.md
  • internal/attestation/attestation.go
  • internal/config/default.go
  • internal/endpoint/endpoint.go
  • internal/server/server.go
  • third_party/fleet-intelligence-sdk/pkg/kmsg/watcher.go
  • third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance.go
✅ Files skipped from review due to trivial changes (2)
  • SECURITY.md
  • third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance.go
🚧 Files skipped from review as they are similar to previous changes (8)
  • cmd/fleetint/inject.go
  • cmd/fleetint/compact.go
  • third_party/fleet-intelligence-sdk/pkg/kmsg/watcher.go
  • docs/configuration.md
  • cmd/fleetint/enroll.go
  • internal/config/default.go
  • internal/attestation/attestation.go
  • docs/usage.md

Comment thread cmd/fleetint/compact.go Outdated
Comment thread cmd/fleetint/enroll.go
@rsampaio rsampaio force-pushed the security/additional-hardening-fixes branch from a0f8cab to f0f0826 Compare April 14, 2026 00:27
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
internal/attestation/attestation.go (1)

468-493: Good security hardening for nonce validation.

The allowlist correctly covers hex and base64url encodings. The length limit and character validation prevent command injection via malicious backend-supplied nonces.

Optional defense-in-depth consideration: Since - is allowed, a nonce like --verbose would pass validation. While exec.CommandContext passes arguments directly and most argument parsers correctly treat the value after --nonce as literal, adding a check that the nonce doesn't start with - could guard against edge-case quirks in nvattest's argument parser.

🛡️ Optional: Reject nonces starting with `-`
 func validateNonce(nonce string) error {
 	if nonce == "" {
 		return fmt.Errorf("nonce is empty")
 	}
+	if nonce[0] == '-' {
+		return fmt.Errorf("nonce cannot start with '-'")
+	}
 	const maxLen = 512
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/attestation/attestation.go` around lines 468 - 493, validateNonce
currently allows '-' which lets values like "--verbose" pass; add a guard in
validateNonce (after the empty check) to reject nonces that start with '-' to
prevent them being interpreted as flags by nvattest or other argument parsers;
update the function validateNonce to return an error if nonce[0] == '-' (and
keep existing length/char checks) and add/adjust tests for the new rejection
case so callers using exec.CommandContext/nvattest are protected.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/fleetint/compact.go`:
- Around line 57-69: The current reachability check only tests the unix socket
when config.DefaultListenAddress starts with "/" so the legacy TCP loopback port
(config.DefaultHealthPort via netutil.IsPortOpen) can be skipped; update the
probe in cmd/fleetint/compact.go to test both transports: if
config.DefaultListenAddress is a unix path keep the unix DialTimeout check
(net.DialTimeout + conn.Close) but also always call
netutil.IsPortOpen(config.DefaultHealthPort) and fail if either probe indicates
a running daemon, or alternatively inspect config.DefaultListenAddress for a ":"
and if it looks like a TCP addr probe that specific TCP address instead of
skipping the port check; reference config.DefaultListenAddress, net.DialTimeout,
netutil.IsPortOpen, and config.DefaultHealthPort when making the change.

In `@internal/server/server.go`:
- Around line 77-79: Stop() is not idempotent: guard the shutdown/teardown path
so concurrent calls from startServer's deferred s.Stop() and an external Stop()
cannot race and double-tear down srv/listener/exporter/DB. Implement a
single-teardown mechanism (e.g., add a sync.Once field on the server struct or a
dedicated mutex) and wrap the body of Stop() (the calls to s.srv.Shutdown,
closing s.listener and exporter/DB teardown) in that guard; alternatively remove
the deferred s.Stop() in startServer and ensure all shutdowns funnel through the
guarded Stop() to prevent concurrent mutation of srv and listener. Ensure
references to srv, listener, and any exporter/DB teardown are protected by the
chosen guard (sync.Once or mutex) so Stop() becomes safe to call multiple times.
- Around line 432-437: The code currently unconditionally calls
removeStaleSocket(s.config.Address) which only checks inode type and can unlink
a live peer socket; instead, modify startup and shutdown flow so you probe the
socket before deleting and only unlink sockets this process actually created:
after successfully creating the Unix listener (where net.Listen / the server's
listener is assigned), set a boolean on the server instance (e.g.,
s.createdListener or s.socketOwner) to true; change removeStaleSocket to first
attempt a Unix domain dial/connect to s.config.Address and only unlink if the
connect fails (or the file is not a socket), and in Stop() and all cleanup sites
(the blocks that currently call removeStaleSocket) check s.createdListener
before attempting unlink so we never remove a socket we didn't create.

---

Nitpick comments:
In `@internal/attestation/attestation.go`:
- Around line 468-493: validateNonce currently allows '-' which lets values like
"--verbose" pass; add a guard in validateNonce (after the empty check) to reject
nonces that start with '-' to prevent them being interpreted as flags by
nvattest or other argument parsers; update the function validateNonce to return
an error if nonce[0] == '-' (and keep existing length/char checks) and
add/adjust tests for the new rejection case so callers using
exec.CommandContext/nvattest are protected.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: fd251f92-2886-4384-b87c-3502a15b8d10

📥 Commits

Reviewing files that changed from the base of the PR and between a0f8cab and f0f0826.

📒 Files selected for processing (18)
  • SECURITY.md
  • cmd/fleetint/compact.go
  • cmd/fleetint/enroll.go
  • cmd/fleetint/inject.go
  • cmd/fleetint/root.go
  • cmd/fleetint/run.go
  • cmd/fleetint/status.go
  • docs/configuration.md
  • docs/usage.md
  • internal/attestation/attestation.go
  • internal/config/default.go
  • internal/endpoint/endpoint.go
  • internal/server/server.go
  • third_party/fleet-intelligence-sdk/pkg/kmsg/watcher.go
  • third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance.go
  • third_party/fleet-intelligence-sdk/pkg/process/runner.go
  • third_party/fleet-intelligence-sdk/pkg/process/runner_exclusive.go
  • third_party/fleet-intelligence-sdk/pkg/process/runner_exclusive_test.go
💤 Files with no reviewable changes (3)
  • third_party/fleet-intelligence-sdk/pkg/process/runner.go
  • third_party/fleet-intelligence-sdk/pkg/process/runner_exclusive.go
  • third_party/fleet-intelligence-sdk/pkg/process/runner_exclusive_test.go
✅ Files skipped from review due to trivial changes (2)
  • docs/configuration.md
  • SECURITY.md
🚧 Files skipped from review as they are similar to previous changes (7)
  • cmd/fleetint/run.go
  • cmd/fleetint/status.go
  • third_party/fleet-intelligence-sdk/pkg/kmsg/watcher.go
  • cmd/fleetint/enroll.go
  • cmd/fleetint/root.go
  • docs/usage.md
  • internal/endpoint/endpoint.go

Comment thread cmd/fleetint/compact.go Outdated
Comment thread internal/server/server.go
Comment thread internal/server/server.go Outdated
Switch the default listener from TCP 127.0.0.1:15133 to a Unix socket
(/run/fleetint/fleetint.sock) with 0600 permissions so that access is
governed by filesystem ownership rather than firewall rules. TCP remains
available via --listen-address for backward compatibility. Client
commands (status, inject) transparently support both transports through
the new endpoint.NewAgentHTTPClient / AgentBaseURL helpers.

Additional hardening in this commit:
- Validate attestation nonce character set and length before passing it
  to the nvattest exec call (defense-in-depth against a compromised
  backend crafting malicious arguments).
- Add --token-file flag to `enroll` so the SAK token can be read from a
  file or stdin instead of appearing in /proc/<pid>/cmdline.
- Reject --path values that point into restricted system directories
  (/etc, /usr, /sys, ...) when running in offline mode.
- Restrict KMSG_FILE_PATH to paths under /dev/ and validate DCGM_URL
  character set to prevent environment-variable-based file read or
  connection redirection by a local attacker.
- Add sync.WaitGroup to attestation Manager so Stop() blocks until the
  background goroutine exits, eliminating a data race on the global
  defaultStateFileFn in tests.

Signed-off-by: Rodrigo Sampaio Vaz <rvaz@nvidia.com>
Add tests covering the new security-hardening code paths:

- endpoint: unix socket URL validation (bare paths, unix:// scheme,
  reject query/fragment/host), NewAgentHTTPClient, AgentBaseURL
- attestation: validateNonce (valid chars, empty, too long, shell
  metacharacters, null bytes, pipes)
- server: removeStaleSocket (nonexistent, regular file, symlink,
  actual socket), Stop() idempotency via sync.Once
- cmd/fleetint: validateOfflinePath (restricted dirs, traversal,
  symlinks), resolveToken (mutual exclusion, file read, missing token)

Also simplify Stop() socket cleanup -- Go's UnixListener.Close()
already unlinks the socket file, so the explicit removeStaleSocket
call in Stop() was redundant.

Signed-off-by: Rodrigo Sampaio Vaz <rvaz@nvidia.com>
@rsampaio rsampaio force-pushed the security/additional-hardening-fixes branch from f0f0826 to d146629 Compare April 14, 2026 05:29
Two changes to prevent breaking existing customer deployments:

1. fleetint status: when the default unix socket connection fails,
   transparently fall back to TCP localhost:15133. This handles the
   upgrade scenario where the daemon was started with an explicit
   --listen-address on TCP (via /etc/default/fleetint) but the CLI
   tools now default to the unix socket.

2. Remove /var from restrictedOfflinePaths. Blocking all of /var was
   too aggressive -- paths like /var/data/ and /var/opt/ are legitimate
   output locations. The remaining restricted list (/etc, /usr, /sys,
   /bin, /boot, /dev, /lib, /proc, /run, /sbin) covers directories
   where writes would damage the OS.

Signed-off-by: Rodrigo Sampaio Vaz <rvaz@nvidia.com>
@rsampaio rsampaio merged commit b305abd into main Apr 14, 2026
9 checks passed
@rsampaio rsampaio deleted the security/additional-hardening-fixes branch April 14, 2026 18:16
jingxiang-z pushed a commit that referenced this pull request Apr 14, 2026
Signed-off-by: Rodrigo Sampaio Vaz <rvaz@nvidia.com>
jingxiang-z pushed a commit that referenced this pull request Apr 14, 2026
Signed-off-by: Rodrigo Sampaio Vaz <rvaz@nvidia.com>
jingxiang-z pushed a commit that referenced this pull request Apr 14, 2026
Signed-off-by: Rodrigo Sampaio Vaz <rvaz@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants