Skip to content

Copilot/sub pr 16#18

Open
Igor Holt (igor-holt) wants to merge 3 commits into
mainfrom
copilot/sub-pr-16
Open

Copilot/sub pr 16#18
Igor Holt (igor-holt) wants to merge 3 commits into
mainfrom
copilot/sub-pr-16

Conversation

@igor-holt
Copy link
Copy Markdown
Member

/

Igor Holt (igor-holt) and others added 3 commits February 1, 2026 13:42
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Igor Holt <iholt@mymail.aacc.edu>
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello Igor Holt (@igor-holt), I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request lays the groundwork for the AIOS Layer, a system designed to provide a secure and reproducible environment for AI/GPU inference workloads. It encompasses the initial development of a Go-based host agent responsible for managing GPU resources through a lease-based scheduling mechanism and configurable policies. The changes also introduce a command-line interface for user interaction, along with extensive documentation and deployment artifacts to facilitate setup, operation, and future development.

Highlights

  • Core AIOS Agent Implementation: Introduced the aios-agent in Go, featuring GPU discovery, a lease-based scheduler, and policy enforcement for managing GPU resources.
  • Comprehensive Tooling and Documentation: Added a CLI (aiosctl), a basic web UI, detailed documentation (architecture, security, roadmap), and scripts for building, testing, and deploying the system (including Dockerfiles and systemd service).
  • CI/CD and Development Workflow: Established a GitHub Actions CI pipeline, a Makefile for common tasks, and contributing guidelines to streamline development.
Changelog
  • aios-layer/.github/workflows/ci.yml
    • Added a GitHub Actions workflow to automate building, testing, and running checks for the 'aios-layer' component.
  • aios-layer/.gitignore
    • Added entries to ignore compiled binaries (/bin) and test/benchmark reports (/reports).
  • aios-layer/CONTRIBUTING.md
    • Introduced a contributing guide detailing development setup, code style requirements (gofmt), and pull request guidelines.
  • aios-layer/LICENSE
    • Added the MIT License for the project, with a copyright year of 2025.
  • aios-layer/Makefile
    • Created a Makefile to simplify common development operations such as build, test, bench, and run.
  • aios-layer/README.md
    • Provided a comprehensive README for the AIOS Layer MVP, including a project overview, 10-minute quickstart guide, architecture summary, NVIDIA GPU integration details, security notes, and testing instructions.
  • aios-layer/SECURITY.md
    • Established a security policy document covering vulnerability reporting, supported versions, and hardening recommendations for the aios-agent.
  • aios-layer/agent/internal/config/config.go
    • Implemented Go structures and a loading function for managing agent configuration from YAML files, covering server, policy, scheduler, runtime, model, and metrics settings.
  • aios-layer/agent/internal/gpu/discovery.go
    • Developed GPU discovery logic in Go, utilizing nvidia-smi to identify available GPUs and parse their details.
  • aios-layer/agent/internal/policy/policy.go
    • Defined a policy engine in Go to validate GPU lease requests based on user quotas and maximum duration.
  • aios-layer/agent/internal/runtime/docker.go
    • Created a Docker runtime launcher in Go to start model server containers with specified GPU access.
  • aios-layer/agent/internal/scheduler/scheduler.go
    • Implemented a GPU scheduler in Go to manage exclusive GPU leases, track active leases, and automatically reap expired ones.
  • aios-layer/agent/internal/scheduler/scheduler_bench_test.go
    • Added Go benchmark tests for the scheduler's CreateLease and ReapExpired functions to measure performance.
  • aios-layer/agent/internal/scheduler/scheduler_test.go
    • Included Go unit tests for the scheduler, verifying exclusive GPU assignment and the correct reaping of expired leases.
  • aios-layer/agent/main.go
    • Developed the main aios-agent application, which serves HTTP API endpoints for health checks, GPU inventory, lease management (create, list, delete), and Prometheus-compatible metrics.
  • aios-layer/cli/main.go
    • Introduced aiosctl, a command-line interface tool for users to request GPU leases and send inference prompts to the model server.
  • aios-layer/config/aios.yaml
    • Provided a default YAML configuration file for the aios-agent, specifying network settings, policy limits, Docker runtime parameters, and metrics enablement.
  • aios-layer/deploy/aios-agent.service
    • Added a systemd service unit file to enable running the aios-agent as a background service with defined security contexts.
  • aios-layer/deploy/docker-compose.yml
    • Created a Docker Compose file for easily deploying the aios-agent alongside a vllm model server for local development and demonstration.
  • aios-layer/deploy/security/apparmor.profile
    • Included an AppArmor profile to enhance the security of the aios-agent by restricting its system capabilities.
  • aios-layer/deploy/security/seccomp.json
    • Added a seccomp profile to limit the system calls available to the aios-agent, further improving its security posture.
  • aios-layer/docker/agent.Dockerfile
    • Provided a Dockerfile for building a lightweight Docker image for the aios-agent.
  • aios-layer/docker/vllm.Dockerfile
    • Provided a Dockerfile for building a vllm model server Docker image, configured with a default model.
  • aios-layer/docs/architecture.md
    • Documented the architectural overview of the AIOS Layer, distinguishing between control and data planes, trust boundaries, and the policy model.
  • aios-layer/docs/nvidia_setup.md
    • Provided a detailed guide for setting up NVIDIA drivers and the container toolkit on Ubuntu 24.04.
  • aios-layer/docs/operations.md
    • Documented operational aspects including observability (logs, metrics), update strategies, and the plan for signed releases.
  • aios-layer/docs/roadmap.md
    • Outlined the project roadmap, detailing plans for MVP, v1.0, multi-node support, and future platform/hardware integrations.
  • aios-layer/docs/threat_model.md
    • Presented a threat model for the MVP, identifying potential threats, their risks, and proposed mitigations.
  • aios-layer/go.mod
    • Initialized the Go module for the aios-layer project and declared its dependencies.
  • aios-layer/go.sum
    • Generated the Go module checksums for dependency verification.
  • aios-layer/scripts/benchmark.sh
    • Added a shell script to benchmark inference latency and monitor GPU utilization.
  • aios-layer/scripts/check_prereqs.sh
    • Created a script to verify the presence of necessary prerequisites like Docker and nvidia-smi.
  • aios-layer/scripts/install.sh
    • Provided an installation script to deploy the aios-agent as a systemd service.
  • aios-layer/scripts/run_checks.sh
    • Developed a script to execute all Go tests and benchmarks, directing their output to report files.
  • aios-layer/scripts/run_demo.sh
    • Created a script to set up and run the AIOS Layer demo environment using Docker Compose, including building necessary images.
  • aios-layer/tests/integration/demo.sh
    • Added an integration test script to validate the functionality of the running aios-agent and model server by making API calls.
  • aios-layer/ui/index.html
    • Developed a simple web-based dashboard to display GPU inventory and active leases from the aios-agent.
Activity
  • No specific activity (comments, reviews, or progress updates) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment Gemini (@gemini-code-assist) Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on Gemini (@gemini-code-assist) comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the aios-layer project, a comprehensive system for managing AI/GPU resources. It includes a Go agent, a CLI, Docker configurations, and documentation. The overall structure is well-organized. My review focuses on improving correctness, robustness, and maintainability. Key areas for improvement include fixing a critical Docker build issue, correcting flawed benchmark tests, improving error handling in the CLI, and addressing potential race conditions and configuration oversights in the agent.

COPY go.mod ./
RUN go mod download
COPY agent ./agent
RUN cd agent && go build -o /bin/aios-agent
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The command cd agent && go build is incorrect. The go build command should be run from the WORKDIR (/src) where go.mod is located. The current command will likely fail the Docker build because the Go toolchain won't be able to resolve the module paths correctly from within the agent subdirectory.

RUN go build -o /bin/aios-agent ./agent

Comment on lines +8 to +23
func BenchmarkCreateLease(b *testing.B) {
gpus := []GPU{{Index: 0, Name: "gpu0", MemoryTotal: 1000}, {Index: 1, Name: "gpu1", MemoryTotal: 1000}}
for i := 0; i < b.N; i++ {
sched := New(gpus)
_, _ = sched.CreateLease("lease", "user", time.Second)
}
}

func BenchmarkReapExpired(b *testing.B) {
gpus := []GPU{{Index: 0, Name: "gpu0", MemoryTotal: 1000}}
for i := 0; i < b.N; i++ {
sched := New(gpus)
_, _ = sched.CreateLease("lease", "user", time.Nanosecond)
sched.ReapExpired()
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The benchmarks in this file are flawed. They re-create the scheduler inside the benchmark loop (for i := 0; i < b.N; i++), which means the benchmark is measuring the setup cost (New(gpus)) on every iteration, not just the function under test. Additionally, BenchmarkCreateLease will start benchmarking the error path after available GPUs are leased. Benchmarks should isolate the code being measured and correctly handle state.

func BenchmarkCreateLease(b *testing.B) {
	gpus := []GPU{{Index: 0, Name: "gpu0", MemoryTotal: 1000}}
	sched := New(gpus)
	b.ReportAllocs()
	b.ResetTimer()

	for i := 0; i < b.N; i++ {
		lease, _ := sched.CreateLease("lease", "user", time.Second)
		sched.Release(lease.ID)
	}
}

func BenchmarkReapExpired(b *testing.B) {
	gpus := []GPU{{Index: 0, Name: "gpu0", MemoryTotal: 1000}}
	sched := New(gpus)
	_, _ = sched.CreateLease("lease", "user", time.Nanosecond)
	time.Sleep(time.Millisecond) // Ensure lease is expired
	b.ReportAllocs()
	b.ResetTimer()

	for i := 0; i < b.N; i++ {
		sched.ReapExpired()
	}
}

Comment thread aios-layer/Makefile
Comment on lines +5 to +7
build:
go build -o bin/aios-agent ./agent
go build -o bin/aiosctl ./cli
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The build target attempts to write binaries to the bin/ directory, but it doesn't ensure this directory exists first. If the bin/ directory is missing, the go build command will fail.

build:
	mkdir -p bin
	go build -o bin/aios-agent ./agent
	go build -o bin/aiosctl ./cli

Comment thread aios-layer/cli/main.go
Comment on lines +76 to +95
func sendInference(endpoint, model, prompt string) {
payload := map[string]any{
"model": model,
"messages": []map[string]string{
{"role": "user", "content": prompt},
},
"max_tokens": 64,
"temperature": 0.2,
}
body, _ := json.Marshal(payload)
client := http.Client{Timeout: time.Second * 60}
resp, err := client.Post(endpoint, "application/json", bytes.NewReader(body))
if err != nil {
fmt.Printf("error: %v\n", err)
return
}
defer resp.Body.Close()
data, _ := io.ReadAll(resp.Body)
fmt.Println(string(data))
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Errors from json.Marshal and io.ReadAll are being ignored. This can hide bugs and lead to unexpected behavior. For a CLI tool, it's crucial to check all errors and report them to the user, preferably on stderr.

func sendInference(endpoint, model, prompt string) {
	payload := map[string]any{
		"model": model,
		"messages": []map[string]string{
			{"role": "user", "content": prompt},
		},
		"max_tokens": 64,
		"temperature": 0.2,
	}
	body, err := json.Marshal(payload)
	if err != nil {
		fmt.Fprintf(os.Stderr, "error: %v\n", err)
		return
	}
	client := http.Client{Timeout: time.Second * 60}
	resp, err := client.Post(endpoint, "application/json", bytes.NewReader(body))
	if err != nil {
		fmt.Fprintf(os.Stderr, "error: %v\n", err)
		return
	}
	defer resp.Body.Close()
	data, err := io.ReadAll(resp.Body)
	if err != nil {
		fmt.Fprintf(os.Stderr, "error: %v\n", err)
		return
	}
	fmt.Println(string(data))
}

Comment thread aios-layer/cli/main.go
Comment on lines +63 to +74
func requestLease(agent, user string, duration int, launch bool) {
payload := leaseRequest{User: user, DurationSeconds: duration, LaunchRuntime: launch}
body, _ := json.Marshal(payload)
resp, err := http.Post(agent+"/v1/leases", "application/json", bytes.NewReader(body))
if err != nil {
fmt.Printf("error: %v\n", err)
return
}
defer resp.Body.Close()
data, _ := io.ReadAll(resp.Body)
fmt.Println(string(data))
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Errors from json.Marshal and io.ReadAll are being ignored. This can hide bugs and lead to unexpected behavior. For a CLI tool, it's crucial to check all errors and report them to the user, preferably on stderr.

func requestLease(agent, user string, duration int, launch bool) {
	payload := leaseRequest{User: user, DurationSeconds: duration, LaunchRuntime: launch}
	body, err := json.Marshal(payload)
	if err != nil {
		fmt.Fprintf(os.Stderr, "error: %v\n", err)
		return
	}
	resp, err := http.Post(agent+"/v1/leases", "application/json", bytes.NewReader(body))
	if err != nil {
		fmt.Fprintf(os.Stderr, "error: %v\n", err)
		return
	}
	defer resp.Body.Close()
	data, err := io.ReadAll(resp.Body)
	if err != nil {
		fmt.Fprintf(os.Stderr, "error: %v\n", err)
		return
	}
	fmt.Println(string(data))
}

Comment thread aios-layer/cli/main.go
Comment on lines +27 to +30
type openAIRequest struct {
Model string `json:"model"`
Messages []string `json:"messages"`
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The openAIRequest struct is defined but never used in the code. It should be removed to avoid dead code.

Comment on lines +10 to +21
func (p Policy) ValidateRequest(existingLeases int, requestedDuration int) error {
if p.MaxGPUsPerUser > 0 && existingLeases >= p.MaxGPUsPerUser {
return errors.New("gpu quota exceeded")
}
if p.MaxDurationSec > 0 && requestedDuration > p.MaxDurationSec {
return errors.New("requested duration exceeds policy")
}
if requestedDuration <= 0 {
return errors.New("invalid duration")
}
return nil
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ValidateRequest function returns errors created with errors.New(). It's a better practice to define these as exported error variables. This allows callers to use errors.Is() to check for specific error types and handle them programmatically.

Suggested change
func (p Policy) ValidateRequest(existingLeases int, requestedDuration int) error {
if p.MaxGPUsPerUser > 0 && existingLeases >= p.MaxGPUsPerUser {
return errors.New("gpu quota exceeded")
}
if p.MaxDurationSec > 0 && requestedDuration > p.MaxDurationSec {
return errors.New("requested duration exceeds policy")
}
if requestedDuration <= 0 {
return errors.New("invalid duration")
}
return nil
}
var (
ErrQuotaExceeded = errors.New("gpu quota exceeded")
ErrDurationExceedsPolicy = errors.New("requested duration exceeds policy")
ErrInvalidDuration = errors.New("invalid duration")
)
func (p Policy) ValidateRequest(existingLeases int, requestedDuration int) error {
if p.MaxGPUsPerUser > 0 && existingLeases >= p.MaxGPUsPerUser {
return ErrQuotaExceeded
}
if p.MaxDurationSec > 0 && requestedDuration > p.MaxDurationSec {
return ErrDurationExceedsPolicy
}
if requestedDuration <= 0 {
return ErrInvalidDuration
}
return nil
}

- uses: actions/setup-go@v5
with:
go-version: "1.22"
- run: go mod download
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The CI workflow can be optimized by caching Go modules to speed up subsequent builds. Consider adding a caching step for ~/go/pkg/mod and the Go build cache before the go mod download step. You can use actions/cache for this.

Comment on lines +21 to +23
if err != nil {
return nil, errors.New("nvidia-smi not available")
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The error returned when nvidia-smi is not found loses the original error context. Wrapping the error provides more detailed information for debugging, such as why the command failed to run (e.g., not in PATH).

		return nil, fmt.Errorf("nvidia-smi not available: %w", err)

Comment on lines +11 to +12
sleep 5
curl -s http://127.0.0.1:8080/v1/health
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a fixed sleep is a fragile way to wait for a service to become available. It can fail on slower systems or under load. It's more robust to use a loop that polls the health endpoint until it succeeds or a timeout is reached.

Suggested change
sleep 5
curl -s http://127.0.0.1:8080/v1/health
echo "Waiting for aios-agent to be healthy..."
for i in {1..10}; do
if curl -s http://127.0.0.1:8080/v1/health | grep -q "ok"; then
echo "aios-agent is healthy."
break
fi
if [ "$i" -eq 10 ]; then
echo "aios-agent failed to start in time." >&2
exit 1
fi
sleep 1
done
curl -s http://127.0.0.1:8080/v1/health

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants