Copilot/sub pr 16#18
Conversation
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Igor Holt <iholt@mymail.aacc.edu>
Summary of ChangesHello Igor Holt (@igor-holt), I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request lays the groundwork for the AIOS Layer, a system designed to provide a secure and reproducible environment for AI/GPU inference workloads. It encompasses the initial development of a Go-based host agent responsible for managing GPU resources through a lease-based scheduling mechanism and configurable policies. The changes also introduce a command-line interface for user interaction, along with extensive documentation and deployment artifacts to facilitate setup, operation, and future development. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on Gemini (@gemini-code-assist) comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces the aios-layer project, a comprehensive system for managing AI/GPU resources. It includes a Go agent, a CLI, Docker configurations, and documentation. The overall structure is well-organized. My review focuses on improving correctness, robustness, and maintainability. Key areas for improvement include fixing a critical Docker build issue, correcting flawed benchmark tests, improving error handling in the CLI, and addressing potential race conditions and configuration oversights in the agent.
| COPY go.mod ./ | ||
| RUN go mod download | ||
| COPY agent ./agent | ||
| RUN cd agent && go build -o /bin/aios-agent |
There was a problem hiding this comment.
The command cd agent && go build is incorrect. The go build command should be run from the WORKDIR (/src) where go.mod is located. The current command will likely fail the Docker build because the Go toolchain won't be able to resolve the module paths correctly from within the agent subdirectory.
RUN go build -o /bin/aios-agent ./agent
| func BenchmarkCreateLease(b *testing.B) { | ||
| gpus := []GPU{{Index: 0, Name: "gpu0", MemoryTotal: 1000}, {Index: 1, Name: "gpu1", MemoryTotal: 1000}} | ||
| for i := 0; i < b.N; i++ { | ||
| sched := New(gpus) | ||
| _, _ = sched.CreateLease("lease", "user", time.Second) | ||
| } | ||
| } | ||
|
|
||
| func BenchmarkReapExpired(b *testing.B) { | ||
| gpus := []GPU{{Index: 0, Name: "gpu0", MemoryTotal: 1000}} | ||
| for i := 0; i < b.N; i++ { | ||
| sched := New(gpus) | ||
| _, _ = sched.CreateLease("lease", "user", time.Nanosecond) | ||
| sched.ReapExpired() | ||
| } | ||
| } |
There was a problem hiding this comment.
The benchmarks in this file are flawed. They re-create the scheduler inside the benchmark loop (for i := 0; i < b.N; i++), which means the benchmark is measuring the setup cost (New(gpus)) on every iteration, not just the function under test. Additionally, BenchmarkCreateLease will start benchmarking the error path after available GPUs are leased. Benchmarks should isolate the code being measured and correctly handle state.
func BenchmarkCreateLease(b *testing.B) {
gpus := []GPU{{Index: 0, Name: "gpu0", MemoryTotal: 1000}}
sched := New(gpus)
b.ReportAllocs()
b.ResetTimer()
for i := 0; i < b.N; i++ {
lease, _ := sched.CreateLease("lease", "user", time.Second)
sched.Release(lease.ID)
}
}
func BenchmarkReapExpired(b *testing.B) {
gpus := []GPU{{Index: 0, Name: "gpu0", MemoryTotal: 1000}}
sched := New(gpus)
_, _ = sched.CreateLease("lease", "user", time.Nanosecond)
time.Sleep(time.Millisecond) // Ensure lease is expired
b.ReportAllocs()
b.ResetTimer()
for i := 0; i < b.N; i++ {
sched.ReapExpired()
}
}| build: | ||
| go build -o bin/aios-agent ./agent | ||
| go build -o bin/aiosctl ./cli |
There was a problem hiding this comment.
| func sendInference(endpoint, model, prompt string) { | ||
| payload := map[string]any{ | ||
| "model": model, | ||
| "messages": []map[string]string{ | ||
| {"role": "user", "content": prompt}, | ||
| }, | ||
| "max_tokens": 64, | ||
| "temperature": 0.2, | ||
| } | ||
| body, _ := json.Marshal(payload) | ||
| client := http.Client{Timeout: time.Second * 60} | ||
| resp, err := client.Post(endpoint, "application/json", bytes.NewReader(body)) | ||
| if err != nil { | ||
| fmt.Printf("error: %v\n", err) | ||
| return | ||
| } | ||
| defer resp.Body.Close() | ||
| data, _ := io.ReadAll(resp.Body) | ||
| fmt.Println(string(data)) | ||
| } |
There was a problem hiding this comment.
Errors from json.Marshal and io.ReadAll are being ignored. This can hide bugs and lead to unexpected behavior. For a CLI tool, it's crucial to check all errors and report them to the user, preferably on stderr.
func sendInference(endpoint, model, prompt string) {
payload := map[string]any{
"model": model,
"messages": []map[string]string{
{"role": "user", "content": prompt},
},
"max_tokens": 64,
"temperature": 0.2,
}
body, err := json.Marshal(payload)
if err != nil {
fmt.Fprintf(os.Stderr, "error: %v\n", err)
return
}
client := http.Client{Timeout: time.Second * 60}
resp, err := client.Post(endpoint, "application/json", bytes.NewReader(body))
if err != nil {
fmt.Fprintf(os.Stderr, "error: %v\n", err)
return
}
defer resp.Body.Close()
data, err := io.ReadAll(resp.Body)
if err != nil {
fmt.Fprintf(os.Stderr, "error: %v\n", err)
return
}
fmt.Println(string(data))
}| func requestLease(agent, user string, duration int, launch bool) { | ||
| payload := leaseRequest{User: user, DurationSeconds: duration, LaunchRuntime: launch} | ||
| body, _ := json.Marshal(payload) | ||
| resp, err := http.Post(agent+"/v1/leases", "application/json", bytes.NewReader(body)) | ||
| if err != nil { | ||
| fmt.Printf("error: %v\n", err) | ||
| return | ||
| } | ||
| defer resp.Body.Close() | ||
| data, _ := io.ReadAll(resp.Body) | ||
| fmt.Println(string(data)) | ||
| } |
There was a problem hiding this comment.
Errors from json.Marshal and io.ReadAll are being ignored. This can hide bugs and lead to unexpected behavior. For a CLI tool, it's crucial to check all errors and report them to the user, preferably on stderr.
func requestLease(agent, user string, duration int, launch bool) {
payload := leaseRequest{User: user, DurationSeconds: duration, LaunchRuntime: launch}
body, err := json.Marshal(payload)
if err != nil {
fmt.Fprintf(os.Stderr, "error: %v\n", err)
return
}
resp, err := http.Post(agent+"/v1/leases", "application/json", bytes.NewReader(body))
if err != nil {
fmt.Fprintf(os.Stderr, "error: %v\n", err)
return
}
defer resp.Body.Close()
data, err := io.ReadAll(resp.Body)
if err != nil {
fmt.Fprintf(os.Stderr, "error: %v\n", err)
return
}
fmt.Println(string(data))
}| type openAIRequest struct { | ||
| Model string `json:"model"` | ||
| Messages []string `json:"messages"` | ||
| } |
| func (p Policy) ValidateRequest(existingLeases int, requestedDuration int) error { | ||
| if p.MaxGPUsPerUser > 0 && existingLeases >= p.MaxGPUsPerUser { | ||
| return errors.New("gpu quota exceeded") | ||
| } | ||
| if p.MaxDurationSec > 0 && requestedDuration > p.MaxDurationSec { | ||
| return errors.New("requested duration exceeds policy") | ||
| } | ||
| if requestedDuration <= 0 { | ||
| return errors.New("invalid duration") | ||
| } | ||
| return nil | ||
| } |
There was a problem hiding this comment.
The ValidateRequest function returns errors created with errors.New(). It's a better practice to define these as exported error variables. This allows callers to use errors.Is() to check for specific error types and handle them programmatically.
| func (p Policy) ValidateRequest(existingLeases int, requestedDuration int) error { | |
| if p.MaxGPUsPerUser > 0 && existingLeases >= p.MaxGPUsPerUser { | |
| return errors.New("gpu quota exceeded") | |
| } | |
| if p.MaxDurationSec > 0 && requestedDuration > p.MaxDurationSec { | |
| return errors.New("requested duration exceeds policy") | |
| } | |
| if requestedDuration <= 0 { | |
| return errors.New("invalid duration") | |
| } | |
| return nil | |
| } | |
| var ( | |
| ErrQuotaExceeded = errors.New("gpu quota exceeded") | |
| ErrDurationExceedsPolicy = errors.New("requested duration exceeds policy") | |
| ErrInvalidDuration = errors.New("invalid duration") | |
| ) | |
| func (p Policy) ValidateRequest(existingLeases int, requestedDuration int) error { | |
| if p.MaxGPUsPerUser > 0 && existingLeases >= p.MaxGPUsPerUser { | |
| return ErrQuotaExceeded | |
| } | |
| if p.MaxDurationSec > 0 && requestedDuration > p.MaxDurationSec { | |
| return ErrDurationExceedsPolicy | |
| } | |
| if requestedDuration <= 0 { | |
| return ErrInvalidDuration | |
| } | |
| return nil | |
| } |
| - uses: actions/setup-go@v5 | ||
| with: | ||
| go-version: "1.22" | ||
| - run: go mod download |
| if err != nil { | ||
| return nil, errors.New("nvidia-smi not available") | ||
| } |
There was a problem hiding this comment.
| sleep 5 | ||
| curl -s http://127.0.0.1:8080/v1/health |
There was a problem hiding this comment.
Using a fixed sleep is a fragile way to wait for a service to become available. It can fail on slower systems or under load. It's more robust to use a loop that polls the health endpoint until it succeeds or a timeout is reached.
| sleep 5 | |
| curl -s http://127.0.0.1:8080/v1/health | |
| echo "Waiting for aios-agent to be healthy..." | |
| for i in {1..10}; do | |
| if curl -s http://127.0.0.1:8080/v1/health | grep -q "ok"; then | |
| echo "aios-agent is healthy." | |
| break | |
| fi | |
| if [ "$i" -eq 10 ]; then | |
| echo "aios-agent failed to start in time." >&2 | |
| exit 1 | |
| fi | |
| sleep 1 | |
| done | |
| curl -s http://127.0.0.1:8080/v1/health |
/