Skip to content

feat(cli): size purchases by target RI/SP utilization (--target-utilization, closes #338)#339

Merged
cristim merged 42 commits into
feat/multicloud-web-frontendfrom
feat/cli-target-utilization
May 13, 2026
Merged

feat(cli): size purchases by target RI/SP utilization (--target-utilization, closes #338)#339
cristim merged 42 commits into
feat/multicloud-web-frontendfrom
feat/cli-target-utilization

Conversation

@cristim
Copy link
Copy Markdown
Member

@cristim cristim commented May 12, 2026

Summary

  • Adds a new CLI flag --target-utilization (-u) that sizes RI/SP purchases by projected post-purchase utilization instead of --coverage % of recommendations.
  • Reuses utilization signals already on the wire (AverageNumberOfInstancesUsedPerHour, AverageUtilization, EstimatedAverageUtilization) — no new AWS API call.
  • Emits three new columns on the dry-run/purchase CSV: ProjectedUtilization, ProjectedCoverage, RecommendedUtilization.

Closes #338.

Why a separate sizing mode

--coverage 80 is "buy 80% of each recommendation's instance count." It answers "how much of demand do we cover?" but ignores "how much of what we buy will actually be used?" In bursty/seasonal workloads, --coverage 80 can still over-buy, leaving commitments idle.

--target-utilization 95 is "buy enough that projected utilization stays ≥ 95%." Counter-intuitively, higher target ⇒ fewer commitments bought (the math is used÷bought; raising the target shrinks the bought side). Use it when under-buying-and-spilling-to-on-demand beats over-buying-and-wasting-commitment.

The two modes are mutually exclusive: when --target-utilization is set, --coverage is ignored with an info log.

Math

  • RI: n_target = floor(AverageInstancesUsedPerHour / (target/100)), capped at rec.Count (never exceed AWS's recommended ceiling). Drop with explanatory INFO log when n_target == 0 (target unreachable). Pass through unmodified when avg is zero (no signal); counted in a single end-of-run skip summary.
  • SP: if RecommendedUtilization >= target, no scaling (AWS already projects above target). Else ratio = RecommendedUtilization / target (always < 1); scales SavingsPlanDetails.HourlyCommitment and rec.EstimatedSavingsexactly the fields ApplyCoverage scales on the SP branch, so the two modes stay structurally consistent. Pass through when RecommendedUtilization <= 0 OR HourlyCommitment <= 0 (CE placeholder rows).

Changes

Commit Scope
7d7baf6d1 pkg/common: 4 new fields on Recommendation
bbbc36a24 parser_ri.go + parser_sp.go: populate the new fields
7d83ec948 cmd/main.go: flag + Config field + validation
79cf1a5ad cmd/helpers.go: ApplyTargetUtilization + applySizing
9468b0aaf cmd/multi_service*.go: route both call sites through applySizing
5f1e982e1 CSV writer: 3 new columns
d40001f6a README: flag table + semantics explainer
2bded0b15 Post-impl review: SP edge cases + target=100 test

Test plan

  • go build ./... clean
  • go test ./... — full suite passes (no regressions in 30+ existing packages)
  • cudly --help shows --target-utilization with the inversion-warning copy
  • TestApplyTargetUtilization_RI covers happy path / AWS-ceiling cap / target unreachable drop / no signal pass-through / target met exactly
  • TestApplyTargetUtilization_RI_Target100 covers the AC's "target == 100 → only purchases with perfectly-matched existing usage" boundary
  • TestApplyTargetUtilization_RI_CostScaling — cost fields untouched on RI branch (matches ApplyCoverage)
  • TestApplyTargetUtilization_SP — no-op / scale-down / no-signal
  • TestApplyTargetUtilization_SP_NoSignalGuardsHourlyCommitment=0 + wrong-Details-type edge cases
  • TestApplySizing — routes based on cfg.TargetUtilization > 0
  • TestValidateTargetUtilization — range + both-flags-set behaviour
  • TestParseRIUtilizationSignals + TestParseSavingsPlanDetail_RecommendedUtilization — fall back to zero on nil/unparseable
  • TestWriteMultiServiceCSVReport_UtilizationColumns — header + blank-when-zero formatting

Design notes

  • Two self-review passes (plan: 10 passes against the design; impl: a fresh-eyes Sonnet fork against the actual code) caught: floor-vs-ceil math, missing ProjectedCoverage AC field, wrong parser function for SPs (parseSavingsPlanDetail, not parseRecommendationDetail), wrong enum identifier names, wrong SP cost-field list, missing explanatory log for unreachable target, and two SP edge cases in this branch. All fixed before the push.
  • Frontend visibility of the new columns is out of scope for this PR per the issue's "Out of scope" section. The CSV change is for the CLI's existing CSV output, not the React frontend.

Summary by CodeRabbit

  • New Features
    • Added --target-coverage sizing (supersedes --coverage when set), plus --rebuy-window-days and --min-pool-size flags.
    • Fetches existing RI coverage, accounts for expiring commitments, and adds RDS family-NU sizing to hit target coverage.
    • CSV report expanded with utilization/coverage columns, richer RDS fields, sorting and a TOTAL row.
  • Documentation
    • README clarifies --coverage vs --target-coverage semantics and precedence.
  • Tests
    • Large suite of unit tests added/extended for sizing, coverage, CSV and validation logic.
  • Other
    • Summary output now shows estimated savings as monthly.

Review Change Stack

cristim added 8 commits May 12, 2026 10:25
Adds four fields to common.Recommendation to carry the utilization data
needed by the upcoming --target-utilization CLI flag:

- AverageInstancesUsedPerHour / RecommendedUtilization — the AWS-returned
  signal (read from the CE response by the recommendations parser in a
  follow-up commit).
- ProjectedUtilization / ProjectedCoverage — populated by the sizing
  step after we pick our own quantity.

Wire-format compatible: all four fields are omitempty so existing
consumers (frontend, CSV writer) ignore them until they're populated.
Populates the utilization fields added to common.Recommendation in the
previous commit:

- parser_ri.go: new parseRIUtilizationSignals helper, called from
  parseRecommendationDetail. Reads AverageNumberOfInstancesUsedPerHour
  and AverageUtilization from the existing CE response — no new API
  call. Nil pointer or parse failure leaves the field at zero (the
  sizing path treats zero as "no signal" and skips).

- parser_sp.go: reads EstimatedAverageUtilization via the existing
  parseOptionalFloat helper and stores it in RecommendedUtilization.
  AverageInstancesUsedPerHour is left at zero for SPs — the field
  isn't applicable; SP sizing uses the utilization value directly.

Tests in parser_ri_test.go and parser_sp_test.go cover both happy paths
and the nil/unparseable fallbacks.
Adds the user-facing CLI surface for issue #338. The flag accepts a
% in [0, 100]; default 0 means "disabled, fall back to --coverage".
Sizing logic lands in a follow-up commit.

- Config.TargetUtilization (cmd/main.go)
- --target-utilization / -u cobra flag with help text spelling out the
  inversion ("higher target = fewer commitments bought")
- Range validation extracted into validateTargetUtilization (keeps
  validateNumericRanges under the gocyclo 10-complexity threshold);
  explicit info-log when both --target-utilization and --coverage are
  set on the same invocation (target wins). Uses cobra's Changed()
  rather than default-comparison so --coverage 80 (which equals the
  default) still triggers the notice.

validateNumericRanges now takes the *cobra.Command for Changed() access;
existing test callers updated to pass nil where the flag-source-detection
branch is irrelevant. New TestValidateTargetUtilization covers the new
range checks and the both-flags-set path.
The core sizing logic for --target-utilization:

- RIs: n_target = floor(AverageInstancesUsedPerHour / (target/100)),
  capped at rec.Count (we never exceed AWS's recommended ceiling).
  Drop with explanatory INFO log when n_target == 0 (target unreachable,
  per the issue AC). Pass through unmodified when avg <= 0 (no signal);
  counted in a single end-of-run WARN summary.

- SPs: when AWS already projects at-or-above target, no scaling. Else
  ratio = RecommendedUtilization / target (< 1); scale exactly the
  fields ApplyCoverage scales on the SP branch
  (SavingsPlanDetails.HourlyCommitment + EstimatedSavings — NOT
  CommitmentCost/OnDemandCost). Discrimination via IsSavingsPlan(rec.Service)
  to match the existing ApplyCoverage convention.

- Routing: new applySizing(recs, cfg, coverage) helper picks between
  ApplyTargetUtilization and ApplyCoverage based on cfg.TargetUtilization.
  The explicit `coverage` parameter lets the CSV path pass its
  csvModeCoverage-substituted value while the main path passes
  cfg.Coverage. Helper-wiring at the call sites lands in the next
  commit.

Per-rec dispatch extracted into applyTargetUtilizationOne to keep
ApplyTargetUtilization under gocyclo's complexity threshold.

Tests cover: RI happy path / AWS-ceiling cap / target unreachable drop /
no signal pass-through / target-met-exactly; SP no-op / scale-down /
no-signal pass-through; cost fields untouched for RIs and partial cost
fields untouched for SPs (so divergence with ApplyCoverage's column set
is impossible without a failing test); applySizing routing logic.
Routes the two sites that previously called applyCommonCoverage directly
through the new applySizing helper, so --target-utilization takes effect
in both pipelines:

- cmd/multi_service.go: CSV-input path. The CSV mode short-circuit
  (skip sizing when csvModeCoverage == 100) is updated to NOT skip
  when target-utilization is set — TargetUtilization applies regardless
  of the coverage value.

- cmd/multi_service_helpers.go: main recommendation-fetch path
  (applyCoverageAndOverrides). Drops the per-mode coverage call in
  favour of applySizing; emits a 🎯 vs 📈 log line based on which
  sizing mode fired.

applyCommonCoverage stays in place (called by applySizing's else
branch) so the legacy coverage path is byte-for-byte unchanged when
--target-utilization is unset.
Adds three columns to the dry-run / purchase CSV emitted by
writeMultiServiceCSVReport:

- ProjectedUtilization — populated by ApplyTargetUtilization on the
  sized rec (RI: avg/n_bought * 100, capped at 100; SP: target or
  AWS's RecommendedUtilization when no scaling needed).
- ProjectedCoverage — RI only (n_bought / avg * 100, capped at 100);
  blank for SPs since the CE response doesn't expose total-demand-$.
- RecommendedUtilization — AWS's projected utilization for the full
  recommendation count (RI AverageUtilization or SP
  EstimatedAverageUtilization).

Zero values render as blank cells via the new formatPercentOrBlank
helper — zero means "unknown / not applicable" (e.g. when running on
the legacy coverage path), and "0.0" cells would mislead.
Adds the new flag to the Purchase Configuration table and spells out
the semantic difference vs `--coverage` plus the inversion
(higher target → fewer commitments). The `--help` copy was added in
the earlier flag-declaration commit and is sufficient at the CLI;
this is for the README's reference table that's the typical first
stop for new users.
Two SP-branch bugs surfaced by post-implementation review of #338:

1. **Details=nil set ProjectedUtilization anyway.** When rec.Details was
   not a *SavingsPlanDetails (a defensive log-and-pass-through path),
   the function still set ProjectedUtilization = targetPct on its way
   out. Result: CSV row showed target utilization while cost fields
   stayed unscaled — misleading the operator. Now returns the rec
   unscaled with zero projection metrics on the Details-type mismatch.

2. **HourlyCommitment=0 with positive utilization signal slipped past
   the no-signal guard.** Cost Explorer occasionally returns SP recs
   with a $0 HourlyCommitment placeholder. The ratio-scaling branch
   then did 0 * ratio = 0 while still claiming target utilization was
   achieved. Added an explicit `HourlyCommitment <= 0` no-signal check
   above the scaling block — pass through, count in skip summary.

Also fills the AC's "target == 100 → only purchases with perfectly-
matched existing usage" test gap (TestApplyTargetUtilization_RI_Target100).
Verifies the avg < 1 drop case, avg=1.0 buy-1 boundary, avg=10.0
perfect-match, and the avg > N truncation-with-clamp case.
@cristim cristim added enhancement New feature or request triaged Item has been triaged priority/p2 Backlog-worthy severity/medium Moderate harm urgency/this-quarter Within the quarter impact/many Affects most users effort/m Days type/feat New capability labels May 12, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 12, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 58e56f50-3529-427a-8ac6-5faeee050ca3

📥 Commits

Reviewing files that changed from the base of the PR and between 5a30572 and bbae9ba.

📒 Files selected for processing (1)
  • providers/aws/recommendations/expiry_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • providers/aws/recommendations/expiry_test.go

📝 Walkthrough

Walkthrough

Adds a target-coverage sizing mode, fetches/applies existing RI coverage, implements RDS family NU sizing and expiry-aware adjustments, parses utilization signals for RI/SP, updates CSV/reporting, adds CLI flags and validation, and includes unit tests integrating these changes into the recommendation pipeline.

Changes

Target Coverage & Sizing Pipeline

Layer / File(s) Summary
Docs & CLI flags
README.md, cmd/main.go, cmd/main_test.go
Adds --target-coverage/-u, --rebuy-window-days, --min-pool-size; documents --coverage vs --target-coverage; introduces effectiveSizingPct and updates purchase-id suffix.
Recommendation contract and parsers
pkg/common/types.go, providers/aws/recommendations/parser_ri.go, providers/aws/recommendations/parser_sp.go, tests
Recommendation gains AverageInstancesUsedPerHour, RecommendedUtilization, RecommendedCount, ExistingCoveragePct, ExistingCoverageKnown, ProjectedUtilization, and ProjectedCoverage. RI parser preserves AWS RecommendedNumberOfInstancesToPurchase into RecommendedCount and parses utilization signals; SP parser parses EstimatedAverageUtilization.
Coverage fetcher & applicator
providers/aws/recommendations/coverage.go, coverage_test.go
Implements GetRICoverageMap with CE GetReservationCoverage pagination/retry, canonical pool-keying (RDS/non-RDS), computes per-pool coverage percent and average concurrent instances, exposes PoolCoverage/PoolCoverageMap, and ApplyCoverageMapToRecommendations to annotate recs and rebase per-rec averages.
Expiry adjustments
providers/aws/recommendations/expiry.go, expiry_test.go
AdjustExistingCoverageForExpiringCommitments reduces ExistingCoveragePct for commitments expiring within windowDays, with clamping and tests for matching and edge cases.
RDS family NU sizing
providers/aws/recommendations/family_nu.go, family_nu_test.go
Adds RDS size-suffix→NU mapping, aggregates family NU coverage, computes family-level scaling factors, floors per-rec counts, rescales cost fields, drops/keeps recs per family target and annotates ProjectedCoverage/ProjectedUtilization; tests validate behaviors.
Sizing helpers
cmd/helpers.go, cmd/helpers_test.go
Updates ApplyCoverage to scale cost-bearing fields; adds ApplyTargetCoverage with RI and SP sizing branches (RI uses floor-based target counts from average and existing coverage; SP scales HourlyCommitment); applySizing routes on cfg.TargetCoverage; unit tests added.
Pipeline wiring
cmd/multi_service.go, cmd/multi_service_helpers.go
fetchExistingCoverage added and coverageMap threaded through fetchAllRecs/fetchAndFilterRegionRecs; applyCoverageAndOverrides replaced to apply coverage map, expiry adjustments, RDS family-NU sizing (when target enabled), and applySizing; CommitmentID generation uses effectiveSizingPct.
CSV/reporting
cmd/multi_service_csv.go, cmd/multi_service_csv_test.go, internal/reporter/reporter.go
Expanded CSV columns (RDS family/engine/deployment, avg/covered/existing/projected coverage), conditional blanking helpers, sorted rows, TOTAL aggregation row, and reporter wording fix to show monthly savings and upfront commitment; tests added.
Filters & validation
cmd/multi_service_filters.go, cmd/validators.go, tests
Centralizes dimension filters (passesDimensionFilters) and MinPoolSize logic, refactors numeric validation to accept *cobra.Command, adds validateTargetCoverage enforcing 0–100 and precedence messaging, and test coverage for validators.
Service client & mocks
providers/aws/recommendations/client.go, client_test.go, providers/aws/service_client.go, service_client_test.go
Adds GetReservationCoverage to CE API interface and test mocks; RecommendationsClientAdapter.GetRICoverageMap delegates to recommendations client; mocks updated for tests.
Misc tests
multiple files
Adds and updates tests across helpers, CSV, coverage, expiry, family_nu, parsers, validators, and main effectiveSizingPct test.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • LeanerCloud/CUDly#312: Both PRs touch AWS Savings Plan recommendation parsing and downstream cost-bearing fields: main PR’s sizing logic scales OnDemandCost/EstimatedSavings while extending parseSavingsPlanDetail (via RecommendedUtilization), and the retrieved PR modifies the same parseSavingsPlanDetail to additionally populate OnDemandCost, which would be the value being scaled/displayed by the main PR.
  • LeanerCloud/CUDly#123: Both PRs modify cmd/helpers.go’s Savings Plans handling in ApplyCoverage (one changes how SP recommendations are detected, the other expands/scales the sizing logic and adds ApplyTargetCoverage).

Suggested labels

urgency/this-sprint, effort/l

🐰 I hopped through code with a careful nibble,
Target set tight so buy-not-waste won't dribble.
CE signals parsed, families sized with care,
CSVs tidy, projections now fair.
A crunchy carrot for reviewers to share.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/cli-target-utilization

@cristim
Copy link
Copy Markdown
Member Author

cristim commented May 12, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 12, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
cmd/validators_test.go (1)

417-423: ⚡ Quick win

“Both flags explicitly set” branch is not actually exercised.

Line 453 passes nil cmd, so the cmd.Flags().Changed("coverage") path never runs. This leaves the precedence-log gate untested.

Suggested test adjustment
 import (
+	"fmt"
 	"os"
 	"path/filepath"
 	"strings"
 	"testing"

 	"github.com/LeanerCloud/CUDly/pkg/common"
+	"github.com/spf13/cobra"
 )
@@
-			toolCfg = Config{Coverage: tt.coverage, TargetUtilization: tt.target}
-			err := validateNumericRanges(nil)
+			toolCfg = Config{Coverage: tt.coverage, TargetUtilization: tt.target}
+			cmd := &cobra.Command{Use: "test"}
+			cmd.Flags().Float64("coverage", 80, "")
+			if tt.name == "target + coverage both set, both valid" {
+				if err := cmd.Flags().Set("coverage", fmt.Sprintf("%v", tt.coverage)); err != nil {
+					t.Fatalf("failed to mark coverage as changed: %v", err)
+				}
+			}
+			err := validateNumericRanges(cmd)

Also applies to: 450-454

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/validators_test.go` around lines 417 - 423, The test
TestValidateTargetUtilization isn't exercising the "both flags explicitly set"
branch because it passes nil for cmd so cmd.Flags().Changed("coverage") never
runs; update the test to create a real cobra.Command (or equivalent command type
used in validators_test.go), register the "coverage" flag, and mark it as set
(e.g., via cmd.Flags().Set("coverage", "<value>") or by calling the appropriate
Changed mechanism) before calling ValidateTargetUtilization so the
cmd.Flags().Changed("coverage") path executes; apply the same fix to the related
case around lines 450-454 to ensure the precedence-log gate is covered.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cmd/helpers_test.go`:
- Around line 922-930: The test table includes wantProjCovGTE but the test body
never asserts projected coverage; update the test loop in cmd/helpers_test.go to
check the function result's ProjectedCoverage (or the field name used in the
returned recommendation/result) and assert that if tests[i].wantProjCovGTE > 0
then result.ProjectedCoverage >= tests[i].wantProjCovGTE; do this alongside the
existing projected utilization check so projected coverage regressions are
caught (apply the same addition for the other similar table near lines 992-997
that also declares wantProjCovGTE).

---

Nitpick comments:
In `@cmd/validators_test.go`:
- Around line 417-423: The test TestValidateTargetUtilization isn't exercising
the "both flags explicitly set" branch because it passes nil for cmd so
cmd.Flags().Changed("coverage") never runs; update the test to create a real
cobra.Command (or equivalent command type used in validators_test.go), register
the "coverage" flag, and mark it as set (e.g., via cmd.Flags().Set("coverage",
"<value>") or by calling the appropriate Changed mechanism) before calling
ValidateTargetUtilization so the cmd.Flags().Changed("coverage") path executes;
apply the same fix to the related case around lines 450-454 to ensure the
precedence-log gate is covered.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 00f8f8b7-ce2e-43d4-b506-c5bacddfffd5

📥 Commits

Reviewing files that changed from the base of the PR and between f33ae7c and 2bded0b.

📒 Files selected for processing (15)
  • README.md
  • cmd/helpers.go
  • cmd/helpers_test.go
  • cmd/main.go
  • cmd/multi_service.go
  • cmd/multi_service_csv.go
  • cmd/multi_service_csv_test.go
  • cmd/multi_service_helpers.go
  • cmd/validators.go
  • cmd/validators_test.go
  • pkg/common/types.go
  • providers/aws/recommendations/parser_ri.go
  • providers/aws/recommendations/parser_ri_test.go
  • providers/aws/recommendations/parser_sp.go
  • providers/aws/recommendations/parser_sp_test.go

Comment thread cmd/helpers_test.go
cristim added 10 commits May 12, 2026 12:44
RenderSummary printed "estimated savings $X/yr" but rec.EstimatedSavings
is monthly (sourced from AWS EstimatedMonthlySavingsAmount in
parser_ri.go). The lower stats block already says "/mo" correctly, so
the two lines disagreed about the same value. Also relabel the cost
field as "upfront commitment" since CommitmentCost is the one-time
upfront portion, not a recurring/lifetime figure.

Refs #338.
generatePurchaseID always took cfg.Coverage (default 80), so dry-runs
sized by --target-utilization=70 still emitted IDs containing "80pct" —
a misleading audit signal that would carry into real --purchase runs.

Extract effectiveSizingPct(cfg) so target wins over coverage when set,
and route all three generatePurchaseID call sites through it. The
function signature now names the parameter sizingPct to make the
expectation explicit at the call site. No format change (still "Xpct"),
since the audit log records the source flag separately.

Refs #338.
Two changes that travel together because the algorithm reinterpretation
is what motivates the rename.

Algorithm change (under-buy):
  Previously the flag computed N = floor(avg / target) and capped at
  rec.Count, which made it a no-op whenever the target was below AWS's
  implied ~99% projected utilization (target=70 on a 99%-utilization
  fleet produced 99% projected utilization, not 70). Replaced by
  N = floor(avg * target/100) for RIs and a target/100 scalar on
  HourlyCommitment / EstimatedSavings for SPs. The result deliberately
  under-buys versus AWS's 100%-coverage baseline, leaving
  (100-target)% of historical demand on-demand. RI projected
  utilization stays near 100% (RIs are always fully used when
  count < avg), and projected coverage approximates target.

Rename:
  --target-utilization -> --target-coverage (CLI flag, Config.TargetCoverage,
  ApplyTargetCoverage / apply* helpers, validateTargetCoverage, all log
  messages, doc comments, README, test names). The flag never targeted a
  utilization figure under any operationally useful interpretation; the
  rename matches what the math actually produces and what operators
  expect when they pass --target-coverage=70.

Verified against shift-prd: --target-coverage=70 on Aurora yields
ProjectedUtilization=100 and ProjectedCoverage clustered at 38-70%
across rows (under target by floor's effect on small counts).
--target-coverage=80 on plain RDS yields the same shape with coverage
clustered at 66-79%.

Refs #338.
…olumns

Under under-buy --target-coverage sizing, ProjectedUtilization is 100%
on every RI row (count <= avg, so RIs are always full) and
RecommendedUtilization is ~99-100% on every row (AWS sizes for that).
Both columns add noise without information; only ProjectedCoverage
varies meaningfully across rows.

Keep the underlying fields on common.Recommendation: ApplyTargetCoverage's
SP branch still reads RecommendedUtilization as a no-signal guard, and
the algorithm still sets ProjectedUtilization for completeness even
though no caller renders it. Internal API stays the same; only the
on-disk CSV shrinks.

Refs #338.
The Count column shows the count we'll actually purchase (post-sizing).
Without a separate column for AWS's pre-sizing recommendation, the dry-
run CSV loses the "what AWS proposed" signal that operators want when
deciding how aggressively to under-buy via --target-coverage.

Add RecommendedCount to common.Recommendation (preserves AWS's pre-sizing
count before ApplyCoverage / ApplyTargetCoverage mutates Count) and emit
it in the CSV right next to Count. SP recs leave the field at zero (SPs
are dollar-denominated, not count-denominated) and the column renders
blank, matching the existing "0 = unknown / not applicable" convention.

Refs #338.
The Recommendation struct's CommitmentCost field holds AWS's upfront for
the pre-sizing RecommendedCount; ApplyCoverage and ApplyTargetCoverage
deliberately leave it untouched when shrinking Count, so rendering the
field directly would over-state the upfront whenever the user under-buys.

Emit a new UpfrontPayment column that pro-rates CommitmentCost by
Count/RecommendedCount for RIs. SP recs (RecommendedCount == 0) emit
CommitmentCost as-is — SP sizing scales HourlyCommitment/EstimatedSavings
but leaves the upfront untouched, a pre-existing inconsistency to track
separately. Blank cell when CommitmentCost is zero, matching the
"0 = unknown / not applicable" convention.

Refs #338.
ApplyCoverage and ApplyTargetCoverage previously mutated Count (RIs) or
HourlyCommitment (SPs) but left CommitmentCost, OnDemandCost, and (for
RIs) EstimatedSavings at AWS's pre-sized values. Every downstream
consumer (reporter table, summary line, audit log, CSV writer) then
displayed sized counts alongside pre-sized cost, over-stating upfront
and savings whenever a user under-buys.

Scale all cost-bearing fields at sizing time by the same ratio that
shrinks Count / HourlyCommitment:

  - ApplyCoverage RI:   ratio = coverage/100
  - ApplyCoverage SP:   ratio = coverage/100
  - ApplyTargetCoverage RI: ratio = nTarget/rec.Count (the actual count ratio)
  - ApplyTargetCoverage SP: ratio = targetPct/100

SavingsPercentage stays untouched (savings-vs-on-demand ratio is
count-invariant). RecommendedCount preserves AWS's pre-sized count for
RIs so the dry-run CSV can still surface AWS's proposal alongside the
sized values.

CSV writer simplified: the temporary formatUpfrontForSizedCount helper
(which pro-rated CommitmentCost at render time) is replaced by
formatCurrencyOrBlank, since rec.CommitmentCost now reflects the sized
purchase directly. Tests updated to assert the scaled values; the
prior "must not scale" assertions were locking in the bug.

Refs #338.
…rage sizing

The under-buy formula used floor(avg * target/100), which ignored any
existing commitments in the same pool: a user with 50% of demand already
covered by 10 owned RIs, asking for target=80%, would get a recommendation
to buy floor(0.80 * 20) = 16 more — an overbuy of 10 vs the correct 6.

Replace the formula with the coverage-gap form:

  gap_pct = target_pct - existing_coverage_pct
  n_target = floor(avg * gap_pct / 100)

Drop the rec entirely (info log) when existing already meets or exceeds
the target. Drop when the gap exists but is too small for a single RI.
Subtraction is kept in percentage units before dividing by 100, so
whole-percent inputs (e.g. 70 - 60) don't lose precision to float
subtraction near integer boundaries.

ExistingCoveragePct is a new field on common.Recommendation that defaults
to zero, so recs without the fetched signal degenerate to the old
no-existing-commitments path (floor(avg * target/100)). The actual CE
GetReservationCoverage fetch is split into a follow-up commit so the
formula and CSV plumbing can land in isolation.

ProjectedCoverage now reflects TOTAL coverage (existing + new
contribution) rather than just the new purchase's share, so operators
can verify they hit the target they asked for.

A new ExistingCoverage CSV column shows the input the sizing used;
blank-when-zero matches the other optional columns.

Refs #338.
Add GetRICoverageMap on the recommendations Client, plus the
ApplyCoverageMapToRecommendations helper that joins the fetched map
back onto each Recommendation.ExistingCoveragePct in place.

The fetcher calls Cost Explorer's GetReservationCoverage grouped by
REGION + INSTANCE_TYPE (CE allows at most 2 GroupBy dimensions), pages
through with the existing rate-limiter pattern that GetRIUtilization
established, and returns a map keyed by "region:instance_type"
(lowercased). Missing pools are absent from the map; the apply helper
leaves ExistingCoveragePct at zero for unmatched recs, which the
sizing path treats as "no signal" and falls back to the
no-existing-commitments formula.

Finer pool dimensions (OS, tenancy for EC2; engine, multi-AZ for RDS)
are aggregated together because the CE API can't slice all of them in
one call. That's a known imprecision for mixed pools but the dominant
case (single-engine, single-AZ-config workloads) gets accurate sizing.

Pipeline wiring lands in a follow-up so the formula and CSV plumbing
can stay independently testable. Three existing mocks (mockCostExplorerAPI,
mockCostExplorerForSP, mockCostExplorerClient) get a GetReservationCoverage
stub so they continue to satisfy the CostExplorerAPI interface.

Refs #338.
Phase 1 of runToolMultiService now fetches existing-pool coverage from
Cost Explorer once per run, threads the resulting PoolCoverageMap down
through fetchAllRecs into applyCoverageAndOverrides, and lets the
sizing path subtract pool coverage that's already owned. Recs whose
pool already meets/exceeds --target-coverage are dropped with an info
log; the rest get sized to fill just the remaining coverage gap.

Implementation notes:

  - CE's GetReservationCoverage returns only EC2 coverage when no SERVICE
    filter is set, so the fetcher loops once per (EC2, RDS, ElastiCache,
    OpenSearch, Redshift, MemoryDB) service. The dimension values match
    CE's canonical SERVICE-dimension strings.
  - GroupBy uses REGION + INSTANCE_TYPE. CE returns attribute keys in
    camelCase ("region", "instanceType") despite the GroupBy input
    accepting SCREAMING_SNAKE_CASE — the lookup tolerates both forms.
  - Metrics is "Hour" (not "HoursPercentage" — CE rejects that as a
    ValidationException); the percentage field is included automatically
    inside the Hour coverage block.
  - Skipping the CE fetch entirely when --target-coverage <= 0 avoids
    the per-call charge on --coverage-only runs.
  - Best-effort: a transient CE failure logs a warning and the pipeline
    continues with an empty map, which makes sizing degenerate to the
    no-existing-commitments path (matches behavior when a pool has no
    fetched coverage entry).
  - Legacy processService entry (test-only) passes nil — those tests
    never had coverage data either.

Verified against shift-prd: a --target-coverage=70 dry-run on Aurora
shows ExistingCoverage values like 46.4% / 38.5% / 36.6% on rows where
existing RDS RIs cover part of demand, with new Count sized to fill
just the gap (e.g. db.t4g.medium us-east-1: AWS rec 25 → cov 46.4% +
buy 5 = 67% total).

Refs #338.
cristim added 7 commits May 13, 2026 11:32
RDS RI list prices for Multi-AZ are roughly 2x Single-AZ. Without a
Deployment column operators couldn't verify the upfront figure was
sized for the deployment they expected; they had to cross-reference
the AWS console.

extractDeployment pulls AZConfig from *common.DatabaseDetails (or its
value form for the CSV-loader path), emitting blank for non-RDS recs.
Column slots between Engine and Instances so the (region, type,
engine, deployment) tuple reads as a unit.
)

Family-NU sizing is now driving RDS buys but the math was invisible in
the CSV — operators couldn't tell at a glance which recs belonged to
the same size-flex family or how much NU each rec contributed.

Family column shows the family prefix (e.g. db.r7g) so rows in the
same size-flex bucket sort together. NormalizedUnits shows
rec.Count × NU(size) (e.g. 15 × db.r7g.large = 60 NU), surfacing the
math AWS rec API uses to bundle .xlarge demand into .large RIs.

RDSFamilyFromType and RDSInstanceNUFromType added as exported wrappers
over the unexported helpers in providers/aws/recommendations so the
CSV writer can reuse the canonical NU scale rather than duplicating it.
)

For partial-upfront / no-upfront RIs the upfront figure is only part
of the total cost; the recurring per-month fee multiplied by the term
needs to be known to compute the real spend and ROI. AWS CE already
exposes RecurringStandardMonthlyCost and CUDly's parser stores it in
rec.RecurringMonthlyCost, but it was never emitted in the CSV.

The sizing steps (per-pool ApplyTargetCoverage{RI,SP} and family-NU
scaleRDSRecInFamily) now scale the pointer-valued monthly cost by the
same ratio they use for CommitmentCost / OnDemandCost / EstimatedSavings,
allocating a new pointer so the original rec target isn't mutated.

formatRecurringMonthlyOrBlank renders nil as blank (AWS didn't return
the field — unknown) and 0.0 as "0.00" (definitely no recurring fee,
e.g. all-upfront) so operators can distinguish the two cases.
Each rec in a multi-rec family-NU group was reporting its own
contribution only (existing% + this rec's new NU / family total NU),
so a family with two per-account recs each adding 20% of family NU
would show "existing + 20%" on both rows instead of the true
"existing + 40%". Misleading when operators eyeball a row to verify
the family will land at the target.

Split sizing into two passes: scaleFamilyRecs handles per-rec count
and cost scaling; annotateFamilyProjection sums new NU across every
surviving rec and writes the same cumulative ProjectedCoverage onto
each one. Each row now reflects "where the family will land" rather
than just "where my slice of it lands".
…338)

ExistingCoverage previously rendered blank for two semantically
distinct cases: CE returned 0% for the pool (legitimate zero coverage,
pool exists and is uncovered) and CE returned nothing for the pool
(rec surfaced from some other AWS signal, CE's lookback hadn't seen
the pool yet). Operators couldn't tell which one a blank cell meant.

Add ExistingCoverageKnown bool to common.Recommendation, set by
ApplyCoverageMapToRecommendations and ApplyFamilyNUSizingRDS whenever
a pool lookup actually hits. formatExistingCoverage renders "n/a" when
Known=false, "0.0" when Known=true with zero pct, and "X.X" otherwise.

The "n/a" cells now flag recently-launched or fluctuating pools where
AWS rec API has fresher data than CE coverage. Buying for them is
legitimate; the column just makes the data-source mismatch visible.
Data rows previously came out in whatever order AWS rec API returned
them, so a $20K row could sit below a $50 row, forcing operators to
scroll or sort manually before reviewing budget impact. Sorting by
UpfrontPayment descending surfaces the biggest decisions first.

TOTAL row at the bottom aggregates Count, NormalizedUnits,
UpfrontPayment, RecurringMonthlyCost (when any rec has one), and
EstimatedSavings so operators get the run-level totals inline rather
than copy-pasting columns into a spreadsheet. "TOTAL" label lands in
the Service column for easy spotting; per-rec identifiers (Account,
CommitmentID, Timestamp, percentages) stay blank in that row since
they don't aggregate meaningfully.
…338)

Closes two CodeRabbit nits on PR #339:

1. **TestApplyTargetCoverage_RI: `wantProjCovGTE` was dead test data.**
   The field was declared on the struct but never populated and never
   asserted on. Now populated for the three keep-the-rec cases
   (94/50/75% respectively, derived from `nTarget/avg*100` with
   existing=0), zero on the drop and pass-through cases (matching the
   `wantProjUtil` "0 means don't assert" convention), and asserted with
   a `>= want - epsilon` check that tolerates the float clamp at 100.

2. **TestValidateTargetCoverage: "both flags set" branch was untested.**
   Passing nil cmd skips the `cmd.Flags().Changed("coverage")` check
   entirely, so the precedence info-log gate had zero coverage. Added
   `useCobraCmd` to the test struct; the "both set" case now builds a
   real cobra command, marks `--coverage` as Changed via Set, and runs
   validation against it. The log line still isn't asserted (stderr
   capture isn't worth the friction here), but the branch is now
   reached.
@cristim
Copy link
Copy Markdown
Member Author

cristim commented May 13, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 13, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cmd/helpers.go`:
- Around line 159-164: The Count and cost fields are inconsistent because
newCount is an integer but costs are still multiplied by the original floating
`ratio`; change the cost-scaling to use a discrete count ratio computed from
`newCount` and `rec.Count` (e.g., `discreteRatio = float64(newCount) /
float64(rec.Count)`) and apply that to `adjusted.CommitmentCost`,
`adjusted.OnDemandCost`, and `adjusted.EstimatedSavings`; also guard against
`rec.Count == 0` to avoid divide-by-zero and keep `adjusted.Count = newCount` as
before.

In `@providers/aws/recommendations/expiry.go`:
- Around line 74-75: The expiringByPool lookup is missing matches because
commitmentPoolKey omits the deployment component while lookupPoolKey (used for
recs[i]) is deployment-aware; update the key construction so both sides use the
same deployment-aware format: either add a deployment field to common.Commitment
and include it in commitmentPoolKey, then rebuild keys consistently, or change
the lookup in the loop (around expCount, ok :=
expiringByPool[lookupPoolKey(recs[i])]) to attempt a safe fallback (if the full
deployment-aware key is not found, try a deployment-empty variant) so RDS pools
aren’t silently missed; apply the same consistent key strategy to the other
lookup sites mentioned (lines ~110-113).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8d0bbeb9-a90b-488d-a4bd-c9c7624b77f0

📥 Commits

Reviewing files that changed from the base of the PR and between 2bded0b and 490b003.

📒 Files selected for processing (29)
  • README.md
  • cmd/helpers.go
  • cmd/helpers_test.go
  • cmd/main.go
  • cmd/main_test.go
  • cmd/multi_service.go
  • cmd/multi_service_csv.go
  • cmd/multi_service_csv_test.go
  • cmd/multi_service_filters.go
  • cmd/multi_service_helpers.go
  • cmd/validators.go
  • cmd/validators_test.go
  • internal/reporter/reporter.go
  • pkg/common/types.go
  • providers/aws/recommendations/client.go
  • providers/aws/recommendations/client_test.go
  • providers/aws/recommendations/coverage.go
  • providers/aws/recommendations/coverage_test.go
  • providers/aws/recommendations/expiry.go
  • providers/aws/recommendations/expiry_test.go
  • providers/aws/recommendations/family_nu.go
  • providers/aws/recommendations/family_nu_test.go
  • providers/aws/recommendations/parser_ri.go
  • providers/aws/recommendations/parser_ri_test.go
  • providers/aws/recommendations/parser_sp.go
  • providers/aws/recommendations/parser_sp_additional_test.go
  • providers/aws/recommendations/parser_sp_test.go
  • providers/aws/service_client.go
  • providers/aws/service_client_test.go
✅ Files skipped from review due to trivial changes (2)
  • providers/aws/recommendations/client_test.go
  • README.md
🚧 Files skipped from review as they are similar to previous changes (3)
  • providers/aws/recommendations/parser_sp_test.go
  • providers/aws/recommendations/parser_sp.go
  • providers/aws/recommendations/parser_ri_test.go

Comment thread cmd/helpers.go Outdated
Comment thread providers/aws/recommendations/expiry.go
cristim added 2 commits May 13, 2026 21:39
…rage (#338)

ApplyCoverage's RI branch truncated newCount to an integer but
multiplied cost-bearing fields by the unrounded coverage ratio,
desynchronising Count and costs. Example: rec.Count=3 + coverage=50%
yields newCount=1 (33% of instances) but costs scaled to 50%,
overstating the sized purchase price by ~50%.

Use sizedRatio = newCount / rec.Count for cost scaling, matching the
discipline already in ApplyTargetCoverage and ApplyFamilyNUSizingRDS.
RecurringMonthlyCost gets the same scaling so partial / no-upfront
plans line up with their sized count too.

Fixes CodeRabbit finding on cmd/helpers.go:164.
Recommendation pool keys are deployment-aware for RDS
(region:instance_type:engine:deployment), but commitmentPoolKey
emitted "" for the deployment slot because common.Commitment lacked
the field. Result: RDS commitments expiring within --rebuy-window-days
silently failed to match any rec's deployment-aware lookup key, so the
expiring share was never deducted from ExistingCoveragePct. CUDly
would under-buy replacements for expiring RIs.

Adds Deployment to common.Commitment, populates it from
ReservedDBInstance.MultiAZ in the RDS service client, and uses it in
commitmentPoolKey. Non-RDS services leave Deployment empty so their
2-part pool keys still match. Adds an expiry test that proves a
Single-AZ commitment only adjusts the Single-AZ rec (not the Multi-AZ
rec in the same instance/region/engine pool).

Fixes CodeRabbit finding on providers/aws/recommendations/expiry.go:75.
@cristim
Copy link
Copy Markdown
Member Author

cristim commented May 13, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 13, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
providers/aws/recommendations/expiry_test.go (1)

208-239: ⚡ Quick win

Rename this subtest to match what it actually asserts.

The current name says “skipped” but the assertions validate the opposite (n == 1). Aligning the title/comments with expected behavior avoids misleading failures later.

✏️ Suggested edit
-	t.Run("already-expired commitments are skipped", func(t *testing.T) {
+	t.Run("already-expired active commitments are counted if within cutoff", func(t *testing.T) {
 		// A commitment with EndDate in the past was already filtered by the
-		// State check upstream, but defensively the function should treat
-		// past dates as "not in the window" too. Past < cutoff but
-		// (past < now) means the RI is no longer providing coverage and
-		// shouldn't be deducted from current ExistingCoveragePct (which
-		// presumably already excludes it).
+		// State check upstream. Current behavior intentionally counts any
+		// commitment marked active when EndDate <= cutoff, even if EndDate < now.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@providers/aws/recommendations/expiry_test.go` around lines 208 - 239, The
subtest title is misleading: it says "already-expired commitments are skipped"
but the assertions check that an active commitment with EndDate before the
cutoff is counted (n == 1, ExistingCoveragePct adjusted). Rename the t.Run
description to reflect that active commitments with EndDate within the cutoff
are treated as expiring and included (e.g., "active commitments before cutoff
are counted as expiring"), and update the inline comment to match this intended
behavior; reference AdjustExistingCoverageForExpiringCommitments, the
recs/commits setup, pastDate, and the 30-day cutoff when making the
label/comment change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@providers/aws/recommendations/expiry_test.go`:
- Around line 208-239: The subtest title is misleading: it says "already-expired
commitments are skipped" but the assertions check that an active commitment with
EndDate before the cutoff is counted (n == 1, ExistingCoveragePct adjusted).
Rename the t.Run description to reflect that active commitments with EndDate
within the cutoff are treated as expiring and included (e.g., "active
commitments before cutoff are counted as expiring"), and update the inline
comment to match this intended behavior; reference
AdjustExistingCoverageForExpiringCommitments, the recs/commits setup, pastDate,
and the 30-day cutoff when making the label/comment change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bf1280af-c886-480a-9f28-722e5466e671

📥 Commits

Reviewing files that changed from the base of the PR and between 490b003 and 23b6b97.

📒 Files selected for processing (6)
  • cmd/helpers.go
  • cmd/helpers_test.go
  • pkg/common/types.go
  • providers/aws/recommendations/expiry.go
  • providers/aws/recommendations/expiry_test.go
  • providers/aws/services/rds/client.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • pkg/common/types.go
  • providers/aws/recommendations/expiry.go
  • cmd/helpers_test.go
  • cmd/helpers.go

cristim added 2 commits May 13, 2026 22:11
…ginated loop (#338)

fetchCoverageForRDSEngineRegion and fetchCoverageForServiceRegion each
had their own ~15-line input struct + 12-line paginated loop + a
service-specific accumulator (accumulateRDSEngineGroups vs
accumulateGroups). Both loops were identical except for which pool-key
shape they wrote to the map.

fetchCoveragePaged owns the loop and invokes a per-row record callback;
each fetch path now just builds its input and passes a closure that
writes the right key shape. rdsEngineRegionFilter and serviceRegionFilter
extract the (engine, region) and (service, region) filter expressions
so the input struct literal stays readable.

Net ~60 lines removed; the two accumulator helpers were deleted entirely
since their work now lives in fetchCoveragePaged's record callback.
No behaviour change — 4410 tests pass.
…paths (#338)

Four sizing paths independently scaled the same four cost-bearing
fields (CommitmentCost, OnDemandCost, EstimatedSavings,
RecurringMonthlyCost) by a ratio, including the same allocate-new-pointer
dance for the RecurringMonthlyCost pointer to avoid mutating the
upstream rec target:

- ApplyCoverage SP branch
- ApplyCoverage RI branch
- ApplyTargetCoverageRI
- ApplyTargetCoverageSP
- ApplyFamilyNUSizingRDS / scaleRDSRecInFamily

Same 7-line pattern duplicated five places. ScaleRecommendationCosts
in pkg/common returns the scaled copy in one call so each sizing site
shrinks to one line. The pointer-safe alloc lives in the helper.

No behaviour change; the SP branches still scale HourlyCommitment in
their copied Details separately because that's an SP-specific detail
field, not a generic rec cost field. 4410 tests pass.
cristim added a commit that referenced this pull request May 13, 2026
…tempty on user/group slices (closes #349, #350) (#353)

* feat(db): migration 000050 drops users_one_admin index (#349)

Removes the partial unique index on users(role) WHERE role='admin'
that blocked multi-admin support. The TOCTOU race the index was added
to close (two concurrent SetupAdmin calls on a fresh install) is now
closed in the service layer via an atomic INSERT-WHERE-NOT-EXISTS
(coming in the next commit). The down migration restores the index
for rollback compatibility but the rest of the #349 changes must be
rolled back as a unit — keeping just the index restore re-introduces
the original 500 bug on the /api/users path.

* feat(auth): add validation error sentinels for handler 4xx mapping (#349)

Five sentinel errors that subsequent commits wrap into
validateCreateUserRequest, SetupAdmin, and CreateUser. The API
handler maps each to a status code:

  ErrInvalidEmail    → 400
  ErrInvalidRole     → 400
  ErrPasswordPolicy  → 400
  ErrEmailInUse      → 409
  ErrAdminExists     → 409

Without this, every validation failure falls through to a generic
500 + "Internal server error" toast, hiding the real cause from the
operator (see issue #349 repro on PR #339's reviewer feedback).

* feat(auth): wrap validateCreateUserRequest errors as sentinels (#349)

Switch the four plain fmt.Errorf returns to use the sentinels added in
the previous commit. The error messages stay byte-for-byte identical
(the sentinels were named to match), so existing tests that
assert.Contains on the message strings continue to pass — only the
underlying error type changes from a fresh fmt.Errorf each call to a
shared sentinel that the API handler can detect via errors.Is.

The role and password paths use %w wrapping so both the category
(ErrInvalidRole / ErrPasswordPolicy) and the specific detail
("invalid role: guest" / "password does not meet policy: must be at
least 12 characters") survive in the user-facing message.

Defence-in-depth note: the handler at handler_users.go:51-56 already
short-circuits any role outside {"admin","user"} with a 400 before
reaching CreateUserAPI, so ErrInvalidRole is unreachable on the HTTP
path. The sentinel is kept anyway for non-HTTP callers (CLI / future
internal use). Widening the handler switch to admit "read_only" is
out of scope for #349.

* feat(auth): atomic CreateAdminIfNone + SetupAdmin race fix (#349)

Adds an atomic INSERT-WHERE-NOT-EXISTS to PostgresStore so SetupAdmin
no longer relies on the users_one_admin partial unique index (dropped
in migration 000050) to close the bootstrap TOCTOU race.

Store contract (Store.CreateAdminIfNone):
  (true, nil)              inserted; this caller is now the admin.
  (false, nil)             admin already existed; row not inserted.
  (false, ErrEmailInUse)   email collides with an existing non-admin.
  (false, err)             any other DB failure (wrapped).

SetupAdmin keeps the upfront AdminExists() fast path (common case: no
DB round-trip wasted on a no-op insert when an admin is already there)
and uses CreateAdminIfNone as the atomic guard for the race window
between the existence check and the insert. Both branches surface
ErrAdminExists so the API handler maps either route to 409.

Validation errors switched to the sentinels from the previous commit
(ErrInvalidEmail, ErrPasswordPolicy wrapping the specific
validatePassword message via the standard errors.Is-friendly wrap).

The duplicate-key test was rewritten end-to-end to match the new flow:
mock CreateAdminIfNone returning (false, nil), assert ErrorIs against
ErrAdminExists, keep the assertion that "failed to create admin" must
NOT appear in the user-facing message. mockAuthStoreForHealth in the
server test package also gets the new method as a no-op so the
StoreInterface contract stays satisfied.

* fix(api): drop omitempty on user/group slice fields (#350)

The Admin Users page crashes with `Failed to load users and groups`
whenever a user has no group memberships, or a group has no
account restrictions. Root cause: `omitempty` on APIUser.Groups and
APIGroup.AllowedAccounts strips the field from the JSON when the
slice is empty/nil, but the frontend TS contract declares both as
required `string[]`. Renderers read `user.groups.length` / map over
the array without a guard, hitting "Cannot read properties of
undefined (reading 'length')" in an Array.map callback. The
catch-all in users/userActions.ts then shows the generic toast.

Two-layer fix:

1. **Backend (the actual contract bug)**:
   - Remove omitempty from APIUser.Groups and APIGroup.AllowedAccounts.
   - In userToAPIUser / groupToAPIGroup, substitute []string{} for nil
     so the JSON encoder emits `[]` rather than `null`. (omitempty
     suppresses both empty slices AND nil, so removing it isn't enough
     on its own — a nil slice would serialise as `null`, which the
     frontend doesn't know how to handle either.)
   - Tests in service_api_test.go assert the JSON output is `[]`, not
     absent and not `null`, for nil and empty-slice inputs alike.

2. **Frontend (defence-in-depth)**:
   - normalizeUser / normalizeGroup at the loadUsers() boundary
     coerce any non-array value back to `[]`, so a future contract
     regression degrades gracefully instead of crashing the page.

Folded into the #349 PR because it touches internal/auth and the user
flow is the same. Standalone closure for #350.

* feat(api,auth): map auth sentinels to 4xx in user/admin endpoints (#349)

The validation-error sentinels added earlier in this branch are now
actually consumed by the HTTP handlers. Three pieces:

1. Service.CreateUser gains defence-in-depth email-uniqueness mapping
   after the store insert. validateCreateUserRequest's pre-check
   handles the common case; this catches the TOCTOU window where two
   concurrent callers race past the pre-check and the loser hits the
   users_email_key constraint.

2. internal/api/handler_users.go gets mapAuthError — a shared helper
   that pattern-matches via errors.Is on the auth package sentinels
   and wraps each in the right ClientError:
     ErrInvalidEmail / ErrInvalidRole / ErrPasswordPolicy -> 400
     ErrEmailInUse   / ErrAdminExists                     -> 409
   Unrecognised errors fall through to handleRequestError's generic
   500 path (unchanged for non-auth errors). createUser's call site
   now wraps via this helper.

3. The setupAdmin handler in handler_auth.go does the same. Before
   this, even a clean ErrAdminExists from the bootstrap path produced
   a generic 500 — every validation failure on the setup endpoint
   was indistinguishable from a real internal error to the operator.
   Both endpoints share the helper since they share the sentinel set.

End-to-end effect: the Admin Users page (and the bootstrap setup
flow) now display the real cause of a failed create/setup attempt.
"email already in use" / "admin user already exists" / "invalid
role: guest" / "password does not meet policy: must be at least 12
characters" all reach the user as 4xx with the message preserved
in the response body for the frontend toast.

* fix(auth,frontend): address CodeRabbit pass-1 findings on #353

Three findings, all real:

1. **Frontend: render calls bypassed the normaliser.** `loadUsers()`
   computed the normalised `groups` array but the subsequent
   `renderGroups()` and `renderPermissionMatrix()` calls still passed
   the raw `groupsResponse.groups`, so the defence-in-depth guard
   from the previous commit only protected the state setters, not
   the actual render path. Both call sites now use the normalised
   constant.

2. **Store: NOT EXISTS clause out of sync with AdminExists.** The
   atomic conditional INSERT in CreateAdminIfNone checked
   `role='admin'`, but AdminExists() checks
   `role='admin' AND active=true`. Inconsistency meant a deployment
   with only inactive admins would pass the fast-path check but the
   insert would fail silently on the WHERE clause, leaving the
   operator with a recoverable-looking error and no admin. Both
   clauses now match.

3. **Store: role hard-coded instead of trusted from caller.** The
   method's contract is "create admin if none", but the INSERT was
   forwarding `user.Role` verbatim. A non-admin role would silently
   slip through (the WHERE clause would still let the insert
   through because a non-admin row doesn't trip the "admin exists"
   check). The INSERT now hard-codes `'admin'` for the role column,
   matching the method name's promise.

All three are pure store/render fixes — no test surface change needed;
existing tests continue to pass (`go test ./...` clean,
`npx tsc --noEmit` clean).
@cristim
Copy link
Copy Markdown
Member Author

cristim commented May 13, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 13, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@providers/aws/recommendations/coverage.go`:
- Around line 177-205: The new CE-based coverage flow (GetRICoverageMap and its
helpers fetchCoverageForServiceRegion / fetchCoverageForRDSEngineRegion) must be
decoupled from the existing --target-utilization path: revert the
target-utilization code path to continue using utilization from the AWS
Recommendation payloads and do not call GetReservationCoverage/GetRICoverageMap
when target-utilization is active; instead make CE coverage an opt-in feature or
separate flag/flow. Concretely, update the code paths that assemble
recommendation sizing to check the target-utilization flag and avoid invoking
GetRICoverageMap or GetReservationCoverage there (keep fetchCoverage... only
used by the new coverage feature), and add a clear conditional gate or separate
method so users without CE permissions or wanting the original behavior are not
forced to incur CE calls or IAM changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7aa85c29-a7af-40e5-9415-91229ff17eeb

📥 Commits

Reviewing files that changed from the base of the PR and between 23b6b97 and 5a30572.

📒 Files selected for processing (4)
  • cmd/helpers.go
  • pkg/common/types.go
  • providers/aws/recommendations/coverage.go
  • providers/aws/recommendations/family_nu.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • providers/aws/recommendations/family_nu.go
  • cmd/helpers.go

Comment on lines +177 to +205
func (c *Client) GetRICoverageMap(ctx context.Context, lookbackDays int, regions []string) (PoolCoverageMap, error) {
if lookbackDays <= 0 {
lookbackDays = 30
}
end := time.Now().UTC()
start := end.AddDate(0, 0, -lookbackDays)
startStr := start.Format("2006-01-02")
endStr := end.Format("2006-01-02")
windowHours := float64(lookbackDays * 24)

out := make(PoolCoverageMap)
// Cost: one CE call per (service, region) for non-RDS; one CE call per
// (engine, region) for RDS. For a 5-service × 23-region survey + 7
// engines × 23 regions ≈ 270 calls (~$2.70/run). Empty regions return
// quickly so the bound is loose in practice.
for _, region := range regions {
for _, service := range coverageServiceFilters {
if err := c.fetchCoverageForServiceRegion(ctx, startStr, endStr, windowHours, service, region, out); err != nil {
return nil, fmt.Errorf("fetching coverage for service %q region %q: %w", service, region, err)
}
}
for _, engine := range rdsCoverageEngines {
if err := c.fetchCoverageForRDSEngineRegion(ctx, startStr, endStr, windowHours, engine, region, out); err != nil {
return nil, fmt.Errorf("fetching coverage for RDS engine %q region %q: %w", engine, region, err)
}
}
}
return out, nil
}
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Keep the target-utilization path off the new Cost Explorer coverage flow.

This adds a new GetReservationCoverage dependency and then rewrites the recommendation utilization signal with CE-derived pool averages, but the stated goal for --target-utilization was to size from utilization already present in AWS recommendation payloads without extra API calls. As written, this changes latency/cost, adds new IAM requirements, and can regress users who currently have enough permissions for recommendations but not coverage. Please split this coverage-based sizing into a separate feature or keep it out of the target-utilization path.

Also applies to: 399-446

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@providers/aws/recommendations/coverage.go` around lines 177 - 205, The new
CE-based coverage flow (GetRICoverageMap and its helpers
fetchCoverageForServiceRegion / fetchCoverageForRDSEngineRegion) must be
decoupled from the existing --target-utilization path: revert the
target-utilization code path to continue using utilization from the AWS
Recommendation payloads and do not call GetReservationCoverage/GetRICoverageMap
when target-utilization is active; instead make CE coverage an opt-in feature or
separate flag/flow. Concretely, update the code paths that assemble
recommendation sizing to check the target-utilization flag and avoid invoking
GetRICoverageMap or GetReservationCoverage there (keep fetchCoverage... only
used by the new coverage feature), and add a clear conditional gate or separate
method so users without CE permissions or wanting the original behavior are not
forced to incur CE calls or IAM changes.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for flagging — this finding is based on a misread of the current code; politely declining the suggested revert.

The flag is --target-coverage, not --target-utilization. The PR renamed it during the design iteration because the semantics changed: "utilization" (how full each RI is) is always ~100% under the new sizing, while "coverage" (what % of demand is covered) is the user-facing knob. CE GetReservationCoverage is the canonical source for that signal — the AWS console's reservations-coverage report comes from the same API. There's no way to compute existing-pool coverage from the rec API payload alone (it returns per-account-curated recs, not org-wide pool aggregates).

The CE coverage fetch is already opt-in. cmd/multi_service.go:40:

func fetchExistingCoverage(...) recommendations.PoolCoverageMap {
    if cfg.TargetCoverage <= 0 {
        return nil
    }
    ...
}

Users on the legacy --coverage path pay no CE cost and need no new IAM. ce:GetReservationCoverage is required only when --target-coverage is set, which is documented in the README and surfaces as an opt-in feature gate.

Why the legacy --coverage path can't reuse the new approach. --coverage operates on AWS rec.Count directly (scale all counts by coverage/100). It doesn't subtract existing coverage and was never meant to — it's the simpler "buy X% of what AWS suggests" knob. The new --target-coverage is a different semantic: "size so the pool reaches Y% covered including existing RIs", which structurally requires the coverage signal.

Closing this thread as dismissed with justification.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cristim, thank you for the detailed and precise correction — you're right on all counts. I apologize for the misread: I conflated --target-utilization (the original name from the issue) with the renamed --target-coverage flag, and I missed the opt-in gate in cmd/multi_service.go. The architectural reasoning is sound:

  • GetReservationCoverage is the only AWS-native way to get org-wide pool coverage (rec payloads are per-account and curated, not raw pool aggregates).
  • The CE call is strictly gated to cfg.TargetCoverage > 0, so the legacy --coverage path incurs no new IAM requirements or API cost.
  • The two flags are semantically distinct ("buy X% of suggestions" vs. "size so the pool reaches Y% covered"), and each correctly uses the appropriate data source.

Dismissed — no changes needed.


✏️ Learnings added
Learnt from: cristim
Repo: LeanerCloud/CUDly PR: 339
File: providers/aws/recommendations/coverage.go:177-205
Timestamp: 2026-05-13T21:30:32.481Z
Learning: In the CUDly project (LeanerCloud/CUDly), the CLI flag for target-based coverage sizing is `--target-coverage` (not `--target-utilization`). The PR renamed it during design iteration: "utilization" (how full each RI is) stays ~100% under the new sizing, while "coverage" (% of demand covered) is the user-facing knob. `ce:GetReservationCoverage` is required only when `--target-coverage` is set, gated by `cfg.TargetCoverage <= 0` in `cmd/multi_service.go`. The legacy `--coverage` path is unaffected and pays no CE cost or IAM change.

Learnt from: cristim
Repo: LeanerCloud/CUDly PR: 339
File: providers/aws/recommendations/coverage.go:177-205
Timestamp: 2026-05-13T21:30:32.481Z
Learning: In CUDly, `GetReservationCoverage` (via `GetRICoverageMap`) is the canonical source for org-wide pool coverage signals needed by `--target-coverage` sizing. The AWS rec API (`GetReservationPurchaseRecommendation`) only returns per-account curated recommendations, not org-wide pool aggregates, so it cannot substitute for `GetReservationCoverage` when computing existing coverage across the pool.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

…behavior (#338)

The subtest was titled "already-expired commitments are skipped" but
the assertions verify the opposite: an active commitment with EndDate
in the past IS counted (n==1, ExistingCoveragePct adjusted from 80 to
30). The cutoff check is EndDate <= cutoff, which past dates satisfy
trivially; the upstream State filter is what gates whether a commit
is providing coverage right now.

Rename to "active commits with EndDate before cutoff are counted
(incl. past dates)" and rewrite the inline comment to document that
the State filter is the only "is this currently providing coverage"
gate, so a future reader doesn't try to add an "EndDate > now" guard
and silently change the semantics.

Fixes CodeRabbit nitpick on expiry_test.go:208-239.
@cristim
Copy link
Copy Markdown
Member Author

cristim commented May 13, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 13, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@cristim cristim merged commit 8ecdc00 into feat/multicloud-web-frontend May 13, 2026
5 checks passed
@cristim cristim deleted the feat/cli-target-utilization branch May 13, 2026 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

effort/m Days enhancement New feature or request impact/many Affects most users priority/p2 Backlog-worthy severity/medium Moderate harm triaged Item has been triaged type/feat New capability urgency/this-quarter Within the quarter

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant