Skip to content

ci: integrate pr-agent-context with coverage XML and refresh flow#2

Merged
shaypal5 merged 2 commits into
mainfrom
feat/pr-agent-context
Apr 7, 2026
Merged

ci: integrate pr-agent-context with coverage XML and refresh flow#2
shaypal5 merged 2 commits into
mainfrom
feat/pr-agent-context

Conversation

@shaypal5
Copy link
Copy Markdown
Member

@shaypal5 shaypal5 commented Apr 7, 2026

Summary

  • CI workflow: test jobs now upload raw .coverage.py{version} artifacts; a new coverage job combines them and produces coverage.xml; a new pr-agent-context job posts an AI-agent-readable context comment on every PR
  • Refresh workflow: new pr-agent-context-refresh.yml re-posts the comment (in append mode, hiding the previous one) whenever a check completes, a review is submitted, or a review comment changes
  • Template: .github/pr-agent-context-template.md with the standard section layout (preamble, opening instructions, copilot comments, review comments, failing checks, patch coverage)
  • Coverage config: [tool.coverage.*] sections added to pyproject.toml so coverage combine and coverage xml work correctly across matrix runs

How it works

PR opened / updated
  → CI runs (test × 2 Python versions, lint, coverage)
  → coverage job combines .coverage.py3.11 + .coverage.py3.12 → coverage.xml
  → pr-agent-context job posts context comment (patch coverage + review threads + failing checks)

review submitted / check completes / review comment changed
  → pr-agent-context-refresh triggers
  → fetches coverage.xml from the CI producer run (cross-run lookup)
  → posts a new context comment in append mode
  → hides the previous context comment

Pattern reference

Follows the implementation in DataHackIL/tfht_enforce_idx:

  • patch_coverage_source_mode: coverage_xml_artifact (combined XML, not raw artifacts)
  • coverage_source_workflows: CI for cross-run lookup in refresh
  • publish_mode: append with hide_previous_managed_comments_on_append: true (default)
  • wait_for_reviews_to_settle: true in refresh so resolved threads are not re-reported

Test plan

  • Open a PR → CI runs → pr-agent-context comment appears with coverage section
  • Submit a review → refresh workflow triggers → old comment hidden, new comment posted
  • Complete all checks with no issues → comment is deleted (delete_comment_when_empty default)

🤖 Generated with Claude Code

Follows the pattern from DataHackIL/tfht_enforce_idx:

CI workflow (ci.yml):
- Test job now runs with --cov avdp --cov-branch --cov-report= and uploads
  raw .coverage.py{version} data as pr-agent-context-coverage-py{version}
  artifacts (one per matrix entry)
- New coverage job: downloads all raw artifacts, combines with coverage combine,
  generates coverage.xml, uploads as coverage-xml artifact
- New pr-agent-context job: runs after lint/test/coverage on every PR using
  coverage_xml_artifact mode pointing at coverage-xml; uses template file

Refresh workflow (pr-agent-context-refresh.yml):
- Triggers on pull_request_review, pull_request_review_comment, check_run, status
- Runs in append mode with hide_previous_managed_comments_on_append (default true)
  so old pr-agent-context comments are hidden when a new one is posted
- wait_for_reviews_to_settle: true so resolved threads are not re-reported
- enable_cross_run_coverage_lookup: true with coverage_source_workflows: CI
  so refresh runs can reuse coverage artifacts from the CI producer run

Template (.github/pr-agent-context-template.md):
- Standard section layout: preamble, opening instructions, copilot comments,
  review comments, failing checks, patch coverage

pyproject.toml:
- Add [tool.coverage.run/paths/report/xml] config so coverage combine and
  coverage xml work correctly across matrix runs and local development

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

@shaypal5 shaypal5 requested a review from Copilot April 7, 2026 11:21
@shaypal5 shaypal5 self-assigned this Apr 7, 2026
@shaypal5 shaypal5 added CI enhancement New feature or request labels Apr 7, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Integrates pr-agent-context into CI by generating a combined coverage.xml across Python matrix runs and adds a refresh workflow that re-posts/upgrades the PR context comment as reviews and checks change.

Changes:

  • Upload per-matrix raw coverage data, then combine into a single coverage.xml artifact in a dedicated coverage job.
  • Add a pr-agent-context CI job plus a new refresh workflow to append updated context comments on review/check activity.
  • Add coverage configuration to pyproject.toml and introduce a standardized PR-agent context prompt template.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
pyproject.toml Adds [tool.coverage.*] configuration to support coverage combine and coverage xml.
.github/workflows/ci.yml Uploads raw coverage artifacts from matrix tests, combines them into coverage.xml, and posts the PR agent context comment.
.github/workflows/pr-agent-context-refresh.yml New workflow to refresh/re-post the PR agent context comment when reviews/comments/checks change.
.github/pr-agent-context-template.md Adds the prompt template layout used by pr-agent-context.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pyproject.toml Outdated
Comment thread .github/workflows/pr-agent-context-refresh.yml
Comment thread .github/workflows/pr-agent-context-refresh.yml
Comment thread .github/workflows/ci.yml
Comment thread .github/workflows/pr-agent-context-refresh.yml
@github-actions

This comment has been minimized.

Replace hardcoded /home/runner/work/SynthBanshee/SynthBanshee/avdp with
/home/runner/work/*/*/avdp so coverage path normalization works regardless
of repo name, fork, or self-hosted runner workspace layout.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

@shaypal5 shaypal5 merged commit 0e69ee8 into main Apr 7, 2026
6 checks passed
@shaypal5 shaypal5 deleted the feat/pr-agent-context branch April 7, 2026 11:33
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 7, 2026

pr-agent-context report:

🚨 `pr-agent-context` failed while preparing PR context.

PR: #2
Error: CalledProcessError: Command '['git', '-C', '/home/runner/work/SynthBanshee/SynthBanshee/caller-repo', 'diff', '--unified=0', '0efa0f83fad69b448f3037f09c3642a5ea209007...edd0b1407fa77605d0e1b8dcf4ffcf6309b047e1']' returned non-zero exit status 128.
Run: https://github.com/DataHackIL/SynthBanshee/actions/runs/24079184993

The workflow continued gracefully so this failure does not block CI.
Check the job logs for the full traceback.

Run metadata:

Tool ref: v4
Tool version: 4.0.13
Trigger: status updated
Workflow run: 24079184993 attempt 1
Comment timestamp: 2026-04-07T11:34:06.450860+00:00
PR head commit: edd0b1407fa77605d0e1b8dcf4ffcf6309b047e1

shaypal5 added a commit that referenced this pull request Apr 15, 2026
Implements §4.2a of docs/audio_generation_v3_design.md.

AGG speakers (AGG_M_30-45_001, BEN_M_40-55_003):
- Align volume_delta_db to target table: I2→+2, I3→+5, I4→+9, I5→+13
- I5 was +10/+11; now +13 dB above I1 baseline (≥8 dB range required)

VIC speakers (VIC_F_25-40_002, SW_F_30-45_001):
- Invert pitch trajectory: was 0→+7 st (rises into child-voice range);
  now -4→-1 st (lower adult female baseline, hard cap at I4–I5)
- Fix debug_run_1 defect #2: HilaNeural baseline ~215 Hz rising to
  285–287 Hz; new offsets target ≤200 Hz at I1, capped at I4–I5
- Redesign volume: was 0/−1/−3/−2/+4; now 0/+1/+2/+4/+5 (positive
  escalation so VIC is audible under AGG without being child-loud)
- Additional distress at I4–I5 to come from rate instability and
  timing compression (M6/M7), not further pitch elevation

All values are initial calibration defaults from §4.2a — validate by
listening test and F0/RMS measurement before scale generation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
shaypal5 added a commit that referenced this pull request Apr 16, 2026
…on (#24)

* feat(m2a): SSML prosody defaults — VIC F0 cap + AGG loudness escalation

Implements §4.2a of docs/audio_generation_v3_design.md.

AGG speakers (AGG_M_30-45_001, BEN_M_40-55_003):
- Align volume_delta_db to target table: I2→+2, I3→+5, I4→+9, I5→+13
- I5 was +10/+11; now +13 dB above I1 baseline (≥8 dB range required)

VIC speakers (VIC_F_25-40_002, SW_F_30-45_001):
- Invert pitch trajectory: was 0→+7 st (rises into child-voice range);
  now -4→-1 st (lower adult female baseline, hard cap at I4–I5)
- Fix debug_run_1 defect #2: HilaNeural baseline ~215 Hz rising to
  285–287 Hz; new offsets target ≤200 Hz at I1, capped at I4–I5
- Redesign volume: was 0/−1/−3/−2/+4; now 0/+1/+2/+4/+5 (positive
  escalation so VIC is audible under AGG without being child-loud)
- Additional distress at I4–I5 to come from rate instability and
  timing compression (M6/M7), not further pitch elevation

All values are initial calibration defaults from §4.2a — validate by
listening test and F0/RMS measurement before scale generation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(m2a): measure-prosody CLI command + prosody_metrics module

New module synthbanshee/labels/prosody_metrics.py:
- measure_clip(wav_path): reads WAV + .jsonl, returns TurnMetrics per event
- aggregate_metrics(): groups by (role, intensity), computes median F0 / mean RMS
- run_threshold_checks(): evaluates §4.2a thresholds (VIC I1 ≤200 Hz,
  VIC I4/I5 <250 Hz, AGG I5−I1 ≥8 dB RMS); returns (label, passed, detail)

New CLI command: synthbanshee measure-prosody <clip_dir>
- Walks clip_dir recursively for .wav files
- Prints Rich table of F0 median/std and RMS by role × intensity
- Prints §4.2a pass/fail checks with measured values
- --output/-o: export per-turn CSV for offline analysis
- --roles: filter to specific speaker roles (default AGG,VIC)
- Exit code 1 on threshold failure

29 unit tests covering _measure_segment, measure_clip, aggregate_metrics,
and run_threshold_checks including edge cases (no data, exact thresholds,
librosa unavailable, stereo WAV, missing JSONL).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(cli): add --max-clips/-n to generate-batch for wet tests

Truncates the selected scene config list after typology-proportional
selection, so small smoke/wet-test runs don't require editing run configs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(scenes): add Tier A she_proves scene configs (SP_SV/IT/NEG/NEU_A_0001-0002)

8 Tier A scene configs covering all four typologies (SV, IT, NEG, NEU),
mirroring the Tier B configs without the acoustic_scene augmentation block.
Seeds in the 1100–1402 range to avoid collisions with Tier B (100–402).

Required by: configs/run_configs/tier_a_500_she_proves.yaml,
             the M2a wet-test batch run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(m2a): address Copilot review comments + patch coverage gaps

COPILOT-1: Add test_vic_i4/i5_exactly_at_threshold_fails — explicit boundary
tests confirming that exactly 250 Hz fails the strict < threshold for I4/I5.

COPILOT-2: Update PR description to accurately reflect scope (config changes
+ measurement tooling + Tier A scene configs + CLI additions).

COPILOT-3: Fix VIC I4/I5 threshold comparison: was <=, now strict < per the
spec label "< 250 Hz". Add strict= parameter to _f0_check inner function.

COPILOT-4: Rename RoleIntensityStats.f0_std_hz → f0_std_hz_mean and update
docstring to "mean of per-turn F0 standard deviations" (not a true pooled
std, which would require frame-level counts). Update CLI and test references.

Coverage gaps:
- prosody_metrics: add TestMeasureSegmentCoverage with int16/float32 RMS
  equivalence, successful librosa path (sys.modules mock), and all-unvoiced
  path; add TestMeasureClipStereo for explicit stereo→mono downmix coverage;
  add test_stats_with_none_f0_reports_no_data for the s.f0_median_hz is None
  branch in run_threshold_checks
- cli: add TestMeasureProsodyCLI covering empty dir, missing JSONL, table
  output, threshold-failure exit code, CSV export, and --roles filter;
  add test_max_clips_truncates_selected_configs for the --max-clips branch

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(config): route generated clips to corpus repo via SYNTHBANSHEE_DATA_DIR

Remove hardcoded output_dir: data/he from all scene/run/example YAML
configs so the SYNTHBANSHEE_DATA_DIR env var default in SceneConfig and
RunConfig takes effect. Add envvar="SYNTHBANSHEE_DATA_DIR" to --output-dir
on both generate and generate-batch CLI commands.

Add .env.example documenting all required env vars including:
  SYNTHBANSHEE_DATA_DIR=/path/to/avdp-synth-corpus/data/he

With SYNTHBANSHEE_DATA_DIR set, clips land in the corpus repo rather than
inside the SynthBanshee source tree.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(speakers): flatten AGG/BEN pitch trajectory — cap at +1 st (M2a)

Azure normalizes WAV output so volume_delta_db has no effect; the only
perceptible change across intensity levels was pitch, causing the
aggressor to sound like helium at I4/I5 (+3/+4 st). Cap pitch_delta_st
at 0 for I2/I3 and +1 for I4/I5 on both AGG_M_30-45_001 and
BEN_M_40-55_003. Rate multiplier (1.0→1.25/1.30) now carries all
urgency cues until M3 adds per-turn RMS gain.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(measure-prosody): address PR #24 Copilot review (COPILOT-1 through COPILOT-5)

COPILOT-1: strengthen test_roles_filter_excludes_other_roles — add AGG
I1/I5 clips with sufficient RMS delta so AGG checks pass, assert
exit_code == 0 and "VIC" not in output.

COPILOT-2 + COPILOT-5: rename CSV header f0_std_hz_mean → f0_std_hz to
match the per-turn TurnMetrics field being exported (not an aggregate).

COPILOT-3: wrap EventLabel.model_validate_json() per line in
measure_clip() with try/except Exception; emit warnings.warn for
malformed lines so the tool remains usable on partial datasets. Also
restructure loop to skip empty lines explicitly (if not line: continue).

COPILOT-4: add include_roles: set[str] | None = None param to
run_threshold_checks(); gate VIC checks and AGG escalation check on role
membership. Pass include_roles=include_roles from the CLI so --roles AGG
does not emit failing VIC no-data checks.

Coverage: add test_malformed_jsonl_line_skipped, test_empty_jsonl_lines_
skipped, test_threshold_failure_sets_exit_code_1, and three
TestRunThresholdChecks tests for include_roles parameter.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* ci(pr-agent-context): pin to v4.0.18, enable include_outdated_review_threads

Pin workflow ref and tool_ref to v4.0.18 (from v4 floating tag) and
set include_outdated_review_threads: true, introduced in that release.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(generate-batch): shuffle selected configs before --max-clips truncation

Without shuffling, selected[:max_clips] always picks the first N configs
in typology-insertion order (e.g. all IT scenes for a Tier A run), making
wet-test runs unrepresentative. Shuffle with random.Random(run_cfg.random_seed)
before truncating so the sample is typology-balanced and deterministic.

Update test assertion to match the new log line format.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs: clarify has_violence is a derived field, not a prohibited label

The previous blanket "no binary Violence/Non-Violence labels" wording was
being misread as a reason to drop has_violence from metadata and manifests.

Clarify the actual policy across CLAUDE.md, AGENTS.md, README.md,
docs/design_approaches.md, and docs/audio_violence_datasdet_project_docs.md:

  - has_violence is a DERIVED convenience field computed from the
    hierarchical taxonomy (violence_typology, tier1_category, max_intensity).
    It belongs in weak_label JSON and manifest.csv; AI teams need it for
    baseline models and stratified sampling.
  - What is prohibited is REPLACING the taxonomy with a binary flag as the
    sole or primary label.
  - The taxonomy columns remain the ground truth.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
shaypal5 added a commit that referenced this pull request May 11, 2026
…lence rule, measured-peak note) (#103)

* docs(spec): #101 — corpus-handoff schema drift

Three doc drifts surfaced from a corpus pre-handoff review of
DataHackIL/avdp-synth-corpus PR #2 (the 8-clip M2a wet test):

1. The §2 directory diagram and §2.4 / §5.1 / §5.2 / §10 examples used
   uppercase paths, contradicting §2.5 ("all filenames are lowercase")
   and the on-disk reality produced by `cli.py:365–367`
   (`scene_id.lower()`, `speaker_id.lower()`).  Consumer teams who try
   to reconstruct paths from `speakers[].speaker_id` (canonical
   uppercase) hit a directory that does not exist.

2. The §5.1 example JSON lacked the M11 `generation_metadata` block
   and `speakers[].voice_family`.  Both are part of the current
   ClipMetadata schema; their absence in the spec left consumer teams
   guessing at the contract.

3. `weak_label.has_violence` was never pinned in the spec.  The
   downstream corpus README invented a rule
   (`typology in {SV,IT,NEG} and max_intensity >= 3`) that disagrees
   with the actual code (`any(e.tier1_category != "NONE" for e in events)`)
   on every NEG row.  Pinned the code rule with a source-file reference
   so external docs can mirror it.

Changes:
  - §2.1: lowercase-path footnote pointing to §2.3 (no inline narrative)
  - §2.3: paragraph distinguishing uppercase `speaker_id` (canonical
          identifier) from the lowercase directory form, with the
          accurate `speakers[0]` rule and the cli.py source pointer
  - §2.4: lowercase clip-id examples + brief parenthetical
  - §5.1: example JSON now has lowercase paths, `-2.0` target,
          `generation_metadata`, `voice_family`; three short field
          notes; new sub-section pinning the `has_violence` rule
  - §5.2 / §10: lowercase `clip_id` to match §2.5

What this does NOT claim (corrections from the closed PR #100):
  - `generation_metadata` presence is by commit, not by
    `generator_version: 0.1.0` (the version was bumped before the
    field was wired — corpus on-disk clips disprove the prior claim)
  - The directory uses `scene.speakers[0]`, not necessarily AGG; for
    NEU/NEG scenes the order is whatever the YAML lists first

Test plan:
  - `pytest tests/unit/` — 1688 passed
  - `ruff check synthbanshee/ tests/ docs/` — clean
  - Visual review: all uppercase `SP_IT_B_0023` / `_EVT_007` clip-id
    forms reconciled across §2.4, §5.1, §5.2, §10

Refs #101.  Pairs with #102 (cli normalized_dbfs fix); merge order
does not matter — they are independent.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(spec): v2 — corpus-handoff schema drift

Revises 4374460 after a self-review found that v1 still had the same
documentation anti-patterns the review of closed PR #100 flagged: rule
stated in two places, prose where a table belongs, a misleading
`voice_family` example, internal-history milestone language, a
technically wrong claim about clip_id in YAML, undocumented event_id
casing, and missing `has_violence` boundary cases.  Plus a critical
correction: v1 (and #102 v1) described `normalized_dbfs` as the
configured *target*, but `labels/schema.py:175` already documents the
field as the **measured** peak — kept deliberately distinct from
`generation_metadata.loudness_target_peak_dbfs` which is the target.

Changes vs v1:
  - §2.1 footnote: shortened to a pointer ("see §2.5 for casing rules")
  - §2.3 paragraph: collapsed to one line pointing at §2.5's table
  - §2.4 parenthetical: rewritten — `clip_id` is *derived* from
    `scene_id` (lowercased + `_NN` segment); the on-disk form does NOT
    appear verbatim in any YAML
  - §2.5: new "Identifier casing (per surface)" table — single source
    of truth for which surfaces use UPPERCASE values vs lowercase paths
  - §5.1 example: `normalized_dbfs: -2.0` → `-2.013` (a realistic
    measured value, since the field is measured not target);
    `voice_family` example given DISTINCT values (`"Avri"` overrides
    `tts_voice_id`) for AGG, omitted for VIC to demonstrate the
    optional fallback
  - §5.1 field notes: four dense bullets replacing the previous
    bullets-plus-`####`-sub-section; M11 language removed (the spec
    describes the contract, not its git history); `has_violence`
    derivation rule moved into a bullet that ALSO documents boundary
    cases (empty events, typology/flag disagreement)
  - §5.2: one sentence explaining `event_id` format (literal `EVT`
    token + lowercase `clip_id` prefix — not a casing inconsistency)

Test plan:
  - `pytest tests/unit/` — 1688 passed
  - `ruff check synthbanshee/ tests/ docs/` — clean

Refs #101 (2/2).  Pairs with #102 (`fix(cli):` measured-peak wiring)
— either order is safe; both converge on the schema docstring's
`normalized_dbfs = measured` contract.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(spec): #101 — fix Tier-B scene_config path in §5.1 example

The example clip is `sp_it_b_0023_00` — the `b` denotes Tier B — but
the `scene_config` field showed `configs/scenes/she_proves/`.  Tier B
She-Proves scene configs actually live under
`configs/scenes/she_proves_tier_b/`.  Caught by Copilot review on
ed65750; consumers reconstructing paths from this example would not
find the file.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants