[TTS] Added TTS comparison report#15621
Conversation
a8f0054 to
a5a4485
Compare
a5a4485 to
13962a9
Compare
blisc
left a comment
There was a problem hiding this comment.
Let's make some changes to the documentation. We can discuss the comment on visualization.
| from matplotlib.patches import PathPatch | ||
| from scripts.tts_comparison_report.reporting.metrics import DistributionMetricSpec, DistributionMetricsRegistry | ||
| from scripts.tts_comparison_report.reporting.models import BucketData, StatTestResult, Winner | ||
|
|
There was a problem hiding this comment.
Why don't we merge this file with https://github.com/NVIDIA-NeMo/NeMo/blob/main/nemo/collections/tts/modules/magpietts_inference/visualization.py? I feel like this file should depend on visualization.py.
There was a problem hiding this comment.
I think merging them would not be a good idea.
The two implementations may look similar at first glance because both use plt.boxplot, but they serve different responsibilities, take different inputs and produce outputs in different formats.
The reporting version is tightly coupled to report-specific semantics such as BucketData, StatTestResult, winner highlighting, in-memory rendering with BytesIO and styling aligned with the report layout. In contrast, the version in magpietts_inference/visualization.py is a more generic visualization helper that writes plots to disk.
There is also an important difference in how the plots are structured. The reporting version builds box plots in the Baseline vs Candidate format for a given benchmark, while visualization.py creates plots for a single model across multiple benchmarks. So even though both functions generate box plots, they are designed for different comparison scenarios.
For these reasons, I think two separate implementations are justified.
If we wanted to move prepare_boxplots into visualization.py, it would not be enough to only refactor the existing function there. We would also need to introduce an additional abstraction layer to separate generic box-plot generation from report-specific semantics. In practice, this would mean having:
- a more generic plotting utility, and
- an adapter layer to align it with the reporting API.
In my view, it would introduce additional complexity, make the abstraction less clear and reduce maintainability.
Another important concern is coupling. The current reporting logic is encapsulated in a separate module, which is an advantage. If reporting becomes dependent on magpietts_inference, then future changes in the inference module could more easily break the reporting pipeline.
219e653 to
ed13548
Compare
WalkthroughIntroduces a comprehensive TTS comparison reporting module that generates evaluation reports comparing baseline and candidate text-to-speech buckets, including metrics aggregation, statistical testing, storage abstraction (local/SFTP), S3 artifact uploads, and Jinja2-based HTML report rendering with optional audio comparisons. Changes
Sequence DiagramsequenceDiagram
participant CLI as CLI/generate_report
participant Orchestrator
participant Storage as Storage<br/>(Local/SFTP)
participant S3Client
participant Renderer
participant Metrics as Metrics &<br/>Stats
CLI->>Orchestrator: run(baseline, candidate, benchmarks, ...)
Orchestrator->>Storage: load bucket data
Storage-->>Orchestrator: BucketData
Orchestrator->>Metrics: prepare_eval_artifacts()
Metrics-->>Orchestrator: EvalArtifacts
Orchestrator->>Metrics: prepare_boxplots()
Metrics-->>Orchestrator: PNG BytesIO
Orchestrator->>S3Client: upload_bytes(boxplot)
S3Client-->>Orchestrator: boxplot_url
alt generate_audio_report
Orchestrator->>Metrics: prepare_audio_pairs()
Metrics-->>Orchestrator: AudioPair dict
Orchestrator->>S3Client: upload_fileobj(audio files)
S3Client-->>Orchestrator: audio_urls
Orchestrator->>Renderer: render(audio_report)
Renderer-->>Orchestrator: html
Orchestrator->>S3Client: upload_bytes(audio report)
S3Client-->>Orchestrator: audio_report_url
end
Orchestrator->>Renderer: render(eval_report)
Renderer-->>Orchestrator: html
Orchestrator->>S3Client: upload_bytes(eval_report)
S3Client-->>Orchestrator: eval_report_url
Orchestrator-->>CLI: (eval_url, audio_url)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 10
🧹 Nitpick comments (5)
scripts/tts_comparison_report/requirements.txt (1)
1-7: Explicitly declarebotocorein requirements.txt.
scripts/tts_comparison_report/reporting/s3_client.pyimportsfrom botocore.config import Config(line 18) directly, butrequirements.txtonly declaresboto3. Whilebotocoreis a transitive dependency ofboto3, explicitly declaring packages you directly import improves clarity and reproducibility, especially in constrained or offline environments.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/tts_comparison_report/requirements.txt` around lines 1 - 7, The requirements file omits a direct declaration of botocore even though s3_client.py imports from botocore.config (Config); add "botocore" to scripts/tts_comparison_report/requirements.txt so the dependency is explicit and reproducible, ensuring any environment install includes the botocore package required by the import in reporting/s3_client.py.scripts/tts_comparison_report/reporting/constants.py (1)
39-40: Consider reducing default presigned URL lifetime and making it configurable.At Line 40, a 1-year default (
31536000) is operationally convenient but increases link leakage risk. Prefer a shorter default (e.g., days) with env/CLI override.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/tts_comparison_report/reporting/constants.py` around lines 39 - 40, Reduce the S3 presigned URL default and make it configurable: change the S3_LINK_EXPIRES_IN constant to a shorter sensible default (e.g., 86400 for 1 day) and load its value from configuration/environment so callers can override it; update the constant definition (S3_LINK_EXPIRES_IN) to read an int from an env var (with fallback to the new default) or from the existing CLI/config loader used by the project, and ensure any tests or documentation referencing S3_LINK_EXPIRES_IN are updated accordingly.scripts/tts_comparison_report/reporting/components/audio_report.py (1)
90-94: Addstacklevel=2to the warning call.Without
stacklevel=2, the warning will appear to originate from within this function rather than from the caller's context, making it harder to trace.Suggested fix
if len(sampled_pairs) < samples_per_benchmark: warnings.warn( f"\nBenchmark '{benchmark_name}' contains only {len(sampled_pairs)} available paired samples, " - f"but {samples_per_benchmark} were requested." + f"but {samples_per_benchmark} were requested.", + stacklevel=2, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/tts_comparison_report/reporting/components/audio_report.py` around lines 90 - 94, The warnings.warn call that notifies when len(sampled_pairs) < samples_per_benchmark should include stacklevel=2 so the warning points to the caller's context; update the warnings.warn invocation (the one using benchmark_name, sampled_pairs, and samples_per_benchmark) to pass stacklevel=2 as an argument.scripts/tts_comparison_report/reporting/components/stat_tests.py (2)
45-49: Addstacklevel=2to the warning call.Without
stacklevel=2, the warning will appear to originate from within this function rather than from the caller's context.Suggested fix
if len(baseline) != len(candidate): warnings.warn( "\nBaseline and candidate contain different numbers of samples. " - "This may indicate missing filewise metrics or dataset mismatch." + "This may indicate missing filewise metrics or dataset mismatch.", + stacklevel=2, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/tts_comparison_report/reporting/components/stat_tests.py` around lines 45 - 49, The warnings.warn call that warns about different numbers of samples should include stacklevel=2 so the warning reports the caller's location; update the warnings.warn invocation in reporting/components/stat_tests.py (the warnings.warn(...) call that checks len(baseline) != len(candidate)) to pass stacklevel=2 as an argument.
107-127: Potential type mismatch:metric.lower_is_betterisOptional[bool]but_run_single_stat_testexpectsbool.While current
DistributionMetricsRegistryentries all have explicit boolean values, theDistributionMetricSpec.lower_is_betterfield is typed asOptional[bool]. If a metric withlower_is_better=Noneis added to the registry, this will cause a runtime type error.Consider adding a guard or skipping metrics where
lower_is_better is None:Suggested fix
for metric in DistributionMetricsRegistry: + if metric.lower_is_better is None: + continue + winner, alternative, p_value = _run_single_stat_test( baseline=bucket_baseline.get_metric_samples(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/tts_comparison_report/reporting/components/stat_tests.py` around lines 107 - 127, DistributionMetricSpec.lower_is_better is Optional[bool] but _run_single_stat_test expects a bool; update the loop over DistributionMetricsRegistry to guard on metric.lower_is_better before calling _run_single_stat_test (e.g., if metric.lower_is_better is None: skip this metric or decide a safe default) so you never pass None into _run_single_stat_test; locate the loop using metric.key and metric.report_name and adjust it to continue when metric.lower_is_better is None (or provide an explicit bool) and ensure StatTestResult is only created for metrics actually tested.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/tts_comparison_report/README.md`:
- Line 190: Fix the grammar in the user-facing note string "Both generated
reports includes a clickable Jira link derived from `--task_id`." by changing
"includes" to "include" so the sentence reads "Both generated reports include a
clickable Jira link derived from `--task_id`."; update the README.md text where
that exact sentence appears.
In `@scripts/tts_comparison_report/reporting/__init__.py`:
- Around line 21-33: The __all__ export list in this module is not
alphabetically sorted which triggers Ruff RUF022; reorder the items in the
__all__ list alphabetically (by string) — e.g., ensure entries like
"BaseStorage", "BucketStructure", "DUMMY_TASK_ID", "LocalStorage",
"Orchestrator", "Renderer", "S3Client", "S3Config", "SFTPStorage",
"SUPPORTED_BENCHMARK_NAMES", "TEMPLATES_DIR" appear in sorted order — by
updating the __all__ assignment so the list is sorted to satisfy the linter.
In `@scripts/tts_comparison_report/reporting/components/__init__.py`:
- Around line 18-22: The __all__ export list is not alphabetically sorted
causing Ruff RUF022; reorder the entries in the __all__ list so they are
lexicographically sorted (e.g., "BoxPlotsConfig", "prepare_audio_pairs",
"prepare_eval_artifacts") to satisfy the linter and keep the module exports
consistent.
In `@scripts/tts_comparison_report/reporting/components/boxplots.py`:
- Around line 147-156: The issue is that axs is sized for num_rows (only metrics
with add_to_box_plot=True) but the loop uses the registry index (metric_idx) to
index axs, causing misalignment when metrics are skipped; change the loop to
maintain a separate plotted_idx (e.g., plotted_idx = 0 before the loop) and for
each metric in DistributionMetricsRegistry, if metric.add_to_box_plot is False
continue, otherwise use axs[plotted_idx] for plotting and increment plotted_idx
after plotting (update any references at metric_idx usage locations such as
around the current axs indexing at metric_idx and later at the second affected
spot). Ensure plotted_idx starts at 0 and is used consistently wherever axs is
referenced in that loop.
In `@scripts/tts_comparison_report/reporting/metrics/__init__.py`:
- Around line 17-22: The __all__ list is unsorted and triggers Ruff RUF022;
reorder the exported symbol names alphabetically in the __all__ assignment so
the list becomes sorted (e.g., put "DistributionMetricSpec",
"DistributionMetricsRegistry", "MetricSpec", "MetricsRegistry" in alphabetical
order) ensuring the identifiers DistributionMetricsRegistry, MetricsRegistry,
DistributionMetricSpec, and MetricSpec are present and only re-ordered.
In `@scripts/tts_comparison_report/reporting/s3_client.py`:
- Around line 41-48: Update the boto3 client Config in the S3 wrapper so it
includes a read_timeout and a retry policy: when building self.client in
reporting/s3_client.py (the boto3.client(...) call), pass
Config(connect_timeout=cfg.connect_timeout, read_timeout=cfg.read_timeout,
retries={'max_attempts': cfg.s3_max_retries or 3, 'mode': 'standard'}) so reads
don't hang on the default 60s and transient errors are retried; ensure cfg
exposes read_timeout and s3_max_retries or provide sane defaults and import/use
botocore.config.Config where the client is created.
In `@scripts/tts_comparison_report/reporting/storage.py`:
- Around line 138-142: The current except block in
scripts/tts_comparison_report/reporting/storage.py catches FileNotFoundError and
then generically catches OSError and returns False, which masks real
SFTP/transport/auth failures; change the second except OSError to not
blanket-return False—instead either re-raise the exception or only treat
specific errno values that truly mean "missing path" (e.g., errno.ENOENT) before
returning False; keep the FileNotFoundError branch as-is, and for other OSError
cases log the full error and propagate (raise) so SFTP/auth/network errors are
not silently ignored.
In `@scripts/tts_comparison_report/templates/eval_report_image.jinja`:
- Line 4: The image tag rendering the plot (<img class="report-plot" src="{{
image_url }}">) lacks an alt attribute for accessibility; update the template to
add a descriptive alt using an available variable (e.g., image_alt) or a
sensible default (e.g., "Plot" or "Evaluation plot") so the tag becomes <img
class="report-plot" src="{{ image_url }}" alt="{{ image_alt |
default('Evaluation plot') }}">, ensuring screen readers and accessibility tools
have meaningful text.
In `@scripts/tts_comparison_report/templates/eval_report_stat_analysis.jinja`:
- Around line 6-7: The template currently uses the |safe filter on the winner
and advantages variables ({{ winner|safe }} and {{ advantages|safe }}), which
disables auto-escaping and risks HTML/JS injection; remove the |safe filters and
render them as {{ winner }} and {{ advantages }} (or explicitly {{ advantages|e
}} if you prefer explicit escaping) and, if you actually need to include HTML
markup in advantages, instead produce/sanitize that HTML server-side and pass a
trusted/sanitized value rather than using |safe in the template; update the
occurrences of winner and advantages in eval_report_stat_analysis.jinja
accordingly.
In `@scripts/tts_comparison_report/templates/eval_report_table.jinja`:
- Line 16: Replace the unconditional use of "{{ cell|safe }}" with
escaped-by-default rendering: stop applying |safe to the raw "cell" variable,
and instead expect structured cell values (e.g., a dict with keys like "html"
and "is_safe" or "text"); when rendering, only call |safe on the trusted field
(e.g., cell.html) if cell.is_safe is true, otherwise render the escaped text
field (e.g., cell.text) so untrusted content is always escaped to avoid XSS.
---
Nitpick comments:
In `@scripts/tts_comparison_report/reporting/components/audio_report.py`:
- Around line 90-94: The warnings.warn call that notifies when
len(sampled_pairs) < samples_per_benchmark should include stacklevel=2 so the
warning points to the caller's context; update the warnings.warn invocation (the
one using benchmark_name, sampled_pairs, and samples_per_benchmark) to pass
stacklevel=2 as an argument.
In `@scripts/tts_comparison_report/reporting/components/stat_tests.py`:
- Around line 45-49: The warnings.warn call that warns about different numbers
of samples should include stacklevel=2 so the warning reports the caller's
location; update the warnings.warn invocation in
reporting/components/stat_tests.py (the warnings.warn(...) call that checks
len(baseline) != len(candidate)) to pass stacklevel=2 as an argument.
- Around line 107-127: DistributionMetricSpec.lower_is_better is Optional[bool]
but _run_single_stat_test expects a bool; update the loop over
DistributionMetricsRegistry to guard on metric.lower_is_better before calling
_run_single_stat_test (e.g., if metric.lower_is_better is None: skip this metric
or decide a safe default) so you never pass None into _run_single_stat_test;
locate the loop using metric.key and metric.report_name and adjust it to
continue when metric.lower_is_better is None (or provide an explicit bool) and
ensure StatTestResult is only created for metrics actually tested.
In `@scripts/tts_comparison_report/reporting/constants.py`:
- Around line 39-40: Reduce the S3 presigned URL default and make it
configurable: change the S3_LINK_EXPIRES_IN constant to a shorter sensible
default (e.g., 86400 for 1 day) and load its value from
configuration/environment so callers can override it; update the constant
definition (S3_LINK_EXPIRES_IN) to read an int from an env var (with fallback to
the new default) or from the existing CLI/config loader used by the project, and
ensure any tests or documentation referencing S3_LINK_EXPIRES_IN are updated
accordingly.
In `@scripts/tts_comparison_report/requirements.txt`:
- Around line 1-7: The requirements file omits a direct declaration of botocore
even though s3_client.py imports from botocore.config (Config); add "botocore"
to scripts/tts_comparison_report/requirements.txt so the dependency is explicit
and reproducible, ensuring any environment install includes the botocore package
required by the import in reporting/s3_client.py.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: a3cbb4a4-03b4-42f1-8ac9-490b986ab8fe
📒 Files selected for processing (32)
scripts/tts_comparison_report/README.mdscripts/tts_comparison_report/__init__.pyscripts/tts_comparison_report/generate_report.pyscripts/tts_comparison_report/reporting/__init__.pyscripts/tts_comparison_report/reporting/components/__init__.pyscripts/tts_comparison_report/reporting/components/audio_report.pyscripts/tts_comparison_report/reporting/components/boxplots.pyscripts/tts_comparison_report/reporting/components/eval_report.pyscripts/tts_comparison_report/reporting/components/metrics_table.pyscripts/tts_comparison_report/reporting/components/stat_tests.pyscripts/tts_comparison_report/reporting/constants.pyscripts/tts_comparison_report/reporting/helpers.pyscripts/tts_comparison_report/reporting/metrics/__init__.pyscripts/tts_comparison_report/reporting/metrics/registry.pyscripts/tts_comparison_report/reporting/metrics/specs.pyscripts/tts_comparison_report/reporting/models.pyscripts/tts_comparison_report/reporting/orchestrator.pyscripts/tts_comparison_report/reporting/renderer.pyscripts/tts_comparison_report/reporting/s3_client.pyscripts/tts_comparison_report/reporting/storage.pyscripts/tts_comparison_report/requirements.txtscripts/tts_comparison_report/templates/audio_report.jinjascripts/tts_comparison_report/templates/audio_report_block.jinjascripts/tts_comparison_report/templates/audio_report_header.jinjascripts/tts_comparison_report/templates/audio_report_pair.jinjascripts/tts_comparison_report/templates/eval_report.jinjascripts/tts_comparison_report/templates/eval_report_block.jinjascripts/tts_comparison_report/templates/eval_report_configuration.jinjascripts/tts_comparison_report/templates/eval_report_header.jinjascripts/tts_comparison_report/templates/eval_report_image.jinjascripts/tts_comparison_report/templates/eval_report_stat_analysis.jinjascripts/tts_comparison_report/templates/eval_report_table.jinja
ac04748 to
0e4848d
Compare
0e4848d to
fcbdd28
Compare
Summary
This PR introduces a report generation pipeline for TTS evaluation buckets.
The new tool compares two evaluation buckets produced by
magpietts_inferenceand generates:Both reports are uploaded to S3-compatible object storage and returned as presigned URLs that can be opened directly in a browser.
Main features
to S3-compatible object storage;
Implementation details
Orchestrator;BaseStoragewith local and SFTP implementations;Notes
0;--benchmarks;--audio_report_benchmarks;task_id.Summary by CodeRabbit
New Features
Documentation