[TTS] Added TTS comparison report by artem-gorodetskii · Pull Request #15621 · NVIDIA-NeMo/NeMo

artem-gorodetskii · 2026-04-17T16:43:55Z

Summary

This PR introduces a report generation pipeline for TTS evaluation buckets.

The new tool compares two evaluation buckets produced by magpietts_inference and generates:

an HTML evaluation report with aggregated and per-benchmark metrics;
an optional HTML audio comparison report with side-by-side audio samples.

Both reports are uploaded to S3-compatible object storage and returned as presigned URLs that can be opened directly in a browser.

Main features

support for local and remote evaluation buckets;
SFTP-based access to remote artifacts;
HTML evaluation report with:
- summary metrics,
- benchmark-level metrics,
- statistical test results,
- box plots for selected distribution metrics,
- model configuration section,
- report navigation sidebar,
- Jira link integration;
optional HTML audio comparison report with:
- context, baseline, and candidate audio,
- benchmark navigation sidebar,
- synchronized audio playback behavior;
upload of:
- reports,
- audio assets,
- box plot images
  to S3-compatible object storage;
presigned URLs for all uploaded reports and assets.

Implementation details

report rendering was moved to Jinja templates;
report generation is coordinated through Orchestrator;
storage access is abstracted through BaseStorage with local and SFTP implementations;
plotting is done in memory and uploaded directly to S3;
benchmark and audio sample validation was added to reduce mismatches between bucket versions.

Notes

the tool currently compares artifacts from repetition 0;
benchmark selection can be controlled through --benchmarks;
audio report benchmark selection can be overridden through --audio_report_benchmarks;
generated reports include expiration information for presigned links;
reports include a Jira link derived from task_id.

Summary by CodeRabbit

New Features
- Added a new TTS comparison report generation tool for evaluating baseline and candidate text-to-speech models.
- Generates evaluation reports with aggregated metrics, statistical test analysis, and box plot visualizations.
- Added optional audio comparison reports featuring side-by-side audio samples and metadata.
- Supports local and remote bucket storage access, plus cloud artifact uploads.
- HTML reports include interactive navigation and metric highlights.
Documentation
- Added comprehensive guide covering installation, usage workflows, and configuration options.

blisc

Let's make some changes to the documentation. We can discuss the comment on visualization.

blisc · 2026-04-17T18:14:25Z

+from matplotlib.patches import PathPatch
+from scripts.tts_comparison_report.reporting.metrics import DistributionMetricSpec, DistributionMetricsRegistry
+from scripts.tts_comparison_report.reporting.models import BucketData, StatTestResult, Winner
+


Why don't we merge this file with https://github.com/NVIDIA-NeMo/NeMo/blob/main/nemo/collections/tts/modules/magpietts_inference/visualization.py? I feel like this file should depend on visualization.py.

I think merging them would not be a good idea.

The two implementations may look similar at first glance because both use plt.boxplot, but they serve different responsibilities, take different inputs and produce outputs in different formats.

The reporting version is tightly coupled to report-specific semantics such as BucketData, StatTestResult, winner highlighting, in-memory rendering with BytesIO and styling aligned with the report layout. In contrast, the version in magpietts_inference/visualization.py is a more generic visualization helper that writes plots to disk.

There is also an important difference in how the plots are structured. The reporting version builds box plots in the Baseline vs Candidate format for a given benchmark, while visualization.py creates plots for a single model across multiple benchmarks. So even though both functions generate box plots, they are designed for different comparison scenarios.

For these reasons, I think two separate implementations are justified.

If we wanted to move prepare_boxplots into visualization.py, it would not be enough to only refactor the existing function there. We would also need to introduce an additional abstraction layer to separate generic box-plot generation from report-specific semantics. In practice, this would mean having:

a more generic plotting utility, and

an adapter layer to align it with the reporting API.

In my view, it would introduce additional complexity, make the abstraction less clear and reduce maintainability.

Another important concern is coupling. The current reporting logic is encapsulated in a separate module, which is an advantage. If reporting becomes dependent on magpietts_inference, then future changes in the inference module could more easily break the reporting pipeline.

coderabbitai · 2026-04-18T13:43:11Z

Walkthrough

Introduces a comprehensive TTS comparison reporting module that generates evaluation reports comparing baseline and candidate text-to-speech buckets, including metrics aggregation, statistical testing, storage abstraction (local/SFTP), S3 artifact uploads, and Jinja2-based HTML report rendering with optional audio comparisons.

Changes

Cohort / File(s)	Summary
CLI Entry Point `scripts/tts_comparison_report/generate_report.py`	Main script orchestrating end-to-end report generation with argument parsing for bucket paths, benchmarks, S3 credentials, remote SSH access, and audio report options; validates inputs and manages resource cleanup.
Core Orchestration & Rendering `scripts/tts_comparison_report/reporting/orchestrator.py`, `scripts/tts_comparison_report/reporting/renderer.py`	`Orchestrator` coordinates workflow from bucket loading through artifact uploads and report rendering; `Renderer` wraps Jinja2 environment for template preloading and rendering with HTML autoescaping.
Data Models & Helpers `scripts/tts_comparison_report/reporting/models.py`, `scripts/tts_comparison_report/reporting/helpers.py`, `scripts/tts_comparison_report/reporting/constants.py`	Core dataclasses for bucket discovery (`BucketData`, `BenchmarkData`), report artifacts (`EvalArtifacts`, `EvalResult`), metadata (`TaskInfo`, `ExpirationInfo`, `AudioPair`), and enums (`Winner`); helper functions for S3 prefixes and task/expiration info; centralized constants for benchmarks, templates, and S3 configuration.
Storage & S3 Integration `scripts/tts_comparison_report/reporting/storage.py`, `scripts/tts_comparison_report/reporting/s3_client.py`	`BaseStorage` abstract interface with `LocalStorage` and `SFTPStorage` implementations for artifact discovery and reading; `S3Config` and `S3Client` for boto3-based uploads with presigned URL generation and custom timeout configuration.
Metrics & Statistical Testing `scripts/tts_comparison_report/reporting/metrics/specs.py`, `scripts/tts_comparison_report/reporting/metrics/registry.py`, `scripts/tts_comparison_report/reporting/components/stat_tests.py`, `scripts/tts_comparison_report/reporting/components/metrics_table.py`	`MetricSpec` and `DistributionMetricSpec` dataclasses defining metric properties; `MetricsRegistry` and `DistributionMetricsRegistry` cataloging TTS evaluation metrics; Mann–Whitney U test wrapper generating `StatTestResult` objects; table row generators for benchmark and summary metric comparisons.
Report Components `scripts/tts_comparison_report/reporting/components/eval_report.py`, `scripts/tts_comparison_report/reporting/components/boxplots.py`, `scripts/tts_comparison_report/reporting/components/audio_report.py`	`prepare_eval_artifacts` orchestrates summary/per-benchmark metrics and statistical tests; `prepare_boxplots` generates matplotlib PNG figures with winner highlighting and confidence intervals; `prepare_audio_pairs` collects and deterministically samples audio pairs for benchmarks.
HTML Templates `scripts/tts_comparison_report/templates/*.jinja`	Jinja2 templates for evaluation report (main page, header, configuration, tables, statistical analysis, box plots, blocks) and audio report (main page, header, audio pair cards, blocks) with responsive CSS, client-side JavaScript for audio playback sync and scroll-driven navigation highlighting.
Package Init & Dependencies `scripts/tts_comparison_report/__init__.py`, `scripts/tts_comparison_report/reporting/__init__.py`, `scripts/tts_comparison_report/reporting/metrics/__init__.py`, `scripts/tts_comparison_report/reporting/components/__init__.py`, `scripts/tts_comparison_report/requirements.txt`	Re-exports of public APIs for orchestrator, renderer, storage, S3, metrics, and constants; declares dependencies (numpy, matplotlib, scipy, paramiko, tqdm, jinja2, boto3).
Documentation `scripts/tts_comparison_report/README.md`	Comprehensive guide explaining bucket comparison workflow, artifact structure, terminology, installation methods, environment variables, CLI flags, and maintenance pointers.

Sequence Diagram

sequenceDiagram
    participant CLI as CLI/generate_report
    participant Orchestrator
    participant Storage as Storage<br/>(Local/SFTP)
    participant S3Client
    participant Renderer
    participant Metrics as Metrics &<br/>Stats
    
    CLI->>Orchestrator: run(baseline, candidate, benchmarks, ...)
    Orchestrator->>Storage: load bucket data
    Storage-->>Orchestrator: BucketData
    Orchestrator->>Metrics: prepare_eval_artifacts()
    Metrics-->>Orchestrator: EvalArtifacts
    Orchestrator->>Metrics: prepare_boxplots()
    Metrics-->>Orchestrator: PNG BytesIO
    Orchestrator->>S3Client: upload_bytes(boxplot)
    S3Client-->>Orchestrator: boxplot_url
    alt generate_audio_report
        Orchestrator->>Metrics: prepare_audio_pairs()
        Metrics-->>Orchestrator: AudioPair dict
        Orchestrator->>S3Client: upload_fileobj(audio files)
        S3Client-->>Orchestrator: audio_urls
        Orchestrator->>Renderer: render(audio_report)
        Renderer-->>Orchestrator: html
        Orchestrator->>S3Client: upload_bytes(audio report)
        S3Client-->>Orchestrator: audio_report_url
    end
    Orchestrator->>Renderer: render(eval_report)
    Renderer-->>Orchestrator: html
    Orchestrator->>S3Client: upload_bytes(eval_report)
    S3Client-->>Orchestrator: eval_report_url
    Orchestrator-->>CLI: (eval_url, audio_url)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 Hop along, dear reader! ✨

Two TTS buckets meet to compare,
Metrics dance, box plots declare,
Statistical tests find the best,
Audio pairs pass the test—
Reports rendered with Jinja flair! 📊🎵

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 56.58% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[TTS] Added TTS comparison report' clearly and concisely summarizes the main change: addition of a TTS comparison report feature. It is specific, directly related to the changeset, and communicates the primary purpose without unnecessary verbosity.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch magpietts_eval_reports

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 10

🧹 Nitpick comments (5)

scripts/tts_comparison_report/requirements.txt (1)
1-7: Explicitly declare botocore in requirements.txt.

scripts/tts_comparison_report/reporting/s3_client.py imports from botocore.config import Config (line 18) directly, but requirements.txt only declares boto3. While botocore is a transitive dependency of boto3, explicitly declaring packages you directly import improves clarity and reproducibility, especially in constrained or offline environments.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/tts_comparison_report/requirements.txt` around lines 1 - 7, The
requirements file omits a direct declaration of botocore even though
s3_client.py imports from botocore.config (Config); add "botocore" to
scripts/tts_comparison_report/requirements.txt so the dependency is explicit and
reproducible, ensuring any environment install includes the botocore package
required by the import in reporting/s3_client.py.
scripts/tts_comparison_report/reporting/constants.py (1)
39-40: Consider reducing default presigned URL lifetime and making it configurable.

At Line 40, a 1-year default (31536000) is operationally convenient but increases link leakage risk. Prefer a shorter default (e.g., days) with env/CLI override.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/tts_comparison_report/reporting/constants.py` around lines 39 - 40,
Reduce the S3 presigned URL default and make it configurable: change the
S3_LINK_EXPIRES_IN constant to a shorter sensible default (e.g., 86400 for 1
day) and load its value from configuration/environment so callers can override
it; update the constant definition (S3_LINK_EXPIRES_IN) to read an int from an
env var (with fallback to the new default) or from the existing CLI/config
loader used by the project, and ensure any tests or documentation referencing
S3_LINK_EXPIRES_IN are updated accordingly.
scripts/tts_comparison_report/reporting/components/audio_report.py (1)
90-94: Add stacklevel=2 to the warning call.

Without stacklevel=2, the warning will appear to originate from within this function rather than from the caller's context, making it harder to trace.
Suggested fix
         if len(sampled_pairs) < samples_per_benchmark:
             warnings.warn(
                 f"\nBenchmark '{benchmark_name}' contains only {len(sampled_pairs)} available paired samples, "
-                f"but {samples_per_benchmark} were requested."
+                f"but {samples_per_benchmark} were requested.",
+                stacklevel=2,
             )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/tts_comparison_report/reporting/components/audio_report.py` around
lines 90 - 94, The warnings.warn call that notifies when len(sampled_pairs) <
samples_per_benchmark should include stacklevel=2 so the warning points to the
caller's context; update the warnings.warn invocation (the one using
benchmark_name, sampled_pairs, and samples_per_benchmark) to pass stacklevel=2
as an argument.
scripts/tts_comparison_report/reporting/components/stat_tests.py (2)
45-49: Add stacklevel=2 to the warning call.

Without stacklevel=2, the warning will appear to originate from within this function rather than from the caller's context.
Suggested fix
     if len(baseline) != len(candidate):
         warnings.warn(
             "\nBaseline and candidate contain different numbers of samples. "
-            "This may indicate missing filewise metrics or dataset mismatch."
+            "This may indicate missing filewise metrics or dataset mismatch.",
+            stacklevel=2,
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/tts_comparison_report/reporting/components/stat_tests.py` around
lines 45 - 49, The warnings.warn call that warns about different numbers of
samples should include stacklevel=2 so the warning reports the caller's
location; update the warnings.warn invocation in
reporting/components/stat_tests.py (the warnings.warn(...) call that checks
len(baseline) != len(candidate)) to pass stacklevel=2 as an argument.
107-127: Potential type mismatch: metric.lower_is_better is Optional[bool] but _run_single_stat_test expects bool.

While current DistributionMetricsRegistry entries all have explicit boolean values, the DistributionMetricSpec.lower_is_better field is typed as Optional[bool]. If a metric with lower_is_better=None is added to the registry, this will cause a runtime type error.

Consider adding a guard or skipping metrics where lower_is_better is None:
Suggested fix
     for metric in DistributionMetricsRegistry:
+        if metric.lower_is_better is None:
+            continue
+
         winner, alternative, p_value = _run_single_stat_test(
             baseline=bucket_baseline.get_metric_samples(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/tts_comparison_report/reporting/components/stat_tests.py` around
lines 107 - 127, DistributionMetricSpec.lower_is_better is Optional[bool] but
_run_single_stat_test expects a bool; update the loop over
DistributionMetricsRegistry to guard on metric.lower_is_better before calling
_run_single_stat_test (e.g., if metric.lower_is_better is None: skip this metric
or decide a safe default) so you never pass None into _run_single_stat_test;
locate the loop using metric.key and metric.report_name and adjust it to
continue when metric.lower_is_better is None (or provide an explicit bool) and
ensure StatTestResult is only created for metrics actually tested.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/tts_comparison_report/README.md`:
- Line 190: Fix the grammar in the user-facing note string "Both generated
reports includes a clickable Jira link derived from `--task_id`." by changing
"includes" to "include" so the sentence reads "Both generated reports include a
clickable Jira link derived from `--task_id`."; update the README.md text where
that exact sentence appears.

In `@scripts/tts_comparison_report/reporting/__init__.py`:
- Around line 21-33: The __all__ export list in this module is not
alphabetically sorted which triggers Ruff RUF022; reorder the items in the
__all__ list alphabetically (by string) — e.g., ensure entries like
"BaseStorage", "BucketStructure", "DUMMY_TASK_ID", "LocalStorage",
"Orchestrator", "Renderer", "S3Client", "S3Config", "SFTPStorage",
"SUPPORTED_BENCHMARK_NAMES", "TEMPLATES_DIR" appear in sorted order — by
updating the __all__ assignment so the list is sorted to satisfy the linter.

In `@scripts/tts_comparison_report/reporting/components/__init__.py`:
- Around line 18-22: The __all__ export list is not alphabetically sorted
causing Ruff RUF022; reorder the entries in the __all__ list so they are
lexicographically sorted (e.g., "BoxPlotsConfig", "prepare_audio_pairs",
"prepare_eval_artifacts") to satisfy the linter and keep the module exports
consistent.

In `@scripts/tts_comparison_report/reporting/components/boxplots.py`:
- Around line 147-156: The issue is that axs is sized for num_rows (only metrics
with add_to_box_plot=True) but the loop uses the registry index (metric_idx) to
index axs, causing misalignment when metrics are skipped; change the loop to
maintain a separate plotted_idx (e.g., plotted_idx = 0 before the loop) and for
each metric in DistributionMetricsRegistry, if metric.add_to_box_plot is False
continue, otherwise use axs[plotted_idx] for plotting and increment plotted_idx
after plotting (update any references at metric_idx usage locations such as
around the current axs indexing at metric_idx and later at the second affected
spot). Ensure plotted_idx starts at 0 and is used consistently wherever axs is
referenced in that loop.

In `@scripts/tts_comparison_report/reporting/metrics/__init__.py`:
- Around line 17-22: The __all__ list is unsorted and triggers Ruff RUF022;
reorder the exported symbol names alphabetically in the __all__ assignment so
the list becomes sorted (e.g., put "DistributionMetricSpec",
"DistributionMetricsRegistry", "MetricSpec", "MetricsRegistry" in alphabetical
order) ensuring the identifiers DistributionMetricsRegistry, MetricsRegistry,
DistributionMetricSpec, and MetricSpec are present and only re-ordered.

In `@scripts/tts_comparison_report/reporting/s3_client.py`:
- Around line 41-48: Update the boto3 client Config in the S3 wrapper so it
includes a read_timeout and a retry policy: when building self.client in
reporting/s3_client.py (the boto3.client(...) call), pass
Config(connect_timeout=cfg.connect_timeout, read_timeout=cfg.read_timeout,
retries={'max_attempts': cfg.s3_max_retries or 3, 'mode': 'standard'}) so reads
don't hang on the default 60s and transient errors are retried; ensure cfg
exposes read_timeout and s3_max_retries or provide sane defaults and import/use
botocore.config.Config where the client is created.

In `@scripts/tts_comparison_report/reporting/storage.py`:
- Around line 138-142: The current except block in
scripts/tts_comparison_report/reporting/storage.py catches FileNotFoundError and
then generically catches OSError and returns False, which masks real
SFTP/transport/auth failures; change the second except OSError to not
blanket-return False—instead either re-raise the exception or only treat
specific errno values that truly mean "missing path" (e.g., errno.ENOENT) before
returning False; keep the FileNotFoundError branch as-is, and for other OSError
cases log the full error and propagate (raise) so SFTP/auth/network errors are
not silently ignored.

In `@scripts/tts_comparison_report/templates/eval_report_image.jinja`:
- Line 4: The image tag rendering the plot (<img class="report-plot" src="{{
image_url }}">) lacks an alt attribute for accessibility; update the template to
add a descriptive alt using an available variable (e.g., image_alt) or a
sensible default (e.g., "Plot" or "Evaluation plot") so the tag becomes <img
class="report-plot" src="{{ image_url }}" alt="{{ image_alt |
default('Evaluation plot') }}">, ensuring screen readers and accessibility tools
have meaningful text.

In `@scripts/tts_comparison_report/templates/eval_report_stat_analysis.jinja`:
- Around line 6-7: The template currently uses the |safe filter on the winner
and advantages variables ({{ winner|safe }} and {{ advantages|safe }}), which
disables auto-escaping and risks HTML/JS injection; remove the |safe filters and
render them as {{ winner }} and {{ advantages }} (or explicitly {{ advantages|e
}} if you prefer explicit escaping) and, if you actually need to include HTML
markup in advantages, instead produce/sanitize that HTML server-side and pass a
trusted/sanitized value rather than using |safe in the template; update the
occurrences of winner and advantages in eval_report_stat_analysis.jinja
accordingly.

In `@scripts/tts_comparison_report/templates/eval_report_table.jinja`:
- Line 16: Replace the unconditional use of "{{ cell|safe }}" with
escaped-by-default rendering: stop applying |safe to the raw "cell" variable,
and instead expect structured cell values (e.g., a dict with keys like "html"
and "is_safe" or "text"); when rendering, only call |safe on the trusted field
(e.g., cell.html) if cell.is_safe is true, otherwise render the escaped text
field (e.g., cell.text) so untrusted content is always escaped to avoid XSS.

---

Nitpick comments:
In `@scripts/tts_comparison_report/reporting/components/audio_report.py`:
- Around line 90-94: The warnings.warn call that notifies when
len(sampled_pairs) < samples_per_benchmark should include stacklevel=2 so the
warning points to the caller's context; update the warnings.warn invocation (the
one using benchmark_name, sampled_pairs, and samples_per_benchmark) to pass
stacklevel=2 as an argument.

In `@scripts/tts_comparison_report/reporting/components/stat_tests.py`:
- Around line 45-49: The warnings.warn call that warns about different numbers
of samples should include stacklevel=2 so the warning reports the caller's
location; update the warnings.warn invocation in
reporting/components/stat_tests.py (the warnings.warn(...) call that checks
len(baseline) != len(candidate)) to pass stacklevel=2 as an argument.
- Around line 107-127: DistributionMetricSpec.lower_is_better is Optional[bool]
but _run_single_stat_test expects a bool; update the loop over
DistributionMetricsRegistry to guard on metric.lower_is_better before calling
_run_single_stat_test (e.g., if metric.lower_is_better is None: skip this metric
or decide a safe default) so you never pass None into _run_single_stat_test;
locate the loop using metric.key and metric.report_name and adjust it to
continue when metric.lower_is_better is None (or provide an explicit bool) and
ensure StatTestResult is only created for metrics actually tested.

In `@scripts/tts_comparison_report/reporting/constants.py`:
- Around line 39-40: Reduce the S3 presigned URL default and make it
configurable: change the S3_LINK_EXPIRES_IN constant to a shorter sensible
default (e.g., 86400 for 1 day) and load its value from
configuration/environment so callers can override it; update the constant
definition (S3_LINK_EXPIRES_IN) to read an int from an env var (with fallback to
the new default) or from the existing CLI/config loader used by the project, and
ensure any tests or documentation referencing S3_LINK_EXPIRES_IN are updated
accordingly.

In `@scripts/tts_comparison_report/requirements.txt`:
- Around line 1-7: The requirements file omits a direct declaration of botocore
even though s3_client.py imports from botocore.config (Config); add "botocore"
to scripts/tts_comparison_report/requirements.txt so the dependency is explicit
and reproducible, ensuring any environment install includes the botocore package
required by the import in reporting/s3_client.py.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: a3cbb4a4-03b4-42f1-8ac9-490b986ab8fe

📥 Commits

Reviewing files that changed from the base of the PR and between b1cb6de and ed13548.

📒 Files selected for processing (32)

scripts/tts_comparison_report/README.md
scripts/tts_comparison_report/__init__.py
scripts/tts_comparison_report/generate_report.py
scripts/tts_comparison_report/reporting/__init__.py
scripts/tts_comparison_report/reporting/components/__init__.py
scripts/tts_comparison_report/reporting/components/audio_report.py
scripts/tts_comparison_report/reporting/components/boxplots.py
scripts/tts_comparison_report/reporting/components/eval_report.py
scripts/tts_comparison_report/reporting/components/metrics_table.py
scripts/tts_comparison_report/reporting/components/stat_tests.py
scripts/tts_comparison_report/reporting/constants.py
scripts/tts_comparison_report/reporting/helpers.py
scripts/tts_comparison_report/reporting/metrics/__init__.py
scripts/tts_comparison_report/reporting/metrics/registry.py
scripts/tts_comparison_report/reporting/metrics/specs.py
scripts/tts_comparison_report/reporting/models.py
scripts/tts_comparison_report/reporting/orchestrator.py
scripts/tts_comparison_report/reporting/renderer.py
scripts/tts_comparison_report/reporting/s3_client.py
scripts/tts_comparison_report/reporting/storage.py
scripts/tts_comparison_report/requirements.txt
scripts/tts_comparison_report/templates/audio_report.jinja
scripts/tts_comparison_report/templates/audio_report_block.jinja
scripts/tts_comparison_report/templates/audio_report_header.jinja
scripts/tts_comparison_report/templates/audio_report_pair.jinja
scripts/tts_comparison_report/templates/eval_report.jinja
scripts/tts_comparison_report/templates/eval_report_block.jinja
scripts/tts_comparison_report/templates/eval_report_configuration.jinja
scripts/tts_comparison_report/templates/eval_report_header.jinja
scripts/tts_comparison_report/templates/eval_report_image.jinja
scripts/tts_comparison_report/templates/eval_report_stat_analysis.jinja
scripts/tts_comparison_report/templates/eval_report_table.jinja

artem-gorodetskii self-assigned this Apr 17, 2026

artem-gorodetskii added the TTS label Apr 17, 2026

artem-gorodetskii force-pushed the magpietts_eval_reports branch from a8f0054 to a5a4485 Compare April 17, 2026 16:45

github-actions Bot removed the TTS label Apr 17, 2026

github-advanced-security AI found potential problems Apr 17, 2026

View reviewed changes

artem-gorodetskii force-pushed the magpietts_eval_reports branch from a5a4485 to 13962a9 Compare April 17, 2026 17:00

artem-gorodetskii requested review from XuesongYang, blisc and rfejgin April 17, 2026 17:23

blisc requested changes Apr 17, 2026

View reviewed changes

artem-gorodetskii force-pushed the magpietts_eval_reports branch 5 times, most recently from 219e653 to ed13548 Compare April 18, 2026 13:42

coderabbitai Bot reviewed Apr 18, 2026

View reviewed changes

artem-gorodetskii force-pushed the magpietts_eval_reports branch 4 times, most recently from ac04748 to 0e4848d Compare April 20, 2026 09:17

[TTS] Added TTS comparison report

fcbdd28

artem-gorodetskii force-pushed the magpietts_eval_reports branch from 0e4848d to fcbdd28 Compare April 20, 2026 09:23

blisc approved these changes Apr 20, 2026

View reviewed changes

blisc merged commit 0b7d8c5 into main Apr 20, 2026
46 checks passed

blisc deleted the magpietts_eval_reports branch April 20, 2026 15:03

Conversation

artem-gorodetskii commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Main features

Implementation details

Notes

Summary by CodeRabbit

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

blisc left a comment

Choose a reason for hiding this comment

Uh oh!

blisc Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

artem-gorodetskii Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

artem-gorodetskii commented Apr 17, 2026 •

edited

Loading

coderabbitai Bot commented Apr 18, 2026 •

edited

Loading