Skip to content

bench: add benchmarks and results#5

Merged
al8n merged 20 commits intomainfrom
0.2.0
Mar 12, 2026
Merged

bench: add benchmarks and results#5
al8n merged 20 commits intomainfrom
0.2.0

Conversation

@al8n
Copy link
Copy Markdown
Contributor

@al8n al8n commented Mar 11, 2026

No description provided.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds Criterion benchmarks for SRT/WebVTT parsing and includes performance-oriented parsing changes (byte-based timestamp parsing + optional memchr fast paths) to improve throughput.

Changes:

  • Optimize SRT/WebVTT timestamp parsing by extracting fields from bytes instead of split/parse.
  • Speed up line iteration in both parsers using memchr when enabled.
  • Add Criterion benchmark suites for SRT and WebVTT and document benchmark results in the README.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/vtt.rs Reworks timestamp parsing and line scanning for performance.
src/srt.rs Reworks header/timestamp parsing and line scanning for performance.
benches/vtt.rs Adds Criterion benchmarks covering small inputs and fixture-based workloads.
benches/srt.rs Adds Criterion benchmarks covering small/medium inputs and fixture-based workloads.
README.md Documents benchmark results and how to run them.
Cargo.toml Registers the new WebVTT benchmark target.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread benches/srt.rs
Comment thread benches/srt.rs Outdated
Comment thread src/srt.rs
Comment on lines +348 to +353
let hour_len = len - 10;
let hours = match hour_len {
2 => Hour(digit2(&b[..2]) as u16),
3 => Hour(digit3(&b[..3])),
_ => return Err(ParseHourError::NotPadded.into()),
};
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_timestamp_bytes assumes the hour component is exactly 2 or 3 digits (hour_len match), but the lexer regex for Token::Header allows 1–3 digits ([0-9]{1,3}). This is a behavior regression: inputs like 0:00:01,000 will now be rejected with ParseHourError::NotPadded even though they match the token regex and were previously accepted. Update the hour parsing to handle 1-digit hours (or tighten the regex to require 2–3 digits) so lexer and parser constraints stay consistent.

Copilot uses AI. Check for mistakes.
Comment thread src/vtt.rs
Comment thread src/vtt.rs Outdated
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 11, 2026

Codecov Report

❌ Patch coverage is 83.63636% with 18 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/vtt.rs 81.48% 15 Missing ⚠️
src/srt.rs 89.28% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread src/srt.rs
Comment thread benches/srt.rs
Comment on lines +14 to +17
";

const MEDIUM_SRT: &str = include_str!("../fixtures/srt/DeathNote_01.eng.srt");

Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MEDIUM_SRT uses include_str!("../fixtures/srt/DeathNote_01.eng.srt"), but Cargo.toml excludes the entire fixtures directory from the published package (exclude = ["tests", "fixtures"]). That means cargo bench will fail to compile for downstream users of the published crate because the included file won’t exist. Consider embedding a smaller sample under benches/, generating the data at runtime, or adjusting the package include/exclude so the referenced fixture is available when benches are built.

Copilot uses AI. Check for mistakes.
Comment thread src/vtt.rs
Comment on lines +126 to +143
if b[len - 4] != b'.' || b[len - 7] != b':' {
return Err(ParseVttError::InvalidTimestamp("invalid format"));
}
let millis = Millisecond(vtt_digit3(&b[len - 3..]));
let seconds = Second(vtt_digit2(&b[len - 6..len - 4]));
let minutes = Minute(vtt_digit2(&b[len - 9..len - 7]));
let hours = if len > 9 {
if b[len - 10] != b':' {
return Err(ParseVttError::InvalidTimestamp("invalid format"));
}
let hour_str = &s[..len - 10];
// VTT hours are unbounded u64; parse the variable-length prefix
parse_vtt_hour_bytes(hour_str.as_bytes())
.map_err(|_| ParseVttError::InvalidTimestamp("invalid hours"))?
} else {
Hour::new()
};
Ok(Timestamp::from_hmsm(hours, minutes, seconds, millis))
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_timestamp now constructs Minute, Second, and Millisecond via their tuple constructors (Minute(..), Second(..), Millisecond(..)), which bypasses the range/padding validation enforced by their FromStr/with/try_with APIs. This function is also used by cue-text parsing (vtt/cue.rs) where the timestamp regex allows arbitrary digits/colons, so inputs like 00:99.000 or 0:00.000 can create invalid units (or trigger underflow/panic in debug builds) and later break as_str()/encoding assumptions. Please validate digits + ranges (e.g., use Minute::try_with, Second::try_with, Millisecond::try_with and reject on None) before constructing the timestamp, or fall back to the previous parse()-based path for cue timestamps.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread src/vtt.rs
Comment on lines +118 to +149
let b = s.as_bytes();
let len = b.len();
if len < 9 {
return Err(ParseVttError::InvalidTimestamp("too short"));
}
// Validate expected separators before doing byte arithmetic.
// This guards against malformed input from cue-text parsing where
// the regex only loosely validates the format.
if b[len - 4] != b'.' || b[len - 7] != b':' {
return Err(ParseVttError::InvalidTimestamp("invalid format"));
}
let millis_val = vtt_digit3(&b[len - 3..]);
let seconds_val = vtt_digit2(&b[len - 6..len - 4]);
let minutes_val = vtt_digit2(&b[len - 9..len - 7]);
// Validate ranges — needed for cue-text path where input is not regex-validated.
if seconds_val >= 60 || minutes_val >= 60 {
return Err(ParseVttError::InvalidTimestamp("out of range"));
}
let millis = Millisecond(millis_val);
let seconds = Second(seconds_val);
let minutes = Minute(minutes_val);
let hours = if len > 9 {
if b[len - 10] != b':' {
return Err(ParseVttError::InvalidTimestamp("invalid format"));
}
let hour_str = &s[..len - 10];
// VTT hours are unbounded u64; parse the variable-length prefix
parse_vtt_hour_bytes(hour_str.as_bytes())
.map_err(|_| ParseVttError::InvalidTimestamp("invalid hours"))?
} else {
Hour::new()
};
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the cue-text path you explicitly note the input may not be regex-validated, but vtt_digit2/vtt_digit3 will silently produce incorrect values for non-ASCII digits (due to u8 underflow/wrap on b[0] - b'0'). Additionally, hour_str can be empty (e.g. \":00:00.000\"), and parse_vtt_hour_bytes returns Ok(Hour(0)), incorrectly accepting an invalid long-form timestamp. Consider validating that all digit positions are is_ascii_digit() before arithmetic and rejecting empty hour_str when len > 9.

Copilot uses AI. Check for mistakes.
Comment thread src/vtt.rs
Comment on lines +158 to +170
fn parse_vtt_hour_bytes(b: &[u8]) -> Result<Hour, ParseHourError> {
let mut val: u64 = 0;
for &byte in b {
if !byte.is_ascii_digit() {
return Err(ParseHourError::NotPadded);
}
val = val
.checked_mul(10)
.and_then(|v| v.checked_add((byte - b'0') as u64))
.ok_or(ParseHourError::NotPadded)?;
}
Ok(Hour(val))
}
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ParseHourError::NotPadded is used for both non-digit input and overflow (checked_mul/checked_add failure). That makes it hard to distinguish formatting errors from overflow when mapping to ParseVttError. Consider using a dedicated error variant for overflow (or mapping overflow to a more specific InvalidTimestamp message) so callers can report a more accurate failure reason.

Copilot uses AI. Check for mistakes.
Comment thread src/srt.rs
Comment thread benches/vtt.rs Outdated
Comment thread README.md
al8n and others added 2 commits March 12, 2026 02:42
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread src/vtt.rs Outdated
Comment on lines +129 to +131
let millis_val = vtt_digit3(&b[len - 3..]);
let seconds_val = vtt_digit2(&b[len - 6..len - 4]);
let minutes_val = vtt_digit2(&b[len - 9..len - 7]);
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_timestamp is called from cue-text parsing (vtt::cue) where timestamp tags are only loosely regex-validated (<[0-9:]+\.[0-9]{3}>). Here vtt_digit2/vtt_digit3 do unchecked u8 subtraction (b[0] - b'0'), which can panic on bytes < b'0' in debug builds and is generally unsafe for malformed input. Consider validating that all expected digit bytes are ASCII digits (or using checked conversion) before doing arithmetic, and return InvalidTimestamp on failure.

Copilot uses AI. Check for mistakes.
Comment thread src/vtt.rs Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread src/srt.rs Outdated
Comment thread src/error.rs
Comment on lines 40 to +57
/// The error type for parsing hour components of timestamps.
#[derive(Debug, Clone, PartialEq, Eq, IsVariant, Unwrap, TryUnwrap, thiserror::Error)]
#[unwrap(ref, ref_mut)]
#[try_unwrap(ref, ref_mut)]
pub enum ParseHourError {
/// The hour component is not zero-padded to 2 digits.
#[error("hour component is not zero-padded to 2 digits")]
#[unwrap(ignore)]
#[try_unwrap(ignore)]
NotPadded,
/// The hour component is out of range (not between 0-999).
#[error("hour component must be between 0-999, but was {0}")]
Overflow(u16),
/// The hour component overflowed u64 (VTT unbounded hours).
#[error("hour component overflowed")]
#[unwrap(ignore)]
#[try_unwrap(ignore)]
HourOverflow,
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ParseHourError is used by both SRT (0–999) and WebVTT (unbounded u64) hour parsing. Adding HourOverflow makes the enum’s current doc comment (0-999) misleading; update the ParseHourError docs to describe that some variants are format-specific (e.g., SRT range vs WebVTT unbounded + overflow).

Copilot uses AI. Check for mistakes.
Comment thread src/vtt.rs
Comment on lines 117 to +151
pub(crate) fn parse_timestamp(s: &str) -> Result<Timestamp, ParseVttError> {
let (time, ms) = s
.split_once('.')
.ok_or(ParseVttError::InvalidTimestamp("missing '.'"))?;
let millis: Millisecond = ms.parse()?;
let mut parts = time.splitn(3, ':');
let first = parts
.next()
.ok_or(ParseVttError::InvalidTimestamp("empty timestamp"))?;
let second = parts.next().ok_or(ParseVttError::InvalidTimestamp(
"missing minutes or seconds",
))?;
match parts.next() {
// Long form: HH:MM:SS
Some(sec_str) => Ok(Timestamp::from_hmsm(
first.parse()?,
second.parse()?,
sec_str.parse()?,
millis,
)),
// Short form: MM:SS
None => Ok(Timestamp::from_hmsm(
Hour::new(),
first.parse()?,
second.parse()?,
millis,
)),
let b = s.as_bytes();
let len = b.len();
if len < 9 {
return Err(ParseVttError::InvalidTimestamp("too short"));
}
// Validate expected separators before doing byte arithmetic.
// This guards against malformed input from cue-text parsing where
// the regex only loosely validates the format.
if b[len - 4] != b'.' || b[len - 7] != b':' {
return Err(ParseVttError::InvalidTimestamp("invalid format"));
}
let millis_val = vtt_digit3(&b[len - 3..]);
let seconds_val = vtt_digit2(&b[len - 6..len - 4]);
let minutes_val = vtt_digit2(&b[len - 9..len - 7]);
// Use try_with to validate ranges — needed for cue-text path where input
// is not regex-validated.
let millis = Millisecond::try_with(millis_val)
.ok_or(ParseVttError::InvalidTimestamp("milliseconds out of range"))?;
let seconds =
Second::try_with(seconds_val).ok_or(ParseVttError::InvalidTimestamp("seconds out of range"))?;
let minutes =
Minute::try_with(minutes_val).ok_or(ParseVttError::InvalidTimestamp("minutes out of range"))?;
let hours = if len > 9 {
if b[len - 10] != b':' {
return Err(ParseVttError::InvalidTimestamp("invalid format"));
}
let hour_str = &s[..len - 10];
// VTT hours are unbounded u64; parse the variable-length prefix
parse_vtt_hour_bytes(hour_str.as_bytes())
.map_err(|_| ParseVttError::InvalidTimestamp("invalid hours"))?
} else {
Hour::new()
};
Ok(Timestamp::from_hmsm(hours, minutes, seconds, millis))
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new byte-based timestamp parsing in parse_timestamp has several edge cases (non-digit bytes from cue-text path, empty hour prefix, huge hour overflow). There are existing cue-text timestamp tests, but none that assert malformed timestamp tags are safely rejected (no panic) and that hour-overflow paths produce an error. Add focused tests covering these cases to lock in the intended behavior.

Copilot uses AI. Check for mistakes.
Comment thread src/vtt.rs Outdated
Comment on lines +118 to +139
let b = s.as_bytes();
let len = b.len();
if len < 9 {
return Err(ParseVttError::InvalidTimestamp("too short"));
}
// Validate expected separators before doing byte arithmetic.
// This guards against malformed input from cue-text parsing where
// the regex only loosely validates the format.
if b[len - 4] != b'.' || b[len - 7] != b':' {
return Err(ParseVttError::InvalidTimestamp("invalid format"));
}
let millis_val = vtt_digit3(&b[len - 3..]);
let seconds_val = vtt_digit2(&b[len - 6..len - 4]);
let minutes_val = vtt_digit2(&b[len - 9..len - 7]);
// Use try_with to validate ranges — needed for cue-text path where input
// is not regex-validated.
let millis = Millisecond::try_with(millis_val)
.ok_or(ParseVttError::InvalidTimestamp("milliseconds out of range"))?;
let seconds =
Second::try_with(seconds_val).ok_or(ParseVttError::InvalidTimestamp("seconds out of range"))?;
let minutes =
Minute::try_with(minutes_val).ok_or(ParseVttError::InvalidTimestamp("minutes out of range"))?;
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_timestamp does unchecked digit arithmetic (b[i] - b'0') for minutes/seconds/millis. When this function is called from cue-text parsing (e.g., unterminated tags), those byte positions may not be ASCII digits, which can panic in debug builds and/or yield wrapped values that slip past the later range checks. Add explicit ASCII-digit validation for the extracted slices (and return InvalidTimestamp) before calling vtt_digit2/vtt_digit3, or make those helpers validate and return Result.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread src/vtt.rs Outdated
Comment thread src/srt.rs
Comment on lines +326 to +331
let slice = s.slice();
// The regex guarantees `H+:MM:SS,mmm <ws> --> <ws> H+:MM:SS,mmm`.
// Find the arrow separator and extract both timestamps via byte arithmetic.
let arrow = slice.find("-->").unwrap();
let start_str = slice[..arrow].trim();
let end_str = slice[arrow + 3..].trim();
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment above parse_header says the regex guarantees H+:MM:SS,mmm ..., but the Token::Header regex currently enforces 2–3 hour digits. Either adjust the comment to match the regex, or relax the regex if H+ is intended.

Copilot uses AI. Check for mistakes.
Comment thread benches/vtt.rs Outdated
Comment thread benches/srt.rs Outdated
al8n and others added 2 commits March 12, 2026 10:52
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
al8n and others added 2 commits March 12, 2026 10:53
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread src/error.rs Outdated
Comment thread src/srt.rs
Comment on lines 295 to 298
/// Header "HH:MM:SS,mmm --> HH:MM:SS,mmm"
#[regex(
r"[0-9]{1,3}:[0-5][0-9]:[0-5][0-9],[0-9]{3}[ \t\x0C]+-->[ \t\x0C]+[0-9]{1,3}:[0-5][0-9]:[0-5][0-9],[0-9]{3}",
r"[0-9]{2,3}:[0-5][0-9]:[0-5][0-9],[0-9]{3}[ \t\x0C]+-->[ \t\x0C]+[0-9]{2,3}:[0-5][0-9]:[0-5][0-9],[0-9]{3}",
parse_header
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SRT header lexer regex was tightened from allowing 1–3 hour digits to 2–3 ([0-9]{2,3}). This changes parsing behavior (e.g., 0:00:01,000 --> ... will no longer be recognized as a header line in either strict or lossy mode). If 1-digit hours should remain supported, keep the regex at {1,3} and update parse_timestamp_bytes to handle hour_len == 1; otherwise, consider documenting this as an intentional strictness increase.

Copilot uses AI. Check for mistakes.
Comment thread benches/vtt.rs Outdated
al8n and others added 4 commits March 12, 2026 11:06
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread benches/srt.rs
Goodbye world!
";

const MEDIUM_SRT: &str = include_str!("../fixtures/srt/DeathNote_01.eng.srt");
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include_str!("../fixtures/srt/DeathNote_01.eng.srt") hard-depends on a fixture file at compile time, but Cargo.toml excludes fixtures/ from the published package. This will make cargo bench fail to compile from crates.io or any checkout without fixtures. Consider reading the file at runtime with a fallback (like the VTT benches do), or include the needed fixtures in the package.

Copilot uses AI. Check for mistakes.
Comment thread benches/srt.rs Outdated
Comment thread src/vtt.rs
Comment on lines 46 to 49
/// A timestamp could not be parsed.
#[error("invalid timestamp: {0}")]
InvalidTimestamp(&'static str),
InvalidTimestamp(TimestampError),

Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR title suggests this change is benchmark-only, but this diff also changes the public WebVTT error API (InvalidTimestamp now contains TimestampError instead of &'static str) and updates core timestamp parsing logic. Please either update the PR title/description (and ideally the changelog) to reflect these behavior/API changes, or split the parsing changes into a separate PR.

Copilot uses AI. Check for mistakes.
al8n and others added 2 commits March 12, 2026 11:54
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@al8n al8n merged commit ac5338c into main Mar 12, 2026
44 checks passed
@al8n al8n deleted the 0.2.0 branch March 12, 2026 04:38
al8n pushed a commit that referenced this pull request Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants