Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds Criterion benchmarks for SRT/WebVTT parsing and includes performance-oriented parsing changes (byte-based timestamp parsing + optional memchr fast paths) to improve throughput.
Changes:
- Optimize SRT/WebVTT timestamp parsing by extracting fields from bytes instead of
split/parse. - Speed up line iteration in both parsers using
memchrwhen enabled. - Add Criterion benchmark suites for SRT and WebVTT and document benchmark results in the README.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
src/vtt.rs |
Reworks timestamp parsing and line scanning for performance. |
src/srt.rs |
Reworks header/timestamp parsing and line scanning for performance. |
benches/vtt.rs |
Adds Criterion benchmarks covering small inputs and fixture-based workloads. |
benches/srt.rs |
Adds Criterion benchmarks covering small/medium inputs and fixture-based workloads. |
README.md |
Documents benchmark results and how to run them. |
Cargo.toml |
Registers the new WebVTT benchmark target. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| let hour_len = len - 10; | ||
| let hours = match hour_len { | ||
| 2 => Hour(digit2(&b[..2]) as u16), | ||
| 3 => Hour(digit3(&b[..3])), | ||
| _ => return Err(ParseHourError::NotPadded.into()), | ||
| }; |
There was a problem hiding this comment.
parse_timestamp_bytes assumes the hour component is exactly 2 or 3 digits (hour_len match), but the lexer regex for Token::Header allows 1–3 digits ([0-9]{1,3}). This is a behavior regression: inputs like 0:00:01,000 will now be rejected with ParseHourError::NotPadded even though they match the token regex and were previously accepted. Update the hour parsing to handle 1-digit hours (or tighten the regex to require 2–3 digits) so lexer and parser constraints stay consistent.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| "; | ||
|
|
||
| const MEDIUM_SRT: &str = include_str!("../fixtures/srt/DeathNote_01.eng.srt"); | ||
|
|
There was a problem hiding this comment.
MEDIUM_SRT uses include_str!("../fixtures/srt/DeathNote_01.eng.srt"), but Cargo.toml excludes the entire fixtures directory from the published package (exclude = ["tests", "fixtures"]). That means cargo bench will fail to compile for downstream users of the published crate because the included file won’t exist. Consider embedding a smaller sample under benches/, generating the data at runtime, or adjusting the package include/exclude so the referenced fixture is available when benches are built.
| if b[len - 4] != b'.' || b[len - 7] != b':' { | ||
| return Err(ParseVttError::InvalidTimestamp("invalid format")); | ||
| } | ||
| let millis = Millisecond(vtt_digit3(&b[len - 3..])); | ||
| let seconds = Second(vtt_digit2(&b[len - 6..len - 4])); | ||
| let minutes = Minute(vtt_digit2(&b[len - 9..len - 7])); | ||
| let hours = if len > 9 { | ||
| if b[len - 10] != b':' { | ||
| return Err(ParseVttError::InvalidTimestamp("invalid format")); | ||
| } | ||
| let hour_str = &s[..len - 10]; | ||
| // VTT hours are unbounded u64; parse the variable-length prefix | ||
| parse_vtt_hour_bytes(hour_str.as_bytes()) | ||
| .map_err(|_| ParseVttError::InvalidTimestamp("invalid hours"))? | ||
| } else { | ||
| Hour::new() | ||
| }; | ||
| Ok(Timestamp::from_hmsm(hours, minutes, seconds, millis)) |
There was a problem hiding this comment.
parse_timestamp now constructs Minute, Second, and Millisecond via their tuple constructors (Minute(..), Second(..), Millisecond(..)), which bypasses the range/padding validation enforced by their FromStr/with/try_with APIs. This function is also used by cue-text parsing (vtt/cue.rs) where the timestamp regex allows arbitrary digits/colons, so inputs like 00:99.000 or 0:00.000 can create invalid units (or trigger underflow/panic in debug builds) and later break as_str()/encoding assumptions. Please validate digits + ranges (e.g., use Minute::try_with, Second::try_with, Millisecond::try_with and reject on None) before constructing the timestamp, or fall back to the previous parse()-based path for cue timestamps.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| let b = s.as_bytes(); | ||
| let len = b.len(); | ||
| if len < 9 { | ||
| return Err(ParseVttError::InvalidTimestamp("too short")); | ||
| } | ||
| // Validate expected separators before doing byte arithmetic. | ||
| // This guards against malformed input from cue-text parsing where | ||
| // the regex only loosely validates the format. | ||
| if b[len - 4] != b'.' || b[len - 7] != b':' { | ||
| return Err(ParseVttError::InvalidTimestamp("invalid format")); | ||
| } | ||
| let millis_val = vtt_digit3(&b[len - 3..]); | ||
| let seconds_val = vtt_digit2(&b[len - 6..len - 4]); | ||
| let minutes_val = vtt_digit2(&b[len - 9..len - 7]); | ||
| // Validate ranges — needed for cue-text path where input is not regex-validated. | ||
| if seconds_val >= 60 || minutes_val >= 60 { | ||
| return Err(ParseVttError::InvalidTimestamp("out of range")); | ||
| } | ||
| let millis = Millisecond(millis_val); | ||
| let seconds = Second(seconds_val); | ||
| let minutes = Minute(minutes_val); | ||
| let hours = if len > 9 { | ||
| if b[len - 10] != b':' { | ||
| return Err(ParseVttError::InvalidTimestamp("invalid format")); | ||
| } | ||
| let hour_str = &s[..len - 10]; | ||
| // VTT hours are unbounded u64; parse the variable-length prefix | ||
| parse_vtt_hour_bytes(hour_str.as_bytes()) | ||
| .map_err(|_| ParseVttError::InvalidTimestamp("invalid hours"))? | ||
| } else { | ||
| Hour::new() | ||
| }; |
There was a problem hiding this comment.
In the cue-text path you explicitly note the input may not be regex-validated, but vtt_digit2/vtt_digit3 will silently produce incorrect values for non-ASCII digits (due to u8 underflow/wrap on b[0] - b'0'). Additionally, hour_str can be empty (e.g. \":00:00.000\"), and parse_vtt_hour_bytes returns Ok(Hour(0)), incorrectly accepting an invalid long-form timestamp. Consider validating that all digit positions are is_ascii_digit() before arithmetic and rejecting empty hour_str when len > 9.
| fn parse_vtt_hour_bytes(b: &[u8]) -> Result<Hour, ParseHourError> { | ||
| let mut val: u64 = 0; | ||
| for &byte in b { | ||
| if !byte.is_ascii_digit() { | ||
| return Err(ParseHourError::NotPadded); | ||
| } | ||
| val = val | ||
| .checked_mul(10) | ||
| .and_then(|v| v.checked_add((byte - b'0') as u64)) | ||
| .ok_or(ParseHourError::NotPadded)?; | ||
| } | ||
| Ok(Hour(val)) | ||
| } |
There was a problem hiding this comment.
ParseHourError::NotPadded is used for both non-digit input and overflow (checked_mul/checked_add failure). That makes it hard to distinguish formatting errors from overflow when mapping to ParseVttError. Consider using a dedicated error variant for overflow (or mapping overflow to a more specific InvalidTimestamp message) so callers can report a more accurate failure reason.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| let millis_val = vtt_digit3(&b[len - 3..]); | ||
| let seconds_val = vtt_digit2(&b[len - 6..len - 4]); | ||
| let minutes_val = vtt_digit2(&b[len - 9..len - 7]); |
There was a problem hiding this comment.
parse_timestamp is called from cue-text parsing (vtt::cue) where timestamp tags are only loosely regex-validated (<[0-9:]+\.[0-9]{3}>). Here vtt_digit2/vtt_digit3 do unchecked u8 subtraction (b[0] - b'0'), which can panic on bytes < b'0' in debug builds and is generally unsafe for malformed input. Consider validating that all expected digit bytes are ASCII digits (or using checked conversion) before doing arithmetic, and return InvalidTimestamp on failure.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| /// The error type for parsing hour components of timestamps. | ||
| #[derive(Debug, Clone, PartialEq, Eq, IsVariant, Unwrap, TryUnwrap, thiserror::Error)] | ||
| #[unwrap(ref, ref_mut)] | ||
| #[try_unwrap(ref, ref_mut)] | ||
| pub enum ParseHourError { | ||
| /// The hour component is not zero-padded to 2 digits. | ||
| #[error("hour component is not zero-padded to 2 digits")] | ||
| #[unwrap(ignore)] | ||
| #[try_unwrap(ignore)] | ||
| NotPadded, | ||
| /// The hour component is out of range (not between 0-999). | ||
| #[error("hour component must be between 0-999, but was {0}")] | ||
| Overflow(u16), | ||
| /// The hour component overflowed u64 (VTT unbounded hours). | ||
| #[error("hour component overflowed")] | ||
| #[unwrap(ignore)] | ||
| #[try_unwrap(ignore)] | ||
| HourOverflow, |
There was a problem hiding this comment.
ParseHourError is used by both SRT (0–999) and WebVTT (unbounded u64) hour parsing. Adding HourOverflow makes the enum’s current doc comment (0-999) misleading; update the ParseHourError docs to describe that some variants are format-specific (e.g., SRT range vs WebVTT unbounded + overflow).
| pub(crate) fn parse_timestamp(s: &str) -> Result<Timestamp, ParseVttError> { | ||
| let (time, ms) = s | ||
| .split_once('.') | ||
| .ok_or(ParseVttError::InvalidTimestamp("missing '.'"))?; | ||
| let millis: Millisecond = ms.parse()?; | ||
| let mut parts = time.splitn(3, ':'); | ||
| let first = parts | ||
| .next() | ||
| .ok_or(ParseVttError::InvalidTimestamp("empty timestamp"))?; | ||
| let second = parts.next().ok_or(ParseVttError::InvalidTimestamp( | ||
| "missing minutes or seconds", | ||
| ))?; | ||
| match parts.next() { | ||
| // Long form: HH:MM:SS | ||
| Some(sec_str) => Ok(Timestamp::from_hmsm( | ||
| first.parse()?, | ||
| second.parse()?, | ||
| sec_str.parse()?, | ||
| millis, | ||
| )), | ||
| // Short form: MM:SS | ||
| None => Ok(Timestamp::from_hmsm( | ||
| Hour::new(), | ||
| first.parse()?, | ||
| second.parse()?, | ||
| millis, | ||
| )), | ||
| let b = s.as_bytes(); | ||
| let len = b.len(); | ||
| if len < 9 { | ||
| return Err(ParseVttError::InvalidTimestamp("too short")); | ||
| } | ||
| // Validate expected separators before doing byte arithmetic. | ||
| // This guards against malformed input from cue-text parsing where | ||
| // the regex only loosely validates the format. | ||
| if b[len - 4] != b'.' || b[len - 7] != b':' { | ||
| return Err(ParseVttError::InvalidTimestamp("invalid format")); | ||
| } | ||
| let millis_val = vtt_digit3(&b[len - 3..]); | ||
| let seconds_val = vtt_digit2(&b[len - 6..len - 4]); | ||
| let minutes_val = vtt_digit2(&b[len - 9..len - 7]); | ||
| // Use try_with to validate ranges — needed for cue-text path where input | ||
| // is not regex-validated. | ||
| let millis = Millisecond::try_with(millis_val) | ||
| .ok_or(ParseVttError::InvalidTimestamp("milliseconds out of range"))?; | ||
| let seconds = | ||
| Second::try_with(seconds_val).ok_or(ParseVttError::InvalidTimestamp("seconds out of range"))?; | ||
| let minutes = | ||
| Minute::try_with(minutes_val).ok_or(ParseVttError::InvalidTimestamp("minutes out of range"))?; | ||
| let hours = if len > 9 { | ||
| if b[len - 10] != b':' { | ||
| return Err(ParseVttError::InvalidTimestamp("invalid format")); | ||
| } | ||
| let hour_str = &s[..len - 10]; | ||
| // VTT hours are unbounded u64; parse the variable-length prefix | ||
| parse_vtt_hour_bytes(hour_str.as_bytes()) | ||
| .map_err(|_| ParseVttError::InvalidTimestamp("invalid hours"))? | ||
| } else { | ||
| Hour::new() | ||
| }; | ||
| Ok(Timestamp::from_hmsm(hours, minutes, seconds, millis)) |
There was a problem hiding this comment.
The new byte-based timestamp parsing in parse_timestamp has several edge cases (non-digit bytes from cue-text path, empty hour prefix, huge hour overflow). There are existing cue-text timestamp tests, but none that assert malformed timestamp tags are safely rejected (no panic) and that hour-overflow paths produce an error. Add focused tests covering these cases to lock in the intended behavior.
| let b = s.as_bytes(); | ||
| let len = b.len(); | ||
| if len < 9 { | ||
| return Err(ParseVttError::InvalidTimestamp("too short")); | ||
| } | ||
| // Validate expected separators before doing byte arithmetic. | ||
| // This guards against malformed input from cue-text parsing where | ||
| // the regex only loosely validates the format. | ||
| if b[len - 4] != b'.' || b[len - 7] != b':' { | ||
| return Err(ParseVttError::InvalidTimestamp("invalid format")); | ||
| } | ||
| let millis_val = vtt_digit3(&b[len - 3..]); | ||
| let seconds_val = vtt_digit2(&b[len - 6..len - 4]); | ||
| let minutes_val = vtt_digit2(&b[len - 9..len - 7]); | ||
| // Use try_with to validate ranges — needed for cue-text path where input | ||
| // is not regex-validated. | ||
| let millis = Millisecond::try_with(millis_val) | ||
| .ok_or(ParseVttError::InvalidTimestamp("milliseconds out of range"))?; | ||
| let seconds = | ||
| Second::try_with(seconds_val).ok_or(ParseVttError::InvalidTimestamp("seconds out of range"))?; | ||
| let minutes = | ||
| Minute::try_with(minutes_val).ok_or(ParseVttError::InvalidTimestamp("minutes out of range"))?; |
There was a problem hiding this comment.
parse_timestamp does unchecked digit arithmetic (b[i] - b'0') for minutes/seconds/millis. When this function is called from cue-text parsing (e.g., unterminated tags), those byte positions may not be ASCII digits, which can panic in debug builds and/or yield wrapped values that slip past the later range checks. Add explicit ASCII-digit validation for the extracted slices (and return InvalidTimestamp) before calling vtt_digit2/vtt_digit3, or make those helpers validate and return Result.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| let slice = s.slice(); | ||
| // The regex guarantees `H+:MM:SS,mmm <ws> --> <ws> H+:MM:SS,mmm`. | ||
| // Find the arrow separator and extract both timestamps via byte arithmetic. | ||
| let arrow = slice.find("-->").unwrap(); | ||
| let start_str = slice[..arrow].trim(); | ||
| let end_str = slice[arrow + 3..].trim(); |
There was a problem hiding this comment.
The comment above parse_header says the regex guarantees H+:MM:SS,mmm ..., but the Token::Header regex currently enforces 2–3 hour digits. Either adjust the comment to match the regex, or relax the regex if H+ is intended.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| /// Header "HH:MM:SS,mmm --> HH:MM:SS,mmm" | ||
| #[regex( | ||
| r"[0-9]{1,3}:[0-5][0-9]:[0-5][0-9],[0-9]{3}[ \t\x0C]+-->[ \t\x0C]+[0-9]{1,3}:[0-5][0-9]:[0-5][0-9],[0-9]{3}", | ||
| r"[0-9]{2,3}:[0-5][0-9]:[0-5][0-9],[0-9]{3}[ \t\x0C]+-->[ \t\x0C]+[0-9]{2,3}:[0-5][0-9]:[0-5][0-9],[0-9]{3}", | ||
| parse_header |
There was a problem hiding this comment.
The SRT header lexer regex was tightened from allowing 1–3 hour digits to 2–3 ([0-9]{2,3}). This changes parsing behavior (e.g., 0:00:01,000 --> ... will no longer be recognized as a header line in either strict or lossy mode). If 1-digit hours should remain supported, keep the regex at {1,3} and update parse_timestamp_bytes to handle hour_len == 1; otherwise, consider documenting this as an intentional strictness increase.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| Goodbye world! | ||
| "; | ||
|
|
||
| const MEDIUM_SRT: &str = include_str!("../fixtures/srt/DeathNote_01.eng.srt"); |
There was a problem hiding this comment.
include_str!("../fixtures/srt/DeathNote_01.eng.srt") hard-depends on a fixture file at compile time, but Cargo.toml excludes fixtures/ from the published package. This will make cargo bench fail to compile from crates.io or any checkout without fixtures. Consider reading the file at runtime with a fallback (like the VTT benches do), or include the needed fixtures in the package.
| /// A timestamp could not be parsed. | ||
| #[error("invalid timestamp: {0}")] | ||
| InvalidTimestamp(&'static str), | ||
| InvalidTimestamp(TimestampError), | ||
|
|
There was a problem hiding this comment.
The PR title suggests this change is benchmark-only, but this diff also changes the public WebVTT error API (InvalidTimestamp now contains TimestampError instead of &'static str) and updates core timestamp parsing logic. Please either update the PR title/description (and ideally the changelog) to reflect these behavior/API changes, or split the parsing changes into a separate PR.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
No description provided.