Skip to content

test: add 140 extensive edge-case tests across all modules#4

Merged
Alex-Wengg merged 7 commits intomainfrom
test/extensive-edge-case-tests
Mar 12, 2026
Merged

test: add 140 extensive edge-case tests across all modules#4
Alex-Wengg merged 7 commits intomainfrom
test/extensive-edge-case-tests

Conversation

@Alex-Wengg
Copy link
Member

@Alex-Wengg Alex-Wengg commented Mar 12, 2026

Summary

  • Adds 140 new tests covering ITN, TN, sentence-level processing, roundtrip consistency, boundary conditions, cross-tagger interference, and realistic TTS preprocessing scenarios
  • Comprehensive date testing per croqueteer's flag: all 31 days, all 12 months, numeric formats (slash/dash), decades, year verbalization (1776-2025), ordinal suffixes, trailing punctuation, invalid inputs
  • Documents 7 known issues found during testing (see below)

Known Issues Documented

# Issue Root Cause Impact
1 normalize("one million")"1 million" not "1000000" Decimal tagger's scale handling intercepts before cardinal Scale numbers (million/billion/trillion) not fully expanded
2 normalize("twelve fifteen")"1215" not "12:15" Time tagger rejects hours 10-19 without AM/PM to avoid year conflicts Ambiguous hour-minute without AM/PM falls to cardinal
3 normalize("two o clock") → passthrough Time tagger expects "o'clock" or "oclock", not "o clock" ASR output may produce space-separated form
4 normalize("fifty percent")"50 %" not "50%" Measure tagger format: "{number} {unit}" Minor formatting
5 Sentence ITN drops "and" between number groups Cardinal's words_to_number filters "and" as separator "21 apples and 32 oranges" → "21 apples 32 oranges"
6 Adjacent number words combined greedily Sliding window longest-match: "twenty one forty two" → 2043 Inherent to greedy algorithm
7 Comma-attached tokens break sentence date matching Tokenization splits "5," as one token "January 5, 2025" partially matched in sentence mode

Test plan

  • All 140 new tests pass
  • All 217 existing tests still pass (143 unit + 57 ITN + 12 TN + 5 doc)
  • Total: 357 tests, 0 failures

Open with Devin

@Alex-Wengg Alex-Wengg force-pushed the test/extensive-edge-case-tests branch from 39d5e30 to 35a0771 Compare March 12, 2026 09:50
devin-ai-integration[bot]

This comment was marked as resolved.

smdesai and others added 7 commits March 12, 2026 06:11
Covers ITN and TN directions, sentence-level processing, roundtrip
consistency, boundary conditions, cross-tagger interference, and
realistic TTS preprocessing scenarios. Includes comprehensive date
testing (all 31 days, all 12 months, numeric formats, decades, year
verbalization).

Documents 7 known issues found during testing:
- Decimal tagger intercepts scale numbers before cardinal (e.g. "one million" → "1 million")
- Time tagger rejects hours 10-19 without AM/PM to avoid year conflicts
- "o clock" (with space) not matched by time tagger
- Measure tagger adds space before % ("50 %" not "50%")
- Sentence ITN consumes "and" between number groups
- Adjacent number words combined greedily in sentence mode
- Comma-attached tokens break sentence-mode date matching
Rename internal modules from academic naming (ITN/TN) to consumer-
oriented naming (ASR/TTS) to match FluidAudio conventions.

Fix three bugs flagged in PR review:
- Year verbalization: X01-X09 years now produce "oh" form
  (1901 → "nineteen oh one" instead of "nineteen one")
- URL parser: case-insensitive protocol detection now matches
  case-insensitive prefix stripping (HTTP:// works correctly)
- number_to_words: i64::MIN no longer panics on negation overflow
  (uses wrapping_neg + u64 instead of -n)
@Alex-Wengg Alex-Wengg force-pushed the test/extensive-edge-case-tests branch from 9123b7f to 4afad25 Compare March 12, 2026 10:11
@Alex-Wengg
Copy link
Member Author

@devin

@Alex-Wengg Alex-Wengg merged commit 38a0acb into main Mar 12, 2026
2 checks passed
Copy link

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

View 12 additional findings in Devin Review.

Open in Devin Review

Comment on lines +151 to +158
let cents: i64 = if cents_str.len() == 1 {
cents_str.parse::<i64>().ok()? * 10
} else if cents_str.len() == 2 {
cents_str.parse().ok()?
} else {
// More than 2 decimal places — take first 2
cents_str[..2].parse().ok()?
};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Panic on currency input with trailing decimal point (e.g., "$5.")

In parse_dollars_cents, when the amount string ends with a dot (e.g., "$5."), splitn(2, '.') produces ["5", ""], making cents_str an empty string (len=0). The three-branch if/else if/else at lines 151-158 doesn't handle len=0: it falls through to the else branch which assumes cents_str.len() >= 3 and executes cents_str[..2], causing a panic (index out of bounds). This is reachable from the public tn_normalize("$5.") API — the money tagger's validation at src/tts/money.rs:90-96 allows dots, so "5." passes through to parse_dollars_cents. Compare with src/tts/decimal.rs:53 which correctly guards against empty fractional parts with if frac_str.is_empty() { return None; }.

Suggested change
let cents: i64 = if cents_str.len() == 1 {
cents_str.parse::<i64>().ok()? * 10
} else if cents_str.len() == 2 {
cents_str.parse().ok()?
} else {
// More than 2 decimal places — take first 2
cents_str[..2].parse().ok()?
};
let cents: i64 = if cents_str.is_empty() {
0
} else if cents_str.len() == 1 {
cents_str.parse::<i64>().ok()? * 10
} else if cents_str.len() == 2 {
cents_str.parse().ok()?
} else {
// More than 2 decimal places — take first 2
cents_str[..2].parse().ok()?
};
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

if clean.contains('.') {
let parts: Vec<&str> = clean.splitn(2, '.').collect();
if parts.len() == 2 {
let int_val: i64 = parts[0].parse().ok()?;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 ? operator inside for loop causes premature return from entire function instead of continuing to next unit match

In the measure TN tagger's parse() function, the ? operator on lines 155 and 169 propagates None out of the entire parse() function when a number fails to parse as i64. Since these are inside a for loop iterating over candidate unit matches, this aborts the loop prematurely instead of continue-ing to the next unit. For example, if a number like .5 (empty integer part) is matched against %, the "".parse::<i64>() fails and the function returns None immediately, without trying any remaining units. While the practical impact is limited (other units for the same input usually have the same number part), this violates the intended control flow of trying all units before giving up.

Prompt for agents
In src/tts/measure.rs, replace the two uses of the ? operator inside the for loop with continue statements to avoid prematurely returning None from the entire parse() function.

Line 155: Change `let int_val: i64 = parts[0].parse().ok()?;` to use a match or if-let that calls continue on parse failure:
  let Ok(int_val) = parts[0].parse::<i64>() else { continue; };

Line 169: Change `let n: i64 = clean.parse().ok()?;` similarly:
  let Ok(n) = clean.parse::<i64>() else { continue; };

This ensures that if a number fails to parse for one unit match, the loop continues trying other unit matches.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants