test: add 140 extensive edge-case tests across all modules by Alex-Wengg · Pull Request #4 · FluidInference/text-processing-rs

Alex-Wengg · 2026-03-12T09:47:19Z

Summary

Adds 140 new tests covering ITN, TN, sentence-level processing, roundtrip consistency, boundary conditions, cross-tagger interference, and realistic TTS preprocessing scenarios
Comprehensive date testing per croqueteer's flag: all 31 days, all 12 months, numeric formats (slash/dash), decades, year verbalization (1776-2025), ordinal suffixes, trailing punctuation, invalid inputs
Documents 7 known issues found during testing (see below)

Known Issues Documented

#	Issue	Root Cause	Impact
1	`normalize("one million")` → `"1 million"` not `"1000000"`	Decimal tagger's scale handling intercepts before cardinal	Scale numbers (million/billion/trillion) not fully expanded
2	`normalize("twelve fifteen")` → `"1215"` not `"12:15"`	Time tagger rejects hours 10-19 without AM/PM to avoid year conflicts	Ambiguous hour-minute without AM/PM falls to cardinal
3	`normalize("two o clock")` → passthrough	Time tagger expects `"o'clock"` or `"oclock"`, not `"o clock"`	ASR output may produce space-separated form
4	`normalize("fifty percent")` → `"50 %"` not `"50%"`	Measure tagger format: `"{number} {unit}"`	Minor formatting
5	Sentence ITN drops "and" between number groups	Cardinal's `words_to_number` filters "and" as separator	"21 apples and 32 oranges" → "21 apples 32 oranges"
6	Adjacent number words combined greedily	Sliding window longest-match: "twenty one forty two" → 2043	Inherent to greedy algorithm
7	Comma-attached tokens break sentence date matching	Tokenization splits "5," as one token	"January 5, 2025" partially matched in sentence mode

Test plan

All 140 new tests pass
All 217 existing tests still pass (143 unit + 57 ITN + 12 TN + 5 doc)
Total: 357 tests, 0 failures

Covers ITN and TN directions, sentence-level processing, roundtrip consistency, boundary conditions, cross-tagger interference, and realistic TTS preprocessing scenarios. Includes comprehensive date testing (all 31 days, all 12 months, numeric formats, decades, year verbalization). Documents 7 known issues found during testing: - Decimal tagger intercepts scale numbers before cardinal (e.g. "one million" → "1 million") - Time tagger rejects hours 10-19 without AM/PM to avoid year conflicts - "o clock" (with space) not matched by time tagger - Measure tagger adds space before % ("50 %" not "50%") - Sentence ITN consumes "and" between number groups - Adjacent number words combined greedily in sentence mode - Comma-attached tokens break sentence-mode date matching

Rename internal modules from academic naming (ITN/TN) to consumer- oriented naming (ASR/TTS) to match FluidAudio conventions. Fix three bugs flagged in PR review: - Year verbalization: X01-X09 years now produce "oh" form (1901 → "nineteen oh one" instead of "nineteen one") - URL parser: case-insensitive protocol detection now matches case-insensitive prefix stripping (HTTP:// works correctly) - number_to_words: i64::MIN no longer panics on negation overflow (uses wrapping_neg + u64 instead of -n)

Alex-Wengg · 2026-03-12T10:13:52Z

@devin

devin-ai-integration

Devin Review found 2 new potential issues.

View 12 additional findings in Devin Review.

devin-ai-integration · 2026-03-12T10:22:47Z

src/tts/money.rs

+    let cents: i64 = if cents_str.len() == 1 {
+        cents_str.parse::<i64>().ok()? * 10
+    } else if cents_str.len() == 2 {
+        cents_str.parse().ok()?
+    } else {
+        // More than 2 decimal places — take first 2
+        cents_str[..2].parse().ok()?
+    };


🔴 Panic on currency input with trailing decimal point (e.g., "$5.")

In parse_dollars_cents, when the amount string ends with a dot (e.g., "$5."), splitn(2, '.') produces ["5", ""], making cents_str an empty string (len=0). The three-branch if/else if/else at lines 151-158 doesn't handle len=0: it falls through to the else branch which assumes cents_str.len() >= 3 and executes cents_str[..2], causing a panic (index out of bounds). This is reachable from the public tn_normalize("$5.") API — the money tagger's validation at src/tts/money.rs:90-96 allows dots, so "5." passes through to parse_dollars_cents. Compare with src/tts/decimal.rs:53 which correctly guards against empty fractional parts with if frac_str.is_empty() { return None; }.

Suggested change

let cents: i64 = if cents_str.len() == 1 {

cents_str.parse::<i64>().ok()? * 10

} else if cents_str.len() == 2 {

cents_str.parse().ok()?

} else {

// More than 2 decimal places — take first 2

cents_str[..2].parse().ok()?

};

let cents: i64 = if cents_str.is_empty() {

0

} else if cents_str.len() == 1 {

cents_str.parse::<i64>().ok()? * 10

} else if cents_str.len() == 2 {

cents_str.parse().ok()?

} else {

// More than 2 decimal places — take first 2

cents_str[..2].parse().ok()?

};

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-03-12T10:22:48Z

src/tts/measure.rs

+        if clean.contains('.') {
+            let parts: Vec<&str> = clean.splitn(2, '.').collect();
+            if parts.len() == 2 {
+                let int_val: i64 = parts[0].parse().ok()?;


🟡 ? operator inside for loop causes premature return from entire function instead of continuing to next unit match

In the measure TN tagger's parse() function, the ? operator on lines 155 and 169 propagates None out of the entire parse() function when a number fails to parse as i64. Since these are inside a for loop iterating over candidate unit matches, this aborts the loop prematurely instead of continue-ing to the next unit. For example, if a number like .5 (empty integer part) is matched against %, the "".parse::<i64>() fails and the function returns None immediately, without trying any remaining units. While the practical impact is limited (other units for the same input usually have the same number part), this violates the intended control flow of trying all units before giving up.

Prompt for agents

In src/tts/measure.rs, replace the two uses of the ? operator inside the for loop with continue statements to avoid prematurely returning None from the entire parse() function. Line 155: Change `let int_val: i64 = parts[0].parse().ok()?;` to use a match or if-let that calls continue on parse failure: let Ok(int_val) = parts[0].parse::<i64>() else { continue; }; Line 169: Change `let n: i64 = clean.parse().ok()?;` similarly: let Ok(n) = clean.parse::<i64>() else { continue; }; This ensures that if a number fails to parse for one unit match, the loop continues trying other unit matches.

Was this helpful? React with 👍 or 👎 to provide feedback.

Alex-Wengg force-pushed the test/extensive-edge-case-tests branch from 39d5e30 to 35a0771 Compare March 12, 2026 09:50

This comment was marked as resolved.

Sign in to view

smdesai and others added 7 commits March 12, 2026 06:11

add written to spoken support

4801c3a

updated README

1119737

add apis

45a1d98

correct tagging on date and money

d200582

style: run cargo fmt

4afad25

Alex-Wengg force-pushed the test/extensive-edge-case-tests branch from 9123b7f to 4afad25 Compare March 12, 2026 10:11

Alex-Wengg merged commit 38a0acb into main Mar 12, 2026
2 checks passed

devin-ai-integration bot reviewed Mar 12, 2026

View reviewed changes

Alex-Wengg mentioned this pull request Mar 12, 2026

fix: currency trailing dot panic and measure parser premature exit #6

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: add 140 extensive edge-case tests across all modules#4

test: add 140 extensive edge-case tests across all modules#4
Alex-Wengg merged 7 commits intomainfrom
test/extensive-edge-case-tests

Alex-Wengg commented Mar 12, 2026 •

edited by devin-ai-integration bot

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

Alex-Wengg commented Mar 12, 2026

Uh oh!

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

devin-ai-integration bot Mar 12, 2026

Uh oh!

devin-ai-integration bot Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Alex-Wengg commented Mar 12, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Known Issues Documented

Test plan

Uh oh!

This comment was marked as resolved.

Uh oh!

Alex-Wengg commented Mar 12, 2026

Uh oh!

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Alex-Wengg commented Mar 12, 2026 •

edited by devin-ai-integration bot

Loading