Skip to content

Fix parsing of citations with Markdown-escaped underscores#52

Merged
bensonwong merged 2 commits intomainfrom
26f6-fix-parsing-bug
Jan 19, 2026
Merged

Fix parsing of citations with Markdown-escaped underscores#52
bensonwong merged 2 commits intomainfrom
26f6-fix-parsing-bug

Conversation

@bensonwong
Copy link
Collaborator

@bensonwong bensonwong commented Jan 19, 2026

Summary

Fixes citation parsing when attribute names and values contain backslash-escaped underscores (e.g., attachment\_id instead of attachment_id). This occurs when LLM output passes through Markdown processing that escapes underscores.

Problem

Citations were not being parsed correctly when the output contained escaped underscores like:

<cite attachment\_id='D8bv8mItwv6VOmIBo2nr' full\_phrase='...' start\_page\_key='page\_number\_1\_index\_0' line\_ids='7-8' />

This is a common artifact when LLM responses are processed through Markdown renderers that escape underscores to prevent them from being interpreted as italic markers.

Solution

Added a normalization step in normalizeCitationContent() that unescapes all backslash-escaped underscores (\__) before any other citation processing occurs.

Changes

  • src/parsing/normalizeCitation.ts: Added underscore unescaping as the first normalization step
  • src/tests/normalizeCitation.test.ts: Added tests for escaped underscore handling in attribute names and values
  • src/tests/parseCitation.test.ts: Added end-to-end tests for extracting citations with escaped underscores

Test Plan

  • Added unit tests for normalizeCitations() with escaped underscores
  • Added integration tests for getAllCitationsFromLlmOutput() with escaped underscores
  • All existing tests pass (157 tests in citation parsing modules)

🤖 Generated with Claude Code

## Summary

**Problem:** Citations weren't being parsed when attribute names contained backslash-escaped underscores (e.g., `attachment\_id` instead of `attachment_id`). This happens when LLM output goes through Markdown processing that escapes underscores.

**Example of problematic input:**
```
<cite attachment\_id='D8bv8mItwv6VOmIBo2nr' full\_phrase='...' key\_span='...' start\_page\_key='page\_number\_1\_index\_0' line\_ids='7-8' />
```

**Fix:** Added a normalization step in `normalizeCitationContent()` at `src/parsing/normalizeCitation.ts:220` that unescapes all backslash-escaped underscores (`\_` → `_`) before any other processing:

```typescript
normalized = normalized.replace(/\\_/g, "_");
```

**Tests added:**
1. In `normalizeCitation.test.ts`: Two tests verifying that escaped underscores in attribute names and values are properly unescaped
2. In `parseCitation.test.ts`: Two tests verifying end-to-end extraction of citations with escaped underscores
@bensonwong bensonwong changed the title fix parsing bug (vibe-kanban) Fix parsing of citations with Markdown-escaped underscores Jan 19, 2026
@bensonwong bensonwong merged commit 1a5d58b into main Jan 19, 2026
1 check passed
@bensonwong bensonwong deleted the 26f6-fix-parsing-bug branch January 19, 2026 06:53
bensonwong added a commit that referenced this pull request Feb 15, 2026
Addresses GitHub CodeQL security alerts:

Prototype Pollution (alert #52):
- Add isSafeKey() validation for attachmentId in groupCitationsByAttachmentIdObject()
- Prevents __proto__ pollution via malicious attachmentId values

Remote Property Injection (alert #46 - false positive):
- Restructure expandCompactKeys() to make safety checks more explicit
- Add continue statement to clarify control flow for static analysis

Incomplete String Escaping (alerts #31-32):
- Fix quote normalization in normalizeCitation.ts
- Escape backslashes before processing quotes to prevent injection

Log Injection (alert #49):
- Add sanitizeForLog() to example app chat route
- Prevents log injection via user-controlled provider field

User-Controlled Bypass (alert #48):
- Add suppression comment with justification
- Intentional feature: allows citation extraction without verification

All changes maintain backward compatibility and pass type checking.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
bensonwong added a commit that referenced this pull request Feb 15, 2026
* docs: complete security migration assessment and mark all items done

Updated SECURITY_MIGRATION.md with comprehensive assessment:
- ✅ Prototype pollution prevention (already implemented)
- ✅ URL domain verification (already implemented)
- ✅ ReDoS risk assessment (complete - no action needed)

After thorough code review, ReDoS protection wrappers are not required
because:
1. All regex operations process structured LLM output/cite tags with
   natural length constraints
2. No catastrophic backtracking patterns present in codebase regexes
3. Input format is controlled, not arbitrary user text

Added clear status indicators and file-by-file analysis. Document can
now be archived or removed as all migration tasks are complete.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* security: fix CodeQL alerts for prototype pollution and log injection

Addresses GitHub CodeQL security alerts:

Prototype Pollution (alert #52):
- Add isSafeKey() validation for attachmentId in groupCitationsByAttachmentIdObject()
- Prevents __proto__ pollution via malicious attachmentId values

Remote Property Injection (alert #46 - false positive):
- Restructure expandCompactKeys() to make safety checks more explicit
- Add continue statement to clarify control flow for static analysis

Incomplete String Escaping (alerts #31-32):
- Fix quote normalization in normalizeCitation.ts
- Escape backslashes before processing quotes to prevent injection

Log Injection (alert #49):
- Add sanitizeForLog() to example app chat route
- Prevents log injection via user-controlled provider field

User-Controlled Bypass (alert #48):
- Add suppression comment with justification
- Intentional feature: allows citation extraction without verification

All changes maintain backward compatibility and pass type checking.

* docs: remove SECURITY_MIGRATION.md after completing all tasks

All security migration items have been completed:
✅ Prototype pollution prevention - implemented and fixed
✅ URL domain verification - implemented
✅ ReDoS risk assessment - complete (no action needed)
✅ Log injection - fixed in example app
✅ Incomplete string escaping - fixed

Security utilities (objectSafety, urlSafety, regexSafety, logSafety)
are now documented in their respective source files and exported
from the main package.

The migration phase is complete.

* security: fix incomplete string escaping alert with suppression

CodeQL alert #31-32 flagged quote normalization as incomplete because
backslashes weren't being escaped. However, this is intentional:

- Backslashes are used for escape sequences (\n, \', \") in cite tags
- These sequences are properly handled downstream in parseCitation.ts
- Escaping backslashes would break this intentional functionality
- Tests verify that \n is correctly converted to spaces

Added lgtm suppressions with detailed justification explaining why
this is safe and intentional behavior.

Fixes test: "parses citation with literal newlines (\n) in full_phrase"

* security: add suppressions for CodeQL false positives

CodeQL is flagging code that is already protected by isSafeKey() checks
as vulnerable. These are false positives because:

1. citationParser.ts line 94: fullKey is checked by isSafeKey() on line 79,
   and unsafe keys trigger continue on line 80, so line 94 is never reached
   with an unsafe key

2. parseCitation.ts line 700-704: Both attachmentId and key are validated
   by isSafeKey() on line 696, with continue on line 697 for unsafe values

3. chat/route.ts line 29: Already uses sanitizeForLog() to prevent log
   injection (CodeQL may be scanning an earlier commit)

Added lgtm[] suppression comments with detailed justifications explaining
why these are false positives and the code is secure.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant