Skip to content

Release v1.19.2: Enhanced character encoding and spacing#293

Merged
paxcalpt merged 4 commits intomainfrom
release/v1.19.2
Feb 5, 2026
Merged

Release v1.19.2: Enhanced character encoding and spacing#293
paxcalpt merged 4 commits intomainfrom
release/v1.19.2

Conversation

@paxcalpt
Copy link
Contributor

@paxcalpt paxcalpt commented Feb 5, 2026

Release v1.19.2

Fixes

bioRxiv Character Encoding - Named HTML Entities

  • Enhanced to use named HTML entities instead of numeric references
  • Added 30+ extended entity mappings: č → č, ū → ū, ė → ė
  • Covers Lithuanian, Polish, Turkish, Romanian, and other Eastern European alphabets
  • Example: "Vaitkevičiūtė" → "Vaitkevičiūtė"
  • Better compatibility with bioRxiv TSV import system

LaTeX Section Spacing - Increased Visibility

  • Increased spacing from \enskip (0.5em) to \quad (1em)
  • Provides better visual separation after runin section titles
  • Addresses feedback that spacing was present but too tight

Testing

  • ✅ All 23 bioRxiv tests pass with new named entities
  • ✅ PDF generation verified with manuscript-rxiv-maker
  • ✅ Visual confirmation of improved spacing

Issues Fixed

Addresses character encoding and spacing feedback from Guillaume Jacquemet


🤖 Generated with Claude Code

paxcalpt and others added 4 commits February 5, 2026 18:21
Enhanced encode_html_entities() to use extended HTML entity names
(e.g., č, ū, ė) instead of numeric references
for Lithuanian and Eastern European characters.

Changes:
- Added extended_entities dictionary with 30+ character mappings
- Lithuanian: č → č, ū → ū, ė → ė
- Other Eastern European: š → š, ž → ž, etc.
- Updated tests to expect named entities

Example: "Vaitkevičiūtė" → "Vaitkevičiūtė"

Addresses feedback from Guillaume Jacquemet

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Enhanced bioRxiv character encoding with named HTML entities
- Supports Lithuanian and Eastern European characters properly
Changed from \enskip (0.5em) to \quad (1em) for more visible spacing
after runin subsubsection titles. Addresses feedback that spacing was
present but still too tight.

Before: \enskip = 0.5em spacing
After:  \quad = 1em spacing (2x wider)

Addresses feedback from Guillaume Jacquemet

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings February 5, 2026 18:23
@github-actions
Copy link

github-actions bot commented Feb 5, 2026

Code Review: Release v1.19.2

I've reviewed PR #293 and overall the changes look solid. Here's my detailed feedback:

Strengths

1. Enhanced HTML Entity Encoding

  • The addition of named HTML entities (e.g., &ccaron;, &umacr;, &edot;) instead of numeric references is excellent for compatibility
  • Comprehensive coverage of 30+ Eastern European characters (Lithuanian, Polish, Turkish, Romanian, etc.)
  • Well-documented with clear examples in docstrings
  • Proper priority handling: extended entities are added first, preventing override by standard entities

2. Test Coverage

  • Tests updated to match new named entity format
  • All 23 bioRxiv tests mentioned as passing
  • Good test coverage for edge cases (empty strings, None values, complex text)

3. LaTeX Spacing Improvement

  • Simple, focused change from \enskip to \quad (0.5em → 1em)
  • Addresses user feedback about tight spacing
  • Consistent with other spacing in the style file

4. Documentation

  • Changelog properly formatted and detailed
  • Clear commit messages with co-authorship attribution
  • Examples in docstrings updated to reflect new behavior

🔍 Code Quality Observations

1. Performance Consideration
The encode_html_entities() function rebuilds the extended_entities and char_to_entity dictionaries on every call. For better performance, consider moving these to module-level constants:

# At module level (after imports)
_EXTENDED_ENTITIES = {
    "č": "ccaron",
    "Č": "Ccaron",
    # ... rest of mappings
}

# Build the full char-to-entity map once at module load
_CHAR_TO_ENTITY = {}
for char, entity_name in _EXTENDED_ENTITIES.items():
    _CHAR_TO_ENTITY[char] = f"&{entity_name};"
for entity_name, codepoint in html.entities.name2codepoint.items():
    char = chr(codepoint)
    if ord(char) > 127 and char not in _CHAR_TO_ENTITY:
        _CHAR_TO_ENTITY[char] = f"&{entity_name};"

def encode_html_entities(text: str) -> str:
    # Use pre-built _CHAR_TO_ENTITY

This would avoid rebuilding ~200+ dictionary entries for every author name processed.

2. Memory Efficiency
Using "".join(result) is good, but you could use a generator expression with "".join() to avoid the intermediate list:

def _encode_char(char):
    char_code = ord(char)
    if char_code <= 127:
        return char
    elif char in _CHAR_TO_ENTITY:
        return _CHAR_TO_ENTITY[char]
    else:
        return f"&#{char_code};"

return "".join(_encode_char(char) for char in text)

3. Type Hints
The function handles None input gracefully (line 46-47), but the type hint says text: str. Consider:

def encode_html_entities(text: str | None) -> str | None:

🔒 Security Analysis

Good security practices observed:

  • ✅ Proper HTML entity encoding prevents XSS in TSV output
  • ✅ No dynamic code execution or injection vulnerabilities
  • ✅ Input validation exists in validate_author_data()
  • ✅ Uses safe CSV writer with proper escaping

No security concerns identified.


🧪 Testing Recommendations

  1. Edge Cases to Consider:

    • Mixed case preservation: encode_html_entities("VAITKEVIČIŪTĖ") should use uppercase entities
    • Emoji or unusual Unicode (should fall back to numeric entities)
    • Very long strings with many special characters (performance test)
  2. Integration Testing:

    • Verify actual bioRxiv TSV import accepts the named entities
    • Test with real-world author names from various languages
  3. Regression Testing:

    • Ensure previously working characters (Portuguese, Spanish, Nordic) still work
    • The tests already cover this, which is excellent ✅

📝 Minor Suggestions

  1. Dependency Management (uv.lock)

    • The uv.lock change adds PyPDF2 3.0.1, which appears to already be listed in pyproject.toml
    • This is expected behavior for lock file updates
    • ⚠️ Note: PyPDF2 is deprecated in favor of pypdf (already in dependencies). Consider if PyPDF2 is still needed in the dev dependencies.
  2. Changelog Format

    • Perfectly follows the required format (### Fixed section, proper version header)
    • Breaking change markers not needed here (no breaking changes) ✅
  3. LaTeX Style File

    • The change is minimal and focused
    • Consider documenting in comments why \quad was chosen over other spacing options

🎯 Verdict

Recommend: APPROVE with optional performance enhancements

The PR addresses real user feedback, improves bioRxiv compatibility, and maintains good test coverage. The code is well-documented and follows project conventions.

The performance optimizations suggested above are optional - the current implementation is correct and will work fine for typical manuscript author lists (5-20 authors). Only optimize if profiling shows this as a bottleneck.


📋 Pre-merge Checklist

Before merging, verify:

  • All tests pass locally and in CI
  • PDF generation tested with ../manuscript-rxiv-maker
  • Visual confirmation of spacing improvement in generated PDF
  • Homebrew formula will be updated after PyPI release (as per CLAUDE.md)

Great work addressing the user feedback! 🎉

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances character encoding for bioRxiv submissions and improves LaTeX section spacing. The changes move from numeric HTML character references to named HTML entities for better compatibility with bioRxiv's TSV import system.

Changes:

  • Added 30+ extended HTML entity mappings (Lithuanian, Polish, Turkish, Romanian characters)
  • Increased LaTeX subsubsection spacing from \enskip (0.5em) to \quad (1em)
  • Updated tests to validate named entity encoding instead of numeric references

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
uv.lock Added PyPDF2 3.0.1 to dev dependencies; lock file revision changed
tests/unit/test_prepare_biorxiv.py Updated test assertions to expect named entities instead of numeric references
src/tex/style/rxiv_maker_style.cls Increased subsubsection spacing from \enskip to \quad
src/rxiv_maker/engines/operations/prepare_biorxiv.py Added extended_entities dictionary with 30+ character mappings to named HTML entities
src/rxiv_maker/version.py Version bumped to 1.19.2
CHANGELOG.md Added v1.19.2 release notes documenting both changes

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +51 to +104
extended_entities = {
# Lithuanian and Eastern European
"č": "ccaron",
"Č": "Ccaron", # c with caron
"ė": "edot",
"Ė": "Edot", # e with dot above
"ū": "umacr",
"Ū": "Umacr", # u with macron
"ā": "amacr",
"Ā": "Amacr", # a with macron
"ē": "emacr",
"Ē": "Emacr", # e with macron
"ī": "imacr",
"Ī": "Imacr", # i with macron
"ō": "omacr",
"Ō": "Omacr", # o with macron
# Other common extended entities
"ă": "abreve",
"Ă": "Abreve", # a with breve
"ą": "aogon",
"Ą": "Aogon", # a with ogonek
"ć": "cacute",
"Ć": "Cacute", # c with acute
"ę": "eogon",
"Ę": "Eogon", # e with ogonek
"ğ": "gbreve",
"Ğ": "Gbreve", # g with breve
"İ": "Idot", # I with dot above
"ı": "inodot", # i without dot
"ł": "lstrok",
"Ł": "Lstrok", # l with stroke
"ń": "nacute",
"Ń": "Nacute", # n with acute
"œ": "oelig",
"Œ": "OElig", # oe ligature
"ř": "rcaron",
"Ř": "Rcaron", # r with caron
"ś": "sacute",
"Ś": "Sacute", # s with acute
"š": "scaron",
"Š": "Scaron", # s with caron
"ş": "scedil",
"Ş": "Scedil", # s with cedilla
"ţ": "tcedil",
"Ţ": "Tcedil", # t with cedilla
"ů": "uring",
"Ů": "Uring", # u with ring
"ź": "zacute",
"Ź": "Zacute", # z with acute
"ż": "zdot",
"Ż": "Zdot", # z with dot above
"ž": "zcaron",
"Ž": "Zcaron", # z with caron
}
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extended entities dictionary includes uppercase character mappings (e.g., "Č" → "Ccaron", "Ė" → "Edot", "Ū" → "Umacr") but there are no test cases validating these uppercase entities. Consider adding test coverage for uppercase extended characters to ensure they encode correctly, especially since HTML entity names are case-sensitive.

Copilot uses AI. Check for mistakes.
@paxcalpt paxcalpt merged commit 368f27c into main Feb 5, 2026
15 of 16 checks passed
@paxcalpt paxcalpt deleted the release/v1.19.2 branch February 5, 2026 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant