Release v1.19.2: Enhanced character encoding and spacing by paxcalpt · Pull Request #293 · HenriquesLab/rxiv-maker

paxcalpt · 2026-02-05T18:23:13Z

Release v1.19.2

Fixes

bioRxiv Character Encoding - Named HTML Entities

Enhanced to use named HTML entities instead of numeric references
Added 30+ extended entity mappings: č → &ccaron;, ū → &umacr;, ė → &edot;
Covers Lithuanian, Polish, Turkish, Romanian, and other Eastern European alphabets
Example: "Vaitkevičiūtė" → "Vaitkevičiūtė"
Better compatibility with bioRxiv TSV import system

LaTeX Section Spacing - Increased Visibility

Increased spacing from \enskip (0.5em) to \quad (1em)
Provides better visual separation after runin section titles
Addresses feedback that spacing was present but too tight

Testing

✅ All 23 bioRxiv tests pass with new named entities
✅ PDF generation verified with manuscript-rxiv-maker
✅ Visual confirmation of improved spacing

Issues Fixed

Addresses character encoding and spacing feedback from Guillaume Jacquemet

🤖 Generated with Claude Code

Enhanced encode_html_entities() to use extended HTML entity names (e.g., &ccaron;, &umacr;, &edot;) instead of numeric references for Lithuanian and Eastern European characters. Changes: - Added extended_entities dictionary with 30+ character mappings - Lithuanian: č → &ccaron;, ū → &umacr;, ė → &edot; - Other Eastern European: š → &scaron;, ž → &zcaron;, etc. - Updated tests to expect named entities Example: "Vaitkevičiūtė" → "Vaitkevi&ccaron;i&umacr;t&edot;" Addresses feedback from Guillaume Jacquemet Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Enhanced bioRxiv character encoding with named HTML entities - Supports Lithuanian and Eastern European characters properly

Changed from \enskip (0.5em) to \quad (1em) for more visible spacing after runin subsubsection titles. Addresses feedback that spacing was present but still too tight. Before: \enskip = 0.5em spacing After: \quad = 1em spacing (2x wider) Addresses feedback from Guillaume Jacquemet Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

github-actions · 2026-02-05T18:25:14Z

Code Review: Release v1.19.2

I've reviewed PR #293 and overall the changes look solid. Here's my detailed feedback:

✅ Strengths

1. Enhanced HTML Entity Encoding

The addition of named HTML entities (e.g., &ccaron;, &umacr;, &edot;) instead of numeric references is excellent for compatibility
Comprehensive coverage of 30+ Eastern European characters (Lithuanian, Polish, Turkish, Romanian, etc.)
Well-documented with clear examples in docstrings
Proper priority handling: extended entities are added first, preventing override by standard entities

2. Test Coverage

Tests updated to match new named entity format
All 23 bioRxiv tests mentioned as passing
Good test coverage for edge cases (empty strings, None values, complex text)

3. LaTeX Spacing Improvement

Simple, focused change from \enskip to \quad (0.5em → 1em)
Addresses user feedback about tight spacing
Consistent with other spacing in the style file

4. Documentation

Changelog properly formatted and detailed
Clear commit messages with co-authorship attribution
Examples in docstrings updated to reflect new behavior

🔍 Code Quality Observations

1. Performance Consideration
The encode_html_entities() function rebuilds the extended_entities and char_to_entity dictionaries on every call. For better performance, consider moving these to module-level constants:

# At module level (after imports)
_EXTENDED_ENTITIES = {
    "č": "ccaron",
    "Č": "Ccaron",
    # ... rest of mappings
}

# Build the full char-to-entity map once at module load
_CHAR_TO_ENTITY = {}
for char, entity_name in _EXTENDED_ENTITIES.items():
    _CHAR_TO_ENTITY[char] = f"&{entity_name};"
for entity_name, codepoint in html.entities.name2codepoint.items():
    char = chr(codepoint)
    if ord(char) > 127 and char not in _CHAR_TO_ENTITY:
        _CHAR_TO_ENTITY[char] = f"&{entity_name};"

def encode_html_entities(text: str) -> str:
    # Use pre-built _CHAR_TO_ENTITY

This would avoid rebuilding ~200+ dictionary entries for every author name processed.

2. Memory Efficiency
Using "".join(result) is good, but you could use a generator expression with "".join() to avoid the intermediate list:

def _encode_char(char):
    char_code = ord(char)
    if char_code <= 127:
        return char
    elif char in _CHAR_TO_ENTITY:
        return _CHAR_TO_ENTITY[char]
    else:
        return f"&#{char_code};"

return "".join(_encode_char(char) for char in text)

3. Type Hints
The function handles None input gracefully (line 46-47), but the type hint says text: str. Consider:

def encode_html_entities(text: str | None) -> str | None:

🔒 Security Analysis

Good security practices observed:

✅ Proper HTML entity encoding prevents XSS in TSV output
✅ No dynamic code execution or injection vulnerabilities
✅ Input validation exists in validate_author_data()
✅ Uses safe CSV writer with proper escaping

No security concerns identified.

🧪 Testing Recommendations

Edge Cases to Consider:
- Mixed case preservation: encode_html_entities("VAITKEVIČIŪTĖ") should use uppercase entities
- Emoji or unusual Unicode (should fall back to numeric entities)
- Very long strings with many special characters (performance test)
Integration Testing:
- Verify actual bioRxiv TSV import accepts the named entities
- Test with real-world author names from various languages
Regression Testing:
- Ensure previously working characters (Portuguese, Spanish, Nordic) still work
- The tests already cover this, which is excellent ✅

📝 Minor Suggestions

Dependency Management (uv.lock)
- The uv.lock change adds PyPDF2 3.0.1, which appears to already be listed in pyproject.toml
- This is expected behavior for lock file updates
- ⚠️ Note: PyPDF2 is deprecated in favor of pypdf (already in dependencies). Consider if PyPDF2 is still needed in the dev dependencies.
Changelog Format
- Perfectly follows the required format (### Fixed section, proper version header)
- Breaking change markers not needed here (no breaking changes) ✅
LaTeX Style File
- The change is minimal and focused
- Consider documenting in comments why \quad was chosen over other spacing options

🎯 Verdict

Recommend: APPROVE with optional performance enhancements

The PR addresses real user feedback, improves bioRxiv compatibility, and maintains good test coverage. The code is well-documented and follows project conventions.

The performance optimizations suggested above are optional - the current implementation is correct and will work fine for typical manuscript author lists (5-20 authors). Only optimize if profiling shows this as a bottleneck.

📋 Pre-merge Checklist

Before merging, verify:

All tests pass locally and in CI
PDF generation tested with ../manuscript-rxiv-maker
Visual confirmation of spacing improvement in generated PDF
Homebrew formula will be updated after PyPI release (as per CLAUDE.md)

Great work addressing the user feedback! 🎉

Copilot

Pull request overview

This PR enhances character encoding for bioRxiv submissions and improves LaTeX section spacing. The changes move from numeric HTML character references to named HTML entities for better compatibility with bioRxiv's TSV import system.

Changes:

Added 30+ extended HTML entity mappings (Lithuanian, Polish, Turkish, Romanian characters)
Increased LaTeX subsubsection spacing from \enskip (0.5em) to \quad (1em)
Updated tests to validate named entity encoding instead of numeric references

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
uv.lock	Added PyPDF2 3.0.1 to dev dependencies; lock file revision changed
tests/unit/test_prepare_biorxiv.py	Updated test assertions to expect named entities instead of numeric references
src/tex/style/rxiv_maker_style.cls	Increased subsubsection spacing from \enskip to \quad
src/rxiv_maker/engines/operations/prepare_biorxiv.py	Added extended_entities dictionary with 30+ character mappings to named HTML entities
src/rxiv_maker/version.py	Version bumped to 1.19.2
CHANGELOG.md	Added v1.19.2 release notes documenting both changes

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-05T18:28:20Z

src/rxiv_maker/engines/operations/prepare_biorxiv.py

+    extended_entities = {
+        # Lithuanian and Eastern European
+        "č": "ccaron",
+        "Č": "Ccaron",  # c with caron
+        "ė": "edot",
+        "Ė": "Edot",  # e with dot above
+        "ū": "umacr",
+        "Ū": "Umacr",  # u with macron
+        "ā": "amacr",
+        "Ā": "Amacr",  # a with macron
+        "ē": "emacr",
+        "Ē": "Emacr",  # e with macron
+        "ī": "imacr",
+        "Ī": "Imacr",  # i with macron
+        "ō": "omacr",
+        "Ō": "Omacr",  # o with macron
+        # Other common extended entities
+        "ă": "abreve",
+        "Ă": "Abreve",  # a with breve
+        "ą": "aogon",
+        "Ą": "Aogon",  # a with ogonek
+        "ć": "cacute",
+        "Ć": "Cacute",  # c with acute
+        "ę": "eogon",
+        "Ę": "Eogon",  # e with ogonek
+        "ğ": "gbreve",
+        "Ğ": "Gbreve",  # g with breve
+        "İ": "Idot",  # I with dot above
+        "ı": "inodot",  # i without dot
+        "ł": "lstrok",
+        "Ł": "Lstrok",  # l with stroke
+        "ń": "nacute",
+        "Ń": "Nacute",  # n with acute
+        "œ": "oelig",
+        "Œ": "OElig",  # oe ligature
+        "ř": "rcaron",
+        "Ř": "Rcaron",  # r with caron
+        "ś": "sacute",
+        "Ś": "Sacute",  # s with acute
+        "š": "scaron",
+        "Š": "Scaron",  # s with caron
+        "ş": "scedil",
+        "Ş": "Scedil",  # s with cedilla
+        "ţ": "tcedil",
+        "Ţ": "Tcedil",  # t with cedilla
+        "ů": "uring",
+        "Ů": "Uring",  # u with ring
+        "ź": "zacute",
+        "Ź": "Zacute",  # z with acute
+        "ż": "zdot",
+        "Ż": "Zdot",  # z with dot above
+        "ž": "zcaron",
+        "Ž": "Zcaron",  # z with caron
+    }


The extended entities dictionary includes uppercase character mappings (e.g., "Č" → "Ccaron", "Ė" → "Edot", "Ū" → "Umacr") but there are no test cases validating these uppercase entities. Consider adding test coverage for uppercase extended characters to ensure they encode correctly, especially since HTML entity names are case-sensitive.

paxcalpt and others added 4 commits February 5, 2026 18:21

chore: bump version to 1.19.2

397e98a

- Enhanced bioRxiv character encoding with named HTML entities - Supports Lithuanian and Eastern European characters properly

docs: update changelog for v1.19.2

0baef1a

Copilot AI review requested due to automatic review settings February 5, 2026 18:23

Copilot started reviewing on behalf of paxcalpt February 5, 2026 18:23 View session

Copilot AI reviewed Feb 5, 2026

View reviewed changes

paxcalpt merged commit 368f27c into main Feb 5, 2026
15 of 16 checks passed

paxcalpt deleted the release/v1.19.2 branch February 5, 2026 18:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release v1.19.2: Enhanced character encoding and spacing#293

Release v1.19.2: Enhanced character encoding and spacing#293
paxcalpt merged 4 commits intomainfrom
release/v1.19.2

paxcalpt commented Feb 5, 2026

Uh oh!

github-actions bot commented Feb 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

paxcalpt commented Feb 5, 2026

Release v1.19.2

Fixes

Testing

Issues Fixed

Uh oh!

github-actions bot commented Feb 5, 2026

Code Review: Release v1.19.2

✅ Strengths

🔍 Code Quality Observations

🔒 Security Analysis

🧪 Testing Recommendations

📝 Minor Suggestions

🎯 Verdict

📋 Pre-merge Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant