Skip to content

Python: Bug: XML parsing error in Semantic Kernel when translating markdown files with complex formatting #12608

Open
@skytin1004

Description

@skytin1004

Describe the bug
A clear and concise description of what the bug is.

Semantic Kernel XML parser fails when handling markdown content during translation with Co-op Translator. The error specifically occurs when the markdown contains complex formatting like GitHub badges, tables, or special characters. The error message shows "not well-formed (invalid token): line 26, column 119" and causes the translation process to fail.

To Reproduce Steps to reproduce the behavior:

  1. Install Co-op Translator (pip install co-op-translator)
  2. Prepare a markdown file with complex formatting (GitHub badges, tables, links)
    • For example, this line triggers an error:
      [French](./translations/fr/README.md) | [Spanish](./translations/es/README.md) | [German](./translations/de/README.md) | [Russian](./translations/ru/README.md) | [Arabic](./translations/ar/README.md) | [Persian (Farsi)](./translations/fa/README.md) | [Urdu](./translations/ur/README.md) | [Chinese (Simplified)](./translations/zh/README.md) | [Chinese (Traditional, Macau)](./translations/mo/README.md) | [Chinese (Traditional, Hong Kong)](./translations/hk/README.md) | [Chinese (Traditional, Taiwan)](./translations/tw/README.md) | [Japanese](./translations/ja/README.md) | [Korean](./translations/ko/README.md) | [Hindi](./translations/hi/README.md) | [Bengali](./translations/bn/README.md) | [Marathi](./translations/mr/README.md) | [Nepali](./translations/ne/README.md) | [Punjabi (Gurmukhi)](./translations/pa/README.md) | [Portuguese (Portugal)](./translations/pt/README.md) | [Portuguese (Brazil)](./translations/br/README.md) | [Italian](./translations/it/README.md) | [Polish](./translations/pl/README.md) | [Turkish](./translations/tr/README.md) | [Greek](./translations/el/README.md) | [Thai](./translations/th/README.md) | [Swedish](./translations/sv/README.md) | [Danish](./translations/da/README.md) | [Norwegian](./translations/no/README.md) | [Finnish](./translations/fi/README.md) | [Dutch](./translations/nl/README.md) | [Hebrew](./translations/he/README.md) | [Vietnamese](./translations/vi/README.md) | [Indonesian](./translations/id/README.md) | [Malay](./translations/ms/README.md) | [Tagalog (Filipino)](./translations/tl/README.md) | [Swahili](./translations/sw/README.md) | [Hungarian](./translations/hu/README.md) | [Czech](./translations/cs/README.md) | [Slovak](./translations/sk/README.md) | [Romanian](./translations/ro/README.md) | [Bulgarian](./translations/bg/README.md) | [Serbian (Cyrillic)](./translations/sr/README.md) | [Croatian](./translations/hr/README.md) | [Slovenian](./translations/sl/README.md) | [Ukrainian](./translations/uk/README.md) | [Burmese (Myanmar)](./translations/my/README.md)
      
  3. Run translation command: translate -l "ko" -d (Note: Internally, Co-op Translator uses Semantic Kernel to perform the translation via prompt injection.)
    • When the above markdown content is passed as part of the prompt, Semantic Kernel tries to wrap it in XML format, which leads to parsing failure due to invalid tokens.
  4. See error message in debug log with "not well-formed (invalid token)" error

Note

The issue occurs within markdown_translator.py when Co-op Translator invokes Semantic Kernel for prompt generation and translation.

See: markdown_translator.py (GitHub)

Expected behavior
A clear and concise description of what you expected to happen.

The markdown file should be properly parsed and translated without XML parsing errors. Semantic Kernel should handle markdown content appropriately, even when it contains special characters or formatting that might be invalid in XML context.

Screenshots
If applicable, add screenshots to help explain your problem.

INFO:semantic_kernel.contents.chat_history:Could not parse prompt Translate the following markdown file to Korean (ko).
        IMPORTANT RULES:
        1. DO NOT add '''markdown or any other tags around the translation
        2. Make sure the translation does not sound too literal
        3. Translate comments as well
        4. This file is written in Markdown format - do not treat it as XML or HTML
        
        ... 
        
as xml, treating as text, error was: not well-formed (invalid token): line 26, column 119

Platform

  • Language: Python
  • Source: pip package co-op-translator using latest semantic kernel version
  • AI model: Azure OpenAI GPT models via Semantic Kernel
  • IDE: VS Code
  • OS: Windows

Additional context
Add any other context about the problem here.

This issue appears similar to previously reported issue #1483 which also involved XML parsing problems, but this specifically occurs in the context of markdown translation. Co-op Translator uses Semantic Kernel internally for AI processing, and the error happens when parsing markdown content that contains special characters or formatting that isn't XML-compliant.

The issue particularly affects files with GitHub badges, URLs with query parameters, and other specialized markdown formatting.

Metadata

Metadata

Labels

Type

Projects

Status

Bug

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions