Python: Bug: XML parsing error in Semantic Kernel when translating markdown files with complex formatting

**Describe the bug**
A clear and concise description of what the bug is.

Semantic Kernel XML parser fails when handling markdown content during translation with Co-op Translator. The error specifically occurs when the markdown contains complex formatting like GitHub badges, tables, or special characters. The error message shows "not well-formed (invalid token): line 26, column 119" and causes the translation process to fail.

To Reproduce Steps to reproduce the behavior:

1. Install Co-op Translator (pip install co-op-translator)
2. Prepare a markdown file with complex formatting (GitHub badges, tables, links)
    - For example, this line triggers an error:
        ```
        [French](./translations/fr/README.md) | [Spanish](./translations/es/README.md) | [German](./translations/de/README.md) | [Russian](./translations/ru/README.md) | [Arabic](./translations/ar/README.md) | [Persian (Farsi)](./translations/fa/README.md) | [Urdu](./translations/ur/README.md) | [Chinese (Simplified)](./translations/zh/README.md) | [Chinese (Traditional, Macau)](./translations/mo/README.md) | [Chinese (Traditional, Hong Kong)](./translations/hk/README.md) | [Chinese (Traditional, Taiwan)](./translations/tw/README.md) | [Japanese](./translations/ja/README.md) | [Korean](./translations/ko/README.md) | [Hindi](./translations/hi/README.md) | [Bengali](./translations/bn/README.md) | [Marathi](./translations/mr/README.md) | [Nepali](./translations/ne/README.md) | [Punjabi (Gurmukhi)](./translations/pa/README.md) | [Portuguese (Portugal)](./translations/pt/README.md) | [Portuguese (Brazil)](./translations/br/README.md) | [Italian](./translations/it/README.md) | [Polish](./translations/pl/README.md) | [Turkish](./translations/tr/README.md) | [Greek](./translations/el/README.md) | [Thai](./translations/th/README.md) | [Swedish](./translations/sv/README.md) | [Danish](./translations/da/README.md) | [Norwegian](./translations/no/README.md) | [Finnish](./translations/fi/README.md) | [Dutch](./translations/nl/README.md) | [Hebrew](./translations/he/README.md) | [Vietnamese](./translations/vi/README.md) | [Indonesian](./translations/id/README.md) | [Malay](./translations/ms/README.md) | [Tagalog (Filipino)](./translations/tl/README.md) | [Swahili](./translations/sw/README.md) | [Hungarian](./translations/hu/README.md) | [Czech](./translations/cs/README.md) | [Slovak](./translations/sk/README.md) | [Romanian](./translations/ro/README.md) | [Bulgarian](./translations/bg/README.md) | [Serbian (Cyrillic)](./translations/sr/README.md) | [Croatian](./translations/hr/README.md) | [Slovenian](./translations/sl/README.md) | [Ukrainian](./translations/uk/README.md) | [Burmese (Myanmar)](./translations/my/README.md)
        ```
3. Run translation command: translate -l "ko" -d (Note: Internally, Co-op Translator uses [Semantic Kernel](https://github.com/microsoft/semantic-kernel) to perform the translation via prompt injection.)
    - When the above markdown content is passed as part of the prompt, Semantic Kernel tries to wrap it in XML format, which leads to parsing failure due to invalid tokens.
4. See error message in debug log with "not well-formed (invalid token)" error

> [!NOTE]
> The issue occurs within markdown_translator.py when Co-op Translator invokes Semantic Kernel for prompt generation and translation.
>
> See: [markdown_translator.py (GitHub)](https://github.com/Azure/co-op-translator/blob/main/src/co_op_translator/core/llm/providers/azure/markdown_translator.py)

**Expected behavior**
A clear and concise description of what you expected to happen.

The markdown file should be properly parsed and translated without XML parsing errors. Semantic Kernel should handle markdown content appropriately, even when it contains special characters or formatting that might be invalid in XML context.

**Screenshots**
If applicable, add screenshots to help explain your problem.

```
INFO:semantic_kernel.contents.chat_history:Could not parse prompt Translate the following markdown file to Korean (ko).
        IMPORTANT RULES:
        1. DO NOT add '''markdown or any other tags around the translation
        2. Make sure the translation does not sound too literal
        3. Translate comments as well
        4. This file is written in Markdown format - do not treat it as XML or HTML
        
        ... 
        
as xml, treating as text, error was: not well-formed (invalid token): line 26, column 119
```


**Platform**
 - Language: Python
 - Source: pip package co-op-translator using latest semantic kernel version
 - AI model: Azure OpenAI GPT models via Semantic Kernel
 - IDE: VS Code
 - OS: Windows

**Additional context**
Add any other context about the problem here.

This issue appears similar to previously reported issue #1483 which also involved XML parsing problems, but this specifically occurs in the context of markdown translation. Co-op Translator uses Semantic Kernel internally for AI processing, and the error happens when parsing markdown content that contains special characters or formatting that isn't XML-compliant.

The issue particularly affects files with GitHub badges, URLs with query parameters, and other specialized markdown formatting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Python: Bug: XML parsing error in Semantic Kernel when translating markdown files with complex formatting #12608

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Python: Bug: XML parsing error in Semantic Kernel when translating markdown files with complex formatting #12608

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions