Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix JsonException: Malformed UTF-8 characters, possibly incorrectly encoded #43

Merged
merged 4 commits into from
Apr 24, 2024

Conversation

VincentLanglet
Copy link
Contributor

@VincentLanglet VincentLanglet commented Feb 21, 2024

Hi @JanEbbing @daniel-jones-deepl,

We recently encountered an issue with text input which after deepl translation cannot be json_decoded by the library.
I create a reproducer of the issue. Notice this only occurs with the option

TranslateTextOptions::TAG_HANDLING => 'xml',

This is how is rendered the input in my IDE
image

The test added is failing with the error:

JsonException: Malformed UTF-8 characters, possibly incorrectly encoded

This is especially annoying because when translating a payload with 1000 texts, if one of them has such a character, the whole payload is failing and no text is translated (when 999 could have been).

Is something can be done:

  • On deepl side ?
  • On this library ?
  • On my side ?

Thanks

@JanEbbing
Copy link
Member

JanEbbing commented Feb 22, 2024

Hi, @VincentLanglet - I'm still investigating this but it might be an issue with PHP/fixable in the library.
The equivalent curl command returns a valid response (though the tag handling is a bit messed up).

$ curl -X POST 'https://api.deepl.com/v2/translate' \
--header 'Authorization: DeepL-Auth-Key MYKEY' \
--header 'Content-Type: application/json' \
--data '{"text": ["Portal<span></span>"], "target_lang": "FR", "source_lang":"EN", "tag_handling":"xml", "ignore_tags":["notranslate"]}'

{"translations":[{"detected_source_language":"EN","text":"Portail<span>/span&gt;"}]}

I tried logging the request/response that gets sent over the wire via PHP yesterday but the logger I used modified the data, so I'm checking with a new one now.

@VincentLanglet
Copy link
Contributor Author

Hi, @VincentLanglet - I'm still investigating this but it might be an issue with PHP/fixable in the library. The equivalent curl command returns a valid response (though the tag handling is a bit messed up).

I tried logging the request/response that gets sent over the wire via PHP yesterday but the logger I used modified the data, so I'm checking with a new one now.

When I tried to log the \CURLOPT_POSTFIELDS generated in HttpClientWrapper::urlEncodeWithRepeatedParams, I had

target_lang=fr&source_lang=en&text=Portal%3Cspan%3E%EE%A0%83%3C%2Fspan%3E&tag_handling=xml&ignore_tags=notranslate

notice that without the tag_handling, the request with

target_lang=fr&source_lang=en&text=Portal%3Cspan%3E%EE%A0%83%3C%2Fspan%3E&ignore_tags=notranslate

works.

When looking at sendCustomHttpRequest, the content type used is

$headers['Content-Type'] = 'application/x-www-form-urlencoded';

Maybe the 'Content-Type: application/json' would be better instead then if it works by json ?
Or maybe the API lib need a better support for form-urlencoded data ?

@daniel-jones-dev
Copy link
Member

Hi @VincentLanglet, I was sick until yesterday, looking into this now too.

I don't think this is likely a problem of PHP, as I can reproduce it with our Python library too. Nor is it likely a problem with sending a JSON-encoded request or URL-encoded request; I could reproduce the response in both cases.

It seems to be caused because this input (combined with XML tag-handling) triggers some unusual case, and our API response includes an invalid UTF-8 sequence: \xEE\xA0 rather than \xEE\xA0\x83. I've forwarded the issue to our backend teams.

@VincentLanglet
Copy link
Contributor Author

It seems to be caused because this input (combined with XML tag-handling) triggers some unusual case, and our API response includes an invalid UTF-8 sequence: \xEE\xA0 rather than \xEE\xA0\x83. I've forwarded the issue to our backend teams.

Thanks @daniel-jones-dev, any idea about when it will be solve ?
We're still getting the error and it fails a lot of our API calls.

@daniel-jones-dev
Copy link
Member

daniel-jones-dev commented Apr 23, 2024

Hi @VincentLanglet, the backend team has looked into the cause of this issue; unfortunately it will not be easily fixed.

In the meantime, I wonder if a workaround in this library could help you: we could suppress these invalid UTF-8 sequences by replacing them with the replacement character “�” (U+FFFD), this would at least allow you to use the other requests. Do you think this would help you?

@VincentLanglet
Copy link
Contributor Author

In the meantime, I wonder if a workaround in this library could help you: we could suppress these invalid UTF-8 sequences by replacing them with the replacement character “�” (U+FFFD), this would at least allow you to use the other requests. Do you think this would help you?

Sure, I talked with our team and it would help a lot. (As a first step until the issue is fixed on the api side).

What implementation did you have in mind ? I think that

mb_substitute_character(0xFFFD);
$content = mb_convert_encoding($content, 'UTF-8', 'UTF-8');

could do the job.

@VincentLanglet VincentLanglet changed the title Reproducer for a bug JsonException: Malformed UTF-8 characters, possibly incorrectly encoded Fix JsonException: Malformed UTF-8 characters, possibly incorrectly encoded Apr 23, 2024
@VincentLanglet VincentLanglet marked this pull request as ready for review April 23, 2024 16:03
@daniel-jones-dev
Copy link
Member

Thanks @VincentLanglet, your change for the workaround looks good. We need to check some internal tests and then we should be able to merge this tomorrow.

The backend team will still work on fixing the issue in the API.

@deepl-bot deepl-bot merged commit 04ef190 into DeepLcom:main Apr 24, 2024
@daniel-jones-dev
Copy link
Member

Workaround is published in v1.7.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants