Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: String validation could leave invalid Utf8 encoded strings #9096

Merged
merged 1 commit into from Apr 29, 2021

Conversation

@rubidium42
Copy link
Contributor

@rubidium42 rubidium42 commented Apr 24, 2021

Motivation / Problem

In case a character was encoded in multiple bytes, but required fewer bytes to be encoded, the first byte would be copied to the output leaving an invalid Utf8 encoded string. Later uses of the validated string would use the same decode logic, which would yield a question mark and just read a single byte, so nothing dangerous happened.
Furthermore, because the next byte would not be a first byte of an encoded Utf8 character, the last few valid characters could be removed by the validation as well.

For example, a character array with: (bits) 11000000 10000000 'a', '\0' would yield 11000000, '\0'. Based on the existing could I would have expected it to yield '\0', but then I found that the Utf8Decode decodes the first two bytes to 0 but then checks whether the decode value is more than 0x80. If that is not the case, it return a question mark and a length of 1. This means that 1 byte is copied. Then it starts the loop again, seeing 1000000 which is not a valid first byte for a Utf8Encoded character thus it assumes a length of 4, which would read beyond the buffer and it terminates.

Description

Rewrote the logic. See the documentation in the diff for the details.
The example about will now return yield 'a', '\0' as validated string.

Limitations

If someone rewrites Utf8Decode to not be as strict, a multibyte encoded '\0' would not terminate the string in the validation, only the length of the string would. Later things that use Utf8Decode to get characters might then terminate on '\0'. But, this is all under the premise that someone wants to make it less strict in the first place, and then most of the logic in the validation would be moot.

Checklist for review

Some things are not automated, and forgotten often. This list is a reminder for the reviewers.

  • The bug fix is important enough to be backported? (label: 'backport requested')
  • This PR affects the save game format? (label 'savegame upgrade')
  • This PR affects the GS/AI API? (label 'needs review: Script API')
    • ai_changelog.hpp, gs_changelog.hpp need updating.
    • The compatibility wrappers (compat_*.nut) need updating.
  • This PR affects the NewGRF API? (label 'needs review: NewGRF')
src/string.cpp Outdated Show resolved Hide resolved
In case a character was encoded in multiple bytes, but required fewer bytes to be encoded, the first byte would be copied to the output leaving an invalid Utf8 encoded string. Later uses of the validated string would use the same decode logic, which would yield a question mark and just read a single byte, so nothing dangerous happened.
Furthermore, because the next byte would not be a first byte of an encoded Utf8 character, the last few valid characters could be removed by the validation as well.
@rubidium42 rubidium42 force-pushed the string_validation_improvement branch from 4e89581 to 03b092c Apr 24, 2021
PeterN
PeterN approved these changes Apr 29, 2021
@PeterN PeterN merged commit f00564e into OpenTTD:master Apr 29, 2021
13 checks passed
@rubidium42 rubidium42 deleted the string_validation_improvement branch Apr 30, 2021
LordAro added a commit to LordAro/OpenTTD that referenced this issue May 1, 2021
…TTD#9096)

In case a character was encoded in multiple bytes, but required fewer bytes to be encoded, the first byte would be copied to the output leaving an invalid Utf8 encoded string. Later uses of the validated string would use the same decode logic, which would yield a question mark and just read a single byte, so nothing dangerous happened.
Furthermore, because the next byte would not be a first byte of an encoded Utf8 character, the last few valid characters could be removed by the validation as well.
LordAro added a commit to LordAro/OpenTTD that referenced this issue May 1, 2021
…TTD#9096)

In case a character was encoded in multiple bytes, but required fewer bytes to be encoded, the first byte would be copied to the output leaving an invalid Utf8 encoded string. Later uses of the validated string would use the same decode logic, which would yield a question mark and just read a single byte, so nothing dangerous happened.
Furthermore, because the next byte would not be a first byte of an encoded Utf8 character, the last few valid characters could be removed by the validation as well.
LordAro added a commit to LordAro/OpenTTD that referenced this issue May 2, 2021
…TTD#9096)

In case a character was encoded in multiple bytes, but required fewer bytes to be encoded, the first byte would be copied to the output leaving an invalid Utf8 encoded string. Later uses of the validated string would use the same decode logic, which would yield a question mark and just read a single byte, so nothing dangerous happened.
Furthermore, because the next byte would not be a first byte of an encoded Utf8 character, the last few valid characters could be removed by the validation as well.
LordAro added a commit to LordAro/OpenTTD that referenced this issue May 2, 2021
…TTD#9096)

In case a character was encoded in multiple bytes, but required fewer bytes to be encoded, the first byte would be copied to the output leaving an invalid Utf8 encoded string. Later uses of the validated string would use the same decode logic, which would yield a question mark and just read a single byte, so nothing dangerous happened.
Furthermore, because the next byte would not be a first byte of an encoded Utf8 character, the last few valid characters could be removed by the validation as well.
LordAro added a commit that referenced this issue May 3, 2021
In case a character was encoded in multiple bytes, but required fewer bytes to be encoded, the first byte would be copied to the output leaving an invalid Utf8 encoded string. Later uses of the validated string would use the same decode logic, which would yield a question mark and just read a single byte, so nothing dangerous happened.
Furthermore, because the next byte would not be a first byte of an encoded Utf8 character, the last few valid characters could be removed by the validation as well.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants