Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError when retrieving engine string that contains invalid bytes #27

Closed
jsza opened this issue Oct 5, 2014 · 6 comments
Closed
Labels

Comments

@jsza
Copy link
Contributor

jsza commented Oct 5, 2014

It appears that this can happen in a number of places. Here are some easily reproducible situations:

from commands.say import SayFilter
from events import Event
from players.helpers import playerinfo_from_userid

@SayFilter
def say_filter(playerinfo, teamonly, command):
    command.get_arg_string()

@Event
def player_changename(event):
    event.get_string('newname')

@Event
def player_activate(event):
    playerinfo = playerinfo_from_userid(event.get_int('userid'))
    playerinfo.get_name()

Running the above example, the following steps will raise a UnicodeEncodeError:

  • Changing your Steam name to: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa££
[SP] Caught an Exception:
Traceback (most recent call last):
  File '../addons/source-python/packages/source-python/events/listener.py', line 90, in fire_game_event
    callback(game_event)
  File '../addons/source-python/plugins/sandbox/sandbox.py', line 11, in player_changename
    event.get_string('newname')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 30: unexpected end of data
  • Saying in chat: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa£
[SP] Caught an Exception:
Traceback (most recent call last):
  File '../addons/source-python/plugins/sandbox/sandbox.py', line 7, in say_filter
    command.get_arg_string()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 127: invalid continuation byte
  • Connecting to server with a Steam name of: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa££
[SP] Caught an Exception:
Traceback (most recent call last):
  File '../addons/source-python/packages/source-python/events/listener.py', line 90, in fire_game_event
    callback(game_event)
  File '../addons/source-python/plugins/sandbox/sandbox.py', line 16, in player_activate
    playerinfo.get_name()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 30: unexpected end of data
@Ayuto Ayuto added the bug label Oct 10, 2014
@Ayuto
Copy link
Member

Ayuto commented Feb 14, 2015

Okay, I have now done some testings and it seems like this is an issue of the game itself. Let's take a look at the chat: It allows you to enter 127 characters. You can't enter more. However, characters like "£" have the size of 2 bytes. If you use them you are actually able to pass more than 127 bytes to the server. But the client will only handle the first 127 bytes -- the rest will be cut.
For example if you enter the following characters (63 * £ + abc = 63 * 2 bytes + 3 * 1 byte = 129 bytes), it will cut "b" and "c":

££££££££££££££££££££££££££££££££££££££££££££££££££££££££££££££££abc

Let's take a look at your example string:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa£

It consists of 126 * a + £ (126 * 1 byte + 1 * 2 bytes = 128 bytes). The client will now cut the last character (£) into two halves and only keeps the first byte, which is invalid without the second half.

I think in this case we should remove the invalid byte.

Edit: If you use the console to say something, you are able to send more bytes. They won't be cut.

@jsza
Copy link
Contributor Author

jsza commented Jun 3, 2015

So, I implemented a possible overall solution for this as a patch to Boost Python: jsza/python@0d9b7b3

The surrogateescape error handling for decode will replace invalid bytes from a UTF-8 string with placeholder characters. These can then be converted back to the original byte string when passed back to the engine if the same error handling is used in a unicode encode function.

It works as expected in practice. However, this means scripters will need to be aware of the surrogateescape encoding when working with strings that contain surrogate characters. This can cause complications when performing actions like writing an invalid player name to a database which is likely to raise an exception.

My reasoning behind adding surrogateescape to the Boost Python string converters was to allow for certainty that the string you're working with is the exact same one that the source engine is aware of.

It may however be a better solution to add "replace" error handling to any functions where the UnicodeDecodeError can occur in Source Python.

Would love to hear the team's thoughts on this.

@Ayuto
Copy link
Member

Ayuto commented Jun 5, 2015

Thank you for taking a look at this! :)

But I don't think we should modify Boost's built-in converter. Then we could also apply that to the whole Python interpreter, which is a little bit easier. That obviously modifies too much of the normal behaviour.

codecs.register_error('strict', codecs.lookup_error('surrogateescape'))

I think I agree with your last statement that we should just add the error handling in specific cases. We can easily fix these special cases by doing this for every special case:

class CCommandExt
{
public:
    static PyObject* ArgS(CCommand& command)
    {
        const char* szCommand = command.ArgS();
        return PyUnicode_DecodeUTF8(szCommand, strlen(szCommand), "ignore");
    }
};

Ayuto added a commit that referenced this issue Jun 6, 2015
Fixed IGameEvent::SetString being exposed twice
@Ayuto
Copy link
Member

Ayuto commented Jun 6, 2015

Are there more occurences where we need to add this fix?

@jsza
Copy link
Contributor Author

jsza commented Jun 6, 2015

Awesome stuff! Thanks for fixing this.

I think another potential source of grief might be Entity.get_prop_string(). I haven't personally run into any corrupt entity keys, but I would guess that it's possible.

@Ayuto
Copy link
Member

Ayuto commented Jun 6, 2015

Yeah, but the only property I can currently think of is "CBasePlayer.m_szNetname", which stores the player's name in an array. So, it can contain an invalid byte as well. However, there is no reason to use this property, because we have a function that returns the name.

I think I will close this issue now. If we should notice another potential source, we can re-open it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants