Correctly handle unicode characters in the URL path names and the CLI arguments (sys.argv) #5189

Kami · 2021-03-13T12:35:08Z

Description

This pull request fixes a bug in the API and CLI side to make sure we correctly handle scenarios where the resource primary key contains a non-ascii character.

Before this fix, st2api and also the CLI would throw an exception when trying to retrieve a resource with a primary key which contains non-ascii characters. This means users would not be able to perform "get one" API operation on such resources (either using the API or using the CLI).

Resolved #5188.

Background, Context

This issue was found and reported by @blag while testing the new release - #5188.

After digging in, I was able to reproduce the issue, find the root cause and fix it. I also confirmed this issue already existed in the past (I went as far as v3.0.0) so it's not a recent regression.

In fact, there are two issues / bugs as described below.

st2api API bug

webob.compat.url_unquote function doesn't work correctly under Python 3 - it tries to decode byte string using latin character set which means it will explode when bytestring contains utf-8 values.

We only support Python 3 now so we don't need that (broken) compat layer from webob anymore and can just use upstream urlunquote which works correctly.

After fixing the issue, I confirmed the fix using curl:

curl -X GET -H  'User-Agent: python-requests/2.23.0' -H  'Accept-Encoding: gzip, deflate' -H  'Accept: */*' -H  'Connection: keep-alive' http://127.0.0.1:9101/v1/rules/examples.test_rule_utf8_náme

curl -X GET -H  'User-Agent: python-requests/2.23.0' -H  'Accept-Encoding: gzip, deflate' -H  'Accept: */*' -H  'Connection: keep-alive' "http://127.0.0.1:9101/v1/rules/examples.test_rule_utf8_n%C3%A1me"

So that's all good, but there is actually also a bug in the client code.

st2client CLI bug

Even after fixing the bug in the API, retrieving a resource using the CLI command returned an exception and didn't work.

After digging into the docs and Python docs, I found out that sys.argv will contain unicode characters which have been decoded using surrogate escape sequences.

This of course won't work because it means we will try to send incorrect URL path with surrogate escape sequences (e.g. examples.test_rule_utf8_n%ED%B3%83%ED%B2%A1me) with the API request.

In addition to that, various other places in the CLI will break as well since they expect unicode values with unicode characters and not surrogate escape sequences.

First I tried to fix this issue by simply only re-encoding the string in the HTTPClient layer. This worked fine for the actual API requests, but as per my comment above, there are also other places which break if value contains surrogate escape sequences which means I moved this re-encoding to as early as possible to ensure rest of the CLI code works with valid unicode values.

As an additional safe-guard I added environment variable / feature flag with which this functionality can be disabled, but that really shouldn't be needed since even if re-encoding fails that exception is non-fatal (it's there mostly for testing purposes).

Some more context is also available in my comments in this issue - #5188.

Target release

For now I assigned this issue to the 3.5.0 milestone, but likely we should also do v3.4.2 release with this bug fix.

TODO

Unit tests
End to end tests - Add a test which verifies we can create and retrieve rule with unicode name st2tests#204
Changelog entry

characters are handled correctly.

This reverts commit 4e0bb04.

where the argument contains unicode character.

easier testability, etc. Add unit tests for that function.

Keep in mind that API tests are problematic - we don't exercise exactly the same code paths as we do when running outside tests because the tests itself utilize webtest module which has the same bug as webob. This means I needed to patch the code so it utilizes the same version of the functions as we do in prod. Not ideal and also shows why it's importat we also have actual end to end tests for the api.

Kami · 2021-03-13T14:54:14Z

Alright, I added various unit tests - e08f9f1, 2f06857.

Adding unit tests for API turned out to be quite challenging and painful.

The root cause is that unit tests for the API layer rely on webtest module which contains the same bug for unicode handling as webob code which I fixed in our router.

To get the tests to work, I needed to patch the test code so we use the same version we use for non-test code.

This is of course not ideal and also shows that it's important we have actual end to end tests for API because those webtest based one have limitations and don't 100% mimic actual non-test code paths.

I will also work on actual end to end test.

In the future we should probably also look at upgrading webob + webtest, but I wanted to avoid this for bug-fix release (v3.4.2) since that would be a larger and more invasive change (with more chance for regressions).

by end to end tests.

similar name.

Kami · 2021-03-13T17:03:29Z

Test fails on Ubuntu 16.04 because locale is not set to utf-8 - that's expected since if locale is not set to utf-8, exception will be thrown when trying to save resource with unicode in the name.

So going forward we need to decide if we should either update locale on Ubuntu 16.04 end to end test system to use UTF-8 or drop it all together (since as discussed in the past, it's EOL).

And to clarify, #5182 likely wont help or make a difference here - this behavior is expected and desired - if locale is not set to utf-8, trying to save data with unicode characters will fail and thrown an exception - as expected and designed (#5182 will only help with the underlying bug which resulted in infinite loop instead of the exception being propagated and logged).

the request URL contains invalid or incorrectly URL encoded characters. Previously we didn't handle this scenario which means original UnicodeDecodeError error with the stack trace got propagated to the user which is a no go. Related issue - GoogleCloudPlatform/webapp2#152.

Kami · 2021-03-13T18:26:17Z

While working and testing this fix, I found another edge case we don't correctly handle - 04e58b6.

If the request URL / path contains invalid characters (that's of course still possible even with this fix - e.g. invalid escape sequences, etc.), we didn't catch this error and the stack track would be propagated and returned directly to the end user:

(virtualenv) vagrant@ubuntu-xenial:~/st2$ curl -X GET -H  'User-Agent: python-requests/2.23.0' -H  'Accept-Encoding: gzip, deflate' -H  'Accept: */*' -H  'Connection: keep-alive' "http://127.0.0.1:9101/v1/rules/%D0%C2%BD%A8%CE%C4%BC%FE%BC%D0.rar"
Traceback (most recent call last):
  File "/home/vagrant/st2/st2common/st2common/router.py", line 246, in match
    path = urllib.parse.unquote(req.path)
  File "/home/vagrant/st2/virtualenv/lib/python3.6/site-packages/webob/request.py", line 476, in path
    bpath = bytes_(self.path_info, self.url_encoding)
  File "/home/vagrant/st2/virtualenv/lib/python3.6/site-packages/webob/descriptors.py", line 70, in fget
    return req.encget(key, encattr=encattr)
  File "/home/vagrant/st2/virtualenv/lib/python3.6/site-packages/webob/request.py", line 165, in encget
    return bytes_(val, 'latin-1').decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 10: invalid continuation byte

That's of course is not a good thing.

I fixed it to return user-friendly error and ensure we don't leak stack info to the end user.

blag · 2021-03-13T22:03:20Z

Regarding Ubuntu 16.04 - the current plan is to drop 16.04 support in ST2 3.5 IF we merge support for 20.04 by then.

But by that point Ubuntu 16.04 will be EOL, and I don't think it makes any sense whatsoever to support our software on an OS that is not supported by its upstream.

The CentOS story might be similar with the shenanigans happening there. So what exactly our OS support is going to look like is a bit up in the air at this point.

This comment here is a little disorganized (sorry), but the main point is that we should definitely add Ubuntu 20.04 support so we can drop Ubuntu 16.04 support.

tests on Ubuntu 16.04. Ubuntu 16.04 is EOL and it makes no sense to spend a lot of time working on a test workaround since it won't be needed in the near future anyway once we remove support for 16.04.

Kami · 2021-04-05T10:34:20Z

For now, I decided to exclude example rule fixture which was used by new end to end tests to make sure tests still pass on Ubuntu 16.04 for the time being - c0ed21b.

Ssadly there is no other easy option which would not require many changes which will soon be removed anyway when removing support for 16.04.

In the future, when we remove support for Ubuntu 16.04, I will add that test fixture back + merge new end to end test.

Kami · 2021-04-05T12:30:31Z

I'll go ahead and merge this since this bug is rather annoying and affecting all deployments which utilize unicode characters in resource refs (+ Python 3).

Kami added 4 commits March 13, 2021 12:21

Add WIP fix for handling urls with unicode characters.

b367ef2

Add WIP fix for the client side and make sure urls with unicode

4e0bb04

characters are handled correctly.

Revert "Add WIP fix for the client side and make sure urls with unicode"

20e9b81

This reverts commit 4e0bb04.

Add a fix for CLI which would result in exceptions in various scenarios

cab8b52

where the argument contains unicode character.

Kami added this to the 3.5.0 milestone Mar 13, 2021

Kami requested a review from blag March 13, 2021 12:35

pull-request-size bot added the size/M PR that changes 30-99 lines. Good size to review. label Mar 13, 2021

Kami added 2 commits March 13, 2021 15:07

Move functionality for surrogate re-encoding to a uility function for

e08f9f1

easier testability, etc. Add unit tests for that function.

pull-request-size bot added size/L PR that changes 100-499 lines. Requires some effort to review. and removed size/M PR that changes 30-99 lines. Good size to review. labels Mar 13, 2021

Kami added 2 commits March 13, 2021 15:58

Add changelog entry.

3ef58c9

Add an example rule fixture for rule with unicode name which can be used

db8e85e

by end to end tests.

Kami mentioned this pull request Mar 13, 2021

Add a test which verifies we can create and retrieve rule with unicode name StackStorm/st2tests#204

Open

Kami modified the milestones: 3.5.0, 3.4.2 Mar 13, 2021

Kami changed the title ~~[WIP] Correctly handle unicode characters in the URL path names and the CLI arguments (sys.argv)~~ Correctly handle unicode characters in the URL path names and the CLI arguments (sys.argv) Mar 13, 2021

Use a different more unique name to avoid breaking other test with a

f3a92b1

similar name.

Add a test case for it.

15274ea

Kami added 2 commits April 5, 2021 12:20

Merge branch 'master' into url_path_unicode_fix

fde03cc

For now, remove test fixture which is causing failure in end to end

c0ed21b

tests on Ubuntu 16.04. Ubuntu 16.04 is EOL and it makes no sense to spend a lot of time working on a test workaround since it won't be needed in the near future anyway once we remove support for 16.04.

Kami modified the milestones: 3.4.2, 3.5.0 Apr 5, 2021

Kami merged commit c687259 into master Apr 5, 2021

Kami deleted the url_path_unicode_fix branch April 5, 2021 12:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correctly handle unicode characters in the URL path names and the CLI arguments (sys.argv) #5189

Correctly handle unicode characters in the URL path names and the CLI arguments (sys.argv) #5189

Kami commented Mar 13, 2021 •

edited

Kami commented Mar 13, 2021 •

edited

Kami commented Mar 13, 2021 •

edited

Kami commented Mar 13, 2021 •

edited

blag commented Mar 13, 2021

Kami commented Apr 5, 2021

Kami commented Apr 5, 2021

Correctly handle unicode characters in the URL path names and the CLI arguments (sys.argv) #5189

Correctly handle unicode characters in the URL path names and the CLI arguments (sys.argv) #5189

Conversation

Kami commented Mar 13, 2021 • edited

Description

Background, Context

st2api API bug

st2client CLI bug

Target release

TODO

Kami commented Mar 13, 2021 • edited

Kami commented Mar 13, 2021 • edited

Kami commented Mar 13, 2021 • edited

blag commented Mar 13, 2021

Kami commented Apr 5, 2021

Kami commented Apr 5, 2021

Kami commented Mar 13, 2021 •

edited

Kami commented Mar 13, 2021 •

edited

Kami commented Mar 13, 2021 •

edited

Kami commented Mar 13, 2021 •

edited