Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly handle unicode characters in the URL path names and the CLI arguments (sys.argv) #5189

Merged
merged 13 commits into from
Apr 5, 2021

Conversation

Kami
Copy link
Member

@Kami Kami commented Mar 13, 2021

Description

This pull request fixes a bug in the API and CLI side to make sure we correctly handle scenarios where the resource primary key contains a non-ascii character.

Before this fix, st2api and also the CLI would throw an exception when trying to retrieve a resource with a primary key which contains non-ascii characters. This means users would not be able to perform "get one" API operation on such resources (either using the API or using the CLI).

Resolved #5188.

Background, Context

This issue was found and reported by @blag while testing the new release - #5188.

After digging in, I was able to reproduce the issue, find the root cause and fix it. I also confirmed this issue already existed in the past (I went as far as v3.0.0) so it's not a recent regression.

In fact, there are two issues / bugs as described below.

st2api API bug

webob.compat.url_unquote function doesn't work correctly under Python 3 - it tries to decode byte string using latin character set which means it will explode when bytestring contains utf-8 values.

We only support Python 3 now so we don't need that (broken) compat layer from webob anymore and can just use upstream urlunquote which works correctly.

After fixing the issue, I confirmed the fix using curl:

curl -X GET -H  'User-Agent: python-requests/2.23.0' -H  'Accept-Encoding: gzip, deflate' -H  'Accept: */*' -H  'Connection: keep-alive' http://127.0.0.1:9101/v1/rules/examples.test_rule_utf8_náme

curl -X GET -H  'User-Agent: python-requests/2.23.0' -H  'Accept-Encoding: gzip, deflate' -H  'Accept: */*' -H  'Connection: keep-alive' "http://127.0.0.1:9101/v1/rules/examples.test_rule_utf8_n%C3%A1me"

So that's all good, but there is actually also a bug in the client code.

st2client CLI bug

Even after fixing the bug in the API, retrieving a resource using the CLI command returned an exception and didn't work.

After digging into the docs and Python docs, I found out that sys.argv will contain unicode characters which have been decoded using surrogate escape sequences.

This of course won't work because it means we will try to send incorrect URL path with surrogate escape sequences (e.g. examples.test_rule_utf8_n%ED%B3%83%ED%B2%A1me) with the API request.

In addition to that, various other places in the CLI will break as well since they expect unicode values with unicode characters and not surrogate escape sequences.

First I tried to fix this issue by simply only re-encoding the string in the HTTPClient layer. This worked fine for the actual API requests, but as per my comment above, there are also other places which break if value contains surrogate escape sequences which means I moved this re-encoding to as early as possible to ensure rest of the CLI code works with valid unicode values.

As an additional safe-guard I added environment variable / feature flag with which this functionality can be disabled, but that really shouldn't be needed since even if re-encoding fails that exception is non-fatal (it's there mostly for testing purposes).

Some more context is also available in my comments in this issue - #5188.

Target release

For now I assigned this issue to the 3.5.0 milestone, but likely we should also do v3.4.2 release with this bug fix.

TODO

@Kami Kami added this to the 3.5.0 milestone Mar 13, 2021
@Kami Kami requested a review from blag March 13, 2021 12:35
@pull-request-size pull-request-size bot added the size/M PR that changes 30-99 lines. Good size to review. label Mar 13, 2021
easier testability, etc.

Add unit tests for that function.
Keep in mind that API tests are problematic - we don't exercise exactly
the same code paths as we do when running outside tests because the
tests itself utilize webtest module which has the same bug as webob.

This means I needed to patch the code so it utilizes the same version of
the functions as we do in prod.

Not ideal and also shows why it's importat we also have actual end to
end tests for the api.
@pull-request-size pull-request-size bot added size/L PR that changes 100-499 lines. Requires some effort to review. and removed size/M PR that changes 30-99 lines. Good size to review. labels Mar 13, 2021
@Kami
Copy link
Member Author

Kami commented Mar 13, 2021

Alright, I added various unit tests - e08f9f1, 2f06857.

Adding unit tests for API turned out to be quite challenging and painful.

The root cause is that unit tests for the API layer rely on webtest module which contains the same bug for unicode handling as webob code which I fixed in our router.

To get the tests to work, I needed to patch the test code so we use the same version we use for non-test code.

This is of course not ideal and also shows that it's important we have actual end to end tests for API because those webtest based one have limitations and don't 100% mimic actual non-test code paths.

I will also work on actual end to end test.

In the future we should probably also look at upgrading webob + webtest, but I wanted to avoid this for bug-fix release (v3.4.2) since that would be a larger and more invasive change (with more chance for regressions).

@Kami Kami modified the milestones: 3.5.0, 3.4.2 Mar 13, 2021
@Kami Kami changed the title [WIP] Correctly handle unicode characters in the URL path names and the CLI arguments (sys.argv) Correctly handle unicode characters in the URL path names and the CLI arguments (sys.argv) Mar 13, 2021
@Kami
Copy link
Member Author

Kami commented Mar 13, 2021

Test fails on Ubuntu 16.04 because locale is not set to utf-8 - that's expected since if locale is not set to utf-8, exception will be thrown when trying to save resource with unicode in the name.

So going forward we need to decide if we should either update locale on Ubuntu 16.04 end to end test system to use UTF-8 or drop it all together (since as discussed in the past, it's EOL).

And to clarify, #5182 likely wont help or make a difference here - this behavior is expected and desired - if locale is not set to utf-8, trying to save data with unicode characters will fail and thrown an exception - as expected and designed (#5182 will only help with the underlying bug which resulted in infinite loop instead of the exception being propagated and logged).

the request URL contains invalid or incorrectly URL encoded characters.

Previously we didn't handle this scenario which means original
UnicodeDecodeError error with the stack trace got propagated to the user
which is a no go.

Related issue -
GoogleCloudPlatform/webapp2#152.
@Kami
Copy link
Member Author

Kami commented Mar 13, 2021

While working and testing this fix, I found another edge case we don't correctly handle - 04e58b6.

If the request URL / path contains invalid characters (that's of course still possible even with this fix - e.g. invalid escape sequences, etc.), we didn't catch this error and the stack track would be propagated and returned directly to the end user:

(virtualenv) vagrant@ubuntu-xenial:~/st2$ curl -X GET -H  'User-Agent: python-requests/2.23.0' -H  'Accept-Encoding: gzip, deflate' -H  'Accept: */*' -H  'Connection: keep-alive' "http://127.0.0.1:9101/v1/rules/%D0%C2%BD%A8%CE%C4%BC%FE%BC%D0.rar"
Traceback (most recent call last):
  File "/home/vagrant/st2/st2common/st2common/router.py", line 246, in match
    path = urllib.parse.unquote(req.path)
  File "/home/vagrant/st2/virtualenv/lib/python3.6/site-packages/webob/request.py", line 476, in path
    bpath = bytes_(self.path_info, self.url_encoding)
  File "/home/vagrant/st2/virtualenv/lib/python3.6/site-packages/webob/descriptors.py", line 70, in fget
    return req.encget(key, encattr=encattr)
  File "/home/vagrant/st2/virtualenv/lib/python3.6/site-packages/webob/request.py", line 165, in encget
    return bytes_(val, 'latin-1').decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 10: invalid continuation byte

That's of course is not a good thing.

I fixed it to return user-friendly error and ensure we don't leak stack info to the end user.

@blag
Copy link
Contributor

blag commented Mar 13, 2021

Regarding Ubuntu 16.04 - the current plan is to drop 16.04 support in ST2 3.5 IF we merge support for 20.04 by then.

But by that point Ubuntu 16.04 will be EOL, and I don't think it makes any sense whatsoever to support our software on an OS that is not supported by its upstream.

The CentOS story might be similar with the shenanigans happening there. So what exactly our OS support is going to look like is a bit up in the air at this point.

This comment here is a little disorganized (sorry), but the main point is that we should definitely add Ubuntu 20.04 support so we can drop Ubuntu 16.04 support.

Kami added 2 commits April 5, 2021 12:20
tests on Ubuntu 16.04.

Ubuntu 16.04 is EOL and it makes no sense to spend a lot of time working
on a test workaround since it won't be needed in the near future anyway
once we remove support for 16.04.
@Kami Kami modified the milestones: 3.4.2, 3.5.0 Apr 5, 2021
@Kami
Copy link
Member Author

Kami commented Apr 5, 2021

For now, I decided to exclude example rule fixture which was used by new end to end tests to make sure tests still pass on Ubuntu 16.04 for the time being - c0ed21b.

Ssadly there is no other easy option which would not require many changes which will soon be removed anyway when removing support for 16.04.

In the future, when we remove support for Ubuntu 16.04, I will add that test fixture back + merge new end to end test.

@Kami
Copy link
Member Author

Kami commented Apr 5, 2021

I'll go ahead and merge this since this bug is rather annoying and affecting all deployments which utilize unicode characters in resource refs (+ Python 3).

@Kami Kami merged commit c687259 into master Apr 5, 2021
@Kami Kami deleted the url_path_unicode_fix branch April 5, 2021 12:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/L PR that changes 100-499 lines. Requires some effort to review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot view rule with UTF-8 character in name
2 participants