Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swap http-parse to llhttp #56

Merged
merged 20 commits into from Mar 30, 2021
Merged

Swap http-parse to llhttp #56

merged 20 commits into from Mar 30, 2021

Conversation

victoraugustolls
Copy link
Contributor

@victoraugustolls victoraugustolls commented Oct 24, 2020

Hi!

First of all, I'm pretty new to Cython, so I'm sorry if I made any dumb mistakes here (pretty sure I might have messed up setup.py).

Some observations:

  • The tests are failing due to some import error, and I'm pretty sure it is because of the changes I made to setup.py Tests are now passing
  • I changed the upgrade behavior and don't know if it is the right approach, would love some feedback
  • The upgrade tests used offset that I couldn't reproduce with llhttp so I might have missed something

@elprans I would love some feedback here to get this going!

Also tagging @indutny in case you have any 2cents or some warnings about something we might miss on this migration, or that we need extra care :D

@victoraugustolls
Copy link
Contributor Author

victoraugustolls commented Oct 24, 2020

Hey @elprans ! All tests are now passing, I was able to understand the problem to setup.py and llhttp dependency with github actions. If you want, I can throw python 3.9 to the matrix too.

The thing here that I don't think it is ok is the upgrade behavior and tests, so would love the feedback! Most specifically, I don't think that I should be calling cparser.llhttp_resume_after_upgrade(self._cparser), but expose this as a method like def resume_after_upgrade(self), but left it there to hear your opinion.

llhttp has a release branch (see issue), that might be the best option for distribution (the approach I took after finding this). This way, I think there is no need to touch test / release pipelines.

@victoraugustolls
Copy link
Contributor Author

@elprans I swapped the PyMem_Free calling order due to this comment

@methane
Copy link

methane commented Oct 28, 2020

@elprans I swapped the PyMem_Free calling order due to this comment

I think the comment doesn't mean free(settings) after free(parser) leaks memory.
See this commit. Ruby binding of llhttp did not free(parser->settings). It caused memory leak.

@indutny
Copy link

indutny commented Oct 28, 2020

Yeah, the order is irrelevant as long as you free both.

@victoraugustolls
Copy link
Contributor Author

victoraugustolls commented Oct 28, 2020

@methane and @indutny thanks, you guys are totally right! I didn't open the ruby code to inspect, totally my fault. If you find any other mistake here, don't hesitate to say! 🙂

@victoraugustolls
Copy link
Contributor Author

Hey @elprans ! Were you able to take a look here?

@@ -1,3 +1,7 @@
[submodule "vendor/http-parser"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we no longer need the http-parser submodule, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Actually it is used to parse urls, as llhttp don't have this feature!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uparser.pxd is used only to extern url parsing methods from http_parser.h, and url_parser..pyx contains previous code related to url parsing. I couldn't extern both llhttp.h and http_parser.h on the same file as they have naming conflicts

@1st1
Copy link
Member

1st1 commented Nov 1, 2020

@tomchristie @sethmlarson Do you guys still use httptools in your projects? Could you please check if this branch works with them? I'd really like to merge this, and overall it looks like it should be a drop in replacement, but I still worry a bit because our test suite could be better.

@sethmlarson
Copy link
Contributor

I don't use httptools for anything, looks like uvicorn pins httptools==0.1.*.

@victoraugustolls
Copy link
Contributor Author

victoraugustolls commented Nov 1, 2020

@tomchristie @sethmlarson Do you guys still use httptools in your projects? Could you please check if this branch works with them? I'd really like to merge this, and overall it looks like it should be a drop in replacement, but I still worry a bit because our test suite could be better.

If httptools is installed, uvicorn will use it, but is an optional dependency!

@1st1
Copy link
Member

1st1 commented Nov 1, 2020

If httptools is installed, uvicorn will use it, but is an optional dependency!

Sure, but we still need to make sure it will work when uvicorn pins a new version. E.g.
Starlette wouldn't break.

@1st1
Copy link
Member

1st1 commented Nov 1, 2020

@victoraugustolls fwiw if you could verify that starlette and uvicorn work fine and their test suites pass with the updated http-tools it would be great!

@victoraugustolls
Copy link
Contributor Author

If httptools is installed, uvicorn will use it, but is an optional dependency!

Sure, but we still need to make sure it will work when uvicorn pins a new version. E.g.
Starlette wouldn't break.

Oh, this I totally agree!

@victoraugustolls
Copy link
Contributor Author

@victoraugustolls fwiw if you could verify that starlette and uvicorn work fine and their test suites pass with the updated http-tools it would be great!

Will do!

@victoraugustolls
Copy link
Contributor Author

@1st1 regarding uvicorn:

  • I swapped the pinned httptools to git+git://github.com/victoraugustolls/httptools.git@master#egg=httptools
  • All tests passed
  • Double checked the venv to make sure that the fork version was installed

image
image

@victoraugustolls
Copy link
Contributor Author

victoraugustolls commented Nov 1, 2020

@1st1 regarding Starlette, it seems like they don't force you to use uvicorn with starlette, the user can choose the ASGI server they want, so their tests don't cover uvicorn usage. At least this is what I could find. Please let me know if you still want me to do something related to Starlette.

Also, can you give me your 2cent on parser.pyx line 189?

if self._cparser.upgrade == 1 and nb == cparser.HPE_PAUSED_UPGRADE:
    read_bytes = cparser.llhttp_get_error_pos(self._cparser)
    cparser.llhttp_resume_after_upgrade(self._cparser)
    raise HttpParserUpgrade(read_bytes)

I don't think I should call cparser.llhttp_resume_after_upgrade(self._cparser) here, but I'm quite in doubt.

@victoraugustolls
Copy link
Contributor Author

victoraugustolls commented Nov 1, 2020

I also just ran a quick load test against an api of mine (locally and on Azure) using a forked version of uvicorn, which uses my fork from this pr and it worked flawlessly

@victoraugustolls
Copy link
Contributor Author

victoraugustolls commented Nov 1, 2020

One other point that concerns me, besides upgrade, is this none on llhttp.h:

 * NOTE: if this function ever returns a non-pause type error, it will continue
 * to return the same error upon each successive call up until `llhttp_init()`
 * is called.
llhttp_errno_t llhttp_execute(llhttp_t* parser, const char* data, size_t len);

Should we be calling llhttp_init() on our end? It seems like we do. @indutny do you have an opinion on this?

UPDATE:

I just added a test to add this behavior against this pr and against master, and this NOTE that I posted above and it seems to be the same behavior on llhttp and http-parser, so I guess no changes needed, but would love confirmation.

I modified this test:

    def test_parser_request_3(self):
        m = mock.Mock()
        p = httptools.HttpRequestParser(m)
        with self.assertRaises(httptools.HttpParserInvalidURLError):
            p.feed_data(b'POST  HTTP/1.2')

        p.feed_data(CHUNKED_REQUEST1_3)

        self.assertEqual(p.get_method(), b'POST')

        m.on_url.assert_called_once_with(b'/test.php?a=b+c')
        self.assertEqual(p.get_http_version(), '1.2')

        m.on_header.assert_called_with(b'Transfer-Encoding', b'chunked')
        m.on_chunk_header.assert_called_with()
        m.on_chunk_complete.assert_called_with()

        self.assertTrue(m.on_message_complete.called)

And in both cases httptools.HttpParserInvalidURLError were raised again.

@indutny
Copy link

indutny commented Nov 3, 2020

Yeah, similar requirement exists for http_parser too. After receiving a fatal error out of it you have to reinitialize it before feeding a new request into it.

@victoraugustolls
Copy link
Contributor Author

@indutny Thanks for the confirmation!

@1st1 Is there something you want me to do here? I'm all ears!

@1st1
Copy link
Member

1st1 commented Nov 18, 2020

I plan to soon (in a few days) work on a project that will involve httptools. I plan to merge this PR then. Sorry for the delay, guys.

@victoraugustolls
Copy link
Contributor Author

Great to hear @1st1 ! Just let me know if you need anything from me here!

@victoraugustolls
Copy link
Contributor Author

Hey @1st1 ! Hope you're doing well! Was just wondering if you had the chance to take a look at this PR!

@fantix
Copy link
Member

fantix commented Mar 25, 2021

Hi @victoraugustolls , sorry for the delay and appreciate your help on this! I'll be helping with getting this PR merged. Looks like llhttp fixed a CVE in a recent 4.0.0 release (or the 2.2.1 patch), I think it makes sense to include that fix. I'll be aslo trying locally to see if the "major API changes" in llhttp cause any large extra effort to this PR. Looks like either 4.0.0 or 2.2.1 works fine.

llhttp has a release branch (see issue), that might be the best option for distribution (the approach I took after finding this). This way, I think there is no need to touch test / release pipelines.

We should pin the submodule to the release/vx.x.x tags. The release branch itself I believe is only to keep all the release/vx.x.x tags, itself as a branch jumps between versions back and forth. But anyways I think git submodule pins to a revision anyways, so we could probably remove the branch = release in .gitmodules?

Copy link
Member

@fantix fantix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll include comments for the pyx files in the next review.

.gitignore Outdated
Comment on lines 32 to 33
vendor/llhttp/node_modules
vendor/llhttp/build
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we pinned to llhttp release/* tags, this is not needed any more. (llhttp should have its own .gitignore on the vx.x.x tags and in the master branch, right?)

.gitmodules Outdated
[submodule "vendor/llhttp"]
path = vendor/llhttp
url = https://github.com/nodejs/llhttp.git
branch = release
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned earlier, we should remove the confusing branch = release.

setup.py Outdated Show resolved Hide resolved


cdef extern from "../../vendor/http-parser/http_parser.h":
ctypedef int (*http_data_cb) (http_parser*,
cdef extern from "../../vendor/llhttp/include/llhttp.h":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cdef extern from "../../vendor/llhttp/include/llhttp.h":
cdef extern from "llhttp.h":

This was a leftover issue - we should let the compiler find the right header files instead of hard-coding here - as we do support --use-system-http-parser.

from libc.stdint cimport uint16_t


cdef extern from "../../vendor/http-parser/http_parser.h":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cdef extern from "../../vendor/http-parser/http_parser.h":
cdef extern from "http_parser.h":

Ditto, leftover issue.

from .errors import HttpParserInvalidURLError

cimport cython
from . cimport uparser
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name uparser feels weird to me comparing to cparser.pxd standing for "the Cython header file to the parser written in C". Perhaps url_cparser is more aligned, if we don't want to name the pxd files after llhttp or http-parser for good reasons like using future llurl for url_cparser. So let's have:

  • parser.pyx -> cparser.pxd for HTTP parsing
  • url_parser.pyx -> url_cparser.pxd for URL parsing

Copy link
Member

@fantix fantix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing here that I don't think it is ok is the upgrade behavior and tests, so would love the feedback! Most specifically, I don't think that I should be calling cparser.llhttp_resume_after_upgrade(self._cparser), but expose this as a method like def resume_after_upgrade(self), but left it there to hear your opinion.

Looks like what llhttp_resume_after_upgrade() does is quite simple - it reset the parser's internal error state if it was paused for upgrade. So according to another doc you quoted:

 * NOTE: if this function ever returns a non-pause type error, it will continue
 * to return the same error upon each successive call up until `llhttp_init()`
 * is called.
llhttp_errno_t llhttp_execute(llhttp_t* parser, const char* data, size_t len);

It's all about how the parser behaves after detecting an upgrade. In short, I think you're right to call llhttp_resume_after_upgrade() in feed_data() (the only missing step in your PR is to calculate the pointer's difference, see below), or successive calls to feed_data() will raise HttpParserUpgrade errors, which is inconsistent with the current released httptools. What's more, according to the docs of llhttp_get_error_pos():

Returns the pointer to the last parsed byte before the returned error. The
pointer is relative to the data argument of llhttp_execute().

Note: this method might be useful for counting the number of parsed bytes.

There's no guarantee that the 2nd call to feed_data() is using the same data parameter - in other words, the return value of llhttp_get_error_pos() is considered an invalid pointer without the context of the first call to feed_data(), so we should calculate the number of parsed bytes while we can.

For the behavior of calling feed_data() again after handling HttpParserUpgrade errors, we should add a test for it, even though this is a rare use case.

TL;DR: Please see suggested patch below:

diff --git a/httptools/parser/parser.pyx b/httptools/parser/parser.pyx
index 6656bef..3158d0e 100644
--- a/httptools/parser/parser.pyx
+++ b/httptools/parser/parser.pyx
@@ -161,13 +161,15 @@ cdef class HttpParser:
     def feed_data(self, data):
         cdef:
             size_t data_len
-            cparser.llhttp_errno_t nb
+            cparser.llhttp_errno_t err
             Py_buffer *buf
+            bint owning_buf = False
+            char* err_pos

         if PyMemoryView_Check(data):
             buf = PyMemoryView_GET_BUFFER(data)
             data_len = <size_t>buf.len
-            nb = cparser.llhttp_execute(
+            err = cparser.llhttp_execute(
                 self._cparser,
                 <char*>buf.buf,
                 data_len)
@@ -175,21 +177,34 @@ cdef class HttpParser:
         else:
             buf = &self.py_buf
             PyObject_GetBuffer(data, buf, PyBUF_SIMPLE)
+            owning_buf = True
             data_len = <size_t>buf.len

-            nb = cparser.llhttp_execute(
+            err = cparser.llhttp_execute(
                 self._cparser,
                 <char*>buf.buf,
                 data_len)

-            PyBuffer_Release(buf)
-
-        if self._cparser.upgrade == 1 and nb == cparser.HPE_PAUSED_UPGRADE:
-            read_bytes = cparser.llhttp_get_error_pos(self._cparser)
-            cparser.llhttp_resume_after_upgrade(self._cparser)
-            raise HttpParserUpgrade(read_bytes)
-
-        if nb != cparser.HPE_OK:
+        try:
+            if self._cparser.upgrade == 1 and err == cparser.HPE_PAUSED_UPGRADE:
+                err_pos = cparser.llhttp_get_error_pos(self._cparser)
+
+                # Immediately free the parser from "error" state, simulating
+                # http-parser behavior here because 1) we never had the API to
+                # allow users manually "resume after upgrade", and 2) the use
+                # case for resuming parsing is very rare.
+                cparser.llhttp_resume_after_upgrade(self._cparser)
+
+                # The err_pos here is specific for the input buf. So if we ever
+                # switch to the llhttp behavior (re-raise HttpParserUpgrade for
+                # successive calls to feed_data() until resume_after_upgrade is
+                # called), we have to store the result and keep our own state.
+                raise HttpParserUpgrade(err_pos - <char*>buf.buf)
+        finally:
+            if owning_buf:
+                PyBuffer_Release(buf)
+
+        if err != cparser.HPE_OK:
             ex = parser_error_from_errno(
                 self._cparser,
                 <cparser.llhttp_errno_t> self._cparser.error)
diff --git a/tests/test_parser.py b/tests/test_parser.py
index 101da22..cd6f59c 100644
--- a/tests/test_parser.py
+++ b/tests/test_parser.py
@@ -332,11 +332,11 @@ class TestRequestParser(unittest.TestCase):
         try:
             p.feed_data(UPGRADE_REQUEST1)
         except httptools.HttpParserUpgrade as ex:
-            offset = len(ex.args[0])
+            offset = ex.args[0]
         else:
             self.fail('HttpParserUpgrade was not raised')

-        self.assertEqual(UPGRADE_REQUEST1[-offset:], b'Hot diggity dogg')
+        self.assertEqual(UPGRADE_REQUEST1[offset:], b'Hot diggity dogg')

         self.assertEqual(headers, {
             b'Sec-WebSocket-Key2': b'12998 5 Y3 1  .P00',
@@ -347,6 +347,11 @@ class TestRequestParser(unittest.TestCase):
             b'Host': b'example.com',
             b'Upgrade': b'WebSocket'})

+        # The parser can be used again for further parsing - this is a legacy
+        # behavior from the time we were still using http-parser.
+        p.feed_data(CHUNKED_REQUEST1_1)
+        self.assertEqual(p.get_method(), b'POST')
+
     def test_parser_request_upgrade_flag(self):

         class Protocol:

def feed_data(self, data):
cdef:
size_t data_len
size_t nb
cparser.llhttp_errno_t nb
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable is no longer nb (number of bytes parsed).

Suggested change
cparser.llhttp_errno_t nb
cparser.llhttp_errno_t err

@@ -200,11 +200,11 @@ def test_parser_upgrade_response_1(self):
try:
p.feed_data(UPGRADE_RESPONSE1)
except httptools.HttpParserUpgrade as ex:
offset = ex.args[0]
offset = len(ex.args[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like mentioned earlier - we shouldn't change the HttpParserUpgrade API here.

Copy link
Member

@fantix fantix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I've added a few fixes and let's get this merged first. Please let me know if you have concerns about my fixes, we could address them in a new PR.

@fantix fantix merged commit 63b5de2 into MagicStack:master Mar 30, 2021
@fantix fantix mentioned this pull request Apr 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants