Deadlock when closing the http_listener #13

Closed
gigaplex opened this Issue Nov 9, 2015 · 13 comments

Comments

Projects
None yet
2 participants
@gigaplex
Contributor

gigaplex commented Nov 9, 2015

Duplicated from CodePlex (issue 421) since the documentation says that issues should be reported to GitHub instead:

When calling http_listener.close().wait() as we're shutting down our service, the application will occasionally deadlock because the wait operation never returns. The deadlock appears to be in the http_windows_server::stop() function, there's a wait that never returns.

pplx::task<void> http_windows_server::stop()
{
    // Shutdown request queue.
    if(m_hRequestQueue != nullptr)
    {
        HttpShutdownRequestQueue(m_hRequestQueue);
         m_receivingTask.wait();

        // Wait for all requests to be finished processing.
        m_zeroOutstandingRequests.wait(); // Deadlock appears to be here

        HttpCloseRequestQueue(m_hRequestQueue);
    }
@gigaplex

This comment has been minimized.

Show comment
Hide comment
@gigaplex

gigaplex Nov 10, 2015

Contributor

I suspect this area of code is of interest:

pplx::task<void> http_windows_server::respond(http::http_response response)
{
    windows_request_context * p_context = static_cast<windows_request_context *>(response._get_server_context());
    return pplx::create_task(p_context->m_response_completed).then([p_context](::pplx::task<void> t)
    {
        // After response is sent, break circular reference between http_response and the request context.
        // Otherwise http_listener::close() can hang.
        p_context->m_response._get_impl()->_set_server_context(nullptr);
        t.get();
    });
}

It looks like if there's some exception that prevents this bit of code from running, then the circular reference isn't broken, causing the hang.

Contributor

gigaplex commented Nov 10, 2015

I suspect this area of code is of interest:

pplx::task<void> http_windows_server::respond(http::http_response response)
{
    windows_request_context * p_context = static_cast<windows_request_context *>(response._get_server_context());
    return pplx::create_task(p_context->m_response_completed).then([p_context](::pplx::task<void> t)
    {
        // After response is sent, break circular reference between http_response and the request context.
        // Otherwise http_listener::close() can hang.
        p_context->m_response._get_impl()->_set_server_context(nullptr);
        t.get();
    });
}

It looks like if there's some exception that prevents this bit of code from running, then the circular reference isn't broken, causing the hang.

@gigaplex

This comment has been minimized.

Show comment
Hide comment
@gigaplex

gigaplex Nov 10, 2015

Contributor

From windows_request_context::async_process_request:

const unsigned long error_code = HttpReceiveHttpRequest(
        pServer->m_hRequestQueue,
        m_request_id,
        0,
        m_request,
        headers_size,
        NULL,
        &m_overlapped);

if(error_code != NO_ERROR && error_code != ERROR_IO_PENDING)
{
    CancelThreadpoolIo(pServer->m_threadpool_io);
    m_msg.reply(status_codes::InternalError);
}

I've managed to reproduce the issue by putting a breakpoint at HttpReceiveHttpRequest in windows_request_context::async_process_request and waiting for the client to time out, causing a network disconnection. HttpReceiveHttpRequest returns ERROR_CONNECTION_INVALID, which prevents m_msg.reply(status_codes::InternalError); from succeeding. The response now waits indefinitely for a broken network connection, and holds on to the smart pointer for the server context, keeping the number of outstanding requests above zero indefinitely. I wouldn't be surprised if there were other error conditions that prevented the reference from being released.

Contributor

gigaplex commented Nov 10, 2015

From windows_request_context::async_process_request:

const unsigned long error_code = HttpReceiveHttpRequest(
        pServer->m_hRequestQueue,
        m_request_id,
        0,
        m_request,
        headers_size,
        NULL,
        &m_overlapped);

if(error_code != NO_ERROR && error_code != ERROR_IO_PENDING)
{
    CancelThreadpoolIo(pServer->m_threadpool_io);
    m_msg.reply(status_codes::InternalError);
}

I've managed to reproduce the issue by putting a breakpoint at HttpReceiveHttpRequest in windows_request_context::async_process_request and waiting for the client to time out, causing a network disconnection. HttpReceiveHttpRequest returns ERROR_CONNECTION_INVALID, which prevents m_msg.reply(status_codes::InternalError); from succeeding. The response now waits indefinitely for a broken network connection, and holds on to the smart pointer for the server context, keeping the number of outstanding requests above zero indefinitely. I wouldn't be surprised if there were other error conditions that prevented the reference from being released.

@gigaplex

This comment has been minimized.

Show comment
Hide comment
@gigaplex

gigaplex Nov 10, 2015

Contributor

I forgot to mention that I was running version 2.6 of the library.

Contributor

gigaplex commented Nov 10, 2015

I forgot to mention that I was running version 2.6 of the library.

@gigaplex

This comment has been minimized.

Show comment
Hide comment
@gigaplex

gigaplex Nov 10, 2015

Contributor

It seems a lot of the error handling code in the windows_request_context class relies on m_msg.reply(status_codes::InternalError); doing the right thing. However, if the lambda in windows_request_context::dispatch_request_to_listener that runs as a response to the http_request.content_ready() task is never run, then bad things will happen. The callback for the http_request.get_response() task is not initialised, preventing replies from working, or at least be able to release the cyclic reference to windows_request_context. Also, the m_msg handle won't be reset, which means there could still be a cyclic reference between the http_request and windows_request_context if no attempt is made to send a reply.

Contributor

gigaplex commented Nov 10, 2015

It seems a lot of the error handling code in the windows_request_context class relies on m_msg.reply(status_codes::InternalError); doing the right thing. However, if the lambda in windows_request_context::dispatch_request_to_listener that runs as a response to the http_request.content_ready() task is never run, then bad things will happen. The callback for the http_request.get_response() task is not initialised, preventing replies from working, or at least be able to release the cyclic reference to windows_request_context. Also, the m_msg handle won't be reset, which means there could still be a cyclic reference between the http_request and windows_request_context if no attempt is made to send a reply.

@ras0219-msft

This comment has been minimized.

Show comment
Hide comment
@ras0219-msft

ras0219-msft Nov 10, 2015

Contributor

First off, thanks a bunch for this very detailed analysis. I'm going to look into reproducing the issue on my local machine.

Please continue posting any further details you find out.

Contributor

ras0219-msft commented Nov 10, 2015

First off, thanks a bunch for this very detailed analysis. I'm going to look into reproducing the issue on my local machine.

Please continue posting any further details you find out.

@ras0219-msft

This comment has been minimized.

Show comment
Hide comment
@ras0219-msft

ras0219-msft Nov 10, 2015

Contributor

One piece of information that might be useful in your analysis: The only way that the lambda in

pplx::task<void> http_windows_server::respond(http::http_response response)
{
    windows_request_context * p_context = static_cast<windows_request_context *>(response._get_server_context());
    return pplx::create_task(p_context->m_response_completed).then([p_context](::pplx::task<void> t)
    {
        // After response is sent, break circular reference between http_response and the request context.
        // Otherwise http_listener::close() can hang.
        p_context->m_response._get_impl()->_set_server_context(nullptr);
        t.get();
    });
}

could possibly not run is if m_response_completed is never set. Even if it is set with an exception, the continuation will still run (and the server context will be reset correctly).

Contributor

ras0219-msft commented Nov 10, 2015

One piece of information that might be useful in your analysis: The only way that the lambda in

pplx::task<void> http_windows_server::respond(http::http_response response)
{
    windows_request_context * p_context = static_cast<windows_request_context *>(response._get_server_context());
    return pplx::create_task(p_context->m_response_completed).then([p_context](::pplx::task<void> t)
    {
        // After response is sent, break circular reference between http_response and the request context.
        // Otherwise http_listener::close() can hang.
        p_context->m_response._get_impl()->_set_server_context(nullptr);
        t.get();
    });
}

could possibly not run is if m_response_completed is never set. Even if it is set with an exception, the continuation will still run (and the server context will be reset correctly).

@gigaplex

This comment has been minimized.

Show comment
Hide comment
@gigaplex

gigaplex Nov 10, 2015

Contributor

Thanks for the update, but that isn't entirely accurate. There are several ways in which that lambda can't run correctly. First off, is that when HttpReceiveHttpRequest fails in windows_request_context::async_process_request, m_response_completed is never set because no exception is thrown. Secondly, I've also noticed that if the get_response() task I mentioned from dispatch_request_to_listener doesn't run (there are many ways this can fail), then m_response isn't initialised, so the task in http_windows_server::respond can't break the cyclic reference as it's not pointing to the correct member (it's under p_context->m_msg.get_response().get() instead).

I've seen an instance where the destructor of m_response_completed triggered the lambda, but it crashed because p_context had already been deleted. I haven't been able to reliably reproduce that situation, I've got more analysis to do.

I've also found a case where m_response_completed is set with an exception, but since no response was created (and thus the protective task-based continuation from details::_http_request::_reply_impl isn't registered), the destructor of m_response_completed takes down the whole process with an unobserved exception violation. This occurs when HttpReceiveRequestEntityBody returns a serious (non-EOF) failure inside windows_request_context::read_request_body_chunk.

Contributor

gigaplex commented Nov 10, 2015

Thanks for the update, but that isn't entirely accurate. There are several ways in which that lambda can't run correctly. First off, is that when HttpReceiveHttpRequest fails in windows_request_context::async_process_request, m_response_completed is never set because no exception is thrown. Secondly, I've also noticed that if the get_response() task I mentioned from dispatch_request_to_listener doesn't run (there are many ways this can fail), then m_response isn't initialised, so the task in http_windows_server::respond can't break the cyclic reference as it's not pointing to the correct member (it's under p_context->m_msg.get_response().get() instead).

I've seen an instance where the destructor of m_response_completed triggered the lambda, but it crashed because p_context had already been deleted. I haven't been able to reliably reproduce that situation, I've got more analysis to do.

I've also found a case where m_response_completed is set with an exception, but since no response was created (and thus the protective task-based continuation from details::_http_request::_reply_impl isn't registered), the destructor of m_response_completed takes down the whole process with an unobserved exception violation. This occurs when HttpReceiveRequestEntityBody returns a serious (non-EOF) failure inside windows_request_context::read_request_body_chunk.

@gigaplex

This comment has been minimized.

Show comment
Hide comment
@gigaplex

gigaplex Nov 11, 2015

Contributor

In case you're having difficulty reproducing, I've got my client side set to a 1 second timeout, and I'm just strategically putting in breakpoints in the cpprest library server-side, waiting for the client to time out the request, and then resuming the server.

Key breakpoint places that cause issues so far include:

  • Immediately before HttpReceiveHttpRequest in windows_request_context::async_process_request (prevents m_response_completed being triggered)
  • Immedately before HttpReceiveRequestEntityBody in windows_request_context::read_request_body_chunk (causes m_response_completed unobserved exception)

Additionally, forcing the client to send a malformed URL causes a similar failure. The function dispatch_request_to_listener is not called, preventing registration of the get_response() lambda and therefore preventing m_response_completed from being signalled.

Contributor

gigaplex commented Nov 11, 2015

In case you're having difficulty reproducing, I've got my client side set to a 1 second timeout, and I'm just strategically putting in breakpoints in the cpprest library server-side, waiting for the client to time out the request, and then resuming the server.

Key breakpoint places that cause issues so far include:

  • Immediately before HttpReceiveHttpRequest in windows_request_context::async_process_request (prevents m_response_completed being triggered)
  • Immedately before HttpReceiveRequestEntityBody in windows_request_context::read_request_body_chunk (causes m_response_completed unobserved exception)

Additionally, forcing the client to send a malformed URL causes a similar failure. The function dispatch_request_to_listener is not called, preventing registration of the get_response() lambda and therefore preventing m_response_completed from being signalled.

@gigaplex

This comment has been minimized.

Show comment
Hide comment
@gigaplex

gigaplex Nov 12, 2015

Contributor

I'm currently in the process of drafting a patch for a pull request to hopefully resolve some/all of the issues I've identified, however there are some questions I'd like answered before continuing.

  • There are quite a few places that call m_msg.reply(...) with some kind of error code. The replies aren't being returned because the callbacks required to process the reply are conditionally initialised inside windows_request_context::dispatch_request_to_listener. There's a guard with a comment saying "If an exception occurred while processing the body then there is no reason to even try sending the response, just re-surface the same exception." This goes against the whole point in sending the error replies back to the client. Is there some justification for blocking the reply? I don't want to reintroduce an old bug if there's a specific reason for this.
  • The code ensures that the response can't be sent until the request body has been fully received. The current state of the code has lost the comment that explains that this is intentional, but I didn't find a specific reason why this should be the case. If there's an error in the response header, or other internal error, I'm not sure why we'd need to wait for the request body before informing the client of the error reply. What issue does this delay resolve?
  • The "asio" based listener breaks out the reply callbacks from the connection::dispatch_request_to_listener function, so that error replies are properly handled. Is there a specific reason that the "httpsys" variant doesn't follow this model? It's the model I'm planning on basing my patch on. While comparing the two implementations, I did notice that the "asio" version doesn't handle the task_canceled exception in the get_response handler, I'd suggest that this functionality from the "httpsys" version should be ported over.
Contributor

gigaplex commented Nov 12, 2015

I'm currently in the process of drafting a patch for a pull request to hopefully resolve some/all of the issues I've identified, however there are some questions I'd like answered before continuing.

  • There are quite a few places that call m_msg.reply(...) with some kind of error code. The replies aren't being returned because the callbacks required to process the reply are conditionally initialised inside windows_request_context::dispatch_request_to_listener. There's a guard with a comment saying "If an exception occurred while processing the body then there is no reason to even try sending the response, just re-surface the same exception." This goes against the whole point in sending the error replies back to the client. Is there some justification for blocking the reply? I don't want to reintroduce an old bug if there's a specific reason for this.
  • The code ensures that the response can't be sent until the request body has been fully received. The current state of the code has lost the comment that explains that this is intentional, but I didn't find a specific reason why this should be the case. If there's an error in the response header, or other internal error, I'm not sure why we'd need to wait for the request body before informing the client of the error reply. What issue does this delay resolve?
  • The "asio" based listener breaks out the reply callbacks from the connection::dispatch_request_to_listener function, so that error replies are properly handled. Is there a specific reason that the "httpsys" variant doesn't follow this model? It's the model I'm planning on basing my patch on. While comparing the two implementations, I did notice that the "asio" version doesn't handle the task_canceled exception in the get_response handler, I'd suggest that this functionality from the "httpsys" version should be ported over.
@ras0219-msft

This comment has been minimized.

Show comment
Hide comment
@ras0219-msft

ras0219-msft Nov 12, 2015

Contributor
  • I can think of no reason why we have to block the response. Not blocking it shouldn't be an issue by itself, though it might exacerbate any other bugs that are lurking (example: connection was dropped during transmission, so sending a response is going to fail. Due to this being a relatively rare scenario, the code may have other bugs). I don't think that's a reason to stop your change though; the objective is to fix the bugs, not cover them up :).
  • I don't believe the delay is required, similarly to the above. The original intent was probably to conceptually simplify the implementation since it's easier to think about "Ok, now we receive the request. Ok, that's done. Ok, now we send the response." than to consider all the possible mixed-states that can arise when simultaneously processing the request and response. For example, there might be data races around m_msg where the request continuation resets it while the response continuation is reading from it.
  • Yeah, that sounds like a good idea. I guess as it currently works, the asio implementation will always respond with at least an InternalError? I'm always in favor of reducing the differences between the platforms, so bringing them in line here would be great.

Additionally, I want to again say thanks for putting more eyes on this code. The http_listener code is the least tested code in the library despite its age, so there are certainly a lot of issues here that need fixing.

Contributor

ras0219-msft commented Nov 12, 2015

  • I can think of no reason why we have to block the response. Not blocking it shouldn't be an issue by itself, though it might exacerbate any other bugs that are lurking (example: connection was dropped during transmission, so sending a response is going to fail. Due to this being a relatively rare scenario, the code may have other bugs). I don't think that's a reason to stop your change though; the objective is to fix the bugs, not cover them up :).
  • I don't believe the delay is required, similarly to the above. The original intent was probably to conceptually simplify the implementation since it's easier to think about "Ok, now we receive the request. Ok, that's done. Ok, now we send the response." than to consider all the possible mixed-states that can arise when simultaneously processing the request and response. For example, there might be data races around m_msg where the request continuation resets it while the response continuation is reading from it.
  • Yeah, that sounds like a good idea. I guess as it currently works, the asio implementation will always respond with at least an InternalError? I'm always in favor of reducing the differences between the platforms, so bringing them in line here would be great.

Additionally, I want to again say thanks for putting more eyes on this code. The http_listener code is the least tested code in the library despite its age, so there are certainly a lot of issues here that need fixing.

@gigaplex

This comment has been minimized.

Show comment
Hide comment
@gigaplex

gigaplex Nov 13, 2015

Contributor

I've pushed another change for review. I'm currently stress testing it as we speak.

I found an answer as to why the library currently delays the response until the body download is complete. The m_overlapped member is reused between the body fetch and response send functions, so we don't want to race over that member. We also want to make sure that, since the m_response member now owns the std::unique_ptr of the context, we need to maintain ownership of the response so it doesn't go out of scope mid way through the content still fetching.

I'm pretty sure there's still a race condition lurking, where if m_response_completed is signalled with an error before the m_response member is initialised, then the callback that breaks the circular reference will run too early. I'd like to move the creation of that callback implementation to the point where m_response is initialised, however I'm not sure what the implications of that will be on other observers of the task returned by http_windows_server::respond.

Contributor

gigaplex commented Nov 13, 2015

I've pushed another change for review. I'm currently stress testing it as we speak.

I found an answer as to why the library currently delays the response until the body download is complete. The m_overlapped member is reused between the body fetch and response send functions, so we don't want to race over that member. We also want to make sure that, since the m_response member now owns the std::unique_ptr of the context, we need to maintain ownership of the response so it doesn't go out of scope mid way through the content still fetching.

I'm pretty sure there's still a race condition lurking, where if m_response_completed is signalled with an error before the m_response member is initialised, then the callback that breaks the circular reference will run too early. I'd like to move the creation of that callback implementation to the point where m_response is initialised, however I'm not sure what the implications of that will be on other observers of the task returned by http_windows_server::respond.

@ras0219-msft ras0219-msft added the bug label Nov 19, 2015

@gigaplex

This comment has been minimized.

Show comment
Hide comment
@gigaplex

gigaplex Dec 3, 2015

Contributor

Can I get my pull request reviewed please? My boss is asking what the hold up is. Unless any issues are raised during the review, I think the pull request should be ready to merge.

Contributor

gigaplex commented Dec 3, 2015

Can I get my pull request reviewed please? My boss is asking what the hold up is. Unless any issues are raised during the review, I think the pull request should be ready to merge.

@ras0219-msft

This comment has been minimized.

Show comment
Hide comment
@ras0219-msft

ras0219-msft Dec 7, 2015

Contributor

Alright, the pull request has been merged. Thanks for the ping reminder :)

Contributor

ras0219-msft commented Dec 7, 2015

Alright, the pull request has been merged. Thanks for the ping reminder :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment