Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock when closing the http_listener #13

Closed
gigaplex opened this issue Nov 9, 2015 · 13 comments
Closed

Deadlock when closing the http_listener #13

gigaplex opened this issue Nov 9, 2015 · 13 comments
Labels

Comments

@gigaplex
Copy link
Contributor

gigaplex commented Nov 9, 2015

Duplicated from CodePlex (issue 421) since the documentation says that issues should be reported to GitHub instead:

When calling http_listener.close().wait() as we're shutting down our service, the application will occasionally deadlock because the wait operation never returns. The deadlock appears to be in the http_windows_server::stop() function, there's a wait that never returns.

pplx::task<void> http_windows_server::stop()
{
    // Shutdown request queue.
    if(m_hRequestQueue != nullptr)
    {
        HttpShutdownRequestQueue(m_hRequestQueue);
         m_receivingTask.wait();

        // Wait for all requests to be finished processing.
        m_zeroOutstandingRequests.wait(); // Deadlock appears to be here

        HttpCloseRequestQueue(m_hRequestQueue);
    }
@gigaplex
Copy link
Contributor Author

I suspect this area of code is of interest:

pplx::task<void> http_windows_server::respond(http::http_response response)
{
    windows_request_context * p_context = static_cast<windows_request_context *>(response._get_server_context());
    return pplx::create_task(p_context->m_response_completed).then([p_context](::pplx::task<void> t)
    {
        // After response is sent, break circular reference between http_response and the request context.
        // Otherwise http_listener::close() can hang.
        p_context->m_response._get_impl()->_set_server_context(nullptr);
        t.get();
    });
}

It looks like if there's some exception that prevents this bit of code from running, then the circular reference isn't broken, causing the hang.

@gigaplex
Copy link
Contributor Author

From windows_request_context::async_process_request:

const unsigned long error_code = HttpReceiveHttpRequest(
        pServer->m_hRequestQueue,
        m_request_id,
        0,
        m_request,
        headers_size,
        NULL,
        &m_overlapped);

if(error_code != NO_ERROR && error_code != ERROR_IO_PENDING)
{
    CancelThreadpoolIo(pServer->m_threadpool_io);
    m_msg.reply(status_codes::InternalError);
}

I've managed to reproduce the issue by putting a breakpoint at HttpReceiveHttpRequest in windows_request_context::async_process_request and waiting for the client to time out, causing a network disconnection. HttpReceiveHttpRequest returns ERROR_CONNECTION_INVALID, which prevents m_msg.reply(status_codes::InternalError); from succeeding. The response now waits indefinitely for a broken network connection, and holds on to the smart pointer for the server context, keeping the number of outstanding requests above zero indefinitely. I wouldn't be surprised if there were other error conditions that prevented the reference from being released.

@gigaplex
Copy link
Contributor Author

I forgot to mention that I was running version 2.6 of the library.

@gigaplex
Copy link
Contributor Author

It seems a lot of the error handling code in the windows_request_context class relies on m_msg.reply(status_codes::InternalError); doing the right thing. However, if the lambda in windows_request_context::dispatch_request_to_listener that runs as a response to the http_request.content_ready() task is never run, then bad things will happen. The callback for the http_request.get_response() task is not initialised, preventing replies from working, or at least be able to release the cyclic reference to windows_request_context. Also, the m_msg handle won't be reset, which means there could still be a cyclic reference between the http_request and windows_request_context if no attempt is made to send a reply.

@ras0219-msft
Copy link
Contributor

First off, thanks a bunch for this very detailed analysis. I'm going to look into reproducing the issue on my local machine.

Please continue posting any further details you find out.

@ras0219-msft
Copy link
Contributor

One piece of information that might be useful in your analysis: The only way that the lambda in

pplx::task<void> http_windows_server::respond(http::http_response response)
{
    windows_request_context * p_context = static_cast<windows_request_context *>(response._get_server_context());
    return pplx::create_task(p_context->m_response_completed).then([p_context](::pplx::task<void> t)
    {
        // After response is sent, break circular reference between http_response and the request context.
        // Otherwise http_listener::close() can hang.
        p_context->m_response._get_impl()->_set_server_context(nullptr);
        t.get();
    });
}

could possibly not run is if m_response_completed is never set. Even if it is set with an exception, the continuation will still run (and the server context will be reset correctly).

@gigaplex
Copy link
Contributor Author

Thanks for the update, but that isn't entirely accurate. There are several ways in which that lambda can't run correctly. First off, is that when HttpReceiveHttpRequest fails in windows_request_context::async_process_request, m_response_completed is never set because no exception is thrown. Secondly, I've also noticed that if the get_response() task I mentioned from dispatch_request_to_listener doesn't run (there are many ways this can fail), then m_response isn't initialised, so the task in http_windows_server::respond can't break the cyclic reference as it's not pointing to the correct member (it's under p_context->m_msg.get_response().get() instead).

I've seen an instance where the destructor of m_response_completed triggered the lambda, but it crashed because p_context had already been deleted. I haven't been able to reliably reproduce that situation, I've got more analysis to do.

I've also found a case where m_response_completed is set with an exception, but since no response was created (and thus the protective task-based continuation from details::_http_request::_reply_impl isn't registered), the destructor of m_response_completed takes down the whole process with an unobserved exception violation. This occurs when HttpReceiveRequestEntityBody returns a serious (non-EOF) failure inside windows_request_context::read_request_body_chunk.

@gigaplex
Copy link
Contributor Author

In case you're having difficulty reproducing, I've got my client side set to a 1 second timeout, and I'm just strategically putting in breakpoints in the cpprest library server-side, waiting for the client to time out the request, and then resuming the server.

Key breakpoint places that cause issues so far include:

  • Immediately before HttpReceiveHttpRequest in windows_request_context::async_process_request (prevents m_response_completed being triggered)
  • Immedately before HttpReceiveRequestEntityBody in windows_request_context::read_request_body_chunk (causes m_response_completed unobserved exception)

Additionally, forcing the client to send a malformed URL causes a similar failure. The function dispatch_request_to_listener is not called, preventing registration of the get_response() lambda and therefore preventing m_response_completed from being signalled.

@gigaplex
Copy link
Contributor Author

I'm currently in the process of drafting a patch for a pull request to hopefully resolve some/all of the issues I've identified, however there are some questions I'd like answered before continuing.

  • There are quite a few places that call m_msg.reply(...) with some kind of error code. The replies aren't being returned because the callbacks required to process the reply are conditionally initialised inside windows_request_context::dispatch_request_to_listener. There's a guard with a comment saying "If an exception occurred while processing the body then there is no reason to even try sending the response, just re-surface the same exception." This goes against the whole point in sending the error replies back to the client. Is there some justification for blocking the reply? I don't want to reintroduce an old bug if there's a specific reason for this.
  • The code ensures that the response can't be sent until the request body has been fully received. The current state of the code has lost the comment that explains that this is intentional, but I didn't find a specific reason why this should be the case. If there's an error in the response header, or other internal error, I'm not sure why we'd need to wait for the request body before informing the client of the error reply. What issue does this delay resolve?
  • The "asio" based listener breaks out the reply callbacks from the connection::dispatch_request_to_listener function, so that error replies are properly handled. Is there a specific reason that the "httpsys" variant doesn't follow this model? It's the model I'm planning on basing my patch on. While comparing the two implementations, I did notice that the "asio" version doesn't handle the task_canceled exception in the get_response handler, I'd suggest that this functionality from the "httpsys" version should be ported over.

@ras0219-msft
Copy link
Contributor

  • I can think of no reason why we have to block the response. Not blocking it shouldn't be an issue by itself, though it might exacerbate any other bugs that are lurking (example: connection was dropped during transmission, so sending a response is going to fail. Due to this being a relatively rare scenario, the code may have other bugs). I don't think that's a reason to stop your change though; the objective is to fix the bugs, not cover them up :).
  • I don't believe the delay is required, similarly to the above. The original intent was probably to conceptually simplify the implementation since it's easier to think about "Ok, now we receive the request. Ok, that's done. Ok, now we send the response." than to consider all the possible mixed-states that can arise when simultaneously processing the request and response. For example, there might be data races around m_msg where the request continuation resets it while the response continuation is reading from it.
  • Yeah, that sounds like a good idea. I guess as it currently works, the asio implementation will always respond with at least an InternalError? I'm always in favor of reducing the differences between the platforms, so bringing them in line here would be great.

Additionally, I want to again say thanks for putting more eyes on this code. The http_listener code is the least tested code in the library despite its age, so there are certainly a lot of issues here that need fixing.

@gigaplex
Copy link
Contributor Author

I've pushed another change for review. I'm currently stress testing it as we speak.

I found an answer as to why the library currently delays the response until the body download is complete. The m_overlapped member is reused between the body fetch and response send functions, so we don't want to race over that member. We also want to make sure that, since the m_response member now owns the std::unique_ptr of the context, we need to maintain ownership of the response so it doesn't go out of scope mid way through the content still fetching.

I'm pretty sure there's still a race condition lurking, where if m_response_completed is signalled with an error before the m_response member is initialised, then the callback that breaks the circular reference will run too early. I'd like to move the creation of that callback implementation to the point where m_response is initialised, however I'm not sure what the implications of that will be on other observers of the task returned by http_windows_server::respond.

@gigaplex
Copy link
Contributor Author

gigaplex commented Dec 3, 2015

Can I get my pull request reviewed please? My boss is asking what the hold up is. Unless any issues are raised during the review, I think the pull request should be ready to merge.

@ras0219-msft
Copy link
Contributor

Alright, the pull request has been merged. Thanks for the ping reminder :)

GerHobbelt pushed a commit to GerHobbelt/cpprestsdk that referenced this issue Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants