Deadlock when closing the http_listener #13

gigaplex · 2015-11-09T05:55:38Z

Duplicated from CodePlex (issue 421) since the documentation says that issues should be reported to GitHub instead:

When calling http_listener.close().wait() as we're shutting down our service, the application will occasionally deadlock because the wait operation never returns. The deadlock appears to be in the http_windows_server::stop() function, there's a wait that never returns.

pplx::task<void> http_windows_server::stop()
{
    // Shutdown request queue.
    if(m_hRequestQueue != nullptr)
    {
        HttpShutdownRequestQueue(m_hRequestQueue);
         m_receivingTask.wait();

        // Wait for all requests to be finished processing.
        m_zeroOutstandingRequests.wait(); // Deadlock appears to be here

        HttpCloseRequestQueue(m_hRequestQueue);
    }

The text was updated successfully, but these errors were encountered:

gigaplex · 2015-11-10T00:26:50Z

I suspect this area of code is of interest:

pplx::task<void> http_windows_server::respond(http::http_response response)
{
    windows_request_context * p_context = static_cast<windows_request_context *>(response._get_server_context());
    return pplx::create_task(p_context->m_response_completed).then([p_context](::pplx::task<void> t)
    {
        // After response is sent, break circular reference between http_response and the request context.
        // Otherwise http_listener::close() can hang.
        p_context->m_response._get_impl()->_set_server_context(nullptr);
        t.get();
    });
}

It looks like if there's some exception that prevents this bit of code from running, then the circular reference isn't broken, causing the hang.

gigaplex · 2015-11-10T00:41:35Z

From windows_request_context::async_process_request:

const unsigned long error_code = HttpReceiveHttpRequest(
        pServer->m_hRequestQueue,
        m_request_id,
        0,
        m_request,
        headers_size,
        NULL,
        &m_overlapped);

if(error_code != NO_ERROR && error_code != ERROR_IO_PENDING)
{
    CancelThreadpoolIo(pServer->m_threadpool_io);
    m_msg.reply(status_codes::InternalError);
}

I've managed to reproduce the issue by putting a breakpoint at HttpReceiveHttpRequest in windows_request_context::async_process_request and waiting for the client to time out, causing a network disconnection. HttpReceiveHttpRequest returns ERROR_CONNECTION_INVALID, which prevents m_msg.reply(status_codes::InternalError); from succeeding. The response now waits indefinitely for a broken network connection, and holds on to the smart pointer for the server context, keeping the number of outstanding requests above zero indefinitely. I wouldn't be surprised if there were other error conditions that prevented the reference from being released.

gigaplex · 2015-11-10T03:59:59Z

I forgot to mention that I was running version 2.6 of the library.

gigaplex · 2015-11-10T07:45:23Z

It seems a lot of the error handling code in the windows_request_context class relies on m_msg.reply(status_codes::InternalError); doing the right thing. However, if the lambda in windows_request_context::dispatch_request_to_listener that runs as a response to the http_request.content_ready() task is never run, then bad things will happen. The callback for the http_request.get_response() task is not initialised, preventing replies from working, or at least be able to release the cyclic reference to windows_request_context. Also, the m_msg handle won't be reset, which means there could still be a cyclic reference between the http_request and windows_request_context if no attempt is made to send a reply.

ras0219-msft · 2015-11-10T16:22:07Z

First off, thanks a bunch for this very detailed analysis. I'm going to look into reproducing the issue on my local machine.

Please continue posting any further details you find out.

ras0219-msft · 2015-11-10T16:24:05Z

One piece of information that might be useful in your analysis: The only way that the lambda in

pplx::task<void> http_windows_server::respond(http::http_response response)
{
    windows_request_context * p_context = static_cast<windows_request_context *>(response._get_server_context());
    return pplx::create_task(p_context->m_response_completed).then([p_context](::pplx::task<void> t)
    {
        // After response is sent, break circular reference between http_response and the request context.
        // Otherwise http_listener::close() can hang.
        p_context->m_response._get_impl()->_set_server_context(nullptr);
        t.get();
    });
}

could possibly not run is if m_response_completed is never set. Even if it is set with an exception, the continuation will still run (and the server context will be reset correctly).

gigaplex · 2015-11-10T23:14:46Z

Thanks for the update, but that isn't entirely accurate. There are several ways in which that lambda can't run correctly. First off, is that when HttpReceiveHttpRequest fails in windows_request_context::async_process_request, m_response_completed is never set because no exception is thrown. Secondly, I've also noticed that if the get_response() task I mentioned from dispatch_request_to_listener doesn't run (there are many ways this can fail), then m_response isn't initialised, so the task in http_windows_server::respond can't break the cyclic reference as it's not pointing to the correct member (it's under p_context->m_msg.get_response().get() instead).

I've seen an instance where the destructor of m_response_completed triggered the lambda, but it crashed because p_context had already been deleted. I haven't been able to reliably reproduce that situation, I've got more analysis to do.

I've also found a case where m_response_completed is set with an exception, but since no response was created (and thus the protective task-based continuation from details::_http_request::_reply_impl isn't registered), the destructor of m_response_completed takes down the whole process with an unobserved exception violation. This occurs when HttpReceiveRequestEntityBody returns a serious (non-EOF) failure inside windows_request_context::read_request_body_chunk.

gigaplex · 2015-11-11T00:41:13Z

In case you're having difficulty reproducing, I've got my client side set to a 1 second timeout, and I'm just strategically putting in breakpoints in the cpprest library server-side, waiting for the client to time out the request, and then resuming the server.

Key breakpoint places that cause issues so far include:

Immediately before HttpReceiveHttpRequest in windows_request_context::async_process_request (prevents m_response_completed being triggered)
Immedately before HttpReceiveRequestEntityBody in windows_request_context::read_request_body_chunk (causes m_response_completed unobserved exception)

Additionally, forcing the client to send a malformed URL causes a similar failure. The function dispatch_request_to_listener is not called, preventing registration of the get_response() lambda and therefore preventing m_response_completed from being signalled.

gigaplex · 2015-11-12T07:43:02Z

I'm currently in the process of drafting a patch for a pull request to hopefully resolve some/all of the issues I've identified, however there are some questions I'd like answered before continuing.

There are quite a few places that call m_msg.reply(...) with some kind of error code. The replies aren't being returned because the callbacks required to process the reply are conditionally initialised inside windows_request_context::dispatch_request_to_listener. There's a guard with a comment saying "If an exception occurred while processing the body then there is no reason to even try sending the response, just re-surface the same exception." This goes against the whole point in sending the error replies back to the client. Is there some justification for blocking the reply? I don't want to reintroduce an old bug if there's a specific reason for this.
The code ensures that the response can't be sent until the request body has been fully received. The current state of the code has lost the comment that explains that this is intentional, but I didn't find a specific reason why this should be the case. If there's an error in the response header, or other internal error, I'm not sure why we'd need to wait for the request body before informing the client of the error reply. What issue does this delay resolve?
The "asio" based listener breaks out the reply callbacks from the connection::dispatch_request_to_listener function, so that error replies are properly handled. Is there a specific reason that the "httpsys" variant doesn't follow this model? It's the model I'm planning on basing my patch on. While comparing the two implementations, I did notice that the "asio" version doesn't handle the task_canceled exception in the get_response handler, I'd suggest that this functionality from the "httpsys" version should be ported over.

ras0219-msft · 2015-11-12T17:35:28Z

I can think of no reason why we have to block the response. Not blocking it shouldn't be an issue by itself, though it might exacerbate any other bugs that are lurking (example: connection was dropped during transmission, so sending a response is going to fail. Due to this being a relatively rare scenario, the code may have other bugs). I don't think that's a reason to stop your change though; the objective is to fix the bugs, not cover them up :).
I don't believe the delay is required, similarly to the above. The original intent was probably to conceptually simplify the implementation since it's easier to think about "Ok, now we receive the request. Ok, that's done. Ok, now we send the response." than to consider all the possible mixed-states that can arise when simultaneously processing the request and response. For example, there might be data races around m_msg where the request continuation resets it while the response continuation is reading from it.
Yeah, that sounds like a good idea. I guess as it currently works, the asio implementation will always respond with at least an InternalError? I'm always in favor of reducing the differences between the platforms, so bringing them in line here would be great.

Additionally, I want to again say thanks for putting more eyes on this code. The http_listener code is the least tested code in the library despite its age, so there are certainly a lot of issues here that need fixing.

gigaplex · 2015-11-13T02:00:07Z

I've pushed another change for review. I'm currently stress testing it as we speak.

I found an answer as to why the library currently delays the response until the body download is complete. The m_overlapped member is reused between the body fetch and response send functions, so we don't want to race over that member. We also want to make sure that, since the m_response member now owns the std::unique_ptr of the context, we need to maintain ownership of the response so it doesn't go out of scope mid way through the content still fetching.

I'm pretty sure there's still a race condition lurking, where if m_response_completed is signalled with an error before the m_response member is initialised, then the callback that breaks the circular reference will run too early. I'd like to move the creation of that callback implementation to the point where m_response is initialised, however I'm not sure what the implications of that will be on other observers of the task returned by http_windows_server::respond.

gigaplex · 2015-12-03T01:06:16Z

Can I get my pull request reviewed please? My boss is asking what the hold up is. Unless any issues are raised during the review, I think the pull request should be ready to merge.

ras0219-msft · 2015-12-07T19:47:31Z

Alright, the pull request has been merged. Thanks for the ping reminder :)

Chri7325/http helpers

gigaplex mentioned this issue Nov 12, 2015

Initial attempt at win32 http_listener response refactor #16

Merged

ras0219-msft added the bug label Nov 19, 2015

ras0219-msft closed this as completed Dec 7, 2015

maumai mentioned this issue Dec 14, 2015

http server fails to reply back on malformed requests #31

Closed

JFC-Dev mentioned this issue Oct 29, 2018

"pure virtual method called" in Linux (no problem on Windows) #942

Closed

yxiang92128 mentioned this issue Oct 31, 2018

core dumps when destructing ContinuatioTaskHandle #944

Closed

yyzhang123 mentioned this issue Jul 3, 2019

multi-threaded http program will enter task.wait() and wait for std::unique_lock<std::mutex> then can't quit #1170

Closed

LiShuangcheng84 mentioned this issue Sep 19, 2019

http_client hangs when upload many files at the same time #1245

Open

oleevg mentioned this issue Dec 27, 2019

Periodic crash in free_stale_connections() on Linux #1291

Open

StepchenkoA mentioned this issue Aug 14, 2020

Crash on appllication exit while SSL is enabled (threadpool destructor is not called?) #1482

Open

mayssaz mentioned this issue Aug 19, 2020

Deadlock when closing the http_listener #1485

Open

nivedks mentioned this issue Dec 7, 2020

Segmentation fault in aarch64 during http_client object destruction #1545

Open

yxiang92128 mentioned this issue Dec 14, 2020

help: segfault in boost::asio::detail::epoll_reactor::start_op while upload_from_stream in multiple threads #1547

Open

davyuan mentioned this issue Mar 18, 2021

Segmentation Fault when cross-compiling for Raspberry PI #1598

Closed

Rohit2387 mentioned this issue Jul 5, 2021

concurrency::streams::fstream::open_istream raising exception. #1631

Open

GerHobbelt pushed a commit to GerHobbelt/cpprestsdk that referenced this issue Mar 20, 2024

Merge pull request microsoft#13 from Esri/chri7325/http_helpers

20be602

Chri7325/http helpers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock when closing the http_listener #13

Deadlock when closing the http_listener #13

gigaplex commented Nov 9, 2015

gigaplex commented Nov 10, 2015

gigaplex commented Nov 10, 2015

gigaplex commented Nov 10, 2015

gigaplex commented Nov 10, 2015

ras0219-msft commented Nov 10, 2015

ras0219-msft commented Nov 10, 2015

gigaplex commented Nov 10, 2015

gigaplex commented Nov 11, 2015

gigaplex commented Nov 12, 2015

ras0219-msft commented Nov 12, 2015

gigaplex commented Nov 13, 2015

gigaplex commented Dec 3, 2015

ras0219-msft commented Dec 7, 2015

Deadlock when closing the http_listener #13

Deadlock when closing the http_listener #13

Comments

gigaplex commented Nov 9, 2015

gigaplex commented Nov 10, 2015

gigaplex commented Nov 10, 2015

gigaplex commented Nov 10, 2015

gigaplex commented Nov 10, 2015

ras0219-msft commented Nov 10, 2015

ras0219-msft commented Nov 10, 2015

gigaplex commented Nov 10, 2015

gigaplex commented Nov 11, 2015

gigaplex commented Nov 12, 2015

ras0219-msft commented Nov 12, 2015

gigaplex commented Nov 13, 2015

gigaplex commented Dec 3, 2015

ras0219-msft commented Dec 7, 2015