Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection time out causes Segmentation fault in iothubtransport_mqtt_common module #446

Closed
hhirvonen opened this issue Apr 5, 2018 · 21 comments

Comments

@hhirvonen
Copy link

Hello,

Backround:
My network firewall blocks AMQP port and that causes delay on edgeHub connection procedure. After 1 minute connection time out it changes connection protocol to AMQP over WebSocket.
Meanwhile in my edgeModule connection time our occurs and it crashes in azure-iot-sdk-c library's iothubtransport_mqtt_common module.

edgeHub container logs:
2018-04-05 10:29:52.703 +00:00 [INF] - Attempting to connect to IoT Hub for client xxx-xxx via AMQP...
2018-04-05 10:30:52.794 +00:00 [INF] - Attempting to connect to IoT Hub for client xx-xxx via AMQP over WebSocket...
2018-04-05 10:30:52.990 +00:00 [INF] - New token requested by client xxx-xxx, but using existing token as it is usable.
2018-04-05 10:30:53.605 +00:00 [INF] - Connected to IoT Hub for client xxx-xxx via AMQP over WebSocket, with client operation timeout 60000.
2018-04-05 10:30:53.829 +00:00 [INF] - Set subscriptions from session state for xxx-xxx on cloud reconnect

edgeModule log:
-> 10:29:51 CONNECT | VER: 4 | KEEPALIVE: 240 | FLAGS: 192 | USERNAME: Edge-Test-Hub.azure-devices.net/xxx-xxx/api-version=2016-11-14&DeviceClientType=iothubclient%2f1.2.2%20(native%3b%20Linux%3b%20x86_64) | PWD: XXXX | CLEAN: 0
Error: Time:Thu Apr 5 10:30:19 2018 File:/build/azure-iot-sdk-c-MaKnm3/azure-iot-sdk-c-0.2.0.0/iothub_client/src/iothubtransport_mqtt_common.c Func:InitializeConnection Line:1865 mqtt_client timed out waiting for CONNACK
Segmentation fault (core dumped)

Distributor ID: Ubuntu
Description: Ubuntu 16.04.3 LTS
Release: 16.04
Codename: xenial
azure-iot-sdk-c release 2018-04-02
Same problem exist also previous release

br Hannu

@villepalo
Copy link
Contributor

I'm having the same segfault occasionally using MQTT and iot-edge v1 Azure/iot-edge-v1#520

@ewertons
Copy link
Contributor

ewertons commented Apr 5, 2018

Hi @villepalo ,
we will investigate the issue right away.
In the mean time, would it be possible for you to capture and share a memory dump using gdb? That would help expedite the root-causing.

@ewertons
Copy link
Contributor

ewertons commented Apr 5, 2018

If you generate a memory dump it will most likely contain PII, so it would be recommended to send it to us on a secure channel. We can provide one but a support ticket will be necessary so our Support Team can reach you. If you prefer to do that way, please open a ticket using the Azure portal.

@villepalo
Copy link
Contributor

I'm trying get the core dump, but this segfault happens so seldom on our environment, that it could take some time

@hhirvonen
Copy link
Author

Here is core dump details
(gdb) bt
#0 0x00007ff0f00020a0 in ?? ()
#1 0x00000000004f9edd in mqtt_client_dowork ()
#2 0x00000000004b1cde in IoTHubTransport_MQTT_Common_DoWork ()
#3 0x000000000048f84c in IoTHubClient_LL_DoWork ()
#4 0x000000000048c5e1 in ScheduleWork_Thread ()
#5 0x00000000004be31e in ThreadWrapper ()
#6 0x00007ff0fbc376ba in start_thread (arg=0x7ff0f5de6700) at pthread_create.c:333
#7 0x00007ff0faec741d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

@ewertons
Copy link
Contributor

ewertons commented Apr 6, 2018

Thanks, but that's a callstack only, and not from a debug build.
Unfortunately it doesn't provide any more info than we already had.
A full mem dump will shed more light on this issue.
We are also trying to repro it, but haven't hit the issue.

@villepalo
Copy link
Contributor

Our environment where this happens once a day:

  1. Ubuntu 14.0.4
  2. MQTT
  3. Connection break happens every five minutes
  4. Connection break last about two minutes
  5. about 30 000 messages sent in 24h

Unfortunately getting debug crash dump is not possible.

@jspaith
Copy link
Contributor

jspaith commented Apr 10, 2018

Is there anyway you could run this under a debugger - ideally with a debug build, but if there's timing issues a release would be helpful too - and see if you can get a repro with some additional info? We really want to get the bottom of this but we're definitely going to need some additional info. Thanks

@villepalo
Copy link
Contributor

Correct me if I'm wrong, but isn't so that in iothubtransport_mqtt_common.c:DisconnectFromClient xioTransport is destroyed and memory freed, but in mqtt_client.c:mqtt_client_disconnect mqtt_client->xioHandle is set to NULL only if mqtt_client->clientConnected is true. And that handle is used later in the xio_dowork called from mqtt_client_dowork?

@hhirvonen
Copy link
Author

I made modification to mqtt_client.c module's mqtt_client_disconnect() function and set mqtt_client->xioHandle to NULL also on (if mqtt_client->clientConnected) else branch and that fixes the core dump problem.
I used release 2018-04-02 build with modified mqtt_client.c module.

@pboettch
Copy link
Contributor

pboettch commented Apr 17, 2018

It seems my issue (#459) is identical to this one and it's great that a solution seems to have been found. I can easily reproduce this issue on my platform and I think I can even force it by setting up a blocking rule in the right moment.

First I will run my program with valgrind, it should show the "use after free". Or is this a threading issue? Then one might see something with a thread-sanitizer.

@villepalo
Copy link
Contributor

I'm pretty sure that this isn't threading issue. I'm also pretty sure that this segfault happens every time if "mqtt_client timed out waiting for CONNACK" line gets logged.

@pboettch
Copy link
Contributor

Valgrind will tell the truth.

@pboettch
Copy link
Contributor

pboettch commented Apr 17, 2018

To trig the problem I'm forcing the mqtt-client to think that it timed out on the connection:

        // iothub_client/src/iothubtransport_mqtt_common.c:1845
        else if (transport_data->mqttClientStatus == MQTT_CLIENT_STATUS_CONNECTING)
        {
            //tickcounter_ms_t current_time;
            //if (tickcounter_get_current_ms(transport_data->msgTickCounter, &current_time) != 0)
            //{
            //    LogError("failed verifying MQTT_CLIENT_STATUS_CONNECTING timeout");
            //    result = __FAILURE__;
            //}
            //else if ((current_time - transport_data->mqtt_connect_time) / 1000 > transport_data->connect_timeout_in_sec)
            //{
                LogError("mqtt_client timed out waiting for CONNACK");
                DisconnectFromClient(transport_data);
                result = 0;
            //}
        }

I'm getting a segfault systematically now.

@pboettch
Copy link
Contributor

And adding

            mqtt_client->xioHandle = NULL;

to the else branch of umqtt/src/mqtt_client.c's if (mqtt_client->clientConnected) fixes the seg-fault.

@hhirvonen did you issue a pull-request?

@hhirvonen
Copy link
Author

No I didn't. I just try it on my own PC.

@pboettch
Copy link
Contributor

It'll be nice to have a non-regression along with the fix. Having said that, I wouldn't know how to write one here.

@villepalo
Copy link
Contributor

villepalo commented Apr 18, 2018

I created a pull request about the issue: Azure/azure-umqtt-c#18

But a thing I do not understand is why this connecting timeout is not considered as a failure:

            else if ((current_time - transport_data->mqtt_connect_time) / 1000 > transport_data->connect_timeout_in_sec)
            {
                LogError("mqtt_client timed out waiting for CONNACK");
                DisconnectFromClient(transport_data);
                result = 0;
            }

Setting result to FAILURE would solve this segFault, wouldn't it?

@villepalo
Copy link
Contributor

This should be fixed now in the current master (with updated submodules). Coud you verify this?

@hhirvonen
Copy link
Author

Yes, I'll verify it. I let you know the results hopefully during this week.

@hhirvonen
Copy link
Author

Yes, It's working and connection time out does not cause Segmentation fault anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants