Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python Interpreter shutdown when using multithreading #148

Closed
Cnidarias opened this issue Jun 15, 2018 · 17 comments
Closed

Python Interpreter shutdown when using multithreading #148

Cnidarias opened this issue Jun 15, 2018 · 17 comments

Comments

@Cnidarias
Copy link

Cnidarias commented Jun 15, 2018

  • OS and version used: Ubuntu 16.04.3

  • Python runtime used: Python 3.5.2

  • SDK version used: azure-iothub-device-client==1.3.5

Description of the issue:

When sending data to an IoT-Hub using the azure-iothub-device-client lib the python interpreter crashes after some undefined period of time when using multithreading.
This behaviour is 100% reproducible but the time it takes for it to happen is varied. When using more threads and sending more frequently it seems that it will happen faster. When using 50 threads and sending every second it crashed after about 30 minutes.

The error one gets is simply:
Fatal Python error: GC object already tracked

Sometimes this comes with a SIGABRT but not always.

Since I ran python with gdb I have attached a backtrace will all the info
I think the most important parts are at the top - I also included a py-bt of all threads which most of them are just sleeping
gdb.txt

Code sample exhibiting the issue:

import threading
import json
import time
import random

from iothub_client import IoTHubClient, IoTHubTransportProvider, IoTHubMessage, IoTHubError


class AzureThreadingTest(threading.Thread):
    def __init__(self, azure_config_file, name):
        super().__init__(name=name)
        self.MSG_COUNT = 0
        self.SEND_CALLBACKS = 0

        self.sleep_time = 1
        self.message_print_time = 30
        self.name = name

        with open(azure_config_file) as json_data:
            self.azure_config = json.load(json_data)

        self.iothub_client = self._iothub_client_init()

    def _iothub_client_init(self):
        connection_string = self.azure_config["iot_hub"]["ConnectionString"]
        protocol = self.azure_config["iot_hub"]["Options"]["PROTOCOL"]
        if protocol == "HTTP":
            protocol_class = IoTHubTransportProvider.HTTP
        elif protocol == "AMQP":
            protocol_class = IoTHubTransportProvider.AMQP
        elif protocol == "MQTT":
            protocol_class = IoTHubTransportProvider.MQTT
        else:
            protocol_class = IoTHubTransportProvider.HTTP
        client = IoTHubClient(connection_string, protocol_class)
        client.set_option("messageTimeout", self.azure_config["iot_hub"]["Options"]["MESSAGE_TIMEOUT"])

        if client.protocol == IoTHubTransportProvider.HTTP:
            timeout = 241000
            minimum_polling_time = 9
            client.set_option("timeout", timeout)
            client.set_option("MinimumPollingTime", minimum_polling_time)

        if client.protocol == IoTHubTransportProvider.MQTT:
            client.set_option("logtrace", 0)

        return client

    def _send_confirmation_callback(self, message, result, user_context):
        user_context.SEND_CALLBACKS += 1

    def _send_to_azure(self, msg):
        try:
            msg = IoTHubMessage(bytearray(str(msg), 'utf8'))
            self.MSG_COUNT += 1
            self.iothub_client.send_event_async(msg, self._send_confirmation_callback, self)
        except IoTHubError as iothub_error:
            print("Unexpected error {} from IoTHub".format(iothub_error))

    def run(self):
        while True:
            try:
                data = {'some_data': random.random()}
                vs = json.dumps(data)
                if self.MSG_COUNT % self.message_print_time == 0:
                    print("Thread {} sent {}".format(self.name, vs))
                self._send_to_azure(vs)
                time.sleep(self.sleep_time)
            except Exception as e:
                print(e)


if __name__ == '__main__':
    thread_count = 50
    for i in range(thread_count):
        test = AzureThreadingTest('azure_config.json', str(i))
        test.daemon = True
        print('Starting thread {}'.format(i))
        test.start()
        time.sleep(1)

    while True:
        pass

The code is pretty simple - just sets up the connection and sends random data.

The protocol we are using is HTTP - the rest of the config file are just the connection strings and shouldn't matter for the issue.

I think the issues lies with the callback handling in the send_event_async method as I am not sure if the GIL is done correctly - but I am more or less just blindly guessing

@bulletlink
Copy link

Experiencing the same issue here.

OS and version used: UbuntuMate 16.04.02

Python runtime used: Python 2.7

Platform used: Raspberry Pi B+

Script uses PyQt4 and QThreads.

I am spawning 3 QThreads in total, 1 of which handles Serial Comms, 1 handles a Phidgets USB device and 1 handles the Azure connection.

Also using HTTP protocol and my Azure thread class looks similar to the one posted.

Without spawning the Azure thread, the script has been running for over a week. As soon as I introduce the Azure thread I get:

Fatal Python error: GC object already tracked

When running through gdb, the error message is followed by:

Thread 1 "python" received signal SIGABRT, Aborted.
__libc_do_syscall ()
at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
47 ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S: No such file or directory

I can also confirm that I can reproduce it 100% of the time, but the time it takes for it to happen can vary from anywhere between a couple of minutes to an hour. I agree that the issue seems to lie somewhere with the confirmation callback in the send_event_async method

@jackt-moran
Copy link

I'm also experiencing this issue or something similar. Are there any updates to this? Is there a fix in the works? Or has anyone found any workarounds?

@ceidam
Copy link

ceidam commented Feb 7, 2019

same issue here:
Raspberry Pi 3 B+
python 3.5.3
azure-iothub-device-client (1.4.6)
I confirm to bulletlink's assumption that the issue is related to the "send_event_async - callback"

is there any solution or workaround?

@Cnidarias
Copy link
Author

Just as a little update on this - since I didn't want to spend a lot of time digging around I resorted to switching to multiproccessing. I suppose I would recommend here do this until someone finds the actual issue.

@jackt-moran
Copy link

jackt-moran commented Feb 8, 2019

Context:
Hardware: Dell's Embedded Box PC 5000
OS: Ubuntu 18.04
Architecture: amd64
Language: Python 2.7
azure-iot-sdk-python version: 1.4.6

I might have had a slightly different issue than is documented here actually.
I initially had this issue using multiprocessing, so I tried switching to multithreading, which didn't solve the problem. After that, I switched to a single threaded version of my app and the issue still persisted. I have a class that wraps all azure-iot-sdk-python elements that I use, and handles all of the messages the app generates. I found that the only way to get my particular issue to stop happening is to not invoke any calls to IoTHubModuleClient.send_event_async.

As a workaround I have a C# HTTP server in another module which my Python module forwards it's requests to. It works fine but obviously it would be nice to be able to use the IoTHubModuleClient class without reliability issues.

@Cnidarias
Copy link
Author

It is funny that you say that because at some point I had exactly the same setup (a C# TCP server which acted as a broker)

@LouanDuToitS3
Copy link

LouanDuToitS3 commented Aug 26, 2019

This is happening to me as well. I found that using "send_event_async" and waiting for the tx confirmation before using "send_event_async" again helps a bit, but still crashes every couple of hours.

This is actually pathetic that after more than a year we still have no official response on this from the devs. What more is needed? Test code exhibiting the issue were provided.

Do the devs think we play with the code for "hobby" projects. Maybe it's time to consider another provider for our services...

pierreca pushed a commit that referenced this issue Aug 28, 2019
* combine connection args ops to make pipeline simpler
@pierreca
Copy link
Contributor

@LouanDuToitS3 @Cnidarias @uitdam @bulletlink @jackt-moran
sorry for the lack of communication or solutions until now. A big part of the reason why we haven't addressed this is because we've been busy rewriting the whole SDK in pure python. we realized last fall after taking a good look at issues (especially related to native code behaviors , import errors and platform support) that the way the v1 of the SDK was built was problematic and definitely not future-proof.
Getting to the point where we could release a new SDK that offers a better experience for python users took longer than expected and we definitely should have communicated this more openly.

While it may be "too little too late" at that point, i'm also hoping we're still "better late than never" at least for folks who've experienced this issue recently. We've reached a point where we feel confident pushing the v2 of the SDK into master and previewing it with customers. The API are a lot more pythonic, and if bugs are found they will be fixed much faster because the environment is much friendlier for pythonistas and we don't depend on another team to build the product.

I hope that what you'll find in master today (and in the azure-iot-device package on PyPI) will suit you a lot more than what we had before. please do let us know.

@jackt-moran
Copy link

@pierreca
I'm looking forward to previewing the v2 SDK! Is this repository now the home of the v2 SDK as opposed to where it previously was ?

@pierreca
Copy link
Contributor

@jackt-moran correct, at that point and even though it's still in preview, we feel it's such an improvement over v1 that it's worth pushing it to master already.

the -preview repository is going to be archived after we're done moving all the builds and issues etc.

@LouanDuToitS3
Copy link

LouanDuToitS3 commented Sep 2, 2019

@pierreca
So far I'm enjoying the V2 SDK. But unfortunately it leaves me wanting. I need the SDK to be production ready. Any idea when the SDK will exit "Preview"?

I'm unable to log issues against the V2 SDK, but here are a condensed list of the current issues preventing me from using V2 in production:

  1. Some of our edge modules require Python2, and therefore I am using the "sync" interface for device communications. Problem is that the "pipeline" exits without any way for me to monitor state. I can see in the SDK code the following:

sys.exit(1) # TODO: raise an error instead

Also there seems to be no way for me to query the connection state. I hope that I'm overlooking something. A comment in the code suggests that a connection will be established if needed, which does not happen, therefore leading me to believe that the code itself at the point of sending data is unaware of the connection state.

  1. "INFO" level debug logging very verbose. I see a lot of debug messages which I would prefer to be lowered to "DEBUG" level. Our solution is using Azure Log Analytics, and therefore too many messages are being uploaded.

  2. The pipeline exits with "Out of memory".
    Once again as in point (1) I have no insight into the current connection to the hub, and therefore can not check if all is connected and fine. Here are a log of a crash in the "pipeline" thread, my code continued without noticing any issues.

2019-09-01T23:32:08.203417009Z ERROR,MQTTTransportStage: _on_mqtt_disconnect called: Out of memory.
2019-09-01T23:32:08.203756480Z WARNING,MQTTTransportStage: disconnection was unexpected
2019-09-01T23:32:08.206882621Z ERROR,Exception caught in background thread.  Unable to handle.
2019-09-01T23:32:08.206916561Z azure.iot.device.common.errors.ProtocolClientError: Out of memory.
2019-09-01T23:32:08.206921772Z 
2019-09-01T23:32:08.206925210Z The above exception was the direct cause of the following exception:
2019-09-01T23:32:08.206928703Z 
2019-09-01T23:32:08.206931677Z Traceback (most recent call last):
2019-09-01T23:32:08.206934793Z   File "/usr/local/lib/python3.6/dist-packages/azure/iot/device/common/pipeline/pipeline_stages_mqtt.py", line 250, in _on_mqtt_disconnected
2019-09-01T23:32:08.206938614Z     six.raise_from(errors.ConnectionDroppedError, cause)
2019-09-01T23:32:08.206941693Z   File "<string>", line 3, in raise_from
2019-09-01T23:32:08.206945247Z azure.iot.device.common.errors.ConnectionDroppedError
2019-09-01T23:32:08.207451602Z ERROR,Exception caught in background thread.  Unable to handle.
2019-09-01T23:32:08.207460452Z azure.iot.device.common.errors.ProtocolClientError: Out of memory.
2019-09-01T23:32:08.207463484Z 
2019-09-01T23:32:08.207465850Z The above exception was the direct cause of the following exception:
2019-09-01T23:32:08.207468221Z 
2019-09-01T23:32:08.207470369Z Traceback (most recent call last):
2019-09-01T23:32:08.207472623Z   File "/usr/local/lib/python3.6/dist-packages/azure/iot/device/common/pipeline/pipeline_stages_mqtt.py", line 250, in _on_mqtt_disconnected
2019-09-01T23:32:08.207475040Z     six.raise_from(errors.ConnectionDroppedError, cause)
2019-09-01T23:32:08.207479203Z   File "<string>", line 3, in raise_from
2019-09-01T23:32:08.207481724Z azure.iot.device.common.errors.ConnectionDroppedError
2019-09-01T23:32:08.207483854Z 
2019-09-01T23:32:08.207485919Z During handling of the above exception, another exception occurred:
2019-09-01T23:32:08.207488161Z 
2019-09-01T23:32:08.207490230Z Traceback (most recent call last):
2019-09-01T23:32:08.207492386Z   File "/usr/local/lib/python3.6/dist-packages/azure/iot/device/common/pipeline/pipeline_thread.py", line 113, in thread_proc
2019-09-01T23:32:08.207506550Z     return func(*args, **kwargs)
2019-09-01T23:32:08.207511834Z   File "/usr/local/lib/python3.6/dist-packages/azure/iot/device/common/pipeline/pipeline_stages_mqtt.py", line 252, in _on_mqtt_disconnected
2019-09-01T23:32:08.207515757Z     unhandled_exceptions.exception_caught_in_background_thread(e)
2019-09-01T23:32:08.207518506Z   File "/usr/local/lib/python3.6/dist-packages/azure/iot/device/common/unhandled_exceptions.py", line 27, in exception_caught_in_background_thread
2019-09-01T23:32:08.207520727Z     logger.error(msg="Exception caught in background thread.  Unable to handle.", exc_info=e)
2019-09-01T23:32:08.207526472Z   File "/usr/lib/python3.6/logging/__init__.py", line 1337, in error
2019-09-01T23:32:08.207530479Z     self._log(ERROR, msg, args, **kwargs)
2019-09-01T23:32:08.207532956Z   File "/usr/lib/python3.6/logging/__init__.py", line 1444, in _log
2019-09-01T23:32:08.207535532Z     self.handle(record)
2019-09-01T23:32:08.207537957Z   File "/usr/lib/python3.6/logging/__init__.py", line 1454, in handle
2019-09-01T23:32:08.207540513Z     self.callHandlers(record)
2019-09-01T23:32:08.207542715Z   File "/usr/lib/python3.6/logging/__init__.py", line 1516, in callHandlers
2019-09-01T23:32:08.207545552Z     hdlr.handle(record)
2019-09-01T23:32:08.207548042Z   File "/usr/lib/python3.6/logging/__init__.py", line 865, in handle
2019-09-01T23:32:08.207550735Z     self.emit(record)
2019-09-01T23:32:08.207553383Z   File "/app/edge_node/python_modules/utils/logging_module.py", line 620, in emit
2019-09-01T23:32:08.207556309Z     self.logmod_parent.exception_ai()
2019-09-01T23:32:08.207558877Z   File "/app/edge_node/python_modules/utils/logging_module.py", line 1011, in exception_ai
2019-09-01T23:32:08.207561666Z     "value": value.message,
2019-09-01T23:32:08.207564872Z AttributeError: 'ConnectionDroppedError' object has no attribute 'message'
2019-09-01T23:32:09.213950513Z WARNING,Connection was unexpected
  1. Routing on message body not working. I had to revert to using "custom_properties" on messages.

I would love to switch over to V2, but unfortunately, for me, it's not production ready.

@pierreca
Copy link
Contributor

pierreca commented Sep 3, 2019

@LouanDuToitS3 thanks for the very valuable feedback.
@cartertinney @BertKleewein please take a look at these errors.

As to when we expect v2 to be production ready: we're incredibly close to feature complete but also very aware that we can't call it production ready until we squash bugs that haven't been found by end-to-end and unit tests. I'm guessing some time in November.

@LouanDuToitS3
Copy link

LouanDuToitS3 commented Sep 4, 2019

@pierreca Thanks.
@cartertinney @BertKleewein let me know in case more info is required on the above mentioned issues.

@BertKleewein
Copy link
Member

Hi @LouanDuToitS3, Here is some specific feedback following the same numbering you used above.
(thanks for numbering your issues):

  1. Has a PR that is currently in it's second round of review - [CARRYOVER] Return pipeline errors to user by raising exceptions from… #292
  2. Is something that we have been aware of but I didn't realize it could impact bandwidth usage. I will start on this ASAP.
  3. I have no answer for yet. Reconnecting and retrying operations is a work item on my near horizon. The exception changes for 1 will hopefully help
  4. can you provide an example of a routing rule that exhibits this behavior?

@LouanDuToitS3
Copy link

LouanDuToitS3 commented Sep 6, 2019

Hi @BertKleewein , I logged issue: #294 for no 4.

@jebrando
Copy link
Contributor

With the Python v2 reaching General Availability we are recommending users move away from the v1 SDK. We will be closing the v1 SDK issues whose features are covered by the v2 release. Feel free to open an new issue if you find an issue with the v2 SDK.

Thank you for your patronage

@az-iot-builder-01
Copy link
Collaborator

@bulletlink, @jackt-moran, @uitdam, @Cnidarias, @LouanDuToitS3, @BertKleewein, thank you for your contribution to our open-sourced project! Please help us improve by filling out this 2-minute customer satisfaction survey

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants